Skip to content

Quantizing V cache not working yet #4425

Closed
@CISC

Description

@CISC

Quantizing the K cache (-ctk) works, however quantizing the V cache (-ctv) does not, I've tried with q4_0, q4_1, q8 etc...

Using the cublas- cu12.2.0 release build I get the following error:

llama_kv_cache_init: VRAM kv self = 336.00 MB
llama_new_context_with_model: KV self size = 336.00 MiB, K (f16): 256.00 MiB, V (q4_1): 80.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 291.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 288.00 MiB
llama_new_context_with_model: total VRAM used: 4719.06 MiB (model: 4095.05 MiB, context: 624.00 MiB)

CUDA error 1 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7596: invalid argument
current device: 0
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7596: !"CUDA error"

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions