Quantizing V cache not working yet

Quantizing the K cache (-ctk) works, however quantizing the V cache (-ctv) does not, I've tried with q4_0, q4_1, q8 etc...

Using the cublas- cu12.2.0 release build I get the following error:

llama_kv_cache_init: VRAM kv self = 336.00 MB
llama_new_context_with_model: KV self size  =  336.00 MiB, K (f16):  256.00 MiB, V (q4_1):   80.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 291.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 288.00 MiB
llama_new_context_with_model: total VRAM used: 4719.06 MiB (model: 4095.05 MiB, context: 624.00 MiB)

CUDA error 1 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7596: invalid argument
current device: 0
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7596: !"CUDA error"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantizing V cache not working yet #4425

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Quantizing V cache not working yet #4425

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions