Closed
Description
Name and Version
b4365 onward
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Problem description & steps to reproduce
I'm observing a slowdown between b4363 and b4365 that persists to this day. I tried two models:
https://huggingface.co/bartowski/gemma-2-27b-it-GGUF/blob/main/gemma-2-27b-it-Q5_K_L.gguf
https://huggingface.co/tensorblock/Qwen2.5-32B-Instruct-abliterated-GGUF/blob/main/Qwen2.5-32B-Instruct-abliterated-Q5_K_M.gguf
Results:
| qwen | gemma
-----------------------------
b4363 | 31.7 t/s | 36.1 t/s
b4365 | 24.5 t/s | 22.7 t/s
-----------------------------
| -23% | -37%
Command used:
.\llama-server.exe --model <model> --ctx-size 8192 --threads 10 --no-mmap --mlock --n-gpu-layers 999 --log-disable --flash-attn --cache-type-k q8_0 --cache-type-v q8_0
Windows 10
First Bad Commit
between b4363 and b4365
Relevant log output
No response