Skip to content

Misc. bug: Large performance regression since version b4365 #10977

Closed
@GlasslessPizza

Description

@GlasslessPizza

Name and Version

b4365 onward

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

I'm observing a slowdown between b4363 and b4365 that persists to this day. I tried two models:

https://huggingface.co/bartowski/gemma-2-27b-it-GGUF/blob/main/gemma-2-27b-it-Q5_K_L.gguf
https://huggingface.co/tensorblock/Qwen2.5-32B-Instruct-abliterated-GGUF/blob/main/Qwen2.5-32B-Instruct-abliterated-Q5_K_M.gguf

Results:

      |   qwen   |   gemma
-----------------------------
b4363 | 31.7 t/s | 36.1 t/s
b4365 | 24.5 t/s | 22.7 t/s
-----------------------------
      |  -23%    |  -37%

Command used:

.\llama-server.exe --model <model> --ctx-size 8192 --threads 10 --no-mmap --mlock --n-gpu-layers 999 --log-disable --flash-attn --cache-type-k q8_0 --cache-type-v q8_0

Windows 10

First Bad Commit

between b4363 and b4365

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions