kv-cache : use ggml_set_rows #14285

ggerganov · 2025-06-19T16:30:58Z

Utilize ggml_set_rows() for updating the KV cache.

Make the graph static with respect to the KV cells head offset
Relax the requirement for continuous KV slots of the input ubatch, making the defrag logic almost obsolete

Currently enabled only if the environment variable LLAMA_SET_ROWS is defined. If not, we fallback to the original way of updating the KV cache using a view + cpy of continuous slots. This is needed until the ggml_set_rows() implementation is finalized and supported by all backends.

Will merge after #14274.

Testing

# regular
LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf \
     -p "I believe the meaning of life is" -n 32 --top-k 1 -fa

# SWA
LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/gemma-3-4b/ggml-model-q8_0.gguf \
     -p "I believe the meaning of life is" -n 32 --top-k 1 -fa

Next PRs

Introduce the concept of "virtual sequences"
Extend llama_kv_cache_unified to support virtual sequences
Extend the attention graph to support virtual sequences
Enable efficient multi-sequence decoding

rgerganov · 2025-06-20T07:53:30Z

I tried this PR with the following change in the RPC backend:

diff --git a/ggml/src/ggml-rpc/ggml-rpc.cpp b/ggml/src/ggml-rpc/ggml-rpc.cpp
index f468f796..dcbede89 100644
--- a/ggml/src/ggml-rpc/ggml-rpc.cpp
+++ b/ggml/src/ggml-rpc/ggml-rpc.cpp
@@ -761,6 +761,8 @@ static enum ggml_status ggml_backend_rpc_graph_compute(ggml_backend_t backend, g
     ggml_backend_rpc_context * rpc_ctx = (ggml_backend_rpc_context *)backend->context;
     std::vector<uint8_t> input;
     serialize_graph(cgraph, input);
+    auto graph_hash = fnv_hash(input.data(), input.size());
+    printf("RPC graph compute: hash = 0x%" PRIx64 ", size = %zu\n", graph_hash, input.size());
     rpc_msg_graph_compute_rsp response;
     auto sock = get_socket(rpc_ctx->endpoint);
     bool status = send_rpc_cmd(sock, RPC_CMD_GRAPH_COMPUTE, input.data(), input.size(), &response, sizeof(response));

The compute graph doesn't change and produces the same hash with gpt2, tinyllama and mistral-7b models. However, the hash does change with gemma3 models. The serialized graph includes tensor addresses, so it's possible that we rebuild same tensors on different addresses resulting in different graph hash.

But in any way this looks like a great progress!

ggerganov · 2025-06-20T08:02:02Z

However, the hash does change with gemma3 models.

Yes, this is expected. I've applied the change only for the unified cache. For the unified+iswa, it is still using the ggml_cpy:

https://github.com/ggml-org/llama.cpp/pull/14285/files#diff-9be9eea14f4aefce7375482c05968900192634e88e92ac263cedb955a64ad7feR1290-R1291

ggerganov · 2025-06-20T08:07:58Z

Should work with Gemma now as well.

ggerganov · 2025-06-20T09:02:46Z

The non-FA path is now also supported, though I am not 100% sure this is the best way to do it.

ggerganov · 2025-06-20T10:06:08Z

The non-FA path is now also supported, though I am not 100% sure this is the best way to do it.

I don't observe any performance regression with CPU-only build, so the implementation should be good enough I think.

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'. ref: #8366

ggml-ci

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 19, 2025

ggerganov mentioned this pull request Jun 19, 2025

ggml : add ggml_set_rows #14274

Draft

ggerganov force-pushed the gg/model-rework-out-ids branch from 1e86597 to 2b940c0 Compare June 20, 2025 07:16

Base automatically changed from gg/model-rework-out-ids to master June 20, 2025 07:50

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 8f1c5e3 to 5f87f28 Compare June 20, 2025 07:59

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 4d0c0ea to db0cd69 Compare June 20, 2025 09:01

rgerganov added 2 commits June 21, 2025 08:43

ggml : add ggml_set_rows

990ee1b

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'. ref: #8366

use I64 for indices

70e3d27

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from a0c0fb6 to d40f705 Compare June 21, 2025 06:09

ggerganov added 2 commits June 21, 2025 09:19

ggml : add repeat impl for i64

bec0ceb

kv-cache : use ggml_set_rows

d1da992

ggml-ci

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from d40f705 to d1da992 Compare June 21, 2025 06:20

github-actions bot added the examples label Jun 21, 2025

ggerganov added 3 commits June 21, 2025 10:37

cont : gate the ggml_set_rows usage with env var

113c762

ggml-ci

cont : migrate to using set of indices instead of slot head

d4be34b

ggml-ci

cont : kv-cells cp/set for non-cont slots

14554a8

ggml-ci

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 1031a5d to 14554a8 Compare June 21, 2025 12:29

cont : support non-continuous slots

ed7c2e0

ggml-ci

ggerganov marked this pull request as ready for review June 21, 2025 13:42

cont : minor fix

fe1bb4f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kv-cache : use ggml_set_rows #14285

kv-cache : use ggml_set_rows #14285

ggerganov commented Jun 19, 2025 •

edited

Loading

Uh oh!

rgerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

Uh oh!

kv-cache : use ggml_set_rows #14285

Are you sure you want to change the base?

kv-cache : use ggml_set_rows #14285

Conversation

ggerganov commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Next PRs

Uh oh!

rgerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

Uh oh!

ggerganov commented Jun 19, 2025 •

edited

Loading