Skip to content

kv-cache : use ggml_set_rows #14285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Jun 19, 2025

depends #14274

Utilize ggml_set_rows() for updating the KV cache.

  • Make the graph static with respect to the KV cells head offset
  • Relax the requirement for continuous KV slots of the input ubatch, making the defrag logic almost obsolete

Currently enabled only if the environment variable LLAMA_SET_ROWS is defined. If not, we fallback to the original way of updating the KV cache using a view + cpy of continuous slots. This is needed until the ggml_set_rows() implementation is finalized and supported by all backends.

Will merge after #14274.

Testing

# regular
LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf \
     -p "I believe the meaning of life is" -n 32 --top-k 1 -fa
# SWA
LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/gemma-3-4b/ggml-model-q8_0.gguf \
     -p "I believe the meaning of life is" -n 32 --top-k 1 -fa

Next PRs

  • Introduce the concept of "virtual sequences"
  • Extend llama_kv_cache_unified to support virtual sequences
  • Extend the attention graph to support virtual sequences
  • Enable efficient multi-sequence decoding

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 19, 2025
@ggerganov ggerganov force-pushed the gg/model-rework-out-ids branch from 1e86597 to 2b940c0 Compare June 20, 2025 07:16
Base automatically changed from gg/model-rework-out-ids to master June 20, 2025 07:50
@rgerganov
Copy link
Collaborator

I tried this PR with the following change in the RPC backend:

diff --git a/ggml/src/ggml-rpc/ggml-rpc.cpp b/ggml/src/ggml-rpc/ggml-rpc.cpp
index f468f796..dcbede89 100644
--- a/ggml/src/ggml-rpc/ggml-rpc.cpp
+++ b/ggml/src/ggml-rpc/ggml-rpc.cpp
@@ -761,6 +761,8 @@ static enum ggml_status ggml_backend_rpc_graph_compute(ggml_backend_t backend, g
     ggml_backend_rpc_context * rpc_ctx = (ggml_backend_rpc_context *)backend->context;
     std::vector<uint8_t> input;
     serialize_graph(cgraph, input);
+    auto graph_hash = fnv_hash(input.data(), input.size());
+    printf("RPC graph compute: hash = 0x%" PRIx64 ", size = %zu\n", graph_hash, input.size());
     rpc_msg_graph_compute_rsp response;
     auto sock = get_socket(rpc_ctx->endpoint);
     bool status = send_rpc_cmd(sock, RPC_CMD_GRAPH_COMPUTE, input.data(), input.size(), &response, sizeof(response));

The compute graph doesn't change and produces the same hash with gpt2, tinyllama and mistral-7b models. However, the hash does change with gemma3 models. The serialized graph includes tensor addresses, so it's possible that we rebuild same tensors on different addresses resulting in different graph hash.

But in any way this looks like a great progress!

@ggerganov ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 8f1c5e3 to 5f87f28 Compare June 20, 2025 07:59
@ggerganov
Copy link
Member Author

However, the hash does change with gemma3 models.

Yes, this is expected. I've applied the change only for the unified cache. For the unified+iswa, it is still using the ggml_cpy:

https://github.com/ggml-org/llama.cpp/pull/14285/files#diff-9be9eea14f4aefce7375482c05968900192634e88e92ac263cedb955a64ad7feR1290-R1291

@ggerganov
Copy link
Member Author

Should work with Gemma now as well.

@ggerganov ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 4d0c0ea to db0cd69 Compare June 20, 2025 09:01
@ggerganov
Copy link
Member Author

The non-FA path is now also supported, though I am not 100% sure this is the best way to do it.

@ggerganov
Copy link
Member Author

The non-FA path is now also supported, though I am not 100% sure this is the best way to do it.

I don't observe any performance regression with CPU-only build, so the implementation should be good enough I think.

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using
indices from 'c'.

ref: #8366
@ggerganov ggerganov force-pushed the gg/kv-cache-use-set-rows branch from a0c0fb6 to d40f705 Compare June 21, 2025 06:09
@ggerganov ggerganov force-pushed the gg/kv-cache-use-set-rows branch from d40f705 to d1da992 Compare June 21, 2025 06:20
@ggerganov ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 1031a5d to 14554a8 Compare June 21, 2025 12:29
@ggerganov ggerganov marked this pull request as ready for review June 21, 2025 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants