-
Notifications
You must be signed in to change notification settings - Fork 12.2k
ggml : add ggml_set_rows #14274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
ggml : add ggml_set_rows #14274
Conversation
So far so good: #14285 I think the |
switch (src0->type) { | ||
case GGML_TYPE_F32: | ||
{ | ||
ggml_compute_forward_set_rows_f32(params, dst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should aim to reuse the existing cpy/dup
code in order to support F32 -> any type
, not just F16
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need support for quantized data types or just F16,F32,BF16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should support all possible KV cache quantization types:
Lines 819 to 830 in 28ee6d2
const std::vector<ggml_type> kv_cache_types = { | |
GGML_TYPE_F32, | |
GGML_TYPE_F16, | |
GGML_TYPE_BF16, | |
GGML_TYPE_Q8_0, | |
GGML_TYPE_Q4_0, | |
GGML_TYPE_Q4_1, | |
GGML_TYPE_IQ4_NL, | |
GGML_TYPE_Q5_0, | |
GGML_TYPE_Q5_1, | |
}; | |
Ideally, it should work for all ggml types that also work with ggml_cpy
.
struct ggml_tensor * ggml_set_rows( | ||
struct ggml_context * ctx, | ||
struct ggml_tensor * a, | ||
struct ggml_tensor * b, | ||
struct ggml_tensor * c) { | ||
GGML_ASSERT(b->ne[2] == c->ne[1]); | ||
GGML_ASSERT(c->ne[3] == 1); | ||
GGML_ASSERT(a->type == GGML_TYPE_F16); | ||
GGML_ASSERT(b->type == GGML_TYPE_F32); | ||
GGML_ASSERT(c->type == GGML_TYPE_I64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to allow broadcasting c
into b
. It would avoid this ggml_repeat_4d
here:
llama.cpp/src/llama-kv-cache-unified.cpp
Lines 795 to 799 in a0c0fb6
v_cur = ggml_cont_3d(ctx, v_cur, 1, v_cur->ne[0], v_cur->ne[1]); | |
kv_idxs = ggml_repeat_4d(ctx, kv_idxs, v_cur->ne[1], v_cur->ne[2], 1, 1); | |
return ggml_set_rows(ctx, v_view, v_cur, kv_idxs); |
Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'. ref: ggml-org#8366
Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.
ref: #8366