ggml: adds CONV_2D op and direct GEMM Vulkan implementation #14316

etasnadi · 2025-06-21T11:05:01Z

This patch adds support for direct computation of 2D convolution on Vulkan backend: it is in a form of a custom GEMM that loads the relevant data from the kernel and input to the shared memory therefore it does not need the materialization of the convolution matrix in the global memory with im2col thus saving lots of memory - similarly how the op implemented in cuDNN. This logic can theoretically result in faster kernels than im2col->matmul because the transfer of the full matrix between GMEM and registers is not needed and the repeating elements for the (virtual) helper matrix can be pulled from L2.

The performance is 2x compared to im2col->matmul on RTX 2060 (2.15 TFLOPS compared to 4.10 TFLOPS according to test-backend-ops theoretical max is ~6 TFLOPS):

$ GGML_VK_DISABLE_COOPMAT=1 ./bin/test-backend-ops -o CONV_2D_INDIRECT_IMPL -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 2060 SUPER (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 2060 SUPER
  Device memory: 8192 MB (8192 MB free)

  CONV_2D_INDIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                       16 runs - 64065.69 us/run - 137.42 GFLOP/run -   2.15 TFLOPS
  Backend Vulkan0: OK

Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

$ ./bin/test-backend-ops -o CONV_2D_INDIRECT_IMPL -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 2060 SUPER (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 2060 SUPER
  Device memory: 8192 MB (8192 MB free)

  CONV_2D_INDIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                       46 runs - 21751.26 us/run - 137.42 GFLOP/run -   6.32 TFLOPS
  Backend Vulkan0: OK

Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

$ GGML_VK_DISABLE_COOPMAT=1 ./bin/test-backend-ops -o CONV_2D_DIRECT_IMPL -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 2060 SUPER (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 2060 SUPER
  Device memory: 8192 MB (8192 MB free)

  CONV_2D_DIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 30 runs - 33534.17 us/run - 137.42 GFLOP/run -   4.10 TFLOPS
  Backend Vulkan0: OK

Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

As a negative result, the indirect op is signiticantly faster on a GTX 1060 notebook (1.73 vs 1.21 TFLOPS -- theoretical max is ~3 TFLOPS) might be because blocktile sizes are too big for this older hardware.

The PR also adds support to compare ops with different implementation graphs in test-backend-ops, so one can compare/test the actual (potentially fused and optimized op under development) to a reference op that does not have a direct implementation on CPU yet making op development faster.

…mory efficient instant GEMM based Vulkan implementation * ggml: adds op CONV_2D, ggml_conv_2d_direct, * ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly with gemm (no need for im2col), * test-backend-ops: adds test_case_ref to check the validity/performance of ops against reference implementations having different graphs,

jeffbolznv

Very cool!

jeffbolznv · 2025-06-21T14:40:33Z

ggml/src/ggml-cpu/ggml-cpu.c

@@ -1858,6 +1858,10 @@ static void ggml_compute_forward(struct ggml_compute_params * params, struct ggm
            {
                ggml_compute_forward_im2col_back_f32(params, tensor);
            } break;
+        case GGML_OP_CONV_2D:
+            {
+                GGML_ABORT("Op not supported on CPU yet.");


I think a CPU implementation is generally required.

ggml/src/ggml-vulkan/ggml-vulkan.cpp

jeffbolznv · 2025-06-21T14:44:56Z

ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp

+const uint32_t Bsh_len = BS_CRS*Bsh_stride;
+
+shared float Ash[Ash_len];  // K x CRS
+shared float Bsh[Bsh_len];  // CRS x NPQ


What is the total shared memory needed? Do we need a runtime check?

128x16 tile size works well thus 16.125K shmem is the minimum so I don't think it's absolutely needed to check. But it would be good to test to be sure that we have the required amount in all kernels.

Vulkan requires a minimum of 16KB, so we probably should have a check.

jeffbolznv · 2025-06-21T14:48:25Z

ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp

+
+void main(){
+    if(gl_WorkGroupID.x == gl_NumWorkGroups.x-1 || gl_WorkGroupID.y == gl_NumWorkGroups.y-1){
+        mainLoopBoundaryCheck();


Is it possible to just pass in a boolean true/false here and && it with the condition? I think it'll be more readable than all the macros

I tested adding the bool expression in the loop but it slows down the execution considerably: 3.7 TFLOPS compared to 4.16 TFLOPS with macros.

https://gist.github.com/etasnadi/a8a3a67fc904c86f35de74c3f8ef819c#file-conv2d_mm-comp

You should write it with the branch in the main function and each side using true/false. The compiler will inline both copies and fold the constants.

Thanks, this finally helped to produce better code. I don't want to trigger the pipelines so I pushed the updates to a different branch (etasnadi@fe85b44), but it seems that the FLOPS are much higher:

Vulkan FLOPS: device direct indirect ------ ------ -------- RTX 2060 5.33 3.40 GTX 1060 (Notebook) 2.2 1.73

Command: GGML_VK_CONV_2D_CONFIG=256,128,16,128,16 GGML_VK_DISABLE_COOPMAT=1 ./bin/test-backend-ops -o CONV_2D_INDIRECT_IMPL -b Vulkan0 perf

In the meantime I implemented the CUDA version and here's the performance:

CUDA FLOPS: device direct indirect ------ ------ -------- GTX 1060 (Notebook) 2.2 2.5 RTX 2060 5.02 6.14

This suggests that my proposed alg is slower than the indirect alg if the latter is well optimized and the indirect Vulkan kernel can be further optimized or their parameters are not general enough to perform well on my devices.

Edit: CUDA branch added: etasnadi@c71890e

Turns out that the CUDA matmul uses cuBLAS if possible:

llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu

Line 1341 in 40bfa04

cublasSgemm(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N,

so the 13.6% improvement can be attributed to the matmul tricks cuBLAS uses. Adding vectorized loads would bring my kernel to this performance, but I will do that when I prepare the CUDA patch.

netrunnereve · 2025-06-21T16:53:52Z

As a negative result, the indirect op is signiticantly faster on a GTX 1060 notebook (1.73 vs 1.21 TFLOPS -- theoretical max is ~3 TFLOPS) might be because blocktile sizes are too big for this older hardware.

On my RX 470 the indirect op is faster as well. IMO it's worth testing with more input and kernel sizes like what we have for im2col, and the real test to get this set up with stablediffusion.cpp (though that thing hasn't been updated for months) to see how it does with an actual model.

CONV_2D_INDIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                       14 runs - 72823.29 us/run - 137.42 GFLOP/run -   1.89 TFLOPS
CONV_2D_DIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 11 runs - 96444.18 us/run - 137.42 GFLOP/run -   1.42 TFLOPS

etasnadi · 2025-06-21T17:21:12Z

As a negative result, the indirect op is signiticantly faster on a GTX 1060 notebook (1.73 vs 1.21 TFLOPS -- theoretical max is ~3 TFLOPS) might be because blocktile sizes are too big for this older hardware.

On my RX 470 the indirect op is faster as well. IMO it's worth testing with more input and kernel sizes like what we have for im2col, and the real test to get this set up with stablediffusion.cpp (though that thing hasn't been updated for months) to see how it does with an actual model.
CONV_2D_INDIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                       14 runs - 72823.29 us/run - 137.42 GFLOP/run -   1.89 TFLOPS
CONV_2D_DIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 11 runs - 96444.18 us/run - 137.42 GFLOP/run -   1.42 TFLOPS

Sure, older models might introduce other bottlenecks that causes the shader to slow down but the memory saving still a considerable advantage. I'm thinking about reimplementing the shader in CUDA so I can profile it with Nsight to see what causes the issue (hopefully it still supports ancient cards).

jeffbolznv · 2025-06-21T17:36:37Z

ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp

+void mainLoop ## FUNC_NAME_SUFFIX(){\
+    initReg();\
+    /* Advance block in CRS dim */\
+    for(uint32_t B_idx_CRS = 0; B_idx_CRS < NB_CRS; B_idx_CRS++){\


You might need to add [[unroll]] on loops with constant trip count. Sometimes the compiler will do this automatically, but when there are nested loops sometimes it won't.

I annotated all the loops except this (because the main loop is sequential) and I get the same flops unfortunately.

jeffbolznv · 2025-06-21T17:41:16Z

ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp

+}
+
+void outProdReg(){
+    for(uint32_t CRS_lidx = 0; CRS_lidx < BS_CRS; CRS_lidx++){


Can you use coopmat here to do the outer products in parallel?

If you ask if the alg supports coopmats -- then yes, we can add it later, but now I focus on achieving good enough flops relative to what can be achieved with the scalar matmul kernel.

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jun 21, 2025

jeffbolznv reviewed Jun 21, 2025

View reviewed changes

am17an mentioned this pull request Jun 21, 2025

Conv2D: Add CPU version #14320

Draft

Green-Sky mentioned this pull request Jun 21, 2025

A specialized Winograd Conv2d op ggml-org/ggml#971

Draft

ggml: adds CONV_2D op and direct GEMM Vulkan implementation #14316

Are you sure you want to change the base?

ggml: adds CONV_2D op and direct GEMM Vulkan implementation #14316

Conversation

etasnadi commented Jun 21, 2025

Uh oh!

jeffbolznv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etasnadi Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etasnadi Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

netrunnereve commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etasnadi commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeffbolznv Jun 21, 2025 •

edited

Loading

etasnadi Jun 22, 2025 •

edited

Loading

etasnadi Jun 22, 2025 •

edited

Loading

netrunnereve commented Jun 21, 2025 •

edited

Loading

etasnadi commented Jun 21, 2025 •

edited

Loading