Skip to content

ggml: adds CONV_2D op and direct GEMM Vulkan implementation #14316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

etasnadi
Copy link
Contributor

This patch adds support for direct computation of 2D convolution on Vulkan backend: it is in a form of a custom GEMM that loads the relevant data from the kernel and input to the shared memory therefore it does not need the materialization of the convolution matrix in the global memory with im2col thus saving lots of memory - similarly how the op implemented in cuDNN. This logic can theoretically result in faster kernels than im2col->matmul because the transfer of the full matrix between GMEM and registers is not needed and the repeating elements for the (virtual) helper matrix can be pulled from L2.

The performance is 2x compared to im2col->matmul on RTX 2060 (2.15 TFLOPS compared to 4.10 TFLOPS according to test-backend-ops theoretical max is ~6 TFLOPS):

$ GGML_VK_DISABLE_COOPMAT=1 ./bin/test-backend-ops -o CONV_2D_INDIRECT_IMPL -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 2060 SUPER (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 2060 SUPER
  Device memory: 8192 MB (8192 MB free)

  CONV_2D_INDIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                       16 runs - 64065.69 us/run - 137.42 GFLOP/run -   2.15 TFLOPS
  Backend Vulkan0: OK

Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

$ ./bin/test-backend-ops -o CONV_2D_INDIRECT_IMPL -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 2060 SUPER (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 2060 SUPER
  Device memory: 8192 MB (8192 MB free)

  CONV_2D_INDIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                       46 runs - 21751.26 us/run - 137.42 GFLOP/run -   6.32 TFLOPS
  Backend Vulkan0: OK

Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

$ GGML_VK_DISABLE_COOPMAT=1 ./bin/test-backend-ops -o CONV_2D_DIRECT_IMPL -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 2060 SUPER (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 2060 SUPER
  Device memory: 8192 MB (8192 MB free)

  CONV_2D_DIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 30 runs - 33534.17 us/run - 137.42 GFLOP/run -   4.10 TFLOPS
  Backend Vulkan0: OK

Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

As a negative result, the indirect op is signiticantly faster on a GTX 1060 notebook (1.73 vs 1.21 TFLOPS -- theoretical max is ~3 TFLOPS) might be because blocktile sizes are too big for this older hardware.

The PR also adds support to compare ops with different implementation graphs in test-backend-ops, so one can compare/test the actual (potentially fused and optimized op under development) to a reference op that does not have a direct implementation on CPU yet making op development faster.

…mory efficient instant GEMM based Vulkan implementation

* ggml: adds op CONV_2D, ggml_conv_2d_direct,

* ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly
with gemm (no need for im2col),

* test-backend-ops: adds test_case_ref to check the validity/performance of ops
against reference implementations having different graphs,
@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jun 21, 2025
Copy link
Collaborator

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!

@@ -1858,6 +1858,10 @@ static void ggml_compute_forward(struct ggml_compute_params * params, struct ggm
{
ggml_compute_forward_im2col_back_f32(params, tensor);
} break;
case GGML_OP_CONV_2D:
{
GGML_ABORT("Op not supported on CPU yet.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a CPU implementation is generally required.

const uint32_t Bsh_len = BS_CRS*Bsh_stride;

shared float Ash[Ash_len]; // K x CRS
shared float Bsh[Bsh_len]; // CRS x NPQ
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the total shared memory needed? Do we need a runtime check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

128x16 tile size works well thus 16.125K shmem is the minimum so I don't think it's absolutely needed to check. But it would be good to test to be sure that we have the required amount in all kernels.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vulkan requires a minimum of 16KB, so we probably should have a check.


void main(){
if(gl_WorkGroupID.x == gl_NumWorkGroups.x-1 || gl_WorkGroupID.y == gl_NumWorkGroups.y-1){
mainLoopBoundaryCheck();
Copy link
Collaborator

@jeffbolznv jeffbolznv Jun 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to just pass in a boolean true/false here and && it with the condition? I think it'll be more readable than all the macros

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested adding the bool expression in the loop but it slows down the execution considerably: 3.7 TFLOPS compared to 4.16 TFLOPS with macros.

https://gist.github.com/etasnadi/a8a3a67fc904c86f35de74c3f8ef819c#file-conv2d_mm-comp

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should write it with the branch in the main function and each side using true/false. The compiler will inline both copies and fold the constants.

Copy link
Contributor Author

@etasnadi etasnadi Jun 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this finally helped to produce better code. I don't want to trigger the pipelines so I pushed the updates to a different branch (etasnadi@fe85b44), but it seems that the FLOPS are much higher:

Vulkan FLOPS:
device                  direct  indirect
------                  ------  --------
RTX 2060                5.33     3.40
GTX 1060 (Notebook)     2.2     1.73

Command: GGML_VK_CONV_2D_CONFIG=256,128,16,128,16 GGML_VK_DISABLE_COOPMAT=1 ./bin/test-backend-ops -o CONV_2D_INDIRECT_IMPL -b Vulkan0 perf

Copy link
Contributor Author

@etasnadi etasnadi Jun 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meantime I implemented the CUDA version and here's the performance:

CUDA FLOPS:
device                  direct  indirect
------                  ------  --------
GTX 1060 (Notebook)     2.2     2.5
RTX 2060                5.02    6.14

This suggests that my proposed alg is slower than the indirect alg if the latter is well optimized and the indirect Vulkan kernel can be further optimized or their parameters are not general enough to perform well on my devices.

Edit: CUDA branch added: etasnadi@c71890e

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out that the CUDA matmul uses cuBLAS if possible:

cublasSgemm(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N,
so the 13.6% improvement can be attributed to the matmul tricks cuBLAS uses. Adding vectorized loads would bring my kernel to this performance, but I will do that when I prepare the CUDA patch.

@netrunnereve
Copy link
Collaborator

netrunnereve commented Jun 21, 2025

As a negative result, the indirect op is signiticantly faster on a GTX 1060 notebook (1.73 vs 1.21 TFLOPS -- theoretical max is ~3 TFLOPS) might be because blocktile sizes are too big for this older hardware.

On my RX 470 the indirect op is faster as well. IMO it's worth testing with more input and kernel sizes like what we have for im2col, and the real test to get this set up with stablediffusion.cpp (though that thing hasn't been updated for months) to see how it does with an actual model.

CONV_2D_INDIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                       14 runs - 72823.29 us/run - 137.42 GFLOP/run -   1.89 TFLOPS
CONV_2D_DIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 11 runs - 96444.18 us/run - 137.42 GFLOP/run -   1.42 TFLOPS

@etasnadi
Copy link
Contributor Author

etasnadi commented Jun 21, 2025

As a negative result, the indirect op is signiticantly faster on a GTX 1060 notebook (1.73 vs 1.21 TFLOPS -- theoretical max is ~3 TFLOPS) might be because blocktile sizes are too big for this older hardware.

On my RX 470 the indirect op is faster as well. IMO it's worth testing with more input and kernel sizes like what we have for im2col, and the real test to get this set up with stablediffusion.cpp (though that thing hasn't been updated for months) to see how it does with an actual model.

CONV_2D_INDIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                       14 runs - 72823.29 us/run - 137.42 GFLOP/run -   1.89 TFLOPS
CONV_2D_DIRECT_IMPL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 11 runs - 96444.18 us/run - 137.42 GFLOP/run -   1.42 TFLOPS

Sure, older models might introduce other bottlenecks that causes the shader to slow down but the memory saving still a considerable advantage. I'm thinking about reimplementing the shader in CUDA so I can profile it with Nsight to see what causes the issue (hopefully it still supports ancient cards).

void mainLoop ## FUNC_NAME_SUFFIX(){\
initReg();\
/* Advance block in CRS dim */\
for(uint32_t B_idx_CRS = 0; B_idx_CRS < NB_CRS; B_idx_CRS++){\
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to add [[unroll]] on loops with constant trip count. Sometimes the compiler will do this automatically, but when there are nested loops sometimes it won't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I annotated all the loops except this (because the main loop is sequential) and I get the same flops unfortunately.

}

void outProdReg(){
for(uint32_t CRS_lidx = 0; CRS_lidx < BS_CRS; CRS_lidx++){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use coopmat here to do the outer products in parallel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you ask if the alg supports coopmats -- then yes, we can add it later, but now I focus on achieving good enough flops relative to what can be achieved with the scalar matmul kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants