Skip to content

Bug: false sharing in threadpool makes ggml_barrier() needlessly slow #9588

Closed
@wtarreau

Description

@wtarreau

What happened?

I was surprised to see libgomp take most of the CPU in perf top on a 80-core ARM server with 6 DDR4 channels. I found that I could disable libgomp by building with GGML_NO_OPENMP, then the CPU was spent in ggml_barrier(), on a load. Looking closer, I found the cause: the barrier is implemented using two separate variables that are unfortunately in the same cache line. This means that all threads waiting on the last one are preventing all other threads from incrementing the thread count quickly, causing the cache line to bounce back-and-forth between all cores. In addition, all threads would perform an atomic write to the barrier_passed counter, while only one of them would write a non-zero value there, causing heavy serialization again.

Addressing only half of the issue at once obviously doesn't completely unlock the performance, however doing the two at once gives significant gains (+21% vs base, +3% vs openmp) on text generation once tested like this:

$ ./llama-cli --temp 0.1 -m ../models/Mistral-Small-Instruct-2409-Q5_K_M.gguf -p "write about anything" -n 40

The results are:

  • base (openmp): 7.04 token/s
  • GGML_NO_OPENMP=1: 5.98 token/s
  • GGML_NO_OPENMP=1 + fix: 7.27 token/s

On smaller models it's even more visible:

$ ./llama-cli --temp 0.1 -m ../models/mistral-7b-instruct-v0.2.Q5_K_M.gguf -p "write about anything" -n 100
  • base (openmp): 20.4 token/s
  • GGML_NO_OPENMP=1: 16.01 token/s
  • GGML_NO_OPENMP=1 + fix: 20.87 token/s (+30.3% and +2.3% respectively)

I'm not observing any relevant gain on x86 however, though the only x86 machines I have access to have few cores and see their performance limited by the DRAM bandwidth. But it might be likely to make a difference on some EPYC having multiple CCD.

I'm attaching the patch that addresses it here. I could send a PR if needed but it takes much more time and won't have time for this until next week-end, and I know that you don't mind applying patches once they're explained, Georgi.

0001-threadpool-avoid-false-sharing-between-n_barrier_pas.patch.txt

Name and Version

$ ./llama-cli --version
version: 3802 (a5b57b0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

Before (with openmp):
llama_perf_sampler_print:    sampling time =       2.01 ms /    44 runs   (    0.05 ms per token, 21857.92 tokens per second)
llama_perf_context_print:        load time =    2201.99 ms
llama_perf_context_print: prompt eval time =     229.43 ms /     4 tokens (   57.36 ms per token,    17.43 tokens per second)
llama_perf_context_print:        eval time =    5537.76 ms /    39 runs   (  141.99 ms per token,     7.04 tokens per second)
llama_perf_context_print:       total time =    5771.76 ms /    43 tokens


Before: (without openmp):
llama_perf_sampler_print:    sampling time =       2.03 ms /    44 runs   (    0.05 ms per token, 21664.20 tokens per second)
llama_perf_context_print:        load time =    2253.97 ms
llama_perf_context_print: prompt eval time =     258.66 ms /     4 tokens (   64.66 ms per token,    15.46 tokens per second)
llama_perf_context_print:        eval time =    6520.80 ms /    39 runs   (  167.20 ms per token,     5.98 tokens per second)
llama_perf_context_print:       total time =    6787.95 ms /    43 tokens


After (without openmp):
llama_perf_sampler_print:    sampling time =       2.06 ms /    44 runs   (    0.05 ms per token, 21369.60 tokens per second)
llama_perf_context_print:        load time =    2189.62 ms
llama_perf_context_print: prompt eval time =     231.70 ms /     4 tokens (   57.92 ms per token,    17.26 tokens per second)
llama_perf_context_print:        eval time =    5362.74 ms /    39 runs   (  137.51 ms per token,     7.27 tokens per second)
llama_perf_context_print:       total time =    5602.92 ms /    43 tokens

Metadata

Metadata

Labels

bug-unconfirmedlow severityUsed to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions