Bug: false sharing in threadpool makes ggml_barrier() needlessly slow

### What happened?

I was surprised to see libgomp take most of the CPU in perf top on a 80-core ARM server with 6 DDR4 channels. I found that I could disable libgomp by building with GGML_NO_OPENMP, then the CPU was spent in ggml_barrier(), on a load. Looking closer, I found the cause: the barrier is implemented using two separate variables that are unfortunately in the same cache line. This means that all threads waiting on the last one are preventing all other threads from incrementing the thread count quickly, causing the cache line to bounce back-and-forth between all cores. In addition, all threads would perform an atomic write to the barrier_passed counter, while only one of them would write a non-zero value there, causing heavy serialization again.

Addressing only half of the issue at once obviously doesn't completely unlock the performance, however doing the two at once gives significant gains (+21% vs base, +3% vs openmp) on text generation once tested like this:
```
$ ./llama-cli --temp 0.1 -m ../models/Mistral-Small-Instruct-2409-Q5_K_M.gguf -p "write about anything" -n 40
```
The results are:
- base (openmp): 7.04 token/s
- GGML_NO_OPENMP=1:  5.98 token/s
- GGML_NO_OPENMP=1 + fix: 7.27 token/s

On smaller models it's even more visible:
```
$ ./llama-cli --temp 0.1 -m ../models/mistral-7b-instruct-v0.2.Q5_K_M.gguf -p "write about anything" -n 100
```
- base (openmp): 20.4 token/s
- GGML_NO_OPENMP=1:  16.01 token/s
- GGML_NO_OPENMP=1 + fix: 20.87 token/s  (+30.3% and +2.3% respectively)

I'm not observing any relevant gain on x86 however, though the only x86 machines I have access to have few cores and see their performance limited by the DRAM bandwidth. But it might be likely to make a difference on some EPYC having multiple CCD.

I'm attaching the patch that addresses it here. I could send a PR if needed but it takes much more time and won't have time for this until next week-end, and I know that you don't mind applying patches once they're explained, Georgi.

[0001-threadpool-avoid-false-sharing-between-n_barrier_pas.patch.txt](https://github.com/user-attachments/files/17088822/0001-threadpool-avoid-false-sharing-between-n_barrier_pas.patch.txt)


### Name and Version

$ ./llama-cli --version
version: 3802 (a5b57b08)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu


### What operating system are you seeing the problem on?

Linux

### Relevant log output

```shell
Before (with openmp):
llama_perf_sampler_print:    sampling time =       2.01 ms /    44 runs   (    0.05 ms per token, 21857.92 tokens per second)
llama_perf_context_print:        load time =    2201.99 ms
llama_perf_context_print: prompt eval time =     229.43 ms /     4 tokens (   57.36 ms per token,    17.43 tokens per second)
llama_perf_context_print:        eval time =    5537.76 ms /    39 runs   (  141.99 ms per token,     7.04 tokens per second)
llama_perf_context_print:       total time =    5771.76 ms /    43 tokens


Before: (without openmp):
llama_perf_sampler_print:    sampling time =       2.03 ms /    44 runs   (    0.05 ms per token, 21664.20 tokens per second)
llama_perf_context_print:        load time =    2253.97 ms
llama_perf_context_print: prompt eval time =     258.66 ms /     4 tokens (   64.66 ms per token,    15.46 tokens per second)
llama_perf_context_print:        eval time =    6520.80 ms /    39 runs   (  167.20 ms per token,     5.98 tokens per second)
llama_perf_context_print:       total time =    6787.95 ms /    43 tokens


After (without openmp):
llama_perf_sampler_print:    sampling time =       2.06 ms /    44 runs   (    0.05 ms per token, 21369.60 tokens per second)
llama_perf_context_print:        load time =    2189.62 ms
llama_perf_context_print: prompt eval time =     231.70 ms /     4 tokens (   57.92 ms per token,    17.26 tokens per second)
llama_perf_context_print:        eval time =    5362.74 ms /    39 runs   (  137.51 ms per token,     7.27 tokens per second)
llama_perf_context_print:       total time =    5602.92 ms /    43 tokens
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: false sharing in threadpool makes ggml_barrier() needlessly slow #9588

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: false sharing in threadpool makes ggml_barrier() needlessly slow #9588

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions