Support of Docker/Kubernetes CPU limit/reservation

### Feature request

Docker (swarm) and Kubernetes have a way to limit CPU usage of a container.

Docker (swarm): 
```yaml
version: '3.4'
services:
    text-embeddings:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-0.6
        deploy:
            resources:
                limits:
                    cpus: '2'
```
Kubernetes:
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: text-embeddings
spec:
  containers:
  - name: text-embeddings
    image: ghcr.io/huggingface/text-embeddings-inference:cpu-0.6
    resources:
      requests:
        cpu: "2000m"
      limits:
        cpu: "2000m"
```
However, for that to be working optimally (see motivation below), the application in the container has to be aware of this limit and to allocate thread pools accordingly.

### Motivation

If thread pools don't match the CPU limit, the container is throttled and **performance will drop way beyond expectations** (6 times slower in the example below).

For instance on my core i3-8300H (4 cores, 8 threads).

I'm evaluating performance with the following apache bench command (a request containing a single 17KB text to be processed with the `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model`: 
```
ab -k -n24 -c4 -p req-en-17k-b1-huggingface.json -T application/json localhost:18083/embed
```


| configuration        | reqs/sec           | avg. CPU usage (top)     |
| ------------- |:-------------:|:-------------:|
| no cpu limit   | 16.87 | 465% |
| cpuset=0,1     | 11.48      | 185% |
| cpus=2 | **1.82**      | 200%  |
| cpus=2 + env vars | 11.03     | 150% |

You can see on line 3 that `cpu=2` (without environment variables) performance are 6 times slower than using `cpuset=0,1`.
The problem is neither Kubernetes nor Docker Swarm allow the cpuset option.
You can see on line 4 that adding environment variables controlling the number of threads has a positive impact on performances (almost on par with `cpuset=0,1`).

`no cpu limit` configuration:
```yaml
version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024
```
`cpuset=0,1` configuration:
```yaml
version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024
        cpuset: "0,1"
```
`cpus=2` configuration:
```yaml
version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024
        deploy:
            resources:
                limits:
                    cpus: '2'
```

`cpus=2 + env vars` configuration:

```yaml
version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024
            # interesting variables below
            - TOKIO_WORKER_THREADS=1
            - NUM_RAYON_THREADS=1
            - MKL_NUM_THREADS=1
            - MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
            - OMP_NUM_THREADS=1
            - MKL_DYNAMIC="FALSE"
            - OMP_DYNAMIC="FALSE"            
        deploy:
            resources:
                limits:
                    cpus: '2'
```




### Your contribution

I'm afraid I can't do much more than this.

Please note that I don't really now which of my environment variables do really have an impact on performance since I am totally unaware of the internals of text-embeddings-inference.

The issue is known though:
https://danluu.com/cgroup-throttling/
https://nemre.medium.com/is-your-go-application-really-using-the-correct-number-of-cpu-cores-20915d2b6ccb

Some ecosystems have begun to take that into account.

For instance since Python 3.13 you can fool Python into thinking it has less CPU using an environment variable:

```
docker run --rm -it --name py13 -e PYTHON_CPU_COUNT=2  python:3.13.0a4-slim  python -c "import os; print(os.cpu_count())"
2
```

Java does that automatically since java 15:
```
docker run --rm -it --name java23 --entrypoint /bin/bash openjdk:23-slim 
root@31e4b2de8fad:/# jshell
jshell>  System.out.println(Runtime.getRuntime().availableProcessors());
8
```
```
ocker run --rm -it --name java23 --cpus=2 --entrypoint /bin/bash openjdk:23-slim 
root@1935b08ebcf7:/# jshell
jshell> System.out.println(Runtime.getRuntime().availableProcessors());
2
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support of Docker/Kubernetes CPU limit/reservation #170

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

configuration	reqs/sec	avg. CPU usage (top)
no cpu limit	16.87	465%
cpuset=0,1	11.48	185%
cpus=2	1.82	200%
cpus=2 + env vars	11.03	150%

Support of Docker/Kubernetes CPU limit/reservation #170

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions