-
Notifications
You must be signed in to change notification settings - Fork 287
Description
Feature request
Docker (swarm) and Kubernetes have a way to limit CPU usage of a container.
Docker (swarm):
version: '3.4'
services:
text-embeddings:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-0.6
deploy:
resources:
limits:
cpus: '2'
Kubernetes:
apiVersion: v1
kind: Pod
metadata:
name: text-embeddings
spec:
containers:
- name: text-embeddings
image: ghcr.io/huggingface/text-embeddings-inference:cpu-0.6
resources:
requests:
cpu: "2000m"
limits:
cpu: "2000m"
However, for that to be working optimally (see motivation below), the application in the container has to be aware of this limit and to allocate thread pools accordingly.
Motivation
If thread pools don't match the CPU limit, the container is throttled and performance will drop way beyond expectations (6 times slower in the example below).
For instance on my core i3-8300H (4 cores, 8 threads).
I'm evaluating performance with the following apache bench command (a request containing a single 17KB text to be processed with the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model
:
ab -k -n24 -c4 -p req-en-17k-b1-huggingface.json -T application/json localhost:18083/embed
configuration | reqs/sec | avg. CPU usage (top) |
---|---|---|
no cpu limit | 16.87 | 465% |
cpuset=0,1 | 11.48 | 185% |
cpus=2 | 1.82 | 200% |
cpus=2 + env vars | 11.03 | 150% |
You can see on line 3 that cpu=2
(without environment variables) performance are 6 times slower than using cpuset=0,1
.
The problem is neither Kubernetes nor Docker Swarm allow the cpuset option.
You can see on line 4 that adding environment variables controlling the number of threads has a positive impact on performances (almost on par with cpuset=0,1
).
no cpu limit
configuration:
version: '3.4'
services:
multiminilml12v2:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
environment:
- MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- NVIDIA_DISABLE_REQUIRE=1
- RUST_BACKTRACE=full
- JSON_OUTPUT=true
- PORT=18083
- MAX_BATCH_TOKENS=65536
- MAX_CLIENT_BATCH_SIZE=1024
cpuset=0,1
configuration:
version: '3.4'
services:
multiminilml12v2:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
environment:
- MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- NVIDIA_DISABLE_REQUIRE=1
- RUST_BACKTRACE=full
- JSON_OUTPUT=true
- PORT=18083
- MAX_BATCH_TOKENS=65536
- MAX_CLIENT_BATCH_SIZE=1024
cpuset: "0,1"
cpus=2
configuration:
version: '3.4'
services:
multiminilml12v2:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
environment:
- MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- NVIDIA_DISABLE_REQUIRE=1
- RUST_BACKTRACE=full
- JSON_OUTPUT=true
- PORT=18083
- MAX_BATCH_TOKENS=65536
- MAX_CLIENT_BATCH_SIZE=1024
deploy:
resources:
limits:
cpus: '2'
cpus=2 + env vars
configuration:
version: '3.4'
services:
multiminilml12v2:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
environment:
- MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- NVIDIA_DISABLE_REQUIRE=1
- RUST_BACKTRACE=full
- JSON_OUTPUT=true
- PORT=18083
- MAX_BATCH_TOKENS=65536
- MAX_CLIENT_BATCH_SIZE=1024
# interesting variables below
- TOKIO_WORKER_THREADS=1
- NUM_RAYON_THREADS=1
- MKL_NUM_THREADS=1
- MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
- OMP_NUM_THREADS=1
- MKL_DYNAMIC="FALSE"
- OMP_DYNAMIC="FALSE"
deploy:
resources:
limits:
cpus: '2'
Your contribution
I'm afraid I can't do much more than this.
Please note that I don't really now which of my environment variables do really have an impact on performance since I am totally unaware of the internals of text-embeddings-inference.
The issue is known though:
https://danluu.com/cgroup-throttling/
https://nemre.medium.com/is-your-go-application-really-using-the-correct-number-of-cpu-cores-20915d2b6ccb
Some ecosystems have begun to take that into account.
For instance since Python 3.13 you can fool Python into thinking it has less CPU using an environment variable:
docker run --rm -it --name py13 -e PYTHON_CPU_COUNT=2 python:3.13.0a4-slim python -c "import os; print(os.cpu_count())"
2
Java does that automatically since java 15:
docker run --rm -it --name java23 --entrypoint /bin/bash openjdk:23-slim
root@31e4b2de8fad:/# jshell
jshell> System.out.println(Runtime.getRuntime().availableProcessors());
8
ocker run --rm -it --name java23 --cpus=2 --entrypoint /bin/bash openjdk:23-slim
root@1935b08ebcf7:/# jshell
jshell> System.out.println(Runtime.getRuntime().availableProcessors());
2