Skip to content

Support of Docker/Kubernetes CPU limit/reservation #170

@bfreuden

Description

@bfreuden

Feature request

Docker (swarm) and Kubernetes have a way to limit CPU usage of a container.

Docker (swarm):

version: '3.4'
services:
    text-embeddings:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-0.6
        deploy:
            resources:
                limits:
                    cpus: '2'

Kubernetes:

apiVersion: v1
kind: Pod
metadata:
  name: text-embeddings
spec:
  containers:
  - name: text-embeddings
    image: ghcr.io/huggingface/text-embeddings-inference:cpu-0.6
    resources:
      requests:
        cpu: "2000m"
      limits:
        cpu: "2000m"

However, for that to be working optimally (see motivation below), the application in the container has to be aware of this limit and to allocate thread pools accordingly.

Motivation

If thread pools don't match the CPU limit, the container is throttled and performance will drop way beyond expectations (6 times slower in the example below).

For instance on my core i3-8300H (4 cores, 8 threads).

I'm evaluating performance with the following apache bench command (a request containing a single 17KB text to be processed with the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model:

ab -k -n24 -c4 -p req-en-17k-b1-huggingface.json -T application/json localhost:18083/embed
configuration reqs/sec avg. CPU usage (top)
no cpu limit 16.87 465%
cpuset=0,1 11.48 185%
cpus=2 1.82 200%
cpus=2 + env vars 11.03 150%

You can see on line 3 that cpu=2 (without environment variables) performance are 6 times slower than using cpuset=0,1.
The problem is neither Kubernetes nor Docker Swarm allow the cpuset option.
You can see on line 4 that adding environment variables controlling the number of threads has a positive impact on performances (almost on par with cpuset=0,1).

no cpu limit configuration:

version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024

cpuset=0,1 configuration:

version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024
        cpuset: "0,1"

cpus=2 configuration:

version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024
        deploy:
            resources:
                limits:
                    cpus: '2'

cpus=2 + env vars configuration:

version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024
            # interesting variables below
            - TOKIO_WORKER_THREADS=1
            - NUM_RAYON_THREADS=1
            - MKL_NUM_THREADS=1
            - MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
            - OMP_NUM_THREADS=1
            - MKL_DYNAMIC="FALSE"
            - OMP_DYNAMIC="FALSE"            
        deploy:
            resources:
                limits:
                    cpus: '2'

Your contribution

I'm afraid I can't do much more than this.

Please note that I don't really now which of my environment variables do really have an impact on performance since I am totally unaware of the internals of text-embeddings-inference.

The issue is known though:
https://danluu.com/cgroup-throttling/
https://nemre.medium.com/is-your-go-application-really-using-the-correct-number-of-cpu-cores-20915d2b6ccb

Some ecosystems have begun to take that into account.

For instance since Python 3.13 you can fool Python into thinking it has less CPU using an environment variable:

docker run --rm -it --name py13 -e PYTHON_CPU_COUNT=2  python:3.13.0a4-slim  python -c "import os; print(os.cpu_count())"
2

Java does that automatically since java 15:

docker run --rm -it --name java23 --entrypoint /bin/bash openjdk:23-slim 
root@31e4b2de8fad:/# jshell
jshell>  System.out.println(Runtime.getRuntime().availableProcessors());
8
ocker run --rm -it --name java23 --cpus=2 --entrypoint /bin/bash openjdk:23-slim 
root@1935b08ebcf7:/# jshell
jshell> System.out.println(Runtime.getRuntime().availableProcessors());
2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions