[Bug] chatqna: xeon pipeline fails (serious performance drop) when CPU affinity of tei and teirerank containers is managed

### Priority

P2-High

### OS type

Ubuntu

### Hardware type

Xeon-SPR

### Installation method

- [X] Pull docker images from hub.docker.com
- [ ] Build docker images from source

### Deploy method

- [ ] Docker compose
- [ ] Docker
- [X] Kubernetes
- [ ] Helm

### Running nodes

Single Node

### What's the version?

Observed with latest chatqna.yaml (git 67394b88) where tei and teirerank containers use image: `ghcr.io/huggingface/text-embeddings-inference:cpu-1.5`

```
** ctr -n k8s.io images ls | grep text-embeddings **
ghcr.io/huggingface/text-embeddings-inference:cpu-1.5                                                                               application/vnd.oci.image.index.v1+json                   sha256:0502794a4d86974839e701dadd6d06e693ec78a0f6e87f68c391e88c52154f3f 48.2 MiB  linux/amd64                                                                                                                        io.cri-containerd.image=managed
ghcr.io/huggingface/text-embeddings-inference@sha256:0502794a4d86974839e701dadd6d06e693ec78a0f6e87f68c391e88c52154f3f               application/vnd.oci.image.index.v1+json                   sha256:0502794a4d86974839e701dadd6d06e693ec78a0f6e87f68c391e88c52154f3f 48.2 MiB  linux/amd64                                                                                                                        io.cri-containerd.image=managed
```

### Description

When managing CPU affinity (with NRI resource policies or Kubernetes cpu-manager) on a node and creating ChatQnA/kubernetes/manifests/xeon/chatqna.yaml, tei and teirerank containers do not handle properly their internal threading and thread-CPU affinities.

They seem to create a thread for every CPU _in the system_, yet they should create a thread for every CPU _allowed for the container_.

In the logs it looks like this:
```
**kubectl logs -n benchmark chatqna-teirerank-674b878d9c-sdkg9**
...
2024-09-06T07:10:06.082735Z  INFO text_embeddings_router: router/src/lib.rs:241: Starting model backend
2024-09-06T07:10:06.095067Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for
thread: 80, index: 0, mask: {1, 65, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-06T07:10:06.095106Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for
thread: 81, index: 1, mask: {2, 66, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-06T07:10:06.095128Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for
thread: 82, index: 2, mask: {3, 67, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2
....
2024-09-06T07:10:06.260526Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for
thread: 88, index: 8, mask: {9, 73, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-06T07:10:08.576066Z  WARN text_embeddings_router: router/src/lib.rs:267: Backend does not support a batch size > 8
2024-09-06T07:10:08.576082Z  WARN text_embeddings_router: router/src/lib.rs:268: forcing `max_batch_requests=8`
2024-09-06T07:10:08.576195Z  WARN text_embeddings_router: router/src/lib.rs:319: Invalid hostname, defaulting to 0.0.0.0
2024-09-06T07:10:08.579399Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1778: Starting HTTP server: 0.0.0.0:2082
2024-09-06T07:10:08.579418Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1779: Ready
```

And in the system's process/thread's CPU affinity level like this:
```
**grep Cpus_allowed_list /proc/2370247/task/2370*/status**
...
/proc/2370247/task/2370368/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370369/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370370/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370371/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370372/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370373/status:Cpus_allowed_list:    40
/proc/2370247/task/2370374/status:Cpus_allowed_list:    41
/proc/2370247/task/2370375/status:Cpus_allowed_list:    42                                                                                                                                  /proc/2370247/task/2370376/status:Cpus_allowed_list:    43
/proc/2370247/task/2370377/status:Cpus_allowed_list:    44
/proc/2370247/task/2370378/status:Cpus_allowed_list:    45
/proc/2370247/task/2370379/status:Cpus_allowed_list:    46
/proc/2370247/task/2370380/status:Cpus_allowed_list:    47
/proc/2370247/task/2370381/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370382/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370383/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370384/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370385/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370386/status:Cpus_allowed_list:    40-47
...
```

That is, only few threads got correct CPU pinning, the rest (that are way too many) run on all allowed CPUs for the container. As a result this destroys the performance of tei and teirerank on CPU.

The log looks like the ort library is trying to create a thread and set affinity for every CPU in the system while it should not try to use any other than allowed CPUs (limited by cgroups cpuset.cpus). Cannot say if the root cause is in the ort library or how it is used here.

### Reproduce steps

1. Install the balloons NRI policy to manage CPUs.
```bash
helm repo add nri-plugins https://containers.github.io/nri-plugins
helm install balloons nri-plugins/nri-resource-policy-balloons --set patchRuntimeConfig=true
```
2. Replace the default balloons configuration with one that runs tei/tei-rerank on dedicated CPUs.
```
cat > chatqna-balloons.yaml << EOF
apiVersion: config.nri/v1alpha1
kind: BalloonsPolicy
metadata:
  name: default
  namespace: kube-system
spec:
  allocatorTopologyBalancing: true
  balloonTypes:
  - name: tgi
    allocatorPriority: high
    minCPUs: 32
    minBalloons: 1
    preferNewBalloons: true
    hideHyperthreads: true
    matchExpressions:
    - key: name
      operator: Equals
      values: ["tgi"]
  - name: embedding
    allocatorPriority: high
    minCPUs: 16
    minBalloons: 2
    preferNewBalloons: true
    hideHyperthreads: true
    matchExpressions:
    - key: name
      operator: In
      values:
      - tei
      - teirerank
  - allocatorPriority: normal
    minCPUs: 14
    hideHyperthreads: false
    name: default
    namespaces:
    - "*"
  log:
    debug: ["policy"]
  pinCPU: true
  pinMemory: false
  reservedPoolNamespaces:
  - kube-system
  reservedResources:
    cpu: "2"
EOF
```
```bash
kubectl delete -n kube-system balloonspolicy default
kubectl create -n kube-system -f balloons-chatqna.yaml
```
3. Deploy the chatqna yaml
```
kubectl create -f ChatQnA/kubernetes/manifests/xeon/chatqna.yaml
```

4. Follow logs from chatqna-tei and chatqna-teirerank.

### Raw log

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] chatqna: xeon pipeline fails (serious performance drop) when CPU affinity of tei and teirerank containers is managed #763

Priority

OS type

Hardware type

Installation method

Deploy method

Running nodes

What's the version?

Description

Reproduce steps

Raw log

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] chatqna: xeon pipeline fails (serious performance drop) when CPU affinity of tei and teirerank containers is managed #763

Description

Priority

OS type

Hardware type

Installation method

Deploy method

Running nodes

What's the version?

Description

Reproduce steps

Raw log

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions