Skip to content

Slow inference time running on CPU #31

@fozziethebeat

Description

@fozziethebeat

System Info

I'm running the top listed model on a Linux x86 machine and I'm seeing extremely slow inference times and throughput when using CPU. Everything is fast and as expected when running on GPU.

On CPU, I'm running with

docker run  -p 5252:80 -v ./data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-0.2.2 --model-id BAAI/bge-large-en-v1.5

When sending traffic to it via locust, I'm consistently seeing response over 1 second.

I'm comparing this to a pretty low effort sentence-transformers server:

import uvicorn
from fastapi import FastAPI 
from pydantic import BaseModel
from typing import List
from sentence_transformers import SentenceTransformer
app = FastAPI()

class EmbedRequest(BaseModel): 
    inputs: str

class EmbedResponse(BaseModel):
    embedding: List[float]


@app.post("/embed")
def embed(request: EmbedRequest):
    print(request.inputs)
    embeddings = app.model.encode([request.inputs])
    return EmbedResponse(embedding=[])  # embeddings[0])


if __name__ == "__main__":
    app.model = SentenceTransformer("BAAI/bge-large-en-v1.5")
    uvicorn.run(app, host="localhost", port=5252, log_level="info")

This really basic server is getting average response times of about 450ms.

Am I using the CPU version correctly? Should it be significantly slower than inference with sentence-transformers for the same model?

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

See above

Expected behavior

Faster throughput on CPU than a naive sentence-transformers server using FastAPI.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions