-
Notifications
You must be signed in to change notification settings - Fork 288
Closed
Description
System Info
I'm running the top listed model on a Linux x86 machine and I'm seeing extremely slow inference times and throughput when using CPU. Everything is fast and as expected when running on GPU.
On CPU, I'm running with
docker run -p 5252:80 -v ./data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-0.2.2 --model-id BAAI/bge-large-en-v1.5
When sending traffic to it via locust, I'm consistently seeing response over 1 second.
I'm comparing this to a pretty low effort sentence-transformers
server:
import uvicorn
from fastapi import FastAPI
from pydantic import BaseModel
from typing import List
from sentence_transformers import SentenceTransformer
app = FastAPI()
class EmbedRequest(BaseModel):
inputs: str
class EmbedResponse(BaseModel):
embedding: List[float]
@app.post("/embed")
def embed(request: EmbedRequest):
print(request.inputs)
embeddings = app.model.encode([request.inputs])
return EmbedResponse(embedding=[]) # embeddings[0])
if __name__ == "__main__":
app.model = SentenceTransformer("BAAI/bge-large-en-v1.5")
uvicorn.run(app, host="localhost", port=5252, log_level="info")
This really basic server is getting average response times of about 450ms.
Am I using the CPU version correctly? Should it be significantly slower than inference with sentence-transformers
for the same model?
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
See above
Expected behavior
Faster throughput on CPU than a naive sentence-transformers
server using FastAPI.
Metadata
Metadata
Assignees
Labels
No labels