-
Notifications
You must be signed in to change notification settings - Fork 289
Open
Description
System Info
Apple M2 Pro 14.2.1 (23C71)
cargo 1.75.0 (1d8b05cdd 2023-11-20)
{
"model_id": "llmrails/ember-v1",
"model_sha": null,
"model_dtype": "float16",
"model_type": {
"embedding": {
"pooling": "cls"
}
},
"max_concurrent_requests": 512,
"max_input_length": 512,
"max_batch_tokens": 16384,
"max_batch_requests": null,
"max_client_batch_size": 32,
"auto_truncate": false,
"tokenization_workers": 12,
"version": "1.2.0",
"sha": "eef2912b318fef33df736f048d769df7056cea16",
"docker_label": null
}
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
- Generate a large garbage string of text. In JavaScript:
const content = '.'.repeat(1e6)
- Call the
/tokenize
endpoint with that text about 20 times. (I used the ember_v1 model) - Notice the text-embeddings-inference process consumes all available CPU and RAM.
Expected behavior
Options:
- /tokenize has a timeout constraint (either hardcoded, set by env var, or passed in as argument)
- A validator sits in front of the model to detect nonsensical/nefarious inputs
- /tokenize returns a 413 if content > e.g. max_batch_tokens * 5 (or similar). Least preferred option since it's nice to get a complete token length as a starting point for chunking strategies.
Metadata
Metadata
Assignees
Labels
No labels