Skip to content

inference through text-embeddings-inference server is slower than inference directly using model #144

@Mohit0928

Description

@Mohit0928

System Info

System details:
System: Linux
AWS Instance: m5.xlarge
CPU: 4
memory: 16GB

I ran BAAI/bge-base-en-v1.5 embedding model using text-embeddings-inference (TEI) hosting it as a docker container using below commands:

model=BAAI/bge-base-en-v1.5 
revision=refs/pr/5 
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run 
 
docker run -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-0.6 --model-id $model

Now, I used below code to process a pdf to generate it's embeddings:

import requests
import fitz
import time
import traceback
import json
import boto3
import numpy as np


url = "http://127.0.0.1:8080/embed"
headers = {"Content-Type": "application/json"}
s3 = boto3.client('s3')

def process_pdf(buf):
    
    try:
        start_time = time.time()
        pdf_document = fitz.open(stream=buf, filetype="pdf")
        page_count = pdf_document.page_count
        

        for page_num in range(page_count):
            
            page = pdf_document[page_num]
            text = page.get_text()
            words = text.split()
            unique_words = list(set(words))
            
            unique_text = ' '.join(unique_words)
            data = {"inputs": unique_text}
            embeddings = requests.post(url, json=data, headers=headers)
            embeddings = np.array([embeddings])
        

        pdf_document.close()

        embed_time = time.time()
        print(embed_time - start_time)


    except Exception as e:
        print(f"An error occurred while processing {s3_path}: {str(e)}")
        traceback.print_exc()  # Print the traceback information
        return


def process_data_sources(s3_path):
    
    try:
        if s3_path.endswith('.pdf'):
            # Get the PDF file content from S3
            print("Processing:", s3_path)
            response = s3.get_object(Bucket="state-medicaid-public-pending", Key=s3_path)
            buf = response['Body'].read()
            # Process the PDF file
            process_pdf(buf)

    except Exception as e:
        print(f"Outer exception occurred: {e}")
        traceback.print_exc()  # Print traceback information

def main():

    s3_path = 'crawldata/2024013001/KY/content/KY-1-RR%20318.pdf'
    process_data_sources(s3_path)


if __name__ == "__main__":
    main()

It took around 19 seconds

Then I ran same document without TEI:

import fitz
import time
import os
import traceback
import json
import boto3
import pandas as pd
import numpy as np
from llama_index.embeddings import HuggingFaceEmbedding

s3 = boto3.client('s3')
embedding_model = 'BAAI/bge-base-en-v1.5'

# Define all embeddings and rerankers
EMBEDDINGS = {
    "embed-model": HuggingFaceEmbedding(model_name=embedding_model, pooling='cls', trust_remote_code=True),
}

embed_model = EMBEDDINGS['embed-model']


def process_pdf(buf):
    
    try:
        start_time = time.time()
        pdf_document = fitz.open(stream=buf, filetype="pdf")
        page_count = pdf_document.page_count
        

        for page_num in range(page_count):
            
            page = pdf_document[page_num]
            text = page.get_text()
            words = text.split()
            unique_words = list(set(words))
            unique_text = ' '.join(unique_words)
            embeddings = embed_model.get_text_embedding(unique_text)
            embeddings = np.array([embeddings])
        

        pdf_document.close()

        embed_time = time.time()
        print(embed_time - start_time)


    except Exception as e:
        print(f"An error occurred while processing {s3_path}: {str(e)}")
        traceback.print_exc()  # Print the traceback information
        return


def process_data_sources(s3_path):
    
    try:
        if s3_path.endswith('.pdf'):
            # Get the PDF file content from S3
            print("Processing:", s3_path)
            response = s3.get_object(Bucket="state-medicaid-public-pending", Key=s3_path)
            buf = response['Body'].read()
            # Process the PDF file
            process_pdf(buf)

    except Exception as e:
        print(f"Outer exception occurred: {e}")
        traceback.print_exc()  # Print traceback information

def main():

    s3_path = 'crawldata/2024013001/KY/content/KY-1-RR%20318.pdf'
    process_data_sources(s3_path)


if __name__ == "__main__":
    main()

It took around 15 seconds.

I'm seeing slower inference speed using TEI engine. Am I missing something or is it slower?

Dependencies:

boto3
pymupdf
transformers
llama_index
torch

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Updated above

Expected behavior

The inference speed should be faster.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions