Skip to content

Encoding issue with non-ASCII characters in Ollama.chat responses (Spanish language issue) #168

Open
@jasp402

Description

@jasp402

Description

When using the Ollama.chat method to interact with the llama3 model, responses containing special characters (e.g., accented characters like á, é, í, ó, ú, ü and punctuation like ¿, ¡) are improperly encoded. While standard ASCII characters work fine, non-ASCII characters are returned with encoding artifacts, making them unreadable.

This issue persists across attempts to decode or process the responses within the client code, suggesting the issue might be related to how the library or server processes UTF-8 encoding.


Observed Behavior

The following responses were received when interacting with the llama3 model via Ollama.chat:

assistant: Hola! ¿Cómo estás?

This was expected to be:

assistant: Hola! ¿Cómo estás?
Raw response from Ollama: {
  model: 'llama3',
  created_at: '2024-11-30T04:57:41.4175287Z',
  message: { role: 'assistant', content: 'Hola! ¿Cómo estás?' },
  done_reason: 'stop',
  done: true,
  total_duration: 2413650900,
  load_duration: 24432200,
  prompt_eval_count: 12,
  prompt_eval_duration: 288000000,
  eval_count: 8,
  eval_duration: 2100000000
}

Additional Context

  • English works fine: Messages containing only English characters are processed correctly.
  • Special characters fail: Any character outside the ASCII range (e.g., accented vowels, ¿, ¡) results in encoding artifacts.

Direct API Output
Testing with curl shows that responses from the server are returned in fragments:

curl -X POST http://localhost:11434/api/generate \
     -H "Content-Type: application/json" \
     -d '{
         "model": "llama3",
         "prompt": "¡Hola! ¿Cómo estás?"
     }'

Response

{
    "model": "llama3",
    "created_at": "2024-11-30T05:00:16.7882944Z",
    "response": "¡",
    "done": false
}
{
    "model": "llama3",
    "created_at": "2024-11-30T05:00:17.1132195Z",
    "response": "h",
    "done": false
}
{
    "model": "llama3",
    "created_at": "2024-11-30T05:00:17.5597785Z",
    "response": "ola",
    "done": false
}
{
    "model": "llama3",
    "created_at": "2024-11-30T05:00:19.5772488Z",
    "response": "?",
    "done": true
}

This suggests the fragments are being returned correctly in terms of structure but not properly encoded.

Attempts to Resolve

UTF-8 Decoding Using Buffer: Tried decoding the response as UTF-8:

const content = Buffer.from(response.message.content, 'latin1').toString('utf8');
console.log('Decoded content:', content);

Result:

Hola! ´┐¢C´┐¢mo est´┐¢s?

Environment Details

OS: Windows 11
Node.js Version: 20.x
Library Version: Latest (installed via npm)
Model Used: llama3
API Host: http://127.0.0.1:11434

Request

  • Confirm UTF-8 Handling: Verify that the server and library are properly handling UTF-8 characters in both streaming and assembled responses.
  • Document Encoding Expectations: Clarify if clients need to perform additional decoding steps or if the library should natively handle this.
  • Provide Guidance: If this issue is expected behavior, please provide steps or examples for properly decoding responses with non-ASCII characters.

Thank you for addressing this issue. If more information or debugging steps are needed, feel free to reach out!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions