Description
Description
When using the Ollama.chat method to interact with the llama3 model, responses containing special characters (e.g., accented characters like á, é, í, ó, ú, ü and punctuation like ¿, ¡) are improperly encoded. While standard ASCII characters work fine, non-ASCII characters are returned with encoding artifacts, making them unreadable.
This issue persists across attempts to decode or process the responses within the client code, suggesting the issue might be related to how the library or server processes UTF-8 encoding.
Observed Behavior
The following responses were received when interacting with the llama3 model via Ollama.chat:
assistant: Hola! ¿Cómo estás?
This was expected to be:
assistant: Hola! ¿Cómo estás?
Raw response from Ollama: {
model: 'llama3',
created_at: '2024-11-30T04:57:41.4175287Z',
message: { role: 'assistant', content: 'Hola! ¿Cómo estás?' },
done_reason: 'stop',
done: true,
total_duration: 2413650900,
load_duration: 24432200,
prompt_eval_count: 12,
prompt_eval_duration: 288000000,
eval_count: 8,
eval_duration: 2100000000
}
Additional Context
- English works fine: Messages containing only English characters are processed correctly.
- Special characters fail: Any character outside the ASCII range (e.g., accented vowels, ¿, ¡) results in encoding artifacts.
Direct API Output
Testing with curl shows that responses from the server are returned in fragments:
curl -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"prompt": "¡Hola! ¿Cómo estás?"
}'
Response
{
"model": "llama3",
"created_at": "2024-11-30T05:00:16.7882944Z",
"response": "¡",
"done": false
}
{
"model": "llama3",
"created_at": "2024-11-30T05:00:17.1132195Z",
"response": "h",
"done": false
}
{
"model": "llama3",
"created_at": "2024-11-30T05:00:17.5597785Z",
"response": "ola",
"done": false
}
{
"model": "llama3",
"created_at": "2024-11-30T05:00:19.5772488Z",
"response": "?",
"done": true
}
This suggests the fragments are being returned correctly in terms of structure but not properly encoded.
Attempts to Resolve
UTF-8 Decoding Using Buffer: Tried decoding the response as UTF-8:
const content = Buffer.from(response.message.content, 'latin1').toString('utf8');
console.log('Decoded content:', content);
Result:
Hola! ´┐¢C´┐¢mo est´┐¢s?
Environment Details
OS: Windows 11
Node.js Version: 20.x
Library Version: Latest (installed via npm)
Model Used: llama3
API Host: http://127.0.0.1:11434
Request
- Confirm UTF-8 Handling: Verify that the server and library are properly handling UTF-8 characters in both streaming and assembled responses.
- Document Encoding Expectations: Clarify if clients need to perform additional decoding steps or if the library should natively handle this.
- Provide Guidance: If this issue is expected behavior, please provide steps or examples for properly decoding responses with non-ASCII characters.
Thank you for addressing this issue. If more information or debugging steps are needed, feel free to reach out!