Skip to content

Fix excessive token usage with Unicode text in realtime event serialization #2444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

josharsh
Copy link

@josharsh josharsh commented Jul 4, 2025

Non-ASCII characters in real-time event data (such as Cyrillic, Chinese, Arabic, etc.) were being unnecessarily escaped during JSON serialisation, causing significant token overhead.

This fix adds ensure_ascii=False to json.dumps() calls in real-time WebSocket event sending, preserving Unicode characters in their original form.

Token savings:

  • 54-60% size reduction for Unicode-heavy schemas
  • ~116+ tokens saved per typical function schema with Cyrillic descriptions
  • Backwards compatible - outputs valid JSON that parses identically

Fixes issue #2428 where Pydantic schema descriptions with Cyrillic text caused 3.6x token overhead.

The fix updates both sync and async realtime connection send() methods to use ensure_ascii=False, which is the modern standard for JSON serialisation with Unicode content.

  • I understand that this repository is auto-generated and my pull request may not be merged

Changes being requested

Additional context & links

…zation

Non-ASCII characters in realtime event data (such as Cyrillic, Chinese, Arabic, etc.)
were being unnecessarily escaped during JSON serialization, causing significant token overhead.
This fix adds ensure_ascii=False to json.dumps() calls in realtime WebSocket event sending,
preserving Unicode characters in their original form.

Token savings:
- 54-60% size reduction for Unicode-heavy schemas
- ~116+ tokens saved per typical function schema with Cyrillic descriptions
- Backward compatible - outputs valid JSON that parses identically

Fixes issue openai#2428 where Pydantic schema descriptions with Cyrillic text caused 3.6x token overhead.

The fix updates both sync and async realtime connection send() methods to use ensure_ascii=False,
which is the modern standard for JSON serialization with Unicode content.
@josharsh josharsh requested a review from a team as a code owner July 4, 2025 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant