Skip to content

unicode: defs: conv: Implement conversion rules of character encodings #10464

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

cosmo0920
Copy link
Contributor

@cosmo0920 cosmo0920 commented Jun 12, 2025

I implemented an encoding conversion engine on Fluent Bit which supports the following encodings:

East Asian Encodings

  • ShiftJIS: 932
  • GB18030: 54936
  • GBK: 936
  • UHC (Unified Hangul Code): 949
  • Big5: 950

Windows (ANSI) Encodings

  • Win1250 (Central European): 1250
  • Win1251 (Cyrillic): 1251
  • Win1252 (Western European / Latin): 1252
  • Win1253 (Greek): 1253
  • Win1254 (Turkish): 1254
  • Win1255 (Hebrew): 1255
  • Win1256 (Arabic): 1256

DOS (OEM) Encodings

  • Win866 (Cyrillic - DOS): 866
  • Win874 (Thai): 874

This is because especially CJK language environments, converting encoding is quite important because the binary data of these encodings of frequently used is not always compatible against UTF-8 way of representation of characters.

Plus, this could be a first milestone to remove the obstacles to move from other log collectors which already support CJK related character encodings.

This type of capability is really really important for Asian languages especially for CJK environments.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

Just only logs for added assertion case of unit tests with valgrind:

==3345944== Memcheck, a memory error detector
==3345944== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3345944== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==3345944== Command: bin/flb-it-unicode
==3345944== 
Test generic_converters...                      [ OK ]
==3345945== Warning: invalid file descriptor -1 in syscall close()
==3345945== 
==3345945== HEAP SUMMARY:
==3345945==     in use at exit: 0 bytes in 0 blocks
==3345945==   total heap usage: 2,170 allocs, 2,170 frees, 288,384 bytes allocated
==3345945== 
==3345945== All heap blocks were freed -- no leaks are possible
==3345945== 
==3345945== For lists of detected and suppressed errors, rerun with: -s
==3345945== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Test generic_converters_alias...                [ OK ]
==3345947== Warning: invalid file descriptor -1 in syscall close()
==3345947== 
==3345947== HEAP SUMMARY:
==3345947==     in use at exit: 0 bytes in 0 blocks
==3345947==   total heap usage: 2,170 allocs, 2,170 frees, 288,384 bytes allocated
==3345947== 
==3345947== All heap blocks were freed -- no leaks are possible
==3345947== 
==3345947== For lists of detected and suppressed errors, rerun with: -s
==3345947== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Test generic_conversions_sjis...                UTF-8: こんにちは[ OK ]
==3345948== Warning: invalid file descriptor -1 in syscall close()
==3345948== 
==3345948== HEAP SUMMARY:
==3345948==     in use at exit: 0 bytes in 0 blocks
==3345948==   total heap usage: 2,172 allocs, 2,172 frees, 288,431 bytes allocated
==3345948== 
==3345948== All heap blocks were freed -- no leaks are possible
==3345948== 
==3345948== For lists of detected and suppressed errors, rerun with: -s
==3345948== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Test generic_conversions_gbk...                 [ OK ]
==3345949== Warning: invalid file descriptor -1 in syscall close()
==3345949== 
==3345949== HEAP SUMMARY:
==3345949==     in use at exit: 0 bytes in 0 blocks
==3345949==   total heap usage: 2,172 allocs, 2,172 frees, 288,404 bytes allocated
==3345949== 
==3345949== All heap blocks were freed -- no leaks are possible
==3345949== 
==3345949== For lists of detected and suppressed errors, rerun with: -s
==3345949== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Test generic_conversions_big5...                [ OK ]
==3345985== Warning: invalid file descriptor -1 in syscall close()
==3345985== 
==3345985== HEAP SUMMARY:
==3345985==     in use at exit: 0 bytes in 0 blocks
==3345985==   total heap usage: 2,172 allocs, 2,172 frees, 288,404 bytes allocated
==3345985== 
==3345985== All heap blocks were freed -- no leaks are possible
==3345985== 
==3345985== For lists of detected and suppressed errors, rerun with: -s
==3345985== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Test generic_conversions_gb18030...             [ OK ]
==3345986== Warning: invalid file descriptor -1 in syscall close()
==3345986== 
==3345986== HEAP SUMMARY:
==3345986==     in use at exit: 0 bytes in 0 blocks
==3345986==   total heap usage: 2,172 allocs, 2,172 frees, 288,408 bytes allocated
==3345986== 
==3345986== All heap blocks were freed -- no leaks are possible
==3345986== 
==3345986== For lists of detected and suppressed errors, rerun with: -s
==3345986== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Test generic_conversions_all...                 --- Testing [SJIS]: "こんにちは" ---
  SJIS to UTF-8: OK
  UTF-8 to SJIS: OK

--- Testing [CP866]: "Привет" ---
  CP866 to UTF-8: OK
  UTF-8 to CP866: OK

--- Testing [CP874]: "สวัสดี" ---
  CP874 to UTF-8: OK
  UTF-8 to CP874: OK

--- Testing [CP932]: "こんにちは" ---
  CP932 to UTF-8: OK
  UTF-8 to CP932: OK

--- Testing [Windows-31J]: "こんにちは" ---
  Windows-31J to UTF-8: OK
  UTF-8 to Windows-31J: OK

--- Testing [CP949]: "안녕하세요" ---
  CP949 to UTF-8: OK
  UTF-8 to CP949: OK

--- Testing [CP1250]: "Děkuji" ---
  CP1250 to UTF-8: OK
  UTF-8 to CP1250: OK

--- Testing [CP1251]: "Спасибо" ---
  CP1251 to UTF-8: OK
  UTF-8 to CP1251: OK

--- Testing [CP1252]: "¡Hola!" ---
  CP1252 to UTF-8: OK
  UTF-8 to CP1252: OK

--- Testing [CP1253]: "Ευχαριστώ" ---
  CP1253 to UTF-8: OK
  UTF-8 to CP1253: OK

--- Testing [CP1254]: "Teşekkürler" ---
  CP1254 to UTF-8: OK
  UTF-8 to CP1254: OK

--- Testing [CP1255]: "תודה" ---
  CP1255 to UTF-8: OK
  UTF-8 to CP1255: OK

--- Testing [CP1256]: "شكرا" ---
  CP1256 to UTF-8: OK
  UTF-8 to CP1256: OK

[ OK ]
==3345987== Warning: invalid file descriptor -1 in syscall close()
==3345987== 
==3345987== HEAP SUMMARY:
==3345987==     in use at exit: 0 bytes in 0 blocks
==3345987==   total heap usage: 2,196 allocs, 2,196 frees, 288,867 bytes allocated
==3345987== 
==3345987== All heap blocks were freed -- no leaks are possible
==3345987== 
==3345987== For lists of detected and suppressed errors, rerun with: -s
==3345987== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
SUCCESS: All unit tests have passed.
==3345944== 
==3345944== HEAP SUMMARY:
==3345944==     in use at exit: 0 bytes in 0 blocks
==3345944==   total heap usage: 1,316 allocs, 1,316 frees, 194,088 bytes allocated
==3345944== 
==3345944== All heap blocks were freed -- no leaks are possible
==3345944== 
==3345944== For lists of detected and suppressed errors, rerun with: -s
==3345944== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

No leaks is detected.

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@cosmo0920 cosmo0920 force-pushed the cosmo0920-implement-conversion-rules-of-character-encodings branch from 68b218b to 9aca9bf Compare June 12, 2025 08:21
@cosmo0920 cosmo0920 force-pushed the cosmo0920-implement-conversion-rules-of-character-encodings branch from 3158015 to c314755 Compare June 12, 2025 09:35
@cosmo0920 cosmo0920 force-pushed the cosmo0920-implement-conversion-rules-of-character-encodings branch from c314755 to fe52a3e Compare June 12, 2025 09:41
@pwhelan
Copy link
Contributor

pwhelan commented Jun 12, 2025

The failure seems to be weird flake in the filter_rewrite_tag test on macOS:

Test heavy_input_pause_emitter...               [2025/06/12 11:12:02] [ info] [input] pausing emitter_for_rewrite_tag.0
[ FAILED ]
  filter_rewrite_tag.c:354: Check heavy_loop > got... failed
    expect: 100000 got: 100000
Test issue_4518...                              [2025/06/12 11:12:05] [ info] [input] pausing emitter_for_rewrite_tag.0

As far as I can tell it is unrelated to the code in this PR.

@cosmo0920
Copy link
Contributor Author

As far as I can tell it is unrelated to the code in this PR.

Yes, that test case is flaky test on macOS. Thanks for pointing out.

Note: The rules which are related to CJK is mainly included for this converter
implementation on Fluent Bit.

Signed-off-by: Hiroshi Hatake <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants