Closed
Description
Component(s)
receiver/filelog, pkg/stanza
Is your feature request related to a problem? Please describe.
The default log driver of container runtimes, such as Docker, Containerd etc., may split logs by bytes instead of runes.
So, a Unicode may be split into two logs.
For example, the original log by application:
<content with 16KB -1 bytes>方...
The Unicode of "方": \xE6\x96\xB9
The output if running in Containerd:
2025-01-01T00:00:00.000000000+00:00 stdout P <content with 16KB -1 bytes>\xE6
2025-01-01T00:00:00.000000000+00:00 stdout F \x96\xB9...
- note: "\x96" means a byte with value 0x96.
The output if running in Docker:
{"log":"<content with 16KB -1 bytes>\ufffd","stream":"stdout","time":"2025-01-01T00:00:00.000000000Z"}
{"log":"\ufffd\ufffd...","stream":"stdout","time":"2025-01-01T00:00:00.000000000Z"}
The collected message:
<content with 16KB -1 bytes>���...
Describe the solution you'd like
For Docker, it seems that we can do nothing.
But for Containerd (maybe also CRI-O), we should try to merge the bytes to get the original Unicode.
- Firstly, when decoding, all invalid bytes will be replaced with �. This can be prevented by setting encoding to
nop
. Maybe this should be default behavior. - Then, the regex parser, which is used by container parser, will run into error as it only support string. We should add support for
[]byte
including invalid UTF-8 bytes. - Finally, after recombining by container parser, we should replace invalid bytes if still exist to ensure that the final output is valid UTF-8 string, and to prevent other operators running into error.
Describe alternatives you've considered
No response
Additional context
No response