Skip to content

If a Unicode character is split by container runtime, we should merge it when recombining #39653

Closed
@h0cheung

Description

@h0cheung

Component(s)

receiver/filelog, pkg/stanza

Is your feature request related to a problem? Please describe.

The default log driver of container runtimes, such as Docker, Containerd etc., may split logs by bytes instead of runes.
So, a Unicode may be split into two logs.
For example, the original log by application:

<content with 16KB -1 bytes>方...

The Unicode of "方": \xE6\x96\xB9

The output if running in Containerd:

2025-01-01T00:00:00.000000000+00:00 stdout P <content with 16KB -1 bytes>\xE6
2025-01-01T00:00:00.000000000+00:00 stdout F \x96\xB9...
  • note: "\x96" means a byte with value 0x96.

The output if running in Docker:

{"log":"<content with 16KB -1 bytes>\ufffd","stream":"stdout","time":"2025-01-01T00:00:00.000000000Z"}
{"log":"\ufffd\ufffd...","stream":"stdout","time":"2025-01-01T00:00:00.000000000Z"}

The collected message:

<content with 16KB -1 bytes>���...

Describe the solution you'd like

For Docker, it seems that we can do nothing.
But for Containerd (maybe also CRI-O), we should try to merge the bytes to get the original Unicode.

  • Firstly, when decoding, all invalid bytes will be replaced with �. This can be prevented by setting encoding to nop. Maybe this should be default behavior.
  • Then, the regex parser, which is used by container parser, will run into error as it only support string. We should add support for []byte including invalid UTF-8 bytes.
  • Finally, after recombining by container parser, we should replace invalid bytes if still exist to ensure that the final output is valid UTF-8 string, and to prevent other operators running into error.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions