If a Unicode character is split by container runtime, we should merge it when recombining

### Component(s)

receiver/filelog, pkg/stanza

### Is your feature request related to a problem? Please describe.

The default log driver of container runtimes, such as Docker, Containerd etc., may split logs by bytes instead of runes.
So, a Unicode may be split into two logs.
For example, the original log by application:
```
<content with 16KB -1 bytes>方...
```
The Unicode of "方": \xE6\x96\xB9

The output if running in Containerd:
```
2025-01-01T00:00:00.000000000+00:00 stdout P <content with 16KB -1 bytes>\xE6
2025-01-01T00:00:00.000000000+00:00 stdout F \x96\xB9...
```
- note: "\x96" means a byte with value 0x96.

The output if running in Docker:
```
{"log":"<content with 16KB -1 bytes>\ufffd","stream":"stdout","time":"2025-01-01T00:00:00.000000000Z"}
{"log":"\ufffd\ufffd...","stream":"stdout","time":"2025-01-01T00:00:00.000000000Z"}
```

The collected message:
```
<content with 16KB -1 bytes>���...
```

### Describe the solution you'd like

For Docker, it seems that we can do nothing.
But for Containerd (maybe also CRI-O), we should try to merge the bytes to get the original Unicode.

- Firstly, when decoding, all invalid bytes will be replaced with �. This can be prevented by setting encoding to `nop`. Maybe this should be default behavior.
- Then, the regex parser, which is used by container parser, will run into error as it only support string. We should add support for `[]byte` including invalid UTF-8 bytes.
- Finally, after recombining by container parser, we should replace invalid bytes if still exist to ensure that the final output is valid UTF-8 string, and to prevent other operators running into error.

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

If a Unicode character is split by container runtime, we should merge it when recombining #39653

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

If a Unicode character is split by container runtime, we should merge it when recombining #39653

Description

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions