Skip to content

[extension/awslogs_encoding] Add support for VPC flow logs in plain text file format #38897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Apr 1, 2025

Conversation

constanca-m
Copy link
Contributor

@constanca-m constanca-m commented Mar 24, 2025

Description

Add support for VPC flow logs sent to S3 in plain text format.

Link to tracking issue

Fixes #38896.

Testing

There are new unit tests added.

Documentation

Comments in the code and unit tests should be enough.

Fields are mapped this way:

Flow log field Attribute Available? Currently supported?
version aws.vpc.flow.log.version 🔴
account-id cloud.account.id 🟢
interface-id aws.eni.id 🔴
srcaddr source.address or network.peer.address 🔴
pkt-srcaddr source.address 🟢
dstaddr destination.address or network.peer.address 🔴
pkt-dstaddr destination.address 🟢
srcport source.port 🟢
dstport destination.port 🟢
protocol network.protocol.name 🟢
packets aws.vpc.flow.packets 🔴
bytes aws.vpc.flow.bytes 🔴
start aws.vpc.flow.start 🔴
end log timestamp N/A
action aws.vpc.flow.action 🔴
log-status aws.vpc.flow.status 🔴
vpc-id aws.vpc.id 🔴
subnet-id aws.vpc.subnet.id 🔴
instance-id host.id 🔴
tcp-flags network.tcp.flags 🔴
type network.type 🟢
region cloud.region 🟢
az-id aws.az.id 🔴
sublocation-type aws.sublocation.type 🔴
sublocation-id aws.sublocation.id 🔴
pkt-src-aws-service aws.vpc.flow.source.service 🔴
pkt-dst-aws-service aws.vpc.flow.destination.service 🔴
flow-direction network.io.direction 🟢
traffic-path aws.vpc.flow.traffic_path 🔴
ecs-cluster-arn aws.ecs.cluster.arn 🟢 🔴
ecs-cluster-name aws.ecs.cluster.name 🔴 🔴
ecs-container-instance-arn aws.ecs.container.instance.arn 🔴 🔴
ecs-container-instance-id aws.ecs.container.instance.id 🔴 🔴
ecs-container-id aws.ecs.container.id 🔴 🔴
ecs-second-container-id aws.ecs.second.container.arn 🔴 🔴
ecs-service-name aws.ecs.service.name 🔴 🔴
ecs-task-definition-arn aws.ecs.task.definition.arn 🔴 🔴
ecs-task-arn aws.ecs.task.arn 🟢 🔴
ecs-task-id aws.ecs.task.id 🟢 🔴
reject-reason aws.vpc.flow.reject_reason 🔴

formatVPCFlowLog = "vpc_flow_log"

fileFormatPlainText = "plain-text"
fileFormatParquet = "parquet"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotta document those somewhere eventually. Not in this PR, it's fine.

Copy link
Contributor

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, but I think we should be translating fields to semantic conventions, and considering additions to SemConv where necessary.

Copy link
Contributor

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, nearly there.

Comment on lines +67 to +83
var errGzipReader error
gzipReader, ok := v.gzipPool.Get().(*gzip.Reader)
if !ok {
gzipReader, errGzipReader = gzip.NewReader(bytes.NewReader(content))
} else {
errGzipReader = gzipReader.Reset(bytes.NewReader(content))
}
if errGzipReader != nil {
if gzipReader != nil {
v.gzipPool.Put(gzipReader)
}
return plog.Logs{}, fmt.Errorf("failed to decompress content: %w", errGzipReader)
}
defer func() {
_ = gzipReader.Close()
v.gzipPool.Put(gzipReader)
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for now, but at some point we should look at extracting this into some common code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Too many unmarshalers having this :)

Comment on lines 289 to 292
case "srcaddr":
record.Attributes().PutStr("source.layer.address", value)
case "dstaddr":
record.Attributes().PutStr("destination.layer.address", value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default format only includes srcaddr and dstaddr, and not pkt-srcaddr or pkt-dstaddr. I think in that case we should still set the standard SemConv fields where we can. Also, as mentioned in the other thread, I think if all 4 are available then we should set https://opentelemetry.io/docs/specs/semconv/attributes-registry/network/#network-peer-address depending on the traffic direction.

Unfortunately the default format doesn't include traffic direction either, otherwise I think we could use that to pick the relevant address for source.address or destination.address, and the other for network.peer.address.

The only other option I can think of is: if pkt-srcaddr and pkt-dstaddr are not present, use srcaddr and dstaddr for source.address and destination.address. May be incorrect in the presence of NAT, but we can document this and how to resolve it (i.e. by including pkt-*addr).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default format only includes srcaddr and dstaddr, and not pkt-srcaddr or pkt-dstaddr.

We aren't settling for the default format. Even a custom format is included in the first line. That's what we are following, so I think we should handle this case already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we'll handle all the formats, what I meant is that a user may have configured a VPC flow log with just the default format, in which case pkt-* won't be present.

// protocolNames are needed to know the name of the protocol number given by the field
// protocol in a flow log record.
// See https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml.
var protocolNames = [143]string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the indices are all correct now. https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml says "nsh" is 145, but we only have 143 entries.

What I was suggesting before is to use this syntax:

var protocolNames = [...]string{
  0: "hopopt",
  // etc.
  255: "reserved",
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you are absolutely correct 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then the easier way would be to have a map[int]string, don't you think? Otherwise we need empty values for numbers that do not map to any protocol name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a fair tradeoff when there's only 256 entries.

Copy link
Contributor Author

@constanca-m constanca-m Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added empty entries. I used a unit test to compare that everything was still correct (not committed). If the protocol is (empty), then the attribute protocol name will be empty. I decided to still have the protocol attribute for those cases. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, but I think the syntax I used above is a little safer. You don't have to specify a value for every index that way - the unspecified ones will get an empty string. See https://go.dev/play/p/RKYse-UDfM6

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unaware this existed. Thanks!!

Copy link
Contributor

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few nits. Thanks for your persistence! 😅

constanca-m and others added 3 commits March 28, 2025 08:21
…er/vpc-flow-log/vpc_flow_log_unmarshaler.go

Co-authored-by: Andrew Wilkins <[email protected]>
…er/vpc-flow-log/vpc_flow_log_unmarshaler.go

Co-authored-by: Andrew Wilkins <[email protected]>
@constanca-m
Copy link
Contributor Author

constanca-m commented Mar 28, 2025

I added a benchmark test yesterday as well. All good there @axw ? It benchmarks unmarshalPlainTextLogs and not UnmarshalLogs, because the gzip took a lot of the statistics to itself.

Copy link
Contributor

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@constanca-m I'm glad you asked about the benchmark. It looked fine, but I just ran it locally and found the time of each of the sizes was the same, which is surprising - they should be 10x as expensive at each level.

Copy link
Contributor

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I think the 80MB test is probably unnecessary. It might be useful to have 8K and 8MB in case there's any constant per-file overhead, which we should see amortised by larger files. We can drop the 80MB one later though, it's fine.

@constanca-m
Copy link
Contributor Author

Alright, I don't mind dropping the 80MB file now, if you prefer? Which quantity of logs would fit the benchmark best? @axw

@axw
Copy link
Contributor

axw commented Mar 28, 2025

@constanca-m perhaps just two extremes: 1, and 1000. With 1 we'll see the maximum per-file overhead , and with 1000 it should be drowned out.

@constanca-m
Copy link
Contributor Author

constanca-m commented Mar 28, 2025

I'm going to add some documentation in an upcoming PR when we get more clarity on all fields. For now it's in the description of the PR, so at least it won't be lost.

Can you merge this PR @atoulme once you have the time please? Thank you!

@constanca-m
Copy link
Contributor Author

@axw I have removed the strings.split like you mentioned, and replaced with strings.Index. It worked much better:

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/extension/encoding/awslogsencodingextension/internal/unmarshaler/vpc-flow-log
cpu: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
                                             │ current.txt │            use-index.txt            │
                                             │   sec/op    │   sec/op     vs base                │
UnmarshalUnmarshalPlainTextLogs/1_log-16       2.265µ ± 2%   2.195µ ± 1%   -3.09% (p=0.000 n=10)
UnmarshalUnmarshalPlainTextLogs/1000_logs-16   660.7µ ± 1%   585.4µ ± 0%  -11.40% (p=0.000 n=10)
geomean                                        38.68µ        35.84µ        -7.34%

                                             │ current.txt  │            use-index.txt             │
                                             │     B/op     │     B/op      vs base                │
UnmarshalUnmarshalPlainTextLogs/1_log-16       5.617Ki ± 0%   5.398Ki ± 0%   -3.89% (p=0.000 n=10)
UnmarshalUnmarshalPlainTextLogs/1000_logs-16   756.4Ki ± 0%   537.6Ki ± 0%  -28.92% (p=0.000 n=10)
geomean                                        65.18Ki        53.87Ki       -17.35%

                                             │ current.txt │           use-index.txt            │
                                             │  allocs/op  │  allocs/op   vs base               │
UnmarshalUnmarshalPlainTextLogs/1_log-16        28.00 ± 0%    27.00 ± 0%  -3.57% (p=0.000 n=10)
UnmarshalUnmarshalPlainTextLogs/1000_logs-16   11.03k ± 0%   10.03k ± 0%  -9.07% (p=0.000 n=10)
geomean                                         555.7         520.3       -6.36%

Thanks for another great suggestion!

Copy link
Contributor

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@constanca-m there are several other performance improvements we can make, let's get this merged as it is and defer any more changes to followups please.

@constanca-m
Copy link
Contributor Author

@atoulme Sorry for persistence, can we get this merged?

@songy23 songy23 merged commit 39ad979 into open-telemetry:main Apr 1, 2025
171 checks passed
@github-actions github-actions bot added this to the next release milestone Apr 1, 2025
dmathieu pushed a commit to dmathieu/opentelemetry-collector-contrib that referenced this pull request Apr 8, 2025
…ext file format (open-telemetry#38897)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

Add support for VPC flow logs sent to S3 in plain text format.

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes open-telemetry#38896.

<!--Describe what testing was performed and which tests were added.-->
#### Testing

There are new unit tests added.

<!--Describe the documentation added.-->
#### Documentation

Comments in the code and unit tests should be enough.

Fields are mapped this way:
| Flow log field | Attribute | Available? | Currently supported? |

|------------------------------|------------------------------------|----------------|----------------------|
| `version` | `aws.vpc.flow.log.version` | 🔴 | |
| `account-id` | `cloud.account.id` | 🟢 | |
| `interface-id` | `aws.eni.id` | 🔴 | |
| `srcaddr` | `source.address` or `network.peer.address` | 🔴
| |
| `pkt-srcaddr` | `source.address` | 🟢 | |
| `dstaddr` | `destination.address` or `network.peer.address` |
🔴 | |
| `pkt-dstaddr` | `destination.address` | 🟢 | |
| `srcport` | `source.port` | 🟢 | |
| `dstport` | `destination.port` | 🟢 | |
| `protocol` | `network.protocol.name` | 🟢 | |
| `packets` | `aws.vpc.flow.packets` | 🔴 | |
| `bytes` | `aws.vpc.flow.bytes` | 🔴 | |
| `start` | `aws.vpc.flow.start` | 🔴 | |
| `end` | log timestamp | N/A | |
| `action` | `aws.vpc.flow.action` | 🔴 | |
| `log-status` | `aws.vpc.flow.status` | 🔴 | |
| `vpc-id` | `aws.vpc.id` | 🔴 | |
| `subnet-id` | `aws.vpc.subnet.id` | 🔴 | |
| `instance-id` | `host.id` | 🔴 | |
| `tcp-flags` | `network.tcp.flags` | 🔴 | |
| `type` | `network.type` | 🟢 | |
| `region` | `cloud.region` | 🟢 | |
| `az-id` | `aws.az.id` | 🔴 | |
| `sublocation-type` | `aws.sublocation.type` | 🔴 | |
| `sublocation-id` | `aws.sublocation.id` | 🔴 | |
| `pkt-src-aws-service` | `aws.vpc.flow.source.service` | 🔴 |
|
| `pkt-dst-aws-service` | `aws.vpc.flow.destination.service` |
🔴 | |
| `flow-direction` | `network.io.direction` | 🟢 | |
| `traffic-path` | `aws.vpc.flow.traffic_path` | 🔴 | |
| `ecs-cluster-arn` | `aws.ecs.cluster.arn` | 🟢 |
🔴 |
| `ecs-cluster-name` | `aws.ecs.cluster.name` | 🔴 |
🔴 |
| `ecs-container-instance-arn` | `aws.ecs.container.instance.arn` |
🔴 | 🔴 |
| `ecs-container-instance-id` | `aws.ecs.container.instance.id` |
🔴 | 🔴 |
| `ecs-container-id` | `aws.ecs.container.id` | 🔴 |
🔴 |
| `ecs-second-container-id` | `aws.ecs.second.container.arn` |
🔴 | 🔴 |
| `ecs-service-name` | `aws.ecs.service.name` | 🔴 |
🔴 |
| `ecs-task-definition-arn` | `aws.ecs.task.definition.arn` |
🔴 | 🔴 |
| `ecs-task-arn` | `aws.ecs.task.arn` | 🟢 | 🔴 |
| `ecs-task-id` | `aws.ecs.task.id` | 🟢 | 🔴 |
| `reject-reason` | `aws.vpc.flow.reject_reason` | 🔴 | |

---------

Co-authored-by: Andrew Wilkins <[email protected]>
Fiery-Fenix pushed a commit to Fiery-Fenix/opentelemetry-collector-contrib that referenced this pull request Apr 24, 2025
…ext file format (open-telemetry#38897)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

Add support for VPC flow logs sent to S3 in plain text format.

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes open-telemetry#38896.

<!--Describe what testing was performed and which tests were added.-->
#### Testing

There are new unit tests added.

<!--Describe the documentation added.-->
#### Documentation

Comments in the code and unit tests should be enough.

Fields are mapped this way:
| Flow log field | Attribute | Available? | Currently supported? |

|------------------------------|------------------------------------|----------------|----------------------|
| `version` | `aws.vpc.flow.log.version` | 🔴 | |
| `account-id` | `cloud.account.id` | 🟢 | |
| `interface-id` | `aws.eni.id` | 🔴 | |
| `srcaddr` | `source.address` or `network.peer.address` | 🔴
| |
| `pkt-srcaddr` | `source.address` | 🟢 | |
| `dstaddr` | `destination.address` or `network.peer.address` |
🔴 | |
| `pkt-dstaddr` | `destination.address` | 🟢 | |
| `srcport` | `source.port` | 🟢 | |
| `dstport` | `destination.port` | 🟢 | |
| `protocol` | `network.protocol.name` | 🟢 | |
| `packets` | `aws.vpc.flow.packets` | 🔴 | |
| `bytes` | `aws.vpc.flow.bytes` | 🔴 | |
| `start` | `aws.vpc.flow.start` | 🔴 | |
| `end` | log timestamp | N/A | |
| `action` | `aws.vpc.flow.action` | 🔴 | |
| `log-status` | `aws.vpc.flow.status` | 🔴 | |
| `vpc-id` | `aws.vpc.id` | 🔴 | |
| `subnet-id` | `aws.vpc.subnet.id` | 🔴 | |
| `instance-id` | `host.id` | 🔴 | |
| `tcp-flags` | `network.tcp.flags` | 🔴 | |
| `type` | `network.type` | 🟢 | |
| `region` | `cloud.region` | 🟢 | |
| `az-id` | `aws.az.id` | 🔴 | |
| `sublocation-type` | `aws.sublocation.type` | 🔴 | |
| `sublocation-id` | `aws.sublocation.id` | 🔴 | |
| `pkt-src-aws-service` | `aws.vpc.flow.source.service` | 🔴 |
|
| `pkt-dst-aws-service` | `aws.vpc.flow.destination.service` |
🔴 | |
| `flow-direction` | `network.io.direction` | 🟢 | |
| `traffic-path` | `aws.vpc.flow.traffic_path` | 🔴 | |
| `ecs-cluster-arn` | `aws.ecs.cluster.arn` | 🟢 |
🔴 |
| `ecs-cluster-name` | `aws.ecs.cluster.name` | 🔴 |
🔴 |
| `ecs-container-instance-arn` | `aws.ecs.container.instance.arn` |
🔴 | 🔴 |
| `ecs-container-instance-id` | `aws.ecs.container.instance.id` |
🔴 | 🔴 |
| `ecs-container-id` | `aws.ecs.container.id` | 🔴 |
🔴 |
| `ecs-second-container-id` | `aws.ecs.second.container.arn` |
🔴 | 🔴 |
| `ecs-service-name` | `aws.ecs.service.name` | 🔴 |
🔴 |
| `ecs-task-definition-arn` | `aws.ecs.task.definition.arn` |
🔴 | 🔴 |
| `ecs-task-arn` | `aws.ecs.task.arn` | 🟢 | 🔴 |
| `ecs-task-id` | `aws.ecs.task.id` | 🟢 | 🔴 |
| `reject-reason` | `aws.vpc.flow.reject_reason` | 🔴 | |

---------

Co-authored-by: Andrew Wilkins <[email protected]>
@constanca-m constanca-m deleted the text-plain-vpc-flow-log branch May 14, 2025 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for VPC flow log in plain text format
5 participants