Description
Component(s)
connector/servicegraph
What happened?
Description
The failed
label fails to distinguish the succeed and failed edge in servicegraph metrics.
I found this weird graph metrics:
The two metrics have the same label set except failed
, their values are very close(during six hours) which is impossible.
Besides, the traces_service_graph_request_total
contains a label failed=true
which also looks like a bug.
After reading the component code, I found the failed
dimension doesn't join into the metricKey
:
opentelemetry-collector-contrib/connector/servicegraphconnector/connector.go
Lines 605 to 618 in 0a12ede
It will lead the component to get the wrong label set when it tries to collect metrics:
I use a unit test to demonstrate it: test failed label not work
In the unit test, I simulate two services' traces data: foo and bar. foo called the bar three times, two successful, and one failed.
I expect those trace simple will generate graph metrics:
traces_service_graph_request_total{client="foo", server="bar", connection_type="", failed="false"} 2
traces_service_graph_request_total{client="foo", server="bar", connection_type="", failed="true"} 1
...
however the component result in: error metrics:
error metrics content
resourceMetrics:
- resource: {}
scopeMetrics:
- metrics:
- name: traces_service_graph_request_total
sum:
aggregationTemporality: 2
dataPoints:
- asInt: "3"
attributes:
- key: client
value:
stringValue: foo
- key: connection_type
value:
stringValue: ""
- key: failed
value:
boolValue: false
- key: server
value:
stringValue: bar
startTimeUnixNano: "1000000"
timeUnixNano: "2000000"
isMonotonic: true
- name: traces_service_graph_request_failed_total
sum:
aggregationTemporality: 2
dataPoints:
- asInt: "1"
attributes:
- key: client
value:
stringValue: foo
- key: connection_type
value:
stringValue: ""
- key: failed
value:
boolValue: false
- key: server
value:
stringValue: bar
startTimeUnixNano: "1000000"
timeUnixNano: "2000000"
isMonotonic: true
In detail, The key problem is that the metricKey
misses the failed
label and generates a key that will refer to different values in some cases.
I can demonstrate it:
firstly, assume this is the first span to go through the connector, an edge finish with this values(without error): e.ClientService=foo, e.ServerService=bar,e.ConnectionType=, e.Failed=false
its metricKey will be foobar
, then the key refers to its dimensions(stored in a keyToMetric map): {client:foo, server:bar, connection_type: , failed: false}
currently, the reqTotal
will be {"foobar": 1}
, after collect metrics, result metrics will be:
traces_service_graph_request_total{client:foo, server:bar, connection_type: , failed: false} 1
...
Then, the second edge finish with this values(contain error): e.ClientService=foo, e.ServerService=bar,e.ConnectionType=, e.Failed=true
.
This edge also generates the foobar
and its dimensions will be {client:foo, server:bar, connection_type: , failed: true}
, after this step, the foobar
's value in keyToMetric is overwritten, the bug occurs. Currently, the reqTotal
will be {"foobar": 2}
, and the reqFailedTotal
will be {"foobar": 1}
. after collecting, metrics will be:
traces_service_graph_request_total{client:foo, server:bar, connection_type: , failed: true} 2
traces_service_graph_request_failed_total{client:foo, server:bar, connection_type: , failed: true} 1
...
in the metrics backend, you will see:
traces_service_graph_request_total{client:foo, server:bar, connection_type: , failed: false} 1
traces_service_graph_request_total{client:foo, server:bar, connection_type: , failed: true} 2
traces_service_graph_request_failed_total{client:foo, server:bar, connection_type: , failed: true} 1
Collector version
main
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
No response
Log output
No response
Additional context
No response