Description
Component(s)
exporter/loadbalancing
Is your feature request related to a problem? Please describe.
I discovered when trying to set configure an otel collector to use the loadbalancing to ship to another otel collector deployment that it fails when istio is involved.
Currently DNS requests made for loadbalancing fail to work with ISTIO services because it it doing an A record look up rather than accepting an SRC
When trying to setup a loadbalancing exporter to talk to a k8s headless service it makes a A record look up, that then means it uses the Pod IP address as the host name, and that fails because ISTIO doesn't allow routing via pod IPs.
This is true even if you use the pod hostname:
2023-02-06T22:23:31.510Z warn zapgrpc/zapgrpc.go:195 [core] [Channel #9 SubChannel #10] grpc: addrConn.createTransport failed to connect to {
"Addr": "10.17.57.172:4137",
"ServerName": "10.17.57.172:4137",
"Attributes": null,
"BalancerAttributes": null,
"Type": 0,
"Metadata": null
}. Err: connection error: desc = "transport: authentication handshake failed: read tcp 10.17.161.21:48420->10.17.57.172:4137: read: connection reset by peer" {"grpc_log": true}
Istio proxy logs from the request being made:
{"upstream_host":"10.17.141.119:4137","response_flags":"UF,URX","response_code":0,"istio_policy_status":null,"upstream_cluster":"InboundPassthroughClusterIpv4","requested_server_name":null,"request_id":null,"route_name":null,"orig_path":null,"downstream_local_address":"10.17.141.119:4137","downstream_remote_address":"10.17.161.13:60252","authority":null,"method":null,"cf_ray":null,"x_forwarded_for":null,"bytes_sent":0,"bytes_received":0,"user_agent":null,"upstream_service_time":null,"protocol":null,"b3_parentspan_id":null,"w3c_traceparent":null,"ml_loadtest":null,"upstream_local_address":null,"b3_span_id":null,"path":null,"ml_faulttest":null,"start_time":"2023-02-06T22:00:30.915Z","b3_trace_id":null,"duration":1,"upstream_transport_failure_reason":null,"response_code_details":null}
Example of loadbalancing config:
loadbalancing:
routing_key: traceID
protocol:
otlp:
timeout: 1s
tls:
insecure: true
resolver:
dns:
hostname: otel-sampler-headless
port: 4137
It works properly if you configure the receiving otel-collector as a stateful set and then use each pod name, because the SRV record will come back matching.
Describe the solution you'd like
Support a similar option to https://thanos.io/tip/thanos/service-discovery.md/#dns-service-discovery where you can set +dnssrvnoa to allow us to use istio with a deployment of the receiving otel-collector
thanos-io/thanos@432785e is the thanos code that does this.
The general ask is that the loadbalancing configuration does the SRV resolve and then from there act as if it is filling in the static section.
Showing the difference:
dig +short _tcp-otlp._tcp.otel-sampler.big-brother.svc.cluster.local
otel-sampler-2.otel-sampler.big-brother.svc.cluster.local.
10.17.135.68
10.17.141.147
10.17.163.65
# Even with a DEPLOYMENT
dig +short _grpc-otlp._tcp.otel-collector-headless.big-brother.svc.cluster.local SRV
10 25 4317 3430636431363132.otel-collector-headless.big-brother.svc.cluster.local.
10 25 4317 3131383736653061.otel-collector-headless.big-brother.svc.cluster.local.
10 25 4317 3936663563343666.otel-collector-headless.big-brother.svc.cluster.local.
10 25 4317 3636643866356166.otel-collector-headless.big-brother.svc.cluster.local.
Notice the SRV part on the end of the second which is what OTEL should accept as an alternative to the A record
Describe alternatives you've considered
Running as a statefulSet works, but doesn't autoscale properly. Right now we have to make do with manually listing out the pods in the stateful set:
resolver:
static:
hostnames:
- otel-sampler-0.otel-sampler.big-brother.svc.cluster.local:4317
- otel-sampler-1.otel-sampler.big-brother.svc.cluster.local:4317
- otel-sampler-2.otel-sampler.big-brother.svc.cluster.local:4317
Additional context
No response