Skip to content

Support a DNSSRVNOA Option on the loadbalancing Exporter #18412

Closed
@knechtionscoding

Description

@knechtionscoding

Component(s)

exporter/loadbalancing

Is your feature request related to a problem? Please describe.

I discovered when trying to set configure an otel collector to use the loadbalancing to ship to another otel collector deployment that it fails when istio is involved.

Currently DNS requests made for loadbalancing fail to work with ISTIO services because it it doing an A record look up rather than accepting an SRC

When trying to setup a loadbalancing exporter to talk to a k8s headless service it makes a A record look up, that then means it uses the Pod IP address as the host name, and that fails because ISTIO doesn't allow routing via pod IPs.

This is true even if you use the pod hostname:

2023-02-06T22:23:31.510Z	warn	zapgrpc/zapgrpc.go:195	[core] [Channel #9 SubChannel #10] grpc: addrConn.createTransport failed to connect to {
  "Addr": "10.17.57.172:4137",
  "ServerName": "10.17.57.172:4137",
  "Attributes": null,
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
}. Err: connection error: desc = "transport: authentication handshake failed: read tcp 10.17.161.21:48420->10.17.57.172:4137: read: connection reset by peer"	{"grpc_log": true}

Istio proxy logs from the request being made:

{"upstream_host":"10.17.141.119:4137","response_flags":"UF,URX","response_code":0,"istio_policy_status":null,"upstream_cluster":"InboundPassthroughClusterIpv4","requested_server_name":null,"request_id":null,"route_name":null,"orig_path":null,"downstream_local_address":"10.17.141.119:4137","downstream_remote_address":"10.17.161.13:60252","authority":null,"method":null,"cf_ray":null,"x_forwarded_for":null,"bytes_sent":0,"bytes_received":0,"user_agent":null,"upstream_service_time":null,"protocol":null,"b3_parentspan_id":null,"w3c_traceparent":null,"ml_loadtest":null,"upstream_local_address":null,"b3_span_id":null,"path":null,"ml_faulttest":null,"start_time":"2023-02-06T22:00:30.915Z","b3_trace_id":null,"duration":1,"upstream_transport_failure_reason":null,"response_code_details":null}

Example of loadbalancing config:

  loadbalancing:
      routing_key: traceID
      protocol:
        otlp:
          timeout: 1s
          tls:
            insecure: true
      resolver:
        dns:
          hostname: otel-sampler-headless
          port: 4137

It works properly if you configure the receiving otel-collector as a stateful set and then use each pod name, because the SRV record will come back matching.

Describe the solution you'd like

Support a similar option to https://thanos.io/tip/thanos/service-discovery.md/#dns-service-discovery where you can set +dnssrvnoa to allow us to use istio with a deployment of the receiving otel-collector

thanos-io/thanos@432785e is the thanos code that does this.

The general ask is that the loadbalancing configuration does the SRV resolve and then from there act as if it is filling in the static section.

Showing the difference:

dig +short _tcp-otlp._tcp.otel-sampler.big-brother.svc.cluster.local
otel-sampler-2.otel-sampler.big-brother.svc.cluster.local.
10.17.135.68
10.17.141.147
10.17.163.65
# Even with a DEPLOYMENT
dig +short _grpc-otlp._tcp.otel-collector-headless.big-brother.svc.cluster.local SRV
10 25 4317 3430636431363132.otel-collector-headless.big-brother.svc.cluster.local.
10 25 4317 3131383736653061.otel-collector-headless.big-brother.svc.cluster.local.
10 25 4317 3936663563343666.otel-collector-headless.big-brother.svc.cluster.local.
10 25 4317 3636643866356166.otel-collector-headless.big-brother.svc.cluster.local.

Notice the SRV part on the end of the second which is what OTEL should accept as an alternative to the A record

Describe alternatives you've considered

Running as a statefulSet works, but doesn't autoscale properly. Right now we have to make do with manually listing out the pods in the stateful set:

      resolver:
        static:
          hostnames:
          - otel-sampler-0.otel-sampler.big-brother.svc.cluster.local:4317
          - otel-sampler-1.otel-sampler.big-brother.svc.cluster.local:4317
          - otel-sampler-2.otel-sampler.big-brother.svc.cluster.local:4317

Additional context

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions