Skip to content

OTel-Arrow receiver admission control broken in several ways #36074

Closed
@jmacd

Description

@jmacd

Component(s)

receiver/otelarrow

What happened?

Description

Several aspects of the OTel-Arrow admission control mechanism are broken or not working as intended.

  1. As a matter of design, the admission.BoundedQueue has a source of complexity, which causes it to have fallible APIs. The Arrow admission path was performing multiple calls to Acquire, once with compressed data and once with uncompressed data. This leads to error handling that could be avoided if it were not for the complication.
  2. The fallible APIs as used in the OTLP code path (internal/{traces,logs,metrics}) are returning control before finishing the call to obsrecv, so no observability happens when admission control fails. This is a major bug.
  3. There is a race condition in the context-cancelled exit path from Acquire(), in case the waiter was already admitted by a concurrent call to Release(). This condition causes the semaphore to leak, potentially. This is a minor bug.
  4. The semaphore is obeying FIFO discipline, but not intentionally. The internal-to-Lightstep code on which this is modeled uses LIFO for reasons documented here. This is not working as intended.

Proposed solution

First, eliminate the complication that necessitates fallible APIs. The problem is the two calls to Acquire() once with compressed size and once with uncompressed size. Because compressed size is typically so much smaller the uncompressed size, the advantage of these two Acquire calls does not outweigh the complexity cost.

Therefore, we can eliminate fallible APIs from the admission package. This may be done by returning a closure from Acquire() to perform the correct release. The potential for mis-use is greatly reduced.

The bounded queue implementation should transition to LIFO. To avoid fallible APIs, transition to LIFO, and fix the race condition is a substantial change. The BoundedQueue tests will be completely rewritten.

Finally, the OTel-Arrow receiver should perform admission control Acquire() once after it computes the uncompressed size of the request, meaning it will stop using the otlp-pdata-size header. The OTel-Arrow exporter should continue to emit this header while older receivers are still in use, but it can be removed eventually.

Collector version

v0.111.0

Environment information

Environment

Any.

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions