-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Describe the bug
The persistent queue removes items from storage after they're successfully exported. This removal happens in a transaction which also updates the list of currently dispatched items. Depending on the implementation details of the underlying storage, this transaction may fail if the storage device is full.
As a result, we can take items out of the queue, but they're not actually removed from the storage, and no new items can be put in.
Steps to reproduce
See the unit test in the linked PR.
Additional context
I've confirmed that filestorage can behave this way via the following test: open-telemetry/opentelemetry-collector-contrib@dbe3105. I suspect that this will be true of any transactional storage engine, as some amount of transaction data needs to be persisted to disk before it can be committed.
How often this can happen in practice is difficult to estimate. It depends heavily on how the size of queue items aligns with available disk space. Anecdotally, I've seen it happen during an incident, on a volume with multiple queues sharing space.