Skip to content

Footgun with futures unordered #131

@nikomatsakis

Description

@nikomatsakis

Brief summary

@farnz opened a really interesting issue rust-lang/futures-rs#2387:

This would make a great status quo story!

We've found a nasty footgun when we use FuturesUnordered (or buffered etc) to get concurrency from a set of futures.

Because FuturesUnordered only polls its contents when it is polled, it is possible for futures lurking in the queue to be surprised by a long poll, even though no individual future spends a long time in poll(). This causes issues in two cases:

  1. When interfacing with an external system via the network; if you take a result from the stream with while let Some(res) = stream.next().await and then do significant wall-clock time inside the loop (even if very little CPU time is involved because you're awaiting another network service), you can hit the external system's timeouts and fail unexpectedly.

  2. When using an async friendly semaphore (like Tokio provides), you can deadlock yourself by having the tasks that are waiting in the FuturesUnordered owning all the semaphores, while having an item in a .for_each() block after buffer_unordered() requiring a semaphore.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=f58e77ba077b40eba40636a4e32b5710 shows the effect. Naïvely, you'd expect all the 10 to 20ms sleep futures to complete in under 40ms, and the 100ms sleep futures to take 100ms to 200ms. However, you can see that the sleep threads complete in the timescale expected and send the wakeup to the future that spawned them, but some of the short async sleep_for futures take over 100ms to complete, because while the thread signals them to wake up, the loop is .awaiting a long sleep future and does not get round to polling the stream again for some time.

We've found this in practice with things where the loop body is "nice" in the sense that it doesn't run for very long inside its poll function, but the total time spent in the loop body is large. The futures being polled by FuturesUnordered do:

async fn do_select<T>(database: &Database, query: Query) -> Result<Vec<T>> {
    let conn = database.get_conn().await?;
    conn.select_query(query).await
}

and the main work looks like:

async fn do_work(database: &Database) {
    let work = do_select(database, FIND_WORK_QUERY)?;
    stream::iter(
        work
            .into_iter()
            .map(|item| do_select(database, work_from_item(item)).await)
            .buffered(5)
            .for_each(|work_item| do_giant_work(work_item)).await;
}

do_giant_work can take 20 seconds wall clock time for big work items. It's possible for get_conn to open the connection (which has a 10 second idle timeout) for each Future in the buffered set, send the first handshake packet, and then return Poll::Pending as it waits for the reply. When the first of the 5 in the buffered set returns Poll::Ready(item), the code then runs do_giant_work which takes 20 seconds. While do_giant_work is in control, nothing re-polls the buffered set of Futures, and so the idle timeout kicks in server-side, and all of the 4 open connections get dropped because we've opened a connection and then not completed the handshake.

We can mitigate the problem by using spawn_with_handle to ensure that the do_select work happens whenever the do_giant_work Future awaits something, but this behaviour has surprised my team more than once (despite enough experience to diagnose this after the fact).

I'm not sure that a perfect technical solution is possible; the issue is that FuturesUnordered is a sub-executor driven by the main executor, and if not polled, it can't poll its set of pending futures. Meanwhile, the external code is under no obligation to poll the FuturesUnordered in a timely fashion. Spawning the futures before putting them in the sub-executor works because the main executor then drives them, and the sub-executor is merely picking up final results, but futures have to be 'static lifetime to be spawned.

Optional details

  • (Optional) Which character(s) would be the best fit and why?
    • Alan: the experienced "GC'd language" developer, new to Rust
    • Grace: the systems programming expert, new to Rust
    • Niklaus: new programmer from an unconventional background
    • Barbara: the experienced Rust developer
  • (Optional) Which project(s) would be the best fit and why?
    • List some projects here.
  • (Optional) What are the key points or morals to emphasize?
    • Write some morals here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions