-
Notifications
You must be signed in to change notification settings - Fork 167
Closed
Description
🚀 The feature
Similar to #1044 (thanks @ejguan!) I propose to add a new datapipe that uses ThreadPoolExecutor
to multithread mapping.
Motivation, pitch
Speed up mapping by using Multithreading
Alternatives
Three possible implementations come to my mind.
- Similar to Implement BatchAsyncMapper #1044 construct batches and use Executor.map() and then unbatch again. One disadvantage of this is that the first item can only be returned once all operations in the batch have finished.
This may change in a future python version see Make Executor.map work with infinite/large inputs correctly python/cpython#74028 and bpo-29842: Make Executor.map less eager so it handles large/unbounded… python/cpython#18566 - Only allow batches as input and apply the operation to each element in the batch. Then return the processed batch.
- Use concurrent.futures.as_completed with a parameter like
scheduled_tasks
to schedule a finite number of tasks. This would return results as soon as they are completed but not preserve order.
Which option do you prefer? We can of course also implement e.g. both option 1 and 3.
Additional context
I am not sure how (if at all) the ThreadPoolExecutor interferes/interacts with multiprocessing used in the Dataloader.
ejguan and NivekT
Metadata
Metadata
Assignees
Labels
No labels