-
Notifications
You must be signed in to change notification settings - Fork 814
Description
🚀 Feature
Motivation
- To avoid pitfall with shuffling and sharding of datapipes in distributed training environments
- To ensure consistent experience of TorchData based datasets across domains.
Pitch
TorchText datasets return datapipes. In order to perform distributed computing, users would typically apply a sharding filter in order to shard the data across ranks. Furthermore, to make sure that we don’t shuffle data only within the corresponding shards, it is important to ensure that the sharding filter is applied after shuffling. As per the investigations from TorchVision, this is not always obvious for users and could lead to suboptimal results if not being done in proper order.
We could do this by simply wrapping the datapipe at the very end
def MyDataSet(...):
dp = ...
dp = dp.shuffle().set_shuffle(False)
dp = dp.sharding_filter()
return dp
when users want to shuffle the dataset, they would simply set shuffle=True in DataLoader. Furthermore since the sharding filter is already applied, users do not have to explicitly call it when doing distributed training.
Alternatives
keep the datasets implementation as such and educate (tutorials, documentation) users to perform shuffling before sharding.
Additional context
- In addition to making sure that shuffling is always done before sharding, this also comes with the benefit that shuffling can be done before the datapipe contains heavy objects (like images) as shuffle datapipe creates a buffer internally to shuffle the corresponding data items. Hence for vision datasets, this is more than just convenient/helper utility.
- If shuffle and sharding are done internally, it would mean that we must document the usage such that users do not apply shuffle and sharding again.
- We also want to ensure that users have similar experiences across domains and hence should have consistent solutions for common pitfalls.
cc: @NicolasHug , @ejguan , @kevinchn , @Nayef211 , @abhinavarora , @VirgileHlav