Skip to content

Regarding adding shuffling and sharding datapipes to in-built datasets #1727

@parmeet

Description

@parmeet

🚀 Feature

Motivation

  • To avoid pitfall with shuffling and sharding of datapipes in distributed training environments
  • To ensure consistent experience of TorchData based datasets across domains.

Pitch

TorchText datasets return datapipes. In order to perform distributed computing, users would typically apply a sharding filter in order to shard the data across ranks. Furthermore, to make sure that we don’t shuffle data only within the corresponding shards, it is important to ensure that the sharding filter is applied after shuffling. As per the investigations from TorchVision, this is not always obvious for users and could lead to suboptimal results if not being done in proper order.

We could do this by simply wrapping the datapipe at the very end

def MyDataSet(...):
    dp = ...
    dp = dp.shuffle().set_shuffle(False)
    dp = dp.sharding_filter()
    return dp

when users want to shuffle the dataset, they would simply set shuffle=True in DataLoader. Furthermore since the sharding filter is already applied, users do not have to explicitly call it when doing distributed training.

Alternatives

keep the datasets implementation as such and educate (tutorials, documentation) users to perform shuffling before sharding.

Additional context

  • In addition to making sure that shuffling is always done before sharding, this also comes with the benefit that shuffling can be done before the datapipe contains heavy objects (like images) as shuffle datapipe creates a buffer internally to shuffle the corresponding data items. Hence for vision datasets, this is more than just convenient/helper utility.
  • If shuffle and sharding are done internally, it would mean that we must document the usage such that users do not apply shuffle and sharding again.
  • We also want to ensure that users have similar experiences across domains and hence should have consistent solutions for common pitfalls.

cc: @NicolasHug , @ejguan , @kevinchn , @Nayef211 , @abhinavarora , @VirgileHlav

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions