Add Shuffle and sharding datapipes to datasets #1729

parmeet · 2022-05-17T13:19:20Z

Reference Issue: #1727

parmeet · 2022-05-17T13:30:40Z

@ejguan, @kevinchn It seems the unit tests on linux are failing due to following error:
ImportError: /lib64/libm.so.6: version `GLIBC_2.29' not found (required by /root/project/env/lib/python3.7/site-packages/torchdata/_torchdata.so)

I wonder if anything has changed recently? cc: @Nayef211

ejguan · 2022-05-17T13:42:01Z

I enabled AWS extension last night. It seems the C-Extension is using glibc-2.29 during compilation. Will do a quick fix.

NicolasHug

Nice, LGTM.

Perhaps it might make sense to add a small test that makes sure all datapipes come with a sharding filter and a shuffler? We have that in torchvision

https://github.com/pytorch/vision/blob/08c8f0e0b68195a4a5a21cdd3f86106f59c2e854/test/test_prototype_builtin_datasets.py#L140-L146

parmeet · 2022-05-17T14:39:15Z

Perhaps it might make sense to add a small test that makes sure all datapipes come with a sharding filter and a shuffler?

Thanks @NicolasHug for the suggestion. I think it is indeed a good idea. Let me add it as well.

vcm2114

Do we expect to have variations for shuffle and sharding between datasets, as for now all datasets use .shuffle().set_shuffle(False).sharding_filter()? If not, we could consider using instead a decorator to wrap this call on the returned object and clean up the code.

NicolasHug · 2022-05-17T16:11:50Z

Good point @VirgileHlav .

In general you want to shard and shuffle on light objects (before decoding, before transforms) to avoid unnecessary computations, and to save memory. For now torchtext datasets yield light objects (simple text), but maybe in the future this will change?

In torchvision, we wrap in different places for each dataset, so using a decorator isn't an option.

vcm2114

@NicolasHug thanks for the clarification, in that case let's keep it as is for now. @parmeet I will also add this to the dataset effort in #1710.

Otherwise LGTM

parmeet · 2022-05-18T03:15:30Z

Thanks @VirgileHlav SGTM! As @NicolasHug mentioned it could potentially vary from dataset to dataset. So if in some cases processing is needed at sample level, you could potentially shard the pipe before processing, but otherwise adding it at the end is fine.

parmeet · 2022-05-18T16:34:23Z

test/datasets/common.py

+from parameterized import parameterized
+from torch.utils.data.graph import traverse
+from torch.utils.data.graph_settings import get_all_graph_pipes
+from torchdata.datapipes.iter import Shuffler, ShardingFilter
+from torchtext.datasets import DATASETS
+
+from ..common.torchtext_test_case import TorchtextTestCase
+
+
+class TestShuffleShardDatasetWrapper(TorchtextTestCase):
+    # Note that for order i.e shuffle before sharding, TorchData will provide linter warning
+    # Modify this test when linter warning is available
+    @parameterized.expand(list(DATASETS.items()))
+    def test_shuffle_shard_wrapper(self, dataset_name, dataset_fn):
+        dp = dataset_fn()
+        if type(dp) == tuple:
+            dp = list(dp)
+        else:
+            dp = [dp]
+
+        for dp_split in dp:
+            dp_graph = get_all_graph_pipes(traverse(dp_split))
+            for annotation_dp_type in [Shuffler, ShardingFilter]:
+                if not any(isinstance(dp, annotation_dp_type) for dp in dp_graph):
+                    raise AssertionError(f"The dataset doesn't contain a {annotation_dp_type.__name__}() datapipe.")


@Nayef211 Just FYI, if we can do something similar for pickle :). @ejguan I left a comment to update the test once linter warnings are available, it is not a blocker for landing this PR.

cc: @NicolasHug

Thanks. I will do a fix for manylinux1 wheel first then add linter for you.

Nayef211 · 2022-05-18T21:33:12Z

test/datasets/common.py

+class TestShuffleShardDatasetWrapper(TorchtextTestCase):
+    # Note that for order i.e shuffle before sharding, TorchData will provide linter warning
+    # Modify this test when linter warning is available
+    @parameterized.expand(list(DATASETS.items()))


QQ: it looks like we're not using the dataset_name anywhere. Why don't we just pass in the dataset_fn to the test by doing something like

@parameterized.expand(list(DATASETS.values()))

Ahh my bad, thanks for the catch. Will fix it!

I actually realized that the parameterized.expand decorator complains when passing in the list of dataset_fn. Lmk if you're able to figure out how to resolve the error

Yupp, the problem is we need to pass tuples inside list. Just created PR to fix it #1733

Awesome. Just incorporated this in my PR #1732

Summary: This PR introduces the linter function to validate there is a shuffle operation before sharding. (Required by TorchText in pytorch/text#1729) - When `sharding_filter` is not presented in the graph, this function always returns `True` - For single-path graph, `shuffle` needs to be placed before `sharding_filter`. - For multi-path graph, any `sharding_filter` requires a `shuffle` before along the path This linter function won't check if there are multiple `sharding_filter` in the graph or `sharding_filter` is at the right place Pull Request resolved: #429 Reviewed By: NivekT Differential Revision: D36529167 Pulled By: ejguan fbshipit-source-id: 56e734eac98b2ddadcd7707ee92ea4032a896969

Add Shuffle and sharding datapipes to datasets

9a3307a

facebook-github-bot added the cla signed label May 17, 2022

parmeet requested a review from NicolasHug May 17, 2022 13:20

parmeet requested review from vcm2114 and Nayef211 May 17, 2022 13:32

NicolasHug approved these changes May 17, 2022

View reviewed changes

parmeet mentioned this pull request May 17, 2022

Add support for all datasets of the GLUE benchmark #1710

Closed

8 tasks

vcm2114 reviewed May 17, 2022

View reviewed changes

vcm2114 approved these changes May 17, 2022

View reviewed changes

NicolasHug mentioned this pull request May 18, 2022

import fails on Linux due to missing library libcrypto.so.10 pytorch/data#418

Closed

parmeet added 2 commits May 18, 2022 11:02

Merge branch 'main' of github.com:pytorch/text into shuffle_shard

3a95f88

added test for shuffle shard

4320441

parmeet commented May 18, 2022

View reviewed changes

parmeet merged commit 2a712f4 into pytorch:main May 18, 2022

parmeet deleted the shuffle_shard branch May 18, 2022 17:01

Nayef211 reviewed May 18, 2022

View reviewed changes

parmeet mentioned this pull request May 19, 2022

Remove redundant dataname in test_shuffle_shard_wrapper #1733

Merged

ejguan mentioned this pull request May 19, 2022

Add linter to check if shuffle before sharding pytorch/data#429

Closed

parmeet mentioned this pull request May 23, 2022

add test for shuffle before shard #1738

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Shuffle and sharding datapipes to datasets #1729

Add Shuffle and sharding datapipes to datasets #1729

Uh oh!

parmeet commented May 17, 2022

Uh oh!

parmeet commented May 17, 2022

Uh oh!

ejguan commented May 17, 2022

Uh oh!

NicolasHug left a comment

Uh oh!

parmeet commented May 17, 2022

Uh oh!

vcm2114 left a comment

Uh oh!

NicolasHug commented May 17, 2022

Uh oh!

vcm2114 left a comment

Uh oh!

parmeet commented May 18, 2022

Uh oh!

parmeet May 18, 2022

Uh oh!

ejguan May 18, 2022

Uh oh!

Nayef211 May 18, 2022

Uh oh!

parmeet May 19, 2022

Uh oh!

Nayef211 May 19, 2022

Uh oh!

parmeet May 19, 2022

Uh oh!

Nayef211 May 19, 2022

Uh oh!

Uh oh!

Add Shuffle and sharding datapipes to datasets #1729

Add Shuffle and sharding datapipes to datasets #1729

Uh oh!

Conversation

parmeet commented May 17, 2022

Uh oh!

parmeet commented May 17, 2022

Uh oh!

ejguan commented May 17, 2022

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

parmeet commented May 17, 2022

Uh oh!

vcm2114 left a comment

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented May 17, 2022

Uh oh!

vcm2114 left a comment

Choose a reason for hiding this comment

Uh oh!

parmeet commented May 18, 2022

Uh oh!

parmeet May 18, 2022

Choose a reason for hiding this comment

Uh oh!

ejguan May 18, 2022

Choose a reason for hiding this comment

Uh oh!

Nayef211 May 18, 2022

Choose a reason for hiding this comment

Uh oh!

parmeet May 19, 2022

Choose a reason for hiding this comment

Uh oh!

Nayef211 May 19, 2022

Choose a reason for hiding this comment

Uh oh!

parmeet May 19, 2022

Choose a reason for hiding this comment

Uh oh!

Nayef211 May 19, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!