Skip to content

Add Shuffle and sharding datapipes to datasets #1729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 18, 2022

Conversation

parmeet
Copy link
Contributor

@parmeet parmeet commented May 17, 2022

Reference Issue: #1727

@parmeet
Copy link
Contributor Author

parmeet commented May 17, 2022

@ejguan, @kevinchn It seems the unit tests on linux are failing due to following error:
ImportError: /lib64/libm.so.6: version `GLIBC_2.29' not found (required by /root/project/env/lib/python3.7/site-packages/torchdata/_torchdata.so)

I wonder if anything has changed recently? cc: @Nayef211

@parmeet parmeet requested review from vcm2114 and Nayef211 May 17, 2022 13:32
@ejguan
Copy link
Contributor

ejguan commented May 17, 2022

I enabled AWS extension last night. It seems the C-Extension is using glibc-2.29 during compilation. Will do a quick fix.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM.

Perhaps it might make sense to add a small test that makes sure all datapipes come with a sharding filter and a shuffler? We have that in torchvision

https://github.com/pytorch/vision/blob/08c8f0e0b68195a4a5a21cdd3f86106f59c2e854/test/test_prototype_builtin_datasets.py#L140-L146

@parmeet
Copy link
Contributor Author

parmeet commented May 17, 2022

Perhaps it might make sense to add a small test that makes sure all datapipes come with a sharding filter and a shuffler?

Thanks @NicolasHug for the suggestion. I think it is indeed a good idea. Let me add it as well.

Copy link
Contributor

@vcm2114 vcm2114 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect to have variations for shuffle and sharding between datasets, as for now all datasets use .shuffle().set_shuffle(False).sharding_filter()? If not, we could consider using instead a decorator to wrap this call on the returned object and clean up the code.

@NicolasHug
Copy link
Member

Good point @VirgileHlav .

In general you want to shard and shuffle on light objects (before decoding, before transforms) to avoid unnecessary computations, and to save memory. For now torchtext datasets yield light objects (simple text), but maybe in the future this will change?

In torchvision, we wrap in different places for each dataset, so using a decorator isn't an option.

Copy link
Contributor

@vcm2114 vcm2114 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug thanks for the clarification, in that case let's keep it as is for now. @parmeet I will also add this to the dataset effort in #1710.

Otherwise LGTM

@parmeet
Copy link
Contributor Author

parmeet commented May 18, 2022

Thanks @VirgileHlav SGTM! As @NicolasHug mentioned it could potentially vary from dataset to dataset. So if in some cases processing is needed at sample level, you could potentially shard the pipe before processing, but otherwise adding it at the end is fine.

Comment on lines +1 to +25
from parameterized import parameterized
from torch.utils.data.graph import traverse
from torch.utils.data.graph_settings import get_all_graph_pipes
from torchdata.datapipes.iter import Shuffler, ShardingFilter
from torchtext.datasets import DATASETS

from ..common.torchtext_test_case import TorchtextTestCase


class TestShuffleShardDatasetWrapper(TorchtextTestCase):
# Note that for order i.e shuffle before sharding, TorchData will provide linter warning
# Modify this test when linter warning is available
@parameterized.expand(list(DATASETS.items()))
def test_shuffle_shard_wrapper(self, dataset_name, dataset_fn):
dp = dataset_fn()
if type(dp) == tuple:
dp = list(dp)
else:
dp = [dp]

for dp_split in dp:
dp_graph = get_all_graph_pipes(traverse(dp_split))
for annotation_dp_type in [Shuffler, ShardingFilter]:
if not any(isinstance(dp, annotation_dp_type) for dp in dp_graph):
raise AssertionError(f"The dataset doesn't contain a {annotation_dp_type.__name__}() datapipe.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nayef211 Just FYI, if we can do something similar for pickle :). @ejguan I left a comment to update the test once linter warnings are available, it is not a blocker for landing this PR.

cc: @NicolasHug

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will do a fix for manylinux1 wheel first then add linter for you.

@parmeet parmeet merged commit 2a712f4 into pytorch:main May 18, 2022
@parmeet parmeet deleted the shuffle_shard branch May 18, 2022 17:01
class TestShuffleShardDatasetWrapper(TorchtextTestCase):
# Note that for order i.e shuffle before sharding, TorchData will provide linter warning
# Modify this test when linter warning is available
@parameterized.expand(list(DATASETS.items()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: it looks like we're not using the dataset_name anywhere. Why don't we just pass in the dataset_fn to the test by doing something like

@parameterized.expand(list(DATASETS.values()))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh my bad, thanks for the catch. Will fix it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually realized that the parameterized.expand decorator complains when passing in the list of dataset_fn. Lmk if you're able to figure out how to resolve the error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yupp, the problem is we need to pass tuples inside list. Just created PR to fix it #1733

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Just incorporated this in my PR #1732

facebook-github-bot pushed a commit to pytorch/data that referenced this pull request May 20, 2022
Summary:
This PR introduces the linter function to validate there is a shuffle operation before sharding. (Required by TorchText in pytorch/text#1729)

- When `sharding_filter` is not presented in the graph, this function always returns `True`
- For single-path graph, `shuffle` needs to be placed before `sharding_filter`.
- For multi-path graph, any `sharding_filter` requires a `shuffle` before along the path

This linter function won't check if there are multiple `sharding_filter` in the graph or `sharding_filter` is at the right place

Pull Request resolved: #429

Reviewed By: NivekT

Differential Revision: D36529167

Pulled By: ejguan

fbshipit-source-id: 56e734eac98b2ddadcd7707ee92ea4032a896969
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants