Skip to content

Replacing lambda functions with regular functions in all datasets #1718

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 11, 2022

Conversation

NivekT
Copy link
Contributor

@NivekT NivekT commented May 11, 2022

Replacing lambda functions with regular functions in all datasets because lambda functions are not serializable by pickle and serialization is necessary for DataLoader's multiprocessing feature.

If the changes are correct, it should not break any CI.

Fixes part of pytorch/data#397 (TorchData team plans to improve DataLoader as well, see the comments within the issue)

Fixes #1716

Copy link
Contributor

@parmeet parmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NivekT . LGTM! Feel free to land once the CI passes w.r.t datasets.

@NivekT NivekT merged commit dfb53af into pytorch:main May 11, 2022
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry that I didn't have chance to look at this PR before landing. We need to replace local functions by global functions.

Comment on lines +55 to +60
def _filepath_fn(_=None):
return os.path.join(root, split + ".csv")

def _modify_res(t):
return int(t[0]), " ".join(t[1:])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use global function rather than local function here. Local functions are still non-picklable for pickle module.

Comment on lines +61 to +72
def _filepath_fn(_=None):
return os.path.join(root, _PATH)

def _extracted_filepath_fn(_=None):
return os.path.join(root, _EXTRACTED_FILES[split])

def _filter_fn(x):
return _EXTRACTED_FILES[split] in x[0]

def _modify_res(t):
return int(t[0]), " ".join(t[1:])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@parmeet
Copy link
Contributor

parmeet commented May 16, 2022

I am sorry that I didn't have chance to look at this PR before landing. We need to replace local functions by global functions.

Oops Thanks for the catch @ejguan. I think we should add make amendment in our datatsets test suit (here ) about pickability. cc: @Nayef211

@NivekT
Copy link
Contributor Author

NivekT commented May 16, 2022

Let me fix this with another PR. Thanks for catching it.

@Nayef211
Copy link
Contributor

Oops Thanks for the catch @ejguan. I think we should add make amendment in our datatsets test suit (here ) about pickability. cc: @Nayef211

@parmeet would you mind clarifying what you mean by "make amendment in our datatsets test suite"?

@parmeet
Copy link
Contributor

parmeet commented May 16, 2022

@parmeet would you mind clarifying what you mean by "make amendment in our datatsets test suite"?

I meant to add tests to make sure we we can dump and load dataset objects using pickle.

so for instance below code give error:

from torchtext.datasets import AG_NEWS
import pickle
dp = AG_NEWS(split='train')
pickle.dump(dp, open("temp.pkl",'w'))

Error: AttributeError: Can't pickle local object 'AG_NEWS.._filepath_fn'

And in our tests we could just add another unit test that would dump and load the datapipe..

@Nayef211
Copy link
Contributor

I meant to add tests to make sure we we can dump and load dataset objects using pickle.

so for instance below code give error:

from torchtext.datasets import AG_NEWS
import pickle
dp = AG_NEWS(split='train')
pickle.dump(dp, open("temp.pkl",'w'))

Error: AttributeError: Can't pickle local object 'AG_NEWS.._filepath_fn'

And in our tests we could just add another unit test that would dump and load the datapipe..

Gotcha thanks for clarifying. Let me create a PR to add these tests to our datasets!

@parmeet
Copy link
Contributor

parmeet commented May 17, 2022

Let me create a PR to add these tests to our datasets!

Thanks @Nayef211 . One suggestion here would be to add a common test parameterized over all the datasets added to the repo instead of adding it to individual test suit :)

@Nayef211
Copy link
Contributor

Thanks @Nayef211 . One suggestion here would be to add a common test parameterized over all the datasets added to the repo instead of adding it to individual test suit :)

Thanks for the advice, I'll try to add a parameterized test for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Torchtext datasets failed during multi-processing due to lambda functions usage
5 participants