Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huggingface all options #952

Closed
wants to merge 6 commits into from
Closed

Conversation

SvenDS9
Copy link
Contributor

@SvenDS9 SvenDS9 commented Jan 19, 2023

Fixes #944

Changes

  • Changed test setup for HuggingFaceHubReader. Do not test against production but ensure load_dataset (from HuggingFace) is called with correct parameters
  • Include HuggingFaceHubReader in documentation

@ejguan could you please have a look

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 19, 2023
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Overall LGTM with a few nit comments

torchdata/datapipes/iter/load/huggingface.py Outdated Show resolved Hide resolved
torchdata/datapipes/iter/load/huggingface.py Outdated Show resolved Hide resolved
torchdata/datapipes/iter/load/huggingface.py Show resolved Hide resolved
Comment on lines 37 to 44
elem = next(iter(datapipe))
assert type(elem) is dict
assert elem["package_name"] == "com.mantz_it.rfanalyzer"
mock_load_dataset.assert_called_with(
path="lhoestq/demo1", streaming=False, split="train", revision="branch", use_auth_token=True
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add one more line to test if there is only one element yielded from the datapipe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly do you mean? _get_response_from_huggingface_hub() returns an iterator over the dataset and we look at the first element in line 37-39.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test validates the first output is the right one. I want to check a StopIteration should be raised when calling next over the iterator one more time

SvenDS9 added a commit to SvenDS9/PytorchData that referenced this pull request Jan 23, 2023
self.config_kwargs = config_kwargs
warnings.warn(
"default behavior of HuggingFaceHubReader will change in version 0.7", DeprecationWarning, stacklevel=2
)
Copy link
Contributor

@ejguan ejguan Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
)
if "split" not in self.config_kwargs:
warnings.warn("Default value of `split` will be changed to None in version 0.7", FutureWarning)

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @NivekT Do you want to chime in for any concern?

Comment on lines +78 to +81
if "split" not in self.config_kwargs:
warnings.warn("Default value of `split` will be changed to None in version 0.7", FutureWarning)
if "revision" not in self.config_kwargs:
warnings.warn("Default value of `revision` will be changed to None in version 0.7", FutureWarning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the default arguments are changed. @ejguan it will be slightly BC-breaking. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, the default arguments remain the same right?
split="train"
revision="main"
streaming=True

The default arguments are assigned in _get_response_from_huggingface_hub.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh in that case I'm fine with it, only that users will not be able to see those default arguments from IDE autocomplete. which is suboptimal but not a blocker.

Is the warning for Streaming missing or we want it to stay True? The default is False for HG's version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the warning for Streaming missing or we want it to stay True?

Good question. I personally like streaming=True to incorporate the style of large-dataset.

@ejguan
Copy link
Contributor

ejguan commented Jan 23, 2023

@SvenDS9 Could you please do a rebase onto main branch?

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Comment on lines 20 to 21
split: Union[str, datasets.Split] = "train",
revision: Union[str, datasets.Version] = "main",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the annotation as str to make sure datasets as optional dependency

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ejguan merged this pull request in f7242a4.

@SvenDS9 SvenDS9 deleted the huggingface_all_options branch February 15, 2023 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Full Support for HuggingFace-Datasets
4 participants