-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huggingface all options #952
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Overall LGTM with a few nit comments
test/test_huggingface_datasets.py
Outdated
elem = next(iter(datapipe)) | ||
assert type(elem) is dict | ||
assert elem["package_name"] == "com.mantz_it.rfanalyzer" | ||
mock_load_dataset.assert_called_with( | ||
path="lhoestq/demo1", streaming=False, split="train", revision="branch", use_auth_token=True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add one more line to test if there is only one element yielded from the datapipe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly do you mean? _get_response_from_huggingface_hub()
returns an iterator over the dataset and we look at the first element in line 37-39.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test validates the first output is the right one. I want to check a StopIteration should be raised when calling next over the iterator one more time
self.config_kwargs = config_kwargs | ||
warnings.warn( | ||
"default behavior of HuggingFaceHubReader will change in version 0.7", DeprecationWarning, stacklevel=2 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | |
if "split" not in self.config_kwargs: | |
warnings.warn("Default value of `split` will be changed to None in version 0.7", FutureWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @NivekT Do you want to chime in for any concern?
if "split" not in self.config_kwargs: | ||
warnings.warn("Default value of `split` will be changed to None in version 0.7", FutureWarning) | ||
if "revision" not in self.config_kwargs: | ||
warnings.warn("Default value of `revision` will be changed to None in version 0.7", FutureWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the default arguments are changed. @ejguan it will be slightly BC-breaking. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, the default arguments remain the same right?
split="train"
revision="main"
streaming=True
The default arguments are assigned in _get_response_from_huggingface_hub
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh in that case I'm fine with it, only that users will not be able to see those default arguments from IDE autocomplete. which is suboptimal but not a blocker.
Is the warning for Streaming missing or we want it to stay True
? The default is False
for HG's version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the warning for Streaming missing or we want it to stay
True
?
Good question. I personally like streaming=True to incorporate the style of large-dataset.
@SvenDS9 Could you please do a rebase onto main branch? |
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
6b7ec81
to
45536d2
Compare
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
split: Union[str, datasets.Split] = "train", | ||
revision: Union[str, datasets.Version] = "main", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the annotation as str
to make sure datasets
as optional dependency
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Fixes #944
Changes
load_dataset
(from HuggingFace) is called with correct parameters@ejguan could you please have a look