Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dataset builder default version #4356

Merged

Conversation

albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented May 16, 2022

Currently, when using a custom config (subclass of BuilderConfig), default version set at the builder level is ignored: we must set default version in the custom config class.

However, when loading a dataset with config_kwargs (for a configuration not present in BUILDER_CONFIGS), the default version set in the custom config is ignored and "0.0.0" is used instead:

ds = load_dataset("wikipedia", language="co", date="20220501", beam_runner="DirectRunner")

generates the following config:

WikipediaConfig(name='20220501.co', version=0.0.0, data_dir=None, data_files=None, description='Wikipedia dataset for co, parsed from 20220501 dump.')

with version "0.0.0" instead of "2.0.0".

See as a counter-example, when the config is present in BUILDER_CONFIGS:

ds = load_dataset("wikipedia", "20220301.fr", beam_runner="DirectRunner")

generates the following config:

WikipediaConfig(name='20220301.fr', version=2.0.0, data_dir=None, data_files=None, description='Wikipedia dataset for fr, parsed from 20220301 dump.')

with correct version "2.0.0", as set in the custom config class.

The reason for this is that DatasetBuilder has a default VERSION ("0.0.0") that overwrites the default version set at the custom config class.

This PR:

  • Removes the default VERSION at DatasetBuilder (set to None, so that the class attribute exists but it does not override the custom config default version).
  • Note that the BuilderConfig class already sets a default version = "0.0.0"; no need to pass this from the builder.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 16, 2022

The documentation is not available anymore as the PR was closed or merged.

@albertvillanova
Copy link
Member Author

This PR requires one of these other PRs being merged first:

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch ! Maybe it's worth adding a test to make sure a custom config gets the expected version ?

src/datasets/commands/dummy_data.py Show resolved Hide resolved
tests/test_dataset_common.py Show resolved Hide resolved
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

tests/test_builder.py Outdated Show resolved Hide resolved
@albertvillanova albertvillanova merged commit 4ed5bb2 into huggingface:master May 30, 2022
@albertvillanova albertvillanova deleted the fix-dataset-default-version branch May 30, 2022 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants