-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support downloading specific splits in load_dataset
#6832
base: main
Are you sure you want to change the base?
Conversation
…and_prepare-if-missing-splits
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! A few comments, but nothing major:
_dataset_name = self.name if self._check_legacy_cache() else self.dataset_name | ||
splits: Optional[List[str]] = None | ||
cached_split_filepatterns = [] | ||
supports_partial_generation = self._supports_partial_generation() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) maybe rename _supports_partial_generation
-> _supports_split_by_split_generation
(or alternatively isinstance(DatasetSplitBuilder)
and DatasetSplitBuilder
would implement the method to list the splits and have the extended _split_generators
signature, but maybe it's too much)
split_names = [rel_instr.splitname for rel_instr in split._relative_instructions] | ||
splits.extend(split_names) | ||
splits = list(unique_values(splits)) # remove duplicates | ||
available_splits = self._available_splits() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can also be a property .splits
# We cannot use info as the source of truth if the builder supports partial generation | ||
# as the info can be incomplete in that case | ||
requested_splits_exist = not splits if supports_partial_generation else info_exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this just an optimization to avoid checking for files for all the splits ?
shutil.rmtree(dirname) | ||
# LocalFileSystem.mv does copy + rm, it is more efficient to simply rename a local directory | ||
shutil.move(tmp_dir, dirname) | ||
for root, dirnames, filenames in os.walk(dirname, topdown=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc this directory is a flat directory no ? (no subdirectories)
maybe this code can be simplified a bit
# We need to update the info in case some splits were added in the meantime | ||
# for example when calling load_dataset from multiple workers. | ||
self.info = self._load_info() | ||
_dataset_name = self.name if self._check_legacy_cache() else self.dataset_name | ||
splits: Optional[List[str]] = None | ||
cached_split_filepatterns = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rename patterns_of_split_files_to_overwrite
@@ -1031,8 +1110,14 @@ def incomplete_dir(dirname): | |||
**download_and_prepare_kwargs, | |||
) | |||
# Sync info | |||
if supports_partial_generation and self.info.download_checksums is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) not sure supports_partial_generation
is needed here
if supports_partial_generation and self.info.download_checksums is not None: | |
if self.info.download_checksums is not None: |
Friendly ping on this! This feature would be really helpful and useful to me (and likely others with limited download speed and storage space!). Thanks so much! |
No one is working on this atm afaik :/ |
No worries! I've patched the ImageNet dataset in: https://huggingface.co/datasets/ILSVRC/imagenet-1k/blob/refs%2Fpr%2F20/imagenet-1k.py Together with: dataset = load_dataset(
"ILSVRC/imagenet-1k",
split="validation",
data_files={"val": "data/val_images.tar.gz"},
revision="refs/pr/20",
trust_remote_code=True,
download_config=DownloadConfig(resume_download=True),
verification_mode=VerificationMode.NO_CHECKS,
) It only downloads the validation set this way (NO_CHECKS is a bit annoying because I'd rather have md5 checks, but I guess I can't have everything) ^^' The patch is not perfect, but it does the job. |
This PR builds on #6639 to support downloading only the specified splits in
load_dataset
. For this to work, a builder's_split_generators
need to be able to accept the requested splits (as a list) via asplits
argument to avoid processing the non-requested ones. Also, the builder has to define a_available_splits
method that lists all the possiblesplits
values.Close #4101, close #2538 (I'm probably missing some)
Should also make it possible to address #6793