Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support downloading specific splits in load_dataset #6832

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

mariosasko
Copy link
Collaborator

@mariosasko mariosasko commented Apr 23, 2024

This PR builds on #6639 to support downloading only the specified splits in load_dataset. For this to work, a builder's _split_generators need to be able to accept the requested splits (as a list) via a splits argument to avoid processing the non-requested ones. Also, the builder has to define a _available_splits method that lists all the possible splits values.

Close #4101, close #2538 (I'm probably missing some)

Should also make it possible to address #6793

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! A few comments, but nothing major:

_dataset_name = self.name if self._check_legacy_cache() else self.dataset_name
splits: Optional[List[str]] = None
cached_split_filepatterns = []
supports_partial_generation = self._supports_partial_generation()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) maybe rename _supports_partial_generation -> _supports_split_by_split_generation

(or alternatively isinstance(DatasetSplitBuilder) and DatasetSplitBuilder would implement the method to list the splits and have the extended _split_generators signature, but maybe it's too much)

split_names = [rel_instr.splitname for rel_instr in split._relative_instructions]
splits.extend(split_names)
splits = list(unique_values(splits)) # remove duplicates
available_splits = self._available_splits()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can also be a property .splits

Comment on lines +1017 to +1019
# We cannot use info as the source of truth if the builder supports partial generation
# as the info can be incomplete in that case
requested_splits_exist = not splits if supports_partial_generation else info_exists
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just an optimization to avoid checking for files for all the splits ?

shutil.rmtree(dirname)
# LocalFileSystem.mv does copy + rm, it is more efficient to simply rename a local directory
shutil.move(tmp_dir, dirname)
for root, dirnames, filenames in os.walk(dirname, topdown=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc this directory is a flat directory no ? (no subdirectories)

maybe this code can be simplified a bit

# We need to update the info in case some splits were added in the meantime
# for example when calling load_dataset from multiple workers.
self.info = self._load_info()
_dataset_name = self.name if self._check_legacy_cache() else self.dataset_name
splits: Optional[List[str]] = None
cached_split_filepatterns = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename patterns_of_split_files_to_overwrite

@@ -1031,8 +1110,14 @@ def incomplete_dir(dirname):
**download_and_prepare_kwargs,
)
# Sync info
if supports_partial_generation and self.info.download_checksums is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) not sure supports_partial_generation is needed here

Suggested change
if supports_partial_generation and self.info.download_checksums is not None:
if self.info.download_checksums is not None:

@BlackHC
Copy link

BlackHC commented Aug 19, 2024

Friendly ping on this! This feature would be really helpful and useful to me (and likely others with limited download speed and storage space!). Thanks so much!

@lhoestq
Copy link
Member

lhoestq commented Aug 19, 2024

No one is working on this atm afaik :/

@BlackHC
Copy link

BlackHC commented Aug 19, 2024

No worries! I've patched the ImageNet dataset in: https://huggingface.co/datasets/ILSVRC/imagenet-1k/blob/refs%2Fpr%2F20/imagenet-1k.py

Together with:

dataset = load_dataset(
        "ILSVRC/imagenet-1k",
        split="validation",
        data_files={"val": "data/val_images.tar.gz"},
        revision="refs/pr/20",
        trust_remote_code=True,
        download_config=DownloadConfig(resume_download=True),
        verification_mode=VerificationMode.NO_CHECKS,
    )

It only downloads the validation set this way (NO_CHECKS is a bit annoying because I'd rather have md5 checks, but I guess I can't have everything) ^^' The patch is not perfect, but it does the job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants