Support downloading specific splits in `load_dataset` #6832

mariosasko · 2024-04-23T12:32:27Z

This PR builds on #6639 to support downloading only the specified splits in load_dataset. For this to work, a builder's _split_generators need to be able to accept the requested splits (as a list) via a splits argument to avoid processing the non-requested ones. Also, the builder has to define a _available_splits method that lists all the possible splits values.

Close #4101, close #2538 (I'm probably missing some)

Should also make it possible to address #6793

…and_prepare-if-missing-splits

…partial-dataset

HuggingFaceDocBuilderDev · 2024-04-23T12:37:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq

Nice ! A few comments, but nothing major:

lhoestq · 2024-04-29T19:10:36Z

src/datasets/builder.py

+            _dataset_name = self.name if self._check_legacy_cache() else self.dataset_name
+            splits: Optional[List[str]] = None
+            cached_split_filepatterns = []
+            supports_partial_generation = self._supports_partial_generation()


(nit) maybe rename _supports_partial_generation -> _supports_split_by_split_generation

(or alternatively isinstance(DatasetSplitBuilder) and DatasetSplitBuilder would implement the method to list the splits and have the extended _split_generators signature, but maybe it's too much)

lhoestq · 2024-04-29T19:11:43Z

src/datasets/builder.py

+                        split_names = [rel_instr.splitname for rel_instr in split._relative_instructions]
+                        splits.extend(split_names)
+                    splits = list(unique_values(splits))  # remove duplicates
+                available_splits = self._available_splits()


it can also be a property .splits

lhoestq · 2024-04-29T19:14:58Z

src/datasets/builder.py

+            # We cannot use info as the source of truth if the builder supports partial generation
+            # as the info can be incomplete in that case
+            requested_splits_exist = not splits if supports_partial_generation else info_exists


is this just an optimization to avoid checking for files for all the splits ?

lhoestq · 2024-04-29T19:16:06Z

src/datasets/builder.py

-                            shutil.rmtree(dirname)
-                        # LocalFileSystem.mv does copy + rm, it is more efficient to simply rename a local directory
-                        shutil.move(tmp_dir, dirname)
+                            for root, dirnames, filenames in os.walk(dirname, topdown=False):


iirc this directory is a flat directory no ? (no subdirectories)

maybe this code can be simplified a bit

lhoestq · 2024-04-29T19:17:07Z

src/datasets/builder.py

                # We need to update the info in case some splits were added in the meantime
                # for example when calling load_dataset from multiple workers.
                self.info = self._load_info()
+            _dataset_name = self.name if self._check_legacy_cache() else self.dataset_name
+            splits: Optional[List[str]] = None
+            cached_split_filepatterns = []


maybe rename patterns_of_split_files_to_overwrite

lhoestq · 2024-04-29T19:18:44Z

src/datasets/builder.py

@@ -1031,8 +1110,14 @@ def incomplete_dir(dirname):
                            **download_and_prepare_kwargs,
                        )
                    # Sync info
+                    if supports_partial_generation and self.info.download_checksums is not None:


(nit) not sure supports_partial_generation is needed here

Suggested change

if supports_partial_generation and self.info.download_checksums is not None:

if self.info.download_checksums is not None:

BlackHC · 2024-08-19T14:23:23Z

Friendly ping on this! This feature would be really helpful and useful to me (and likely others with limited download speed and storage space!). Thanks so much!

lhoestq · 2024-08-19T15:12:30Z

No one is working on this atm afaik :/

BlackHC · 2024-08-19T15:19:37Z

No worries! I've patched the ImageNet dataset in: https://huggingface.co/datasets/ILSVRC/imagenet-1k/blob/refs%2Fpr%2F20/imagenet-1k.py

Together with:

dataset = load_dataset(
        "ILSVRC/imagenet-1k",
        split="validation",
        data_files={"val": "data/val_images.tar.gz"},
        revision="refs/pr/20",
        trust_remote_code=True,
        download_config=DownloadConfig(resume_download=True),
        verification_mode=VerificationMode.NO_CHECKS,
    )

It only downloads the validation set this way (NO_CHECKS is a bit annoying because I'd rather have md5 checks, but I guess I can't have everything) ^^' The patch is not perfect, but it does the job.

lhoestq and others added 12 commits February 2, 2024 11:36

download_and_prepare if missing splits

b7d854b

fix tests

dfeb762

style

32fd734

Merge branch 'main' of github.com:huggingface/datasets into download_…

4d083f6

…and_prepare-if-missing-splits

Support downloading specific splits

80c434b

Update packaged modules

fd73e6a

Add tests

72ba149

Remove pdb commnet

aabede8

Nit

bee2fd2

Fixes

03abb24

More tests

ca68130

Merge branch 'main' of github.com:huggingface/datasets into download-…

5a45909

…partial-dataset

mariosasko added 6 commits April 23, 2024 14:47

Fix packaged_modules tests

531bd7e

Reproduce error

73bcc50

More fixes

01b0a7c

API to know splits in advance

a42a9dd

Improve tests

15973ba

Final test fix (hopefully)

2ec71a2

mariosasko marked this pull request as ready for review April 25, 2024 17:05

mariosasko requested a review from lhoestq April 25, 2024 17:05

lhoestq mentioned this pull request Apr 29, 2024

[Streaming] Only load requested splits without resolving files for the other splits #6847

Open

lhoestq reviewed Apr 30, 2024

View reviewed changes

isaac-chung mentioned this pull request May 7, 2024

Speeding up MTEB embeddings-benchmark/mteb#381

Closed

lhoestq mentioned this pull request Oct 8, 2024

Solution to issue: #7080 Modified load_dataset function, so that it prompts the user to select a dataset when subdatasets or splits (train, test) are available #7191

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support downloading specific splits in `load_dataset` #6832

Support downloading specific splits in `load_dataset` #6832

mariosasko commented Apr 23, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 23, 2024

lhoestq left a comment

lhoestq Apr 29, 2024

lhoestq Apr 29, 2024

lhoestq Apr 29, 2024

lhoestq Apr 29, 2024

lhoestq Apr 29, 2024

lhoestq Apr 29, 2024

BlackHC commented Aug 19, 2024

lhoestq commented Aug 19, 2024

BlackHC commented Aug 19, 2024

	if supports_partial_generation and self.info.download_checksums is not None:
	if self.info.download_checksums is not None:

Support downloading specific splits in load_dataset #6832

Are you sure you want to change the base?

Support downloading specific splits in load_dataset #6832

Conversation

mariosasko commented Apr 23, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Apr 23, 2024

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Apr 29, 2024

Choose a reason for hiding this comment

lhoestq Apr 29, 2024

Choose a reason for hiding this comment

lhoestq Apr 29, 2024

Choose a reason for hiding this comment

lhoestq Apr 29, 2024

Choose a reason for hiding this comment

lhoestq Apr 29, 2024

Choose a reason for hiding this comment

lhoestq Apr 29, 2024

Choose a reason for hiding this comment

BlackHC commented Aug 19, 2024

lhoestq commented Aug 19, 2024

BlackHC commented Aug 19, 2024

Support downloading specific splits in `load_dataset` #6832

Support downloading specific splits in `load_dataset` #6832

mariosasko commented Apr 23, 2024 •

edited

Loading