Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

losing DatasetInfo in Dataset.map when num_proc > 1 #6585

Open
JochenSiegWork opened this issue Jan 12, 2024 · 2 comments
Open

losing DatasetInfo in Dataset.map when num_proc > 1 #6585

JochenSiegWork opened this issue Jan 12, 2024 · 2 comments
Assignees

Comments

@JochenSiegWork
Copy link
Contributor

Describe the bug

Hello and thanks for developing this package!

When I process a Dataset with the map function using multiple processors some set attributes of the DatasetInfo get lost and are None in the resulting Dataset.

Steps to reproduce the bug

from datasets import Dataset, DatasetInfo


def run_map(num_proc):
    dataset = Dataset.from_dict(
        {"col1": [0, 1], "col2": [3, 4]},
        info=DatasetInfo(
            dataset_name="my_dataset",
        ),
    )

    ds = dataset.map(lambda x: x, num_proc=num_proc)
    print(ds.info.dataset_name)


run_map(1)
run_map(2)

This puts out:

Map: 100%|██████████| 2/2 [00:00<00:00, 724.66 examples/s]
my_dataset
Map (num_proc=2): 100%|██████████| 2/2 [00:00<00:00, 18.25 examples/s]
None

Expected behavior

I expect the DatasetInfo to be kept as it was and there should be no difference in the output of running map with num_proc=1 and num_proc=2.

Expected output:

Map: 100%|██████████| 2/2 [00:00<00:00, 724.66 examples/s]
my_dataset
Map (num_proc=2): 100%|██████████| 2/2 [00:00<00:00, 18.25 examples/s]
my_dataset

Environment info

  • datasets version: 2.16.1
  • Platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.17
  • Python version: 3.8.18
  • huggingface_hub version: 0.20.2
  • PyArrow version: 12.0.1
  • Pandas version: 2.0.3
  • fsspec version: 2023.9.2
@lhoestq
Copy link
Member

lhoestq commented Jan 12, 2024

Hi ! This issue comes from the fact that map() with num_proc>1 shards the dataset in multiple chunks to be processed (one per process) and merges them. The DatasetInfos of each chunk are then merged together, but for some fields like dataset_name it's not been implemented and default to None.

The DatasetInfo merge is defined here, in case you'd like to contribute an improvement:

@classmethod
def from_merge(cls, dataset_infos: List["DatasetInfo"]):

@JochenSiegWork
Copy link
Contributor Author

#self-assign

JochenSiegWork added a commit to JochenSiegWork/datasets that referenced this issue Jan 12, 2024
* try not to merge DatasetInfos if they're equal

* fixes losing DatasetInfo during parallel Dataset.map
lhoestq added a commit that referenced this issue Jan 26, 2024
* try not to merge DatasetInfos if they're equal

* fixes losing DatasetInfo during parallel Dataset.map

Co-authored-by: Quentin Lhoest <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants