losing DatasetInfo in Dataset.map when num_proc > 1 #6585

JochenSiegWork · 2024-01-12T13:39:19Z

Describe the bug

Hello and thanks for developing this package!

When I process a Dataset with the map function using multiple processors some set attributes of the DatasetInfo get lost and are None in the resulting Dataset.

Steps to reproduce the bug

from datasets import Dataset, DatasetInfo


def run_map(num_proc):
    dataset = Dataset.from_dict(
        {"col1": [0, 1], "col2": [3, 4]},
        info=DatasetInfo(
            dataset_name="my_dataset",
        ),
    )

    ds = dataset.map(lambda x: x, num_proc=num_proc)
    print(ds.info.dataset_name)


run_map(1)
run_map(2)

This puts out:

Map: 100%|██████████| 2/2 [00:00<00:00, 724.66 examples/s]
my_dataset
Map (num_proc=2): 100%|██████████| 2/2 [00:00<00:00, 18.25 examples/s]
None

Expected behavior

I expect the DatasetInfo to be kept as it was and there should be no difference in the output of running map with num_proc=1 and num_proc=2.

Expected output:

Map: 100%|██████████| 2/2 [00:00<00:00, 724.66 examples/s]
my_dataset
Map (num_proc=2): 100%|██████████| 2/2 [00:00<00:00, 18.25 examples/s]
my_dataset

Environment info

datasets version: 2.16.1
Platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.17
Python version: 3.8.18
huggingface_hub version: 0.20.2
PyArrow version: 12.0.1
Pandas version: 2.0.3
fsspec version: 2023.9.2

The text was updated successfully, but these errors were encountered:

lhoestq · 2024-01-12T13:47:03Z

Hi ! This issue comes from the fact that map() with num_proc>1 shards the dataset in multiple chunks to be processed (one per process) and merges them. The DatasetInfos of each chunk are then merged together, but for some fields like dataset_name it's not been implemented and default to None.

The DatasetInfo merge is defined here, in case you'd like to contribute an improvement:

datasets/src/datasets/info.py

Lines 269 to 270 in d2e0034

    
           @classmethod 
        
           def from_merge(cls, dataset_infos: List["DatasetInfo"]):

JochenSiegWork · 2024-01-12T14:08:14Z

#self-assign

* try not to merge DatasetInfos if they're equal * fixes losing DatasetInfo during parallel Dataset.map

* try not to merge DatasetInfos if they're equal * fixes losing DatasetInfo during parallel Dataset.map Co-authored-by: Quentin Lhoest <[email protected]>

github-actions bot assigned JochenSiegWork Jan 12, 2024

JochenSiegWork added a commit to JochenSiegWork/datasets that referenced this issue Jan 12, 2024

keep more info in DatasetInfo.from_merge huggingface#6585

f57bb9b

* try not to merge DatasetInfos if they're equal * fixes losing DatasetInfo during parallel Dataset.map

thiagobarbosa mentioned this issue Jan 15, 2024

Error with the huggingface hf-speech-bench huggingface/blog#1623

Open

lhoestq added a commit that referenced this issue Jan 26, 2024

keep more info in DatasetInfo.from_merge #6585 (#6586)

ca76ca1

* try not to merge DatasetInfos if they're equal * fixes losing DatasetInfo during parallel Dataset.map Co-authored-by: Quentin Lhoest <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

losing DatasetInfo in Dataset.map when num_proc > 1 #6585

losing DatasetInfo in Dataset.map when num_proc > 1 #6585

JochenSiegWork commented Jan 12, 2024

lhoestq commented Jan 12, 2024

JochenSiegWork commented Jan 12, 2024

losing DatasetInfo in Dataset.map when num_proc > 1 #6585

losing DatasetInfo in Dataset.map when num_proc > 1 #6585

Comments

JochenSiegWork commented Jan 12, 2024

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented Jan 12, 2024

JochenSiegWork commented Jan 12, 2024