Fix for Incorrect ex_iterable used with multi num_worker #6582

kq-chen · 2024-01-11T08:49:43Z

Corrects an issue where self._ex_iterable was erroneously used instead of ex_iterable, when both Distributed Data Parallel (DDP) and multi num_worker are used concurrently. This improper usage led to the generation of incorrect shards_indices, subsequently causing issues with the control flow responsible for worker creation. The fix ensures the appropriate iterable is used, thus providing a more accurate determination of whether a new worker should be instantiated or not.

Corrects an issue where `self._ex_iterable` was erroneously used instead of `ex_iterable`, when both Distributed Data Parallel (DDP) and multi num_worker are used concurrently. This improper usage led to the generation of incorrect `shards_indices`, subsequently causing issues with the control flow responsible for worker creation. The fix ensures the appropriate iterable is used, thus providing a more accurate determination of whether a new worker should be instantiated or not.

kq-chen · 2024-01-11T09:06:06Z

A toy example to reveal the bug.

"""
DATASETS_VERBOSITY=debug torchrun --nproc-per-node 2 main.py 
"""
import torch.utils.data
import torch.distributed
import datasets.distributed
import datasets

# num shards = 4
shards = [(0, 100), (100, 200), (200, 300), (300, 400)]


def gen(shards):
    for st, ed in shards:
        yield from range(st, ed)

torch.distributed.init_process_group()

# want to create total worker = world_size * 8
ds = datasets.IterableDataset.from_generator(gen, gen_kwargs={'shards': shards})
ds = datasets.distributed.split_dataset_by_node(
    ds,
    rank=torch.distributed.get_rank(),
    world_size=torch.distributed.get_world_size(),
)
dl = torch.utils.data.DataLoader(ds, batch_size=10, num_workers=8)

for x in dl:
    print(f"RANK={torch.distributed.get_rank()} {x}")

lhoestq

Good catch ! We'll do a release asap and include this fix

github-actions · 2024-03-01T19:09:13Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005401 / 0.011353 (-0.005952)	0.004023 / 0.011008 (-0.006985)	0.064601 / 0.038508 (0.026093)	0.028567 / 0.023109 (0.005457)	0.245476 / 0.275898 (-0.030422)	0.292727 / 0.323480 (-0.030752)	0.003080 / 0.007986 (-0.004905)	0.002779 / 0.004328 (-0.001549)	0.050046 / 0.004250 (0.045796)	0.043906 / 0.037052 (0.006854)	0.273896 / 0.258489 (0.015407)	0.308430 / 0.293841 (0.014589)	0.028442 / 0.128546 (-0.100104)	0.010694 / 0.075646 (-0.064953)	0.209048 / 0.419271 (-0.210223)	0.036062 / 0.043533 (-0.007471)	0.242689 / 0.255139 (-0.012450)	0.261695 / 0.283200 (-0.021504)	0.018519 / 0.141683 (-0.123163)	1.122735 / 1.452155 (-0.329420)	1.172680 / 1.492716 (-0.320036)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.093827 / 0.018006 (0.075820)	0.302650 / 0.000490 (0.302161)	0.000218 / 0.000200 (0.000018)	0.000045 / 0.000054 (-0.000009)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018778 / 0.037411 (-0.018633)	0.067516 / 0.014526 (0.052990)	0.079693 / 0.176557 (-0.096864)	0.125907 / 0.737135 (-0.611228)	0.081771 / 0.296338 (-0.214568)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.281809 / 0.215209 (0.066600)	2.773937 / 2.077655 (0.696283)	1.443622 / 1.504120 (-0.060497)	1.334359 / 1.541195 (-0.206836)	1.364813 / 1.468490 (-0.103677)	0.561670 / 4.584777 (-4.023107)	2.338292 / 3.745712 (-1.407420)	2.807595 / 5.269862 (-2.462267)	1.734162 / 4.565676 (-2.831514)	0.063681 / 0.424275 (-0.360594)	0.004934 / 0.007607 (-0.002673)	0.336781 / 0.226044 (0.110737)	3.311744 / 2.268929 (1.042815)	1.826802 / 55.444624 (-53.617822)	1.579604 / 6.876477 (-5.296872)	1.620526 / 2.142072 (-0.521546)	0.647061 / 4.805227 (-4.158166)	0.117729 / 6.500664 (-6.382935)	0.042216 / 0.075469 (-0.033253)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.994289 / 1.841788 (-0.847499)	12.266185 / 8.074308 (4.191877)	9.634035 / 10.191392 (-0.557357)	0.144521 / 0.680424 (-0.535902)	0.013787 / 0.534201 (-0.520414)	0.288353 / 0.579283 (-0.290930)	0.262183 / 0.434364 (-0.172181)	0.336960 / 0.540337 (-0.203378)	0.441142 / 1.386936 (-0.945794)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005678 / 0.011353 (-0.005675)	0.004011 / 0.011008 (-0.006998)	0.049319 / 0.038508 (0.010811)	0.032543 / 0.023109 (0.009434)	0.276389 / 0.275898 (0.000491)	0.298495 / 0.323480 (-0.024985)	0.004192 / 0.007986 (-0.003794)	0.002765 / 0.004328 (-0.001563)	0.048739 / 0.004250 (0.044489)	0.046212 / 0.037052 (0.009160)	0.286614 / 0.258489 (0.028125)	0.315949 / 0.293841 (0.022108)	0.029833 / 0.128546 (-0.098714)	0.010762 / 0.075646 (-0.064884)	0.058489 / 0.419271 (-0.360783)	0.052258 / 0.043533 (0.008725)	0.275873 / 0.255139 (0.020734)	0.288668 / 0.283200 (0.005468)	0.018828 / 0.141683 (-0.122855)	1.140196 / 1.452155 (-0.311959)	1.229500 / 1.492716 (-0.263217)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.094161 / 0.018006 (0.076155)	0.303519 / 0.000490 (0.303030)	0.000219 / 0.000200 (0.000019)	0.000043 / 0.000054 (-0.000012)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022088 / 0.037411 (-0.015324)	0.076376 / 0.014526 (0.061850)	0.088705 / 0.176557 (-0.087851)	0.127602 / 0.737135 (-0.609533)	0.088689 / 0.296338 (-0.207649)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.292363 / 0.215209 (0.077154)	2.859215 / 2.077655 (0.781561)	1.566389 / 1.504120 (0.062270)	1.439195 / 1.541195 (-0.102000)	1.463805 / 1.468490 (-0.004685)	0.551660 / 4.584777 (-4.033116)	2.427462 / 3.745712 (-1.318250)	2.712372 / 5.269862 (-2.557490)	1.811331 / 4.565676 (-2.754346)	0.061539 / 0.424275 (-0.362736)	0.005062 / 0.007607 (-0.002545)	0.341984 / 0.226044 (0.115940)	3.352171 / 2.268929 (1.083242)	1.917550 / 55.444624 (-53.527074)	1.642668 / 6.876477 (-5.233809)	1.817204 / 2.142072 (-0.324868)	0.630849 / 4.805227 (-4.174379)	0.115788 / 6.500664 (-6.384876)	0.041041 / 0.075469 (-0.034428)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.017725 / 1.841788 (-0.824062)	12.976994 / 8.074308 (4.902686)	10.307414 / 10.191392 (0.116022)	0.141090 / 0.680424 (-0.539334)	0.015548 / 0.534201 (-0.518653)	0.288184 / 0.579283 (-0.291099)	0.276409 / 0.434364 (-0.157955)	0.328289 / 0.540337 (-0.212048)	0.429138 / 1.386936 (-0.957798)

lhoestq approved these changes Mar 1, 2024

View reviewed changes

lhoestq merged commit 31ae21f into huggingface:main Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for Incorrect ex_iterable used with multi num_worker #6582

Fix for Incorrect ex_iterable used with multi num_worker #6582

kq-chen commented Jan 11, 2024

kq-chen commented Jan 11, 2024

lhoestq left a comment

github-actions bot commented Mar 1, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Fix for Incorrect ex_iterable used with multi num_worker #6582

Fix for Incorrect ex_iterable used with multi num_worker #6582

Conversation

kq-chen commented Jan 11, 2024

kq-chen commented Jan 11, 2024

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 1, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json