Skip to content

Commit

Permalink
Update index.rst (#3636)
Browse files Browse the repository at this point in the history
  • Loading branch information
VioletteLepercq authored Jan 26, 2022
1 parent 3a44eb7 commit e2e96ff
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Datasets

🤗 Datasets is a library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks.

Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. We also feature a deep integration with the `Hugging Face Hub <https://huggingface.co/datasets>`_, allowing you to easily load and share a dataset with the wider NLP community. There are currently over 900 datasets, and more than 25 metrics available.
Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. We also feature a deep integration with the `Hugging Face Hub <https://huggingface.co/datasets>`_, allowing you to easily load and share a dataset with the wider NLP community. There are currently over 2658 datasets, and more than 34 metrics available.

Find your dataset today on the `Hugging Face Hub <https://huggingface.co/datasets>`_, or take an in-depth look inside a dataset with the live `Datasets Viewer <https://huggingface.co/datasets/viewer/>`_.

Expand Down

1 comment on commit e2e96ff

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009181 / 0.011353 (-0.002172) 0.004058 / 0.011008 (-0.006950) 0.028441 / 0.038508 (-0.010067) 0.033870 / 0.023109 (0.010761) 0.286788 / 0.275898 (0.010890) 0.338439 / 0.323480 (0.014959) 0.007997 / 0.007986 (0.000012) 0.004954 / 0.004328 (0.000626) 0.008359 / 0.004250 (0.004108) 0.046610 / 0.037052 (0.009557) 0.280974 / 0.258489 (0.022485) 0.334330 / 0.293841 (0.040489) 0.028987 / 0.128546 (-0.099559) 0.009043 / 0.075646 (-0.066603) 0.229647 / 0.419271 (-0.189624) 0.046166 / 0.043533 (0.002633) 0.271353 / 0.255139 (0.016214) 0.316319 / 0.283200 (0.033120) 0.098141 / 0.141683 (-0.043542) 1.636487 / 1.452155 (0.184332) 1.674978 / 1.492716 (0.182262)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.312632 / 0.018006 (0.294626) 0.541178 / 0.000490 (0.540688) 0.009307 / 0.000200 (0.009108) 0.000257 / 0.000054 (0.000203)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.034318 / 0.037411 (-0.003093) 0.021270 / 0.014526 (0.006744) 0.027026 / 0.176557 (-0.149531) 0.063856 / 0.737135 (-0.673279) 0.027338 / 0.296338 (-0.269001)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.368730 / 0.215209 (0.153521) 3.699655 / 2.077655 (1.622001) 1.594150 / 1.504120 (0.090030) 1.414401 / 1.541195 (-0.126793) 1.521547 / 1.468490 (0.053057) 0.400002 / 4.584777 (-4.184775) 4.408245 / 3.745712 (0.662533) 2.295555 / 5.269862 (-2.974307) 0.946961 / 4.565676 (-3.618716) 0.055285 / 0.424275 (-0.368990) 0.012672 / 0.007607 (0.005065) 0.517797 / 0.226044 (0.291753) 5.191382 / 2.268929 (2.922453) 2.284224 / 55.444624 (-53.160401) 1.938079 / 6.876477 (-4.938398) 2.016782 / 2.142072 (-0.125290) 0.572589 / 4.805227 (-4.232638) 0.130783 / 6.500664 (-6.369881) 0.068302 / 0.075469 (-0.007167)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.536816 / 1.841788 (-0.304972) 14.480530 / 8.074308 (6.406222) 26.979537 / 10.191392 (16.788145) 0.841455 / 0.680424 (0.161031) 0.463033 / 0.534201 (-0.071168) 0.446153 / 0.579283 (-0.133130) 0.492124 / 0.434364 (0.057760) 0.324753 / 0.540337 (-0.215584) 0.341314 / 1.386936 (-1.045623)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008627 / 0.011353 (-0.002726) 0.004382 / 0.011008 (-0.006626) 0.030193 / 0.038508 (-0.008315) 0.036103 / 0.023109 (0.012994) 0.333774 / 0.275898 (0.057876) 0.342543 / 0.323480 (0.019063) 0.006717 / 0.007986 (-0.001269) 0.003973 / 0.004328 (-0.000356) 0.007793 / 0.004250 (0.003542) 0.044042 / 0.037052 (0.006990) 0.319437 / 0.258489 (0.060948) 0.343817 / 0.293841 (0.049976) 0.032061 / 0.128546 (-0.096486) 0.009790 / 0.075646 (-0.065857) 0.258361 / 0.419271 (-0.160910) 0.052498 / 0.043533 (0.008965) 0.313304 / 0.255139 (0.058165) 0.338594 / 0.283200 (0.055394) 0.100761 / 0.141683 (-0.040922) 1.784883 / 1.452155 (0.332728) 1.865838 / 1.492716 (0.373122)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.284010 / 0.018006 (0.266003) 0.541021 / 0.000490 (0.540531) 0.002035 / 0.000200 (0.001835) 0.000095 / 0.000054 (0.000040)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.035174 / 0.037411 (-0.002238) 0.024293 / 0.014526 (0.009767) 0.032208 / 0.176557 (-0.144348) 0.069898 / 0.737135 (-0.667237) 0.032937 / 0.296338 (-0.263402)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.430395 / 0.215209 (0.215186) 4.306861 / 2.077655 (2.229206) 1.878802 / 1.504120 (0.374682) 1.685058 / 1.541195 (0.143863) 1.824969 / 1.468490 (0.356479) 0.443397 / 4.584777 (-4.141380) 4.680743 / 3.745712 (0.935031) 3.268837 / 5.269862 (-2.001025) 0.959844 / 4.565676 (-3.605832) 0.053731 / 0.424275 (-0.370544) 0.012419 / 0.007607 (0.004811) 0.543838 / 0.226044 (0.317794) 5.412277 / 2.268929 (3.143348) 2.394151 / 55.444624 (-53.050474) 1.994777 / 6.876477 (-4.881700) 2.140788 / 2.142072 (-0.001285) 0.561053 / 4.805227 (-4.244175) 0.125548 / 6.500664 (-6.375116) 0.063421 / 0.075469 (-0.012048)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.442746 / 1.841788 (-0.399041) 14.198510 / 8.074308 (6.124202) 26.847560 / 10.191392 (16.656168) 0.853060 / 0.680424 (0.172636) 0.515112 / 0.534201 (-0.019089) 0.496890 / 0.579283 (-0.082393) 0.506981 / 0.434364 (0.072617) 0.317450 / 0.540337 (-0.222887) 0.327314 / 1.386936 (-1.059622)

CML watermark

Please sign in to comment.