2.3.0
Datasets Changes
- New: ImageNet-Sketch by @nateraw in #4301
- New: Biwi Kinect Head Pose by @dnaveenr in #3903
- New: enwik8 by @HallerPatrick in #4321
- New: LCCC dataset by @silverriver in #4416
- New: TruthfulQA by @jon-tow in #4159
- New: BIG-bench by @andersjohanandreassen in #4125
- New: QuickDraw by @mariosasko in #3592
- New: SST-2 by @albertvillanova in #4473
- Update: imagenet-1k - remove manual download by @mariosasko in #4299
- ImageNet can now be loaded in python with
load_dataset
without requiring a manual download ! - It also supports streaming mode with
load_dataset("imagenet-1k", streaming=True)
- ImageNet can now be loaded in python with
- Update: spider - Remove Google Drive URL by @albertvillanova in #4410
- Update: blended_skill_talk - add missing columns to by @mariosasko in #4437
- Update: multi-news - Use newer version with fixes by @JohnGiorgi in #4451
- Update: fever - update data URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/44554459
- Update: udhr - Add and fix language tags by @albertvillanova in https://github.com/huggingface/datasets/pull/
- Update: udhr - update metadata by @leondz in #4362
- Update: wider_face - Replace data URLs once hosted on the Hub by @albertvillanova in #4469
- Update: PASS - update dataset version by @mariosasko in #4488
- Fix: GEM - fix bug in wiki_auto_asset_turk config by @albertvillanova in #4389
- Fix: GEM - fix URL for totto config by @albertvillanova in #4396
- Fix: timit_asr - fix DuplicatedKeysError by @albertvillanova in #4424
- Fix: timit_asr - Make extensions case-insensitive by @albertvillanova in #4425
- Fix: timit_asr - Fix directory names for LDC data by @albertvillanova in #4436
- Fix: iwslt2017 by @lhoestq in #4481
Dataset Features
- to_tf_dataset rewrite by @Rocketknight1 in #4170
- see more in the documentation
- Support DataLoader with num_workers > 0 in streaming mode by @lhoestq in #4375
- see more in the documentation
- Added stratify option to
train_test_split
by @nandwalritik in #4322 - Re-add support for Apache Beam functionality by @albertvillanova in #4328
- Resume
push_to_hub
: skip identical files inpush_to_hub
instead of overwriting by @mariosasko in #4402 - Support nested/complex feature types as
features
in packaged loaders by @mariosasko in #4364 - Optimize contiguous shard and select by @lhoestq in #4466
Dataset Cards
- Minor fixes/improvements in
scene_parse_150
card by @mariosasko in #4447 - Tidy up license metadata for google_wellformed_query, newspop, sick by @leondz in #4378
- Fix example in opus_ubuntu, Add license info by @leondz in #4360
- Update README.md of fquad by @lhoestq in #4450
Documentation
- Add API code examples for loading methods by @stevhliu in #4300
- Add API code examples for remaining main classes by @stevhliu in #4292
- Generalize tutorials for audio and vision by @stevhliu in #4468
- [Docs] How to use with PyTorch page by @lhoestq in #4474
- First draft of the docs for TF + Datasets by @Rocketknight1 in #4457
Other improvements and bug fixes
- Update CI deprecated legacy image by @albertvillanova in #4393
- remove int documentation from logging docs by @lvwerra in #4392
- Fix docstring in DatasetDict::shuffle by @felixdivo in #4344
- Fix Version equality by @albertvillanova in #4359
- Set builder name from module instead of class by @albertvillanova in #4388
- Test dill by @albertvillanova in #4385
- Refactor download by @albertvillanova in #4384
- Fix dependency on dill version by @albertvillanova in #4397
- Support remote cache_dir by @albertvillanova in #4347
- Update imagenet gate by @lhoestq in #4408
- Fix dataset builder default version by @albertvillanova in #4356
- Uncomment logging deactivation for ArrowBasedBuilder by @thomasw21 in #4403
- Rename DatasetBuilder config_name by @albertvillanova in #4414
- Fix metadata validation by @albertvillanova in #4390
- Add HF.co for PRs/Issues for specific datasets by @lhoestq in #4427
- Fix type hint and documentation for
new_fingerprint
by @fxmarty in #4326 - Skip hidden files/directories in data files resolution and
iter_files
by @mariosasko in #4412 - Fix docstring of inspect_dataset by @albertvillanova in #4438
- Fix builder docstring by @albertvillanova in #4432
- Fix kwargs in docstrings by @albertvillanova in #4444
- Fix missing args in docstring of load_dataset_builder by @albertvillanova in #4445
- Add missing kwargs to docstrings by @albertvillanova in #4446
- Add extractor for bzip2-compressed files by @asivokon in #4421
- Fix dummy dataset generation script for handling nested types of _URLs by @silverriver in #4434
- Update
dataset_infos.json
with new split info inDataset.push_to_hub
to avoid verification error by @mariosasko in #4415 - Update builder docstring for deprecated/added arguments by @albertvillanova in #4429
- Extend support for streaming datasets that use xml.dom.minidom.parse by @albertvillanova in #4464
- Fix script fetching and local path handling in
inspect_dataset
andinspect_metric
by @mariosasko in #4433 - Fix bigbench config names by @lhoestq in #4465
- Fix 401 error for unauthticated requests to non-existing repos by @lhoestq in #4472
- Reorder returned validation/test splits in script template by @albertvillanova in #4470
- Better ImportError message when a dataset script dependency is missing by @lhoestq in #4484
- Fix cast to null by @lhoestq in #4485
- Update
_format_columns
inremove_columns
by @alvarobartt in #4411 - Fix wrong map parameter name in cache docs by @h4iku in #4293
- Pin the revision in imagenet download links by @lhoestq in #4492
- Refactor column mappings for question answering datasets by @lewtun in #4391
New Contributors
- @leondz made their first contribution in #4378
- @felixdivo made their first contribution in #4344
- @nandwalritik made their first contribution in #4322
- @fxmarty made their first contribution in #4326
- @HallerPatrick made their first contribution in #4321
- @silverriver made their first contribution in #4416
- @asivokon made their first contribution in #4421
- @andersjohanandreassen made their first contribution in #4125
Full Changelog: 2.2.2...lol