Release Release v0.1.2: more core functions are available now. · modelscope/data-juicer

New OPs

nlpaug_en_mapper: simple data augmentation using nlpaug library for English corpus. #17
nlpcda_zh_mapper: simple data augmentation using nlpcda library for Chinese corpus. #17
token_num_filter: filter out samples by the number of tokens in them. HF tokenizers are supported. #24

OP Fusion #14
- Now Filters that share the same contextual variables can be fused into one OP, saving at most 25% time when processing datasets.
Cache management #19
- Cache management works now for our Data-Juicer due to the new serialization method being applied.
- Cache compression is supported: it will automatically compress caches when they are useless and decompress them if needed, which saves at most 50% disk space.
Distributed data processing with Ray is supported now. #21
Config sys optimization:
- Only keep text_keys and remove previous misleading arg text_key(s)_to_process/load. #13
- A new argument export_in_parallel is added to control whether export the result datasets in parallel. #17
- Display the config table after config parsing is ready. #17

Replace original string constants with constant enums. #13
Expand the checkpoint protection range to cover the exporting process. #14
Remove extra intermediate variables storage in document_simhash_deduplicator to save more memory. #14
Docs updates. #15 #16
PyPi package is available. You can install data-juicer by pip install py-data-juicer now. #23
Docker building is available now. The official docker image for Docker Hub is in progress. #23
Deploy the unit tests for Data-Juicer. #29