Skip to content

Release v0.1.2: more core functions are available now.

Compare
Choose a tag to compare
@HYLcool HYLcool released this 28 Sep 06:32
5bd715d

New OPs

  • nlpaug_en_mapper: simple data augmentation using nlpaug library for English corpus. #17
  • nlpcda_zh_mapper: simple data augmentation using nlpcda library for Chinese corpus. #17
  • token_num_filter: filter out samples by the number of tokens in them. HF tokenizers are supported. #24

New features

  • OP Fusion #14
    • Now Filters that share the same contextual variables can be fused into one OP, saving at most 25% time when processing datasets.
  • Cache management #19
    • Cache management works now for our Data-Juicer due to the new serialization method being applied.
    • Cache compression is supported: it will automatically compress caches when they are useless and decompress them if needed, which saves at most 50% disk space.
  • Distributed data processing with Ray is supported now. #21
  • Config sys optimization:
    • Only keep text_keys and remove previous misleading arg text_key(s)_to_process/load. #13
    • A new argument export_in_parallel is added to control whether export the result datasets in parallel. #17
    • Display the config table after config parsing is ready. #17

Others

  • Replace original string constants with constant enums. #13
  • Expand the checkpoint protection range to cover the exporting process. #14
  • Remove extra intermediate variables storage in document_simhash_deduplicator to save more memory. #14
  • Docs updates. #15 #16
  • PyPi package is available. You can install data-juicer by pip install py-data-juicer now. #23
  • Docker building is available now. The official docker image for Docker Hub is in progress. #23
  • Deploy the unit tests for Data-Juicer. #29