Sanitizer is a Python library for dealing with duplicated training data. It utilizes a module that is added in Python 3, called concurrent.futures
to minimizes the time that is needed for the general process.
What you have to do first is to change labels.json
and config.json
files to your needs.
labels.json
- This contains the input folder names as key and their related id, ascii symbol as value in json format.
config.json
- This holds symbol values from labels.json
as key and result folder name as value in json format.
poetry run sanitizer
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.