We leveraged the TextRecognitionDataGenerator and SynthText to generate the MJST+ dataset, with the specific generation process outlined as follows.
- list_gallery: Randomly selected 700,000 text corpora were obtained from corpora.
- background imgs: Acquired 8,000 natural scene images as backgrounds from SynthText.
Install the pypi package
cd generate_data_by_trgd/TextRecognitionDataGenerator
pip install -r requirements.txt
For convenience, assume that you have placed the corpora data in the raw_gallery_data_folder
. Use the following script to generate a specified number
of list_gallery.txt
.
list_gallery_path=/path/to/your/list_gallery.txt
count=60000000 # to generate a specified quantity of corpus data.
python gen_gallery.py $raw_gallery_data_folder $list_gallery_path $count
Assume that you have placed the background data in BG_IMG_PATH
, then run the following script to generate the data.
sh run_gen_data.sh
The final format of the generated data is as follows:
.
├── data
│ ├── Label_data.txt
│ ├── Label_data_clean.txt
│ └── imgs
└── labels
└── details
└── list_gallery.txt
We are grateful for the generation tools provided by the TextRecognitionDataGenerator and SynthText.