Skip to content

Latest commit

 

History

History
42 lines (35 loc) · 1.76 KB

File metadata and controls

42 lines (35 loc) · 1.76 KB

MJST+ DataSet Generation

We leveraged the TextRecognitionDataGenerator and SynthText to generate the MJST+ dataset, with the specific generation process outlined as follows.

1. Preparation

  1. list_gallery: Randomly selected 700,000 text corpora were obtained from corpora.
  2. background imgs: Acquired 8,000 natural scene images as backgrounds from SynthText.

2. Installation

Install the pypi package

 cd generate_data_by_trgd/TextRecognitionDataGenerator
 pip install -r requirements.txt

3. Run for gan data

For convenience, assume that you have placed the corpora data in the raw_gallery_data_folder. Use the following script to generate a specified number of list_gallery.txt.

list_gallery_path=/path/to/your/list_gallery.txt
count=60000000 # to generate a specified quantity of corpus data.
python gen_gallery.py $raw_gallery_data_folder $list_gallery_path $count

Assume that you have placed the background data in BG_IMG_PATH, then run the following script to generate the data.

sh run_gen_data.sh

The final format of the generated data is as follows:

.
├── data
│   ├── Label_data.txt
│   ├── Label_data_clean.txt
│   └── imgs
└── labels
    └── details
        └── list_gallery.txt

Acknowledgement

We are grateful for the generation tools provided by the TextRecognitionDataGenerator and SynthText.