ocr with swin-transformer
Simple and understandable swin-transformer OCR project. The model in this repository heavily relied on high-level open-source projects like timm and x_transformers. And also you can find that the procedure of training is intuitive thanks to the legibility of pytorch-lightning.
The model in this repository encodes input image to context vector with 'shifted-window` which is a swin-transformer encoding mechanism. And it decodes the vector with a normal auto-regressive transformer.
If you are not familiar with transformer OCR structure, transformer-ocr would be easier to understand because it uses a traditional convolution network (ResNet-v2) for the encoder.
With private korean handwritten text dataset, the accuracy(exact match) is 97.6%.
./dataset/
├─ preprocessed_image/
│ ├─ cropped_image_0.jpg
│ ├─ cropped_image_1.jpg
│ ├─ ...
├─ train.txt
└─ val.txt
# in train.txt
cropped_image_0.jpg\tHello World.
cropped_image_1.jpg\tvision-transformer-ocr
...
You should preprocess the data first. Crop the image by word or sentence level area. Put all image data in a specific directory. Ground truth information should be provided with a txt file. In the txt file, write the image file name and label with \t
separator in the same line.
In settings/
directory, you can find default.yaml
. You can set almost every hyper-parameter in that file. Copy one and edit it as your experiment version. I recommend you to run with the default setting first, before you change it.
python run.py --version 0 --setting settings/default.yaml --num_workers 16 --batch_size 128
You can check your training log with tensorboard.
tensorboard --log_dir tb_logs --bind_all
When your model finishes training, you can use your model for prediction.
python predict.py --setting <your_setting.yaml> --target <image_or_directory> --tokenizer <your_tokenizer_pkl> --checkpoint <saved_checkpoint>
You can export your model to ONNX format. It's very easy thanks to pytorch-lightning. See the related pytorch-lightning document.
@misc{liu-2021,
title = {Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
author = {Ze Liu and Yutong Lin and Yue Cao and Han Hu and Yixuan Wei and Zheng Zhang and Stephen Lin and Baining Guo},
year = {2021},
eprint = {2103.14030},
archivePrefix = {arXiv}
}