This is the pytorch implementation of Paper: ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer (ICCV 2023). The paper is available at this link.
2024.04.09
We release a new text spotting pipeline Bridge Text Spotting that combines the advantages of end-to-end and two-step text spotting. Code
2023.07.21
Code is available.
Python 3.8 + PyTorch 1.10.0 + CUDA 11.3 + torchvision=0.11.0 + Detectron2 (v0.2.1) + OpenCV for visualization
conda create -n ESTS python=3.8 -y
conda activate ESTS
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt
git clone https://github.com/mxin262/ESTextSpotter.git
cd detectron2-0.2.1
python setup.py build develop
pip install opencv-python
cd models/ests/ops
sh make.sh
Please download TotalText, CTW1500, MLT, ICDAR2013, ICDAR2015, and CurvedSynText150k according to the guide provided by SPTS v2: README.md.
Please download the MLT 2019 in Images / Annotations.
Extract all the datasets and make sure you organize them as follows
- datasets
| - CTW1500
| | - annotations
| | - ctwtest_text_image
| | - ctwtrain_text_image
| - totaltext (or icdar2015)
| | - test_images
| | - train_images
| | - test.json
| | - train.json
| - mlt2017 (or syntext1, syntext2)
| - annotations
| - images
Dataset | Det-P | Det-R | Det-F1 | E2E-None | E2E-Full | Weights |
---|---|---|---|---|---|---|
Pretrain | 90.7 | 85.3 | 87.9 | 73.8 | 85.5 | OneDrive |
Total-Text | 91.8 | 88.2 | 90.0 | 80.9 | 87.1 | OneDrive |
CTW1500 | 91.3 | 88.6 | 89.9 | 65.0 | 83.9 | OneDrive |
Dataset | Det-P | Det-R | Det-F1 | E2E-S | E2E-W | E2E-G | Weights |
---|---|---|---|---|---|---|---|
ICDAR2015 | 95.1 | 88 | 91.4 | 88.5 | 83.1 | 78.1 | OneDrive |
Dataset | H-mean | Weights |
---|---|---|
VinText | 73.6 | OneDrive |
Dataset | Det-P | Det-R | Det-H | 1-NED | Weights |
---|---|---|---|---|---|
ICDAR 2019 ReCTS | 94.1 | 91.3 | 92.7 | 78.1 | OneDrive |
Dataset | R | P | H | AP | Arabic | Latin | Chinese | Japanese | Korean | Bangla | Hindi | Weights |
---|---|---|---|---|---|---|---|---|---|---|---|---|
MLT | 75.5 | 83.37 | 79.24 | 72.52 | 52.00 | 77.34 | 48.20 | 48.42 | 63.56 | 38.26 | 50.83 | OneDrive |
We use 8 GPUs for training and 2 images each GPU by default.
- Pretrain
bash scripts/Pretrain.sh /path/to/your/dataset
- Fine-tune model on the mixed real dataset
bash scripts/Joint_train.sh /path/to/your/dataset
- Fine-tune model
bash scripts/TT_finetune.sh /path/to/your/dataset
0 for Text Detection; 1 for Text Spotting.
bash scripts/test.sh config/ESTS/ESTS_5scale_tt_finetune.py /path/to/your/dataset 1 /path/to/your/checkpoint /path/to/your/test dataset
e.g.:
bash scripts/test.sh config/ESTS/ESTS_5scale_tt_finetune.py ../datasets 1 totaltext_checkpoint.pth totaltext_val
Visualize the detection and recognition results
python vis.py
This repository can only be used for non-commercial research purpose.
For commercial use, please contact Prof. Lianwen Jin ([email protected]).
Copyright 2023, Deep Learning and Vision Computing Lab, South China University of Technology.
AdelaiDet, DINO, Detectron2, TESTR
If our paper helps your research, please cite it in your publications:
@InProceedings{Huang_2023_ICCV,
author = {Huang, Mingxin and Zhang, Jiaxin and Peng, Dezhi and Lu, Hao and Huang, Can and Liu, Yuliang and Bai, Xiang and Jin, Lianwen},
title = {ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {19495-19505}
}