Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DONUT: Reading order for pseudo-OCR pre-training task #478

Open
mustaszewski opened this issue Jan 16, 2025 · 1 comment
Open

DONUT: Reading order for pseudo-OCR pre-training task #478

mustaszewski opened this issue Jan 16, 2025 · 1 comment

Comments

@mustaszewski
Copy link

mustaszewski commented Jan 16, 2025

I would like to train the Donut base model for a few more epochs on the pre-training pseudo-OCR task using a custom dataset. In what reading order should the individual words of the document image be passed to the model? The Donut paper states:

The model is trained to read all texts in the image in reading order (from top-left to bottom-right, basically). [...] This task can be interpreted as a pseudo-OCR task.

What does "top-left to bottom-right" mean for multi-column text? For instance, consider the attached dummy document with one heading and two text columns:
000a_readingorder
Should the document be transcribed as:

  • Word1 Col1w1 Col1w2 Col2w1 Col2w2, or
  • Word1 Col1w1 Col2w1 Col1w2 Col2w2 ?

I imagine that any dataset used for the pre-training pseudo-OCR task should adopt the same reading order policy as the pe-trained Donut base model. Unfortunately, I am not able to find any information of the exact implementation of "top-left to bottom-right", neither in the paper, the paper supplement, nor the source code.

@NielsRogge
Copy link
Owner

Hi,

The best would be to contact the Donut author regarding this. @gwkrsrch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants