Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plain Text vs. Format Output #182

Open
ep0p opened this issue Nov 7, 2024 · 2 comments
Open

Plain Text vs. Format Output #182

ep0p opened this issue Nov 7, 2024 · 2 comments
Labels
good first issue Good for newcomers

Comments

@ep0p
Copy link

ep0p commented Nov 7, 2024

I've encountered an issue with GOT text recognition during inference. Using 'plain text' often results in awkwardly spaced letters within words. However, switching to 'format' produces well-structured and accurate text. My document is in French, and GOT handles French unexpectedly well without fine-tuning. Here are the examples:

Here is the document:
image

Here is the inference in plain text:
image

Here is the inference with 'format':
image

@Ucas-HaoranWei
Copy link
Owner

Ha ha, the 'plain text' only uses the Fitz extracted data without format.
I am delighted that GOT‘s zero-shot ability in French is good. Thank you for the test.

@Ucas-HaoranWei Ucas-HaoranWei added the good first issue Good for newcomers label Nov 9, 2024
@ep0p
Copy link
Author

ep0p commented Nov 12, 2024

Would there by any way to make it see well the awkwardly spaced letters within words, since using format sometimes does not recognize all the blocks of text on a page?

Also since i can not get access to WeChat, i have some questions for which maybe you can provide an answer :)

  1. When testing various document types, I’ve observed that some text blocks, particularly in documents with complex layouts, are occasionally missed or ignored by the OCR.
    Are there any workarounds or best practices to improve accuracy and ensure these blocks are detected?
    Would fine-tuning the model address or reduce this issue?

  2. What is the recommended VRAM capacity for fine-tuning the OCR model efficiently?
    Specifically, could you share the minimum or ideal VRAM needed to handle a reasonable batch size without significant performance delays?

  3. If I fine-tune the model using complex images and specify the desired reading order of the text, can the model learn and adapt to this sequence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants