Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images of documents get extracted as document text #751

Open
Bain-OS opened this issue Jan 15, 2025 · 1 comment
Open

Images of documents get extracted as document text #751

Bain-OS opened this issue Jan 15, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@Bain-OS
Copy link

Bain-OS commented Jan 15, 2025

Bug

Figures/images that contain text are incorrectly exploded as content, rather than just an image.

Steps to reproduce

For example, here is some debug output attained by executing docling --debug-visualize-cells --debug-visualize-layout https://arxiv.org/pdf/2408.09869

image

Notice the figure is an image of a document. Docling is detecting the text within the image as text, I would have expected just two elements on this page, the image and its description.

Is there a means to achieve this with configuration?

Docling version

> docling --version
Docling version: 2.15.1
Docling Core version: 2.14.0
Docling IBM Models version: 3.1.2
Docling Parse version: 3.0.0

Python version

Python 3.11.10
@Bain-OS Bain-OS added the bug Something isn't working label Jan 15, 2025
@wcool1
Copy link

wcool1 commented Jan 16, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants