Images of documents get extracted as document text #751

Bain-OS · 2025-01-15T04:17:32Z

Bug

Figures/images that contain text are incorrectly exploded as content, rather than just an image.

Steps to reproduce

For example, here is some debug output attained by executing docling --debug-visualize-cells --debug-visualize-layout https://arxiv.org/pdf/2408.09869

Notice the figure is an image of a document. Docling is detecting the text within the image as text, I would have expected just two elements on this page, the image and its description.

Is there a means to achieve this with configuration?

Docling version

> docling --version
Docling version: 2.15.1
Docling Core version: 2.14.0
Docling IBM Models version: 3.1.2
Docling Parse version: 3.0.0

Python version

Python 3.11.10

The text was updated successfully, but these errors were encountered:

wcool1 · 2025-01-16T09:49:16Z

https://github.com/opendatalab/MinerU

Bain-OS added the bug Something isn't working label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Images of documents get extracted as document text #751

Images of documents get extracted as document text #751

Bain-OS commented Jan 15, 2025

wcool1 commented Jan 16, 2025

Images of documents get extracted as document text #751

Images of documents get extracted as document text #751

Comments

Bain-OS commented Jan 15, 2025

Bug

Steps to reproduce

Docling version

Python version

wcool1 commented Jan 16, 2025