cannot index jpeg200 #1306

sarce666 · 2021-11-11T13:30:09Z

sarce666
Nov 11, 2021

hello

i use tesseract tu index text in image
i use the scrawler-es7-2.7 on debian
i put ai-imageio-jpeg2000-1.4.0.jar in lib
when i try to do my index :
i go this error
13:03:22,034 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is [auto] and tesseract was found. 13:03:23,402 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.

and i can see in elastic that my image is not indexed

seccond part, i use pdf strategy as default
but i go a lot of tesserqct process stuck
21995 root 20 0 150696 88920 12108 R 60.6 0.5 0:24.99 tesseract 22042 root 20 0 124344 62444 11932 R 53.6 0.4 0:05.24 tesseract 22027 root 20 0 122868 60280 11484 R 52.6 0.4 0:18.10 tesseract 21980 root 20 0 137740 75304 12044 R 52.3 0.5 0:39.46 tesseract 22005 root 20 0 125392 62964 11604 R 52.3 0.4 0:28.97 tesseract 21984 root 20 0 136204 73912 11612 R 48.7 0.5 0:38.29 tesseract 22001 root 20 0 172164 108144 11524 R 48.0 0.7 0:25.79 tesseract

and in my logs i m stuck in

`13:11:50,462 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is [auto] and tesseract was found.
13:11:52,525 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

13:11:53,633 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
13:11:53,657 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].

`

is there a way to make tesseract lighter maybe

dadoonet · 2021-11-11T21:28:53Z

dadoonet
Nov 11, 2021
Maintainer

It's a bug which should be fixed with 2.8.
You could try 2.8-SNAPSHOT in the meantime.

see #1271 (comment)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot index jpeg200 #1306

{{title}}

Replies: 1 comment

{{title}}

Select a reply

cannot index jpeg200 #1306

sarce666 Nov 11, 2021

Replies: 1 comment

dadoonet Nov 11, 2021 Maintainer

sarce666
Nov 11, 2021

dadoonet
Nov 11, 2021
Maintainer