Replies: 1 comment
-
It's a bug which should be fixed with 2.8. see #1271 (comment) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
hello
i use tesseract tu index text in image
i use the scrawler-es7-2.7 on debian
i put ai-imageio-jpeg2000-1.4.0.jar in lib
when i try to do my index :
i go this error
13:03:22,034 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is [auto] and tesseract was found. 13:03:23,402 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.
and i can see in elastic that my image is not indexed
seccond part, i use pdf strategy as default
but i go a lot of tesserqct process stuck
21995 root 20 0 150696 88920 12108 R 60.6 0.5 0:24.99 tesseract 22042 root 20 0 124344 62444 11932 R 53.6 0.4 0:05.24 tesseract 22027 root 20 0 122868 60280 11484 R 52.6 0.4 0:18.10 tesseract 21980 root 20 0 137740 75304 12044 R 52.3 0.5 0:39.46 tesseract 22005 root 20 0 125392 62964 11604 R 52.3 0.4 0:28.97 tesseract 21984 root 20 0 136204 73912 11612 R 48.7 0.5 0:38.29 tesseract 22001 root 20 0 172164 108144 11524 R 48.0 0.7 0:25.79 tesseract
and in my logs i m stuck in
`13:11:50,462 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is [auto] and tesseract was found.
13:11:52,525 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
13:11:53,633 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
13:11:53,657 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
`
is there a way to make tesseract lighter maybe
Beta Was this translation helpful? Give feedback.
All reactions