Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump tika-parsers from 1.24.1 to 1.26 #16410

Merged
merged 1 commit into from
Apr 10, 2021

Conversation

dependabot[bot]
Copy link
Contributor

@dependabot dependabot bot commented on behalf of github Apr 9, 2021

Bumps tika-parsers from 1.24.1 to 1.26.

Changelog

Sourced from tika-parsers's changelog.

Release 2.0.0-ALPHA - 01/13/2021

BREAKING CHANGES in 2.0.0

  • General

    • OCR is now triggered automatically for PDFs if tesseract is on the user's path see (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr) for how to disable OCR.
    • Removed deprecated Metadata keys/properties (TIKA-1974).
    • Removed dangerous calls to read an inputstream or convert to bytes without specifying a charset
    • Parsers can be configured via tika-config.xml on instantiation. We have moved away from configuration via .properties files because of confusion among users. This affects the PDFParser, TesseractOCRParser and the StringsParser.
    • For those parsers that can be configured per parse via a config object passed in through the ParseContext, the config object will only update those fields that the user has modified. The config object will no longer fully reset all settings to the default settings per parse. This has a more intuitive "update the base/configured settings" with what has been changed in the config object.
  • tika-parsers

    • The parser modules have been broken into three main modules: tika-parsers-classic, tika-parsers-extended and tika-parsers-advanced. Users may now need to add tika-parsers-extended to tika-app and tika-server to include parsers that used to be included by default (for example: envi, gdal, grib, isatab, netcdf).
    • ChmParser was moved to org.apache.tika.parser.microsoft.chm
    • RTFParser was moved to org.apache.tika.parser.microsoft.rtf
  • tika-app

  • tika-server

    • tika-server now by default forks a process to isolate the parsing in the forked process (this was called the -spawnChild option in tika-1.x). Clients must now expect that tika-server will restart on OOM, timeouts, crashes or after parsing a large number of files. When this happens tika-server will restand and not receive connections for brief periods. The less robust, legacy behavior of not forking a process is available with "-noFork"

    • tika-server's /metadata endpoint requires tika-server-classic to write XMP/rdf output. This output is not available in tika-server-core.

Release 1.26 - ??/??/????

  • The "writeLimit" header now pertains to the combined characters written per container document (and embedded documents) in the /rmeta endpoint in tika-server (TIKA-3325); it no longer functions only

... (truncated)

Commits
  • 2e83fd4 [maven-release-plugin] prepare release 1.26-rc1
  • 1842758 fix rat and imports for 1.26 release
  • 8c21fba Update CHANGES.txt for 1.26 release
  • da05576 TIKA-3334 -- fix thread safety bug in handling embedded docs in open office p...
  • 2b8c9a3 TIKA-3336 -- new zip bombs detect in 1.26-SNAPSHOT compared with 1.25 -- bug,...
  • b1e8641 TIKA-3335 -- handle bad xml more robustly when checking for encryption
  • b29cce5 TIKA-3244 -- general upgrades for 1.26
  • b63072c Merge remote-tracking branch 'origin/branch_1x' into branch_1x
  • 8bf65c0 TIKA-3332 -- recursively process the embedded file tree in PDFs.
  • 6a27f3e TIKA-3244: update spring
  • Additional commits viewable in compare view

Dependabot compatibility score

You can trigger a rebase of this PR by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot dependabot bot added the area/dependencies Pull requests that update a dependency file label Apr 9, 2021
@quarkus-bot
Copy link

quarkus-bot bot commented Apr 9, 2021

This workflow status is outdated as a new workflow run has been triggered.

Failing Jobs - Building 2695022

Status Name Step Test failures Logs Raw logs
Initial JDK 11 Build Build ⚠️ Check → Logs Raw logs

@gastaldi gastaldi requested a review from sberyozkin April 10, 2021 01:06
@gastaldi gastaldi force-pushed the dependabot/maven/org.apache.tika-tika-parsers-1.26 branch 2 times, most recently from 71043ea to 5710642 Compare April 10, 2021 01:21
@quarkus-bot
Copy link

quarkus-bot bot commented Apr 10, 2021

This workflow status is outdated as a new workflow run has been triggered.

Failing Jobs - Building 71043ea

Status Name Step Test failures Logs Raw logs
Initial JDK 11 Build Build ⚠️ Check → Logs Raw logs

@gastaldi
Copy link
Contributor

This is the error I see in the integration tests:

2021-04-09 22:24:04,596 ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (executor-thread-1) HTTP Request to /parse/text failed, error id: 87ede0df-0c73-4273-a651-8993ba13df24-1: org.jboss.resteasy.spi.UnhandledException: io.quarkus.tika.TikaParseException: Unable to parse the stream
	at org.jboss.resteasy.core.ExceptionHandler.handleApplicationException(ExceptionHandler.java:106)
	at org.jboss.resteasy.core.ExceptionHandler.handleException(ExceptionHandler.java:372)
	at org.jboss.resteasy.core.SynchronousDispatcher.writeException(SynchronousDispatcher.java:218)
	at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:519)
	at org.jboss.resteasy.core.SynchronousDispatcher.lambda$invoke$4(SynchronousDispatcher.java:261)
	at org.jboss.resteasy.core.SynchronousDispatcher.lambda$preprocess$0(SynchronousDispatcher.java:161)
	at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
	at org.jboss.resteasy.core.SynchronousDispatcher.preprocess(SynchronousDispatcher.java:164)
	at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:247)
	at io.quarkus.resteasy.runtime.standalone.RequestDispatcher.service(RequestDispatcher.java:73)
	at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler.dispatch(VertxRequestHandler.java:138)
	at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler.access$000(VertxRequestHandler.java:41)
	at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler$1.run(VertxRequestHandler.java:93)
	at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2415)
	at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1452)
	at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
	at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
	at java.lang.Thread.run(Thread.java:834)
	at org.jboss.threads.JBossThread.run(JBossThread.java:501)
	at com.oracle.svm.core.thread.JavaThreads.threadStartRoutine(JavaThreads.java:519)
	at com.oracle.svm.core.posix.thread.PosixJavaThreads.pthreadStartRoutine(PosixJavaThreads.java:192)
Caused by: io.quarkus.tika.TikaParseException: Unable to parse the stream
	at io.quarkus.tika.TikaParser.parseStream(TikaParser.java:114)
	at io.quarkus.tika.TikaParser.parse(TikaParser.java:44)
	at io.quarkus.tika.TikaParser.parse(TikaParser.java:40)
	at io.quarkus.tika.TikaParser.parse(TikaParser.java:32)
	at io.quarkus.it.tika.TikaParserResource.extractText(TikaParserResource.java:29)
	at java.lang.reflect.Method.invoke(Method.java:566)
	at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:170)
	at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:130)
	at org.jboss.resteasy.core.ResourceMethodInvoker.internalInvokeOnTarget(ResourceMethodInvoker.java:646)
	at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTargetAfterFilter(ResourceMethodInvoker.java:510)
	at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invokeOnTarget$2(ResourceMethodInvoker.java:460)
	at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
	at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTarget(ResourceMethodInvoker.java:462)
	at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:420)
	at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:394)
	at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:69)
	at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:492)
	... 17 more
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@18e37c7
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
	at io.quarkus.tika.TikaParser.parseStream(TikaParser.java:85)
	... 33 more
Caused by: java.lang.IllegalArgumentException: java.io.IOException: resource '/org/apache/pdfbox/resources/afm/Helvetica.afm' not found
	at org.apache.pdfbox.pdmodel.font.Standard14Fonts.getAFM(Standard14Fonts.java:189)
	at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:113)
	at org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:180)
	at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:97)
	at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
	at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:933)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:515)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:489)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
	at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:144)
	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:394)
	at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:125)
	at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:986)
	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:269)
	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96)
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:177)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
	... 35 more
Caused by: java.io.IOException: resource '/org/apache/pdfbox/resources/afm/Helvetica.afm' not found
	at org.apache.pdfbox.pdmodel.font.Standard14Fonts.loadMetrics(Standard14Fonts.java:120)
	at org.apache.pdfbox.pdmodel.font.Standard14Fonts.getAFM(Standard14Fonts.java:185)
	... 52 more

@quarkus-bot
Copy link

quarkus-bot bot commented Apr 10, 2021

This workflow status is outdated as a new workflow run has been triggered.

Failing Jobs - Building 5710642

Status Name Step Test failures Logs Raw logs
JVM Tests - JDK 11 Build Test failures Logs Raw logs
JVM Tests - JDK 15 Build Test failures Logs Raw logs
JVM Tests - JDK 8 Build Test failures Logs Raw logs
Native Tests - Misc2 Build Test failures Logs Raw logs
Native Tests - Windows - hibernate-validator Build ⚠️ Check → Logs Raw logs

Test Failures

⚙️ JVM Tests - JDK 11 #

📦 integration-tests/kafka

io.quarkus.it.kafka.SaslKafkaConsumerTest line 48 - Source on GitHub
io.quarkus.it.kafka.SslKafkaConsumerTest line 56 - Source on GitHub

⚙️ JVM Tests - JDK 15 #

📦 integration-tests/kafka

io.quarkus.it.kafka.SaslKafkaConsumerTest line 48 - Source on GitHub
io.quarkus.it.kafka.SslKafkaConsumerTest line 56 - Source on GitHub

⚙️ JVM Tests - JDK 8 #

📦 integration-tests/kafka

io.quarkus.it.kafka.SaslKafkaConsumerTest line 48 - Source on GitHub
io.quarkus.it.kafka.SslKafkaConsumerTest line 56 - Source on GitHub

⚙️ Native Tests - Misc2 #

📦 integration-tests/tika

io.quarkus.it.tika.NativeTikaParserIT - Source on GitHub

@sberyozkin
Copy link
Member

Hey @gastaldi Thanks for looking into it, I think adding

resource.produce(new NativeImageResourceDirectoryBuildItem("org/apache/pdfbox/resources/afm"));

should fix it - I can take care of it a bit later - as I'm struggling with Yubico setup :-)

@gastaldi gastaldi force-pushed the dependabot/maven/org.apache.tika-tika-parsers-1.26 branch from 5710642 to d1fc1a6 Compare April 10, 2021 14:03
@gastaldi
Copy link
Contributor

@sberyozkin that seems to do the trick, thanks! Pushed

@sberyozkin
Copy link
Member

@gastaldi Thanks George :-)

@gsmet gsmet merged commit 6f0a58a into main Apr 10, 2021
@quarkus-bot quarkus-bot bot added this to the 2.0 - main milestone Apr 10, 2021
@dependabot dependabot bot deleted the dependabot/maven/org.apache.tika-tika-parsers-1.26 branch April 10, 2021 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dependencies Pull requests that update a dependency file area/tika
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants