Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apache Tika can not parse Microsoft Docx format in native mode #6549

Closed
tpenakov opened this issue Jan 14, 2020 · 62 comments
Closed

Apache Tika can not parse Microsoft Docx format in native mode #6549

tpenakov opened this issue Jan 14, 2020 · 62 comments
Assignees
Labels
area/tika kind/bug Something isn't working

Comments

@tpenakov
Copy link

The test project in order to reproduce the problem is created here.

Steps to reproduce:

  • create a native executable: ./mvnw package -Pnative
  • start the binary: ./target/otaibe-apache-tika-docx-native-1.0-SNAPSHOT-runner
  • call the service
    • Option 1 : curl -v -H "Content-Type: application/octet-stream" -X POST --data-binary @src/test/resources/test_bg.docx http://localhost:11025/parse
    • Option 2 : mvn package -D%test.service.http.port=11025
  • the output for the binary execution throws an exception:
2020-01-14 14:43:40,589 ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (executor-thread-1) HTTP Request to /parse failed, error id: 7eca2481-63eb-44e0-8c4c-4d57968f69ec-1: org.jboss.resteasy.spi.UnhandledException: org.apache.xerces.parsers.ObjectFactory$ConfigurationError: Provider org.apache.xerces.parsers.XIncludeAwareParserConfiguration not found
        at org.jboss.resteasy.core.ExceptionHandler.handleApplicationException(ExceptionHandler.java:106)
        at org.jboss.resteasy.core.ExceptionHandler.handleException(ExceptionHandler.java:372)
        at org.jboss.resteasy.core.SynchronousDispatcher.writeException(SynchronousDispatcher.java:209)
        at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:496)
        at org.jboss.resteasy.core.SynchronousDispatcher.lambda$invoke$4(SynchronousDispatcher.java:252)
        at org.jboss.resteasy.core.SynchronousDispatcher.lambda$preprocess$0(SynchronousDispatcher.java:153)
        at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:363)
        at org.jboss.resteasy.core.SynchronousDispatcher.preprocess(SynchronousDispatcher.java:156)
        at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:238)
        at io.quarkus.resteasy.runtime.standalone.RequestDispatcher.service(RequestDispatcher.java:73)
        at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler.dispatch(VertxRequestHandler.java:120)
        at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler.access$000(VertxRequestHandler.java:36)
        at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler$1.run(VertxRequestHandler.java:85)
        at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
        at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:2011)
        at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1535)
        at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1426)
        at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
        at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
        at java.lang.Thread.run(Thread.java:748)
        at org.jboss.threads.JBossThread.run(JBossThread.java:479)
        at com.oracle.svm.core.thread.JavaThreads.threadStartRoutine(JavaThreads.java:460)
        at com.oracle.svm.core.posix.thread.PosixJavaThreads.pthreadStartRoutine(PosixJavaThreads.java:193)
Caused by: org.apache.xerces.parsers.ObjectFactory$ConfigurationError: Provider org.apache.xerces.parsers.XIncludeAwareParserConfiguration not found
        at org.apache.xerces.parsers.ObjectFactory.newInstance(Unknown Source)
        at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
        at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
        at org.apache.xerces.parsers.DOMParser.<init>(Unknown Source)
        at org.apache.xerces.parsers.DOMParser.<init>(Unknown Source)
        at org.apache.xerces.jaxp.DocumentBuilderImpl.<init>(Unknown Source)
        at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.newDocumentBuilder(Unknown Source)
        at org.apache.poi.ooxml.util.DocumentHelper.newDocumentBuilder(DocumentHelper.java:91)
        at org.apache.poi.ooxml.util.DocumentHelper.readDocument(DocumentHelper.java:165)
        at org.apache.poi.openxml4j.opc.internal.ContentTypeManager.parseContentTypesFile(ContentTypeManager.java:392)
        at org.apache.poi.openxml4j.opc.internal.ContentTypeManager.<init>(ContentTypeManager.java:104)
        at org.apache.poi.openxml4j.opc.internal.ZipContentTypeManager.<init>(ZipContentTypeManager.java:54)
        at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:258)
        at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:721)
        at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:302)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:110)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
        at io.quarkus.tika.TikaParser.parseStream(TikaParser.java:85)
        at io.quarkus.tika.TikaParser.getMetadata(TikaParser.java:68)
        at io.quarkus.tika.TikaParser.getMetadata(TikaParser.java:64)
        at org.otaibe.apache.tika.docx.nerror.TikaParserResource.getContentType(TikaParserResource.java:52)
        at org.otaibe.apache.tika.docx.nerror.TikaParserResource.hello(TikaParserResource.java:38)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:151)
        at org.jboss.resteasy.core.MethodInjectorImpl.lambda$invoke$3(MethodInjectorImpl.java:122)
        at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
        at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
        at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
        at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:110)
        at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:122)
        at org.jboss.resteasy.core.ResourceMethodInvoker.internalInvokeOnTarget(ResourceMethodInvoker.java:594)
        at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTargetAfterFilter(ResourceMethodInvoker.java:468)
        at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invokeOnTarget$2(ResourceMethodInvoker.java:421)
        at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:363)
        at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTarget(ResourceMethodInvoker.java:423)
        at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:391)
        at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invoke$1(ResourceMethodInvoker.java:365)
        at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)
        at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)
        at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:110)
        at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:365)
        at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:477)
        ... 19 more
@tpenakov tpenakov added the kind/bug Something isn't working label Jan 14, 2020
@sberyozkin
Copy link
Member

@tpenakov thanks, we've covered some cases, but since so many formats are supported not all code paths have been likely covered

@sberyozkin sberyozkin self-assigned this Jan 14, 2020
@sberyozkin
Copy link
Member

Adding a step which loads org.apache.xerces.xni.parser.XMLParserConfiguration provider resource in TikaProcessor should fix it.

@tpenakov
Copy link
Author

@sberyozkin thanks - Is it possible to do it on my project via configuration?

@sberyozkin
Copy link
Member

@tpenakov may be with the SubstrateVM configuration, something like -H:ReflectionConfigurationFiles=reflect-config.json, but I can't find an example where one would set it to include some extra META-INF/services resource.
@dmlloyd, @gsmet do you know if it is possible to do ?

just for my record, it is org.apache.tika.parser.microsoft.ooxml.OOXMLParser which is not working in the native mode

@tpenakov
Copy link
Author

@sberyozkin - I've tried to dig by my self and ended up with this configuration (below), but the error is still there :( Just a different configurations is missing.

Here is my reflection-config.json

[
  {
    "name" : "org.apache.xerces.parsers.XIncludeAwareParserConfiguration",
    "allDeclaredConstructors" : true,
    "allPublicConstructors" : true,
    "allDeclaredMethods" : true,
    "allPublicMethods" : true,
    "allDeclaredFields" : true,
    "allPublicFields" : true
  },
  {
    "name" : "org.apache.xerces.impl.dv.ObjectFactory",
    "allDeclaredConstructors" : true,
    "allPublicConstructors" : true,
    "allDeclaredMethods" : true,
    "allPublicMethods" : true,
    "allDeclaredFields" : true,
    "allPublicFields" : true
  },
  {
    "name" : "org.apache.poi.xwpf.usermodel.XWPFStyles",
    "allDeclaredConstructors" : true,
    "allPublicConstructors" : true,
    "allDeclaredMethods" : true,
    "allPublicMethods" : true,
    "allDeclaredFields" : true,
    "allPublicFields" : true
  },
  {
    "name" : "org.apache.xerces.impl.dv.dtd.DTDDVFactoryImpl",
    "allDeclaredConstructors" : true,
    "allPublicConstructors" : true,
    "allDeclaredMethods" : true,
    "allPublicMethods" : true,
    "allDeclaredFields" : true,
    "allPublicFields" : true
  }
]

@tpenakov
Copy link
Author

@sberyozkin - if needed I can add the reflection-config.json to the test project?

@sberyozkin
Copy link
Member

@tpenakov thanks, I'll try to fix it at the processor level when I get to it

@gsmet
Copy link
Member

gsmet commented Jan 14, 2020

Maybe @tpenakov would be interested in contributing?

@tpenakov
Copy link
Author

Thank you @gsmet ,
I do not know how to do this.
Could you please send me some links/guides?
I can try to read them and then to make a decision...

@gsmet gsmet changed the title Apache Tika does not working in native mode Apache Tika does not work in native mode Jan 14, 2020
@geoand
Copy link
Contributor

geoand commented Jan 14, 2020

@tpenakov You can find some information at: https://quarkus.io/guides/writing-native-applications-tips

The real hard-core information however can be found here: https://quarkus.io/guides/writing-extensions.

The people on the Quarkus team would be glad to help you out should you decide to take this on

@tpenakov
Copy link
Author

@geoand , @gsmet , @sberyozkin - I can try to do it.

@gsmet
Copy link
Member

gsmet commented Jan 15, 2020

@tpenakov let us know about your progress. If you don't get to it, maybe @irenakezic will be able to help.

@tpenakov
Copy link
Author

tpenakov commented Jan 16, 2020

@gsmet and @sberyozkin ,
I have some progress. But I am a little bit stuck here. Now the exception is:

Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@33956e0
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
        at io.quarkus.tika.TikaParser.parseStream(TikaParser.java:85)
        ... 43 more
Caused by: org.apache.poi.ooxml.POIXMLException: org.apache.poi.xwpf.usermodel.XWPFSettings.<init>org.apache.poi.openxml4j.opc.PackagePart
        at org.apache.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:66)
        at org.apache.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:657)
        at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:180)
        at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:137)
        at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:60)
        at org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:224)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:161)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        ... 46 more
Caused by: java.lang.NoSuchMethodException: org.apache.poi.xwpf.usermodel.XWPFSettings.<init>org.apache.poi.openxml4j.opc.PackagePart
        at java.lang.Class.getConstructor0(DynamicHub.java:3082)
        at java.lang.Class.getDeclaredConstructor(DynamicHub.java:2178)
        at org.apache.poi.xwpf.usermodel.XWPFFactory.createDocumentPart(XWPFFactory.java:56)
        at org.apache.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:63)
        ... 54 more

Adding this code to TikaProcessor does not fix it:

    @BuildStep
    ReflectiveClassBuildItem reflectionXWPFStyles() {
        //https://github.com/quarkusio/quarkus/issues/6549
        return new ReflectiveClassBuildItem(true, true, true, "org.apache.poi.xwpf.usermodel.XWPFStyles");
    }

    @BuildStep
    ReflectiveClassBuildItem reflectionPackagePart() {
        //https://github.com/quarkusio/quarkus/issues/6549
        return new ReflectiveClassBuildItem(true, true, "org.apache.poi.openxml4j.opc.PackagePart");
    }

    @BuildStep
    ReflectiveClassBuildItem reflectionZipPackagePart() {
        //https://github.com/quarkusio/quarkus/issues/6549
        return new ReflectiveClassBuildItem(true, true, "org.apache.poi.openxml4j.opc.ZipPackagePart");
    }

I will need more time to take a closer look. May be on Friday and during the weekend...

@sberyozkin
Copy link
Member

sberyozkin commented Jan 16, 2020

@tpenakov Thanks for starting looking into it.
How did you register org.apache.xerces.xni.parser.XMLParserConfiguration ?

Does updating registerTikaProviders with something like

serviceProvider.produce(
                new ServiceProviderBuildItem("org.apache.xerces.xni.parser.XMLParserConfiguration",
                        getProviderNames("org.apache.xerces.xni.parser.XMLParserConfiguration")));

works ?

Can you also please check that you don't have some competing dependencies as Caused by: java.lang.NoSuchMethodException: org.apache.poi.xwpf.usermodel.XWPFSettings.<init>org.apache.poi.openxml4j.opc.PackagePart error may imply

thanks

@tpenakov
Copy link
Author

@sberyozkin - Thanks for the quick feedback.
About the service registration:
I've registered it in the similar way:

        serviceProvider.produce(
                new ServiceProviderBuildItem(XMLParserConfiguration.class.getName(),
                        Arrays.asList("org.apache.xerces.parsers.XIncludeAwareParserConfiguration")));

this is because of the file content on:

org/apache/xerces/parsers/org.apache.xerces.xni.parser.XMLParserConfiguration

Should I have to update the registration on your way?

About the competing dependencies:
You were right - there is competing dependencies.
One used from kogito:

[INFO] +- org.kie.kogito:drools-decisiontables:jar:0.6.1:compile
[INFO] |  \- org.drools:drools-decisiontables:jar:7.29.0.Final:compile
[INFO] |     +- org.apache.poi:poi-ooxml:jar:3.17:compile
[INFO] |     |  +- org.apache.poi:poi-ooxml-schemas:jar:3.17:compile

And one used from tika parsers:

[INFO] +- org.apache.tika:tika-parsers:jar:1.22:compile
[INFO] |  +- org.apache.poi:poi-ooxml:jar:4.0.1:compile
[INFO] |  |  +- org.apache.poi:poi-ooxml-schemas:jar:4.0.1:compile

I've tried to exclude it from the tika runtime pom.xml file (extensions/tika/runtime/pom.xml) :

       <dependency>
           <groupId>org.apache.tika</groupId>
           <artifactId>tika-parsers</artifactId>
           <exclusions>
               <exclusion>
                   <groupId>org.slf4j</groupId>
                   <artifactId>jcl-over-slf4j</artifactId>
               </exclusion>
               <exclusion>
                   <groupId>javax.annotation</groupId>
                   <artifactId>javax.annotation-api</artifactId>
               </exclusion>
               <exclusion>
                   <groupId>jakarta.xml.bind</groupId>
                   <artifactId>jakarta.xml.bind-api</artifactId>
               </exclusion>
               <exclusion>
                   <groupId>org.apache.poi</groupId>
                   <artifactId>po-ooxml</artifactId>
               </exclusion>
           </exclusions>
       </dependency>

But now the org.apache.tika.parser.microsoft.ooxml.OOXMLParser is not working, because the structure of the classes between po-ooxml:3.17 and po-ooxml:4.0.1 is different.

Also I've tried to exclude the po-ooxml:3.17 dependency from quarkus-bom pom.xml file:

            <dependency>
                <groupId>org.kie.kogito</groupId>
                <artifactId>drools-decisiontables</artifactId>
                <version>${kogito.version}</version>
                <exclusions>
                    <exclusion>
                        <groupId>org.apache.poi</groupId>
                        <artifactId>poi-ooxml</artifactId>
                    </exclusion>
                </exclusions>
            </dependency>

Now the po-ooxml:3.17 dependency is gone, however I am still getting the java.lang.NoSuchMethodException: org.apache.poi.xwpf.usermodel.XWPFSettings.<init>org.apache.poi.openxml4j.opc.PackagePart. For this error I am not quite sure that my setup is correct.

Is it possible to arrange a quick call with you or with someone else from the team in order to setup my environment properly? This will help me a lot, because in that way I will save a lot of time and will be much more productive at the end.

It will be great if we can arrange a such call :)

@tpenakov
Copy link
Author

tpenakov commented Jan 17, 2020

@sberyozkin - from my previous message, please ignore the part about dependencies and about the dedicated call in order to be more productive. @geoand helps me a lot with 'productivity' setup.
@geoand - thank you for that!
About the java.lang.NoSuchMethodException: org.apache.poi.xwpf.usermodel.XWPFSettings.<init>org.apache.poi.openxml4j.opc.PackagePart - it is fixed now.
The next challenge is:

Caused by: java.util.MissingResourceException: Resource bundle not found org.apache.xerces.impl.msg.SAXMessages. Register the resource bundle using the option -H:IncludeResourceBundles=org.apache.xerces.impl.msg.SAXMessages.
        at com.oracle.svm.core.jdk.LocalizationSupport.getCached(LocalizationSupport.java:66)
        at java.util.ResourceBundle.getBundle(ResourceBundle.java:63)
        at org.apache.xerces.util.SAXMessageFormatter.formatMessage(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.getProperty(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.setProperty(Unknown Source)
        at org.apache.xmlbeans.impl.common.SAXHelper.trySetXercesSecurityManager(SAXHelper.java:119)
        at org.apache.xmlbeans.impl.common.SAXHelper.newXMLReader(SAXHelper.java:49)
        at org.apache.xmlbeans.impl.store.Locale.getSaxLoader(Locale.java:3055)
        ... 57 more

@sberyozkin
Copy link
Member

@tpenakov Nice, and indeed thanks to to Georgios :-)

@sberyozkin
Copy link
Member

sberyozkin commented Jan 17, 2020

@tpenakov this probably can be added with SubstrateResourceBuildItem

@tpenakov
Copy link
Author

@sberyozkin ,
I've found this way:

    @BuildStep
    public void registerResourceBundles(BuildProducer<NativeImageResourceBundleBuildItem> resource) throws Exception {
        resource.produce(new NativeImageResourceBundleBuildItem("org.apache.xerces.impl.msg.SAXMessages"));
    }

@sberyozkin
Copy link
Member

@tpenakov by the way, if you add an ooxml shortcut and use it in the configuration then it will save a ton of MBs in the native image size :-)

@sberyozkin
Copy link
Member

@tpenakov Super, I'm learning with you along the way :-)

@tpenakov
Copy link
Author

@sberyozkin - I've added it as docx shortcut, but you are right that ooxml is much more correct :)

@sberyozkin
Copy link
Member

@tpenakov yes, I just saw Tim (Tika lead) referring to the whole family as ooxml in one of the issues.
Re NativeImageResourceBundleBuildItemvs SubstrateResourceBuildItem, I was looking at the old version of TikaProcessor and forgot the latter was renamed :-)

@sberyozkin
Copy link
Member

sberyozkin commented Jan 17, 2020

@tpenakov one other thing, you may want to import https://github.com/sberyozkin/quarkus/blob/master/extensions/tika/deployment/src/main/java/io/quarkus/tika/deployment/TikaParsersConfigBuildItem.java in all the step functions in dealing with OOXML and check if the list value returned from a map (the key is a parser name) is not null. If it is null then a user set some shortcuts not even involving OOXML and in this case whatever the OOXMl step does can be skipped; same for PDF related resources - at the moment they are likely adding to the native image size even if you don't want to read PDF. It can be optimized later though

@tpenakov
Copy link
Author

@sberyozkin - ok about the OOXML.
About the PDF - do you want me to do it in the same task or will fire a separate one for it?

@sberyozkin
Copy link
Member

sberyozkin commented Jan 17, 2020

@tpenakov please do it for PDF as well because it would make a diff for your case :-)

@tpenakov
Copy link
Author

Hi @sberyozkin ,

I've managed to get it working for xlsx and pptx file types, but now I have 3 serious problems. The first two pop up after the code rebase in my Quarkus fork last week. IMO they are not related to the current task and I definately will need help in order to resolve them (quick and dirty solution is applied for the moment :) ). Here are the problems together with some explanations:

  1. Added xalan dependency to the extensions/arc/runtime/pom.xml. I am certain that this is not in the right place, but if I remove it the native tests fail. The error is described in my previous posts. This is just a hack in order to be concentrated on the current task.
  2. Added --report-unsupported-elements-at-runtime to the integration-tests/tika/src/main/resources/application.properties. The error is shown in my previous post. This is just a hack in order to be concentrated on the current task.
  3. Trying to register supported classes for all 3 file types (docx, xlsx and pptx) for OOXML parser leads to OutOfMemoryError on my PC. Could you please advise me how to proceed with that?

The code is published on the same fork

Heed some help here.
Thanks in advance!

@sberyozkin
Copy link
Member

sberyozkin commented Jan 27, 2020

@tpenakov Hi, sorry for a delay, and thanks for continuing spending the time on the issue, it is realy appreciated. I'm subscribed but I did not get a single notification...In fact I'm actually not getting the notifications at all, this is strange...
Well, what do you think about going ahead with a new clean branch against the latest master and starting with a PR supporting Docx format only based on the work you showed me last week, just to move forward step by step, as it appears every new format in the OOXML family brings new issues.
What do you think ?
Cheers

@tpenakov
Copy link
Author

@sberyozkin - no problem about the clean start with docx only.
I am almost certain that the problems 1 and 2 from my previous post will be present and there too.
I will let you know when I reach at that point.

@sberyozkin
Copy link
Member

sberyozkin commented Jan 28, 2020

@tpenakov Yes, sounds good, lets get docx only working for the moment, I'm sure we will make it work :-). But please wait till #6752 is merged.

@sberyozkin
Copy link
Member

@tpenakov Hi, when you get a time please start from a clean master, #6752 has been merged now, so it might also help with avoiding few of the issues you've seen recently. As agreed lets do DOCx first, thanks

@tpenakov
Copy link
Author

Thank you @sberyozkin
Will let you know about the progress.

@tpenakov
Copy link
Author

tpenakov commented Feb 5, 2020

Hi @sberyozkin ,
I was thinking about the problem when the number of classes for docx, xlsx and pptx for native compilation become too big and the result is OutOfMemoryError.
What if we create a separate apache-tika extension per ooxml format? In that way we will have apache-tika-ooxml-docx, apache-tika-ooxml-xlsx, apache-tika-ooxml-pptx extensions.
What do you think - is there a chance this to solve the OutOfMemoryError?

@sberyozkin
Copy link
Member

Hi @tpenakov, OOM won't happen just because the native image is too big. Besides, with the parser configuration optimizations the tika extension will have a much slimmer native image, example, for PDF only, for DOCx only, etc.
Thanks

@sberyozkin sberyozkin changed the title Apache Tika does not work in native mode Apache Tika can not parse OOXML formats in native mode Feb 9, 2020
@sberyozkin sberyozkin changed the title Apache Tika can not parse OOXML formats in native mode Apache Tika can not parse Microsoft Docx format in native mode Feb 9, 2020
@sberyozkin
Copy link
Member

sberyozkin commented Feb 9, 2020

@tpenakov Hi, I've renamed this issue to have it focused around a specific issue you have reported to do with the Docx format. I will create a follow up issue to check other OOXMl formats in the native mode. thanks

@tpenakov
Copy link
Author

tpenakov commented Feb 9, 2020

Hi @sberyozkin ,
Yep - that seems reasonable.
For this week I wasn't able to work on this one, but hopefully will try to end it next week.

@sberyozkin
Copy link
Member

@tpenakov Hi, no problems, happy you are still OK with looking at this issue :-)

@tpenakov
Copy link
Author

Hi @sberyozkin ,
PR is cerated: #7198
However there is a few things to points out:

  • xalan dependency is included - probably on wrong place: extensions/arc/runtime/pom.xml
  • additional config property is added for supported file types - we have support for pptx ans xlsx file types as well, however if we include all of them - the native build ends with OOM error.
  • Trying to use @ConfigProperty in io.quarkus.it.tika.TikaEmbeddedContentTest leads to NPE for native build. This one (@ConfigProperty without @Inject does not work in test #2061) claims that is fixed, but I am receiving it.

@sberyozkin
Copy link
Member

Hi @tpenakov As noted in the PR request, it is appreciated you've spent so much time on this issue :-), I'll try to help now as well. By the way, please also watch #7171, which, if implemented, may help you more. Though as far as this extension is concerned the POI issues will have to be fixed anyway.
I'll keep you up to date once I get to testing your PR, cheers

@tpenakov
Copy link
Author

Thank you @sberyozkin - I am watching the #7171 already. I also suggested Apache POI together with Xml Beans to become a separate extensions.
BTW - the bigger part of Apache POI inclusion is done in this task...

tpenakov added a commit to tpenakov/quarkus that referenced this issue Feb 18, 2020
…ative mode - move reflections maven artifact under the tika-deployment module
@slpereira
Copy link

Hi guys, there is some date when this will be corrected? I am oplanning to use Apache Tika with Quarkus in a Microservice environment, and this BUG is preventing the deploy of our stack.

@sberyozkin
Copy link
Member

sberyozkin commented Oct 4, 2020

https://github.com/apache/poi/blob/trunk/src/java/org/apache/poi/poifs/nio/CleanerUtil.java#L180 has to be addressed, I had to add -report-unsupported-elements-at-runtime to bypass the problem in order to upgrade to Tika 1.24.1 - which is ok-ish since POI does not work yet in the native mode. See also oracle/graal#2761.

Update: a cleaner workaround is in place now thanks to @Sanne providing a CleanerUtil substitution.

@sberyozkin
Copy link
Member

sberyozkin commented Oct 4, 2020

@slpereira I'm not having enough time to prioritize on Tika issues, however, slowly but surely some issues are being addressed. I'll pick up this issue during the next round when I'll start looking at Tika issues. Thanks

@mzellho
Copy link
Contributor

mzellho commented Aug 31, 2021

Hello everybody, any news on this issue? I just ran into it using Quarkus 2.1.2.Final in native mode as well, non-native is working nicely. I could provide stack traces and such if needed ... thank you!

@geoand
Copy link
Contributor

geoand commented Apr 11, 2023

I am going to close this as the Tika extension has been moved to the Quarkiverse

@geoand geoand closed this as not planned Won't fix, can't repro, duplicate, stale Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tika kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants