-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apache Tika can not parse Microsoft Docx format in native mode #6549
Comments
@tpenakov thanks, we've covered some cases, but since so many formats are supported not all code paths have been likely covered |
Adding a step which loads |
@sberyozkin thanks - Is it possible to do it on my project via configuration? |
@tpenakov may be with the SubstrateVM configuration, something like just for my record, it is |
@sberyozkin - I've tried to dig by my self and ended up with this configuration (below), but the error is still there :( Just a different configurations is missing. Here is my reflection-config.json
|
@sberyozkin - if needed I can add the reflection-config.json to the test project? |
@tpenakov thanks, I'll try to fix it at the processor level when I get to it |
Maybe @tpenakov would be interested in contributing? |
Thank you @gsmet , |
@tpenakov You can find some information at: https://quarkus.io/guides/writing-native-applications-tips The real hard-core information however can be found here: https://quarkus.io/guides/writing-extensions. The people on the Quarkus team would be glad to help you out should you decide to take this on |
@geoand , @gsmet , @sberyozkin - I can try to do it. |
@tpenakov let us know about your progress. If you don't get to it, maybe @irenakezic will be able to help. |
@gsmet and @sberyozkin ,
Adding this code to TikaProcessor does not fix it:
I will need more time to take a closer look. May be on Friday and during the weekend... |
@tpenakov Thanks for starting looking into it. Does updating
works ? Can you also please check that you don't have some competing dependencies as thanks |
@sberyozkin - Thanks for the quick feedback.
this is because of the file content on:
Should I have to update the registration on your way? About the competing dependencies:
And one used from tika parsers:
I've tried to exclude it from the tika runtime pom.xml file (
But now the Also I've tried to exclude the
Now the Is it possible to arrange a quick call with you or with someone else from the team in order to setup my environment properly? This will help me a lot, because in that way I will save a lot of time and will be much more productive at the end. It will be great if we can arrange a such call :) |
@sberyozkin - from my previous message, please ignore the part about dependencies and about the dedicated call in order to be more productive. @geoand helps me a lot with 'productivity' setup.
|
@tpenakov Nice, and indeed thanks to to Georgios :-) |
@tpenakov this probably can be added with |
@sberyozkin ,
|
@tpenakov by the way, if you add an |
@tpenakov Super, I'm learning with you along the way :-) |
@sberyozkin - I've added it as |
@tpenakov yes, I just saw Tim (Tika lead) referring to the whole family as |
@tpenakov one other thing, you may want to import https://github.com/sberyozkin/quarkus/blob/master/extensions/tika/deployment/src/main/java/io/quarkus/tika/deployment/TikaParsersConfigBuildItem.java in all the step functions in dealing with OOXML and check if the list value returned from a map (the key is a parser name) is not |
@sberyozkin - ok about the OOXML. |
@tpenakov please do it for PDF as well because it would make a diff for your case :-) |
Hi @sberyozkin , I've managed to get it working for xlsx and pptx file types, but now I have 3 serious problems. The first two pop up after the code rebase in my Quarkus fork last week. IMO they are not related to the current task and I definately will need help in order to resolve them (quick and dirty solution is applied for the moment :) ). Here are the problems together with some explanations:
The code is published on the same fork Heed some help here. |
@tpenakov Hi, sorry for a delay, and thanks for continuing spending the time on the issue, it is realy appreciated. I'm subscribed but I did not get a single notification...In fact I'm actually not getting the notifications at all, this is strange... |
@sberyozkin - no problem about the clean start with docx only. |
Thank you @sberyozkin |
Hi @sberyozkin , |
Hi @tpenakov, OOM won't happen just because the native image is too big. Besides, with the parser configuration optimizations the tika extension will have a much slimmer native image, example, for PDF only, for DOCx only, etc. |
@tpenakov Hi, I've renamed this issue to have it focused around a specific issue you have reported to do with the Docx format. I will create a follow up issue to check other OOXMl formats in the native mode. thanks |
Hi @sberyozkin , |
@tpenakov Hi, no problems, happy you are still OK with looking at this issue :-) |
Hi @sberyozkin ,
|
Hi @tpenakov As noted in the PR request, it is appreciated you've spent so much time on this issue :-), I'll try to help now as well. By the way, please also watch #7171, which, if implemented, may help you more. Though as far as this extension is concerned the POI issues will have to be fixed anyway. |
Thank you @sberyozkin - I am watching the #7171 already. I also suggested Apache POI together with Xml Beans to become a separate extensions. |
…ative mode - move reflections maven artifact under the tika-deployment module
Hi guys, there is some date when this will be corrected? I am oplanning to use Apache Tika with Quarkus in a Microservice environment, and this BUG is preventing the deploy of our stack. |
https://github.com/apache/poi/blob/trunk/src/java/org/apache/poi/poifs/nio/CleanerUtil.java#L180 has to be addressed, I had to add Update: a cleaner workaround is in place now thanks to @Sanne providing a |
@slpereira I'm not having enough time to prioritize on Tika issues, however, slowly but surely some issues are being addressed. I'll pick up this issue during the next round when I'll start looking at Tika issues. Thanks |
Hello everybody, any news on this issue? I just ran into it using Quarkus 2.1.2.Final in native mode as well, non-native is working nicely. I could provide stack traces and such if needed ... thank you! |
I am going to close this as the Tika extension has been moved to the Quarkiverse |
The test project in order to reproduce the problem is created here.
Steps to reproduce:
./mvnw package -Pnative
./target/otaibe-apache-tika-docx-native-1.0-SNAPSHOT-runner
curl -v -H "Content-Type: application/octet-stream" -X POST --data-binary @src/test/resources/test_bg.docx http://localhost:11025/parse
mvn package -D%test.service.http.port=11025
The text was updated successfully, but these errors were encountered: