Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#6549 - Apache Tika can not parse Microsoft Docx format in native mode #7198

Conversation

tpenakov
Copy link

@tpenakov tpenakov commented Feb 14, 2020

Fixes: #6549
A few things to points out:

  • xalan dependency is included - probably on wrong place: extensions/arc/runtime/pom.xml
  • additional config property is added for supported file types - we have support for pptx ans xlsx file types as well, however if we include all of them - the native build ends with OOM error.
  • Trying to use @ConfigProperty in io.quarkus.it.tika.TikaEmbeddedContentTest leads to NPE for native build. This one (@ConfigProperty without @Inject does not work in test #2061) claims that is fixed, but I am receiving it.

@boring-cyborg boring-cyborg bot added area/arc Issue related to ARC (dependency injection) area/core area/dependencies Pull requests that update a dependency file area/tika labels Feb 14, 2020
@geoand geoand requested a review from sberyozkin February 14, 2020 12:26
@@ -2762,6 +2775,32 @@
<artifactId>quarkus-banner</artifactId>
<version>${project.version}</version>
</dependency>

<!-- reflections -->
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure we don't want to add this dependency to the core. We have basically done without it for all this time, so I don't really think it's warranted to add it now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @geoand - if you check the code the reflection framework is used in more than one place. I've just put it on more centralized place. Anyway - what I've tried to achieve here is to register for reflection all classes in given package. Could you please advice is there an other way to achieve it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use Jandex instead to query the application classes structures, see the examples in other extensions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tpenakov like @gastaldi said, we usually don't use reflection in Quarkus extensions but instead rely on Jandex (which is effectively an index for Java code) to look up various metadata.

One question, do you really need to register all classes of a package for reflection?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @geoand ,

The package which I want to register contains the implementation of components for parsing docx format (org.openxmlformats.schemas.wordprocessingml.x2006.main.impl). Also will need to include all classes in packagess for parsing xlsx and pptx components.
I've tried to google this for Jandex, but without success. If you know how to do it in Jandex (to register all classes for given package) - this will save me a lot of time and I will remove the reflection framework :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Jandex does this TBH. If you must register all classes in a package for reflection, then sure go ahead and use the reflections module (if @sberyozkin agrees), but please only use it in your extension and no where else

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @geoand , @gastaldi and @sberyozkin,
Reflections maven artifact is now moved under the tika-deployment.

@@ -27,6 +27,17 @@
<groupId>org.eclipse.microprofile.context-propagation</groupId>
<artifactId>microprofile-context-propagation-api</artifactId>
</dependency>
<dependency>
<groupId>xalan</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ARC needs to depend on Xalan?

Copy link
Author

@tpenakov tpenakov Feb 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gastaldi ,
If I put it on the other places the tika-integration-tests are failed.
I've added a few more notes it he PR description.

import org.apache.commons.lang3.ArrayUtils;
import org.apache.commons.lang3.StringUtils;

import java.lang.reflect.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use star imports, make sure to use the IDE config file as described in https://github.com/quarkusio/quarkus/blob/master/CONTRIBUTING.md#ide-config-and-code-style

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right @gastaldi -the IDE config file wasn't included. Now it is, however the wildcard import is still there :(

@@ -79,6 +79,10 @@
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @geoand mentioned, I don't think it's a good idea to add this library to core

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gastaldi ,
if you check the code the reflection framework is used in more than one place. I've just put it on more centralized place. Anyway - what I've tried to achieve here is to register for reflection all classes in given package. Could you please advice is there an other way to achieve it?

Not a problem to move it only on Tika related deploynment, but as I said this library is used on more than one place. In my experience if such things is happened, we have to move the dependency on more centralized place, or to not used it at all and to replace it with the similar functionality.

Comment on lines +45 to +50
<exclusions>
<exclusion>
<groupId>org.apache.xmlbeans</groupId>
<artifactId>xmlbeans</artifactId>
</exclusion>
</exclusions>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this exclusion and the dependency below (since it's resolved transitively)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gastaldi - There was a difference in the versions of XML Beans. I found only this way to deal with it.

@sberyozkin
Copy link
Member

sberyozkin commented Feb 14, 2020

@tpenakov Thanks for giving it a go, appreciated. I agree though with @geoand and @gastaldi is that the way it is done it would cause too many side-effects. Let me try to find some time asap and experiment with your PR locally, I'll keep you uptodate, thank you !

@tpenakov
Copy link
Author

Thank you @sberyozkin - I've added some comments/answers to @geoand and @gastaldi and added and the explanations in the PR. I also have some concerns (written in PR description)

…ative mode - move reflections maven artifact under the tika-deployment module
@gsmet
Copy link
Member

gsmet commented Oct 12, 2020

Closing this one as out of date. Let's create a new PR if we still need it.

@gsmet gsmet closed this Oct 12, 2020
@gsmet gsmet added the triage/out-of-date This issue/PR is no longer valid or relevant label Oct 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/arc Issue related to ARC (dependency injection) area/core area/dependencies Pull requests that update a dependency file area/tika triage/out-of-date This issue/PR is no longer valid or relevant
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Apache Tika can not parse Microsoft Docx format in native mode
5 participants