Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate Tika XML configuration during build time #6752

Merged
merged 1 commit into from
Jan 31, 2020

Conversation

machi1990
Copy link
Member

test: TikaProcessorTest should use Quarkus test framework

Fixes #6746
Fixes #5700

test: TikaProcessorTest should use Quarkus test framework

Fixes quarkusio#6746
Fixes quarkusio#5700
void initializeTikaParser(BeanContainerBuildItem beanContainer, TikaRecorder recorder,
BuildProducer<ServiceProviderBuildItem> serviceProvider, TikaConfiguration configuration)
throws Exception {
Map<String, List<TikaParserParameter>> parsers = getSupportedParserConfig(configuration.tikaConfigPath,
Copy link
Member

@sberyozkin sberyozkin Jan 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machi1990 first of all, thanks for this effort :-)
I'm not sure about removing a build item containing this Map. My plan, with the current master, has been to make this Map visible to various steps dealing with the individual parsers. For example, I suggested to @tpenakov in the issue where he works on OOXML support to group all PDF related build steps into a single method like preparePdfParser and check this map if PDF parser key is available and only then do all those PDF specific native registrations, the same for OOXML.

For example, this method, this one and this one are specific to PDF and the plan has been to make a single method preparePDFParser out of 3 of them and make all of that optional depending on whether org.apache.tika.parser.pdf.PDFParser key is in the map or not.

So I think that build item still has to be retained. Let me know please if it makes sense. CC @gsmet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machi1990 first of all, thanks for this effort :-)

Hi @sberyozkin thank you.

I'm not sure about removing a build item containing this Map. My plan, with the current master, has been to make this Map visible to various steps dealing with the individual parsers. For example, I suggested to @tpenakov in the issue where he works on OOXML support to group all PDF related build steps into a single method like preparePdfParser and check this map if PDF parser key is available and only then do all those PDF specific native registrations, the same for OOXML.

Oh sorry, I was not aware of the future plans around this class.

If I understood the description above correctly, it seems to be that these various methods (e.g preparePdfParser) will be consuming this build item from within the TikaProcessor class, correct?

If so, the same be achieved with an invocation of a private method passing the parsers' map as a parameter ?

Copy link
Member

@sberyozkin sberyozkin Jan 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machi1990 sorry, had to sign off for a bit.
Yes, I actually thought about it while being offline :-), but then TikaProcessor would just have a large single method :-), calling preparePdfParser, prepareOOXmlParser, etc. It just looks a bit cleaner to me to have dedicated build steps per specific parser, or does the new CL model does not allow for it as far as Tika is concerned ? Not a big deal but wonder what is the best approach here

Copy link
Member

@sberyozkin sberyozkin Jan 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @machi1990 @gsmet so what do you think; I've been thinking today, I'd like, I think, to have a build step per each parser which requires some support. The main build step is the one which generates the tika config. may be it can be called as such, generateTikaConfig or createTikaConfig, it would return a Map with the build item, and then there would be PDF build step, OOXML build step, etc as required which will optionally register the extra bits if the map has the parser key. Otherwise it would indeed have to be this master build step keeping adding private calls to preparePDF.etc and it would feel a bit less cool to me :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machi1990 sorry, had to sign off for a bit.

It is okay, I missed your ping too.

Yes, I actually thought about it while being offline :-), but then TikaProcessor would just have a large single method :-), calling preparePdfParser, prepareOOXmlParser, etc. It just looks a bit cleaner to me to have dedicated build steps per specific parser, or does the new CL model does not allow for it as far as Tika is concerned ?

Hi @sberyozkin the change is not CL related but on whether we need the BuildItem to achieve the above.

Hi @machi1990 @gsmet so what do you think; I've been thinking today, I'd like, I think, to have a build step per each parser which requires some support. The main build step is the one which generates the tika config. may be it can be called as such, generateTikaConfig or createTikaConfig, it would return a Map with the build item, and then there would be PDF build step, OOXML build step, etc as required which will optionally register the extra bits if the map has the parser key. Otherwise it would indeed have to be this master build step keeping adding private calls to preparePDF.etc and it would feel a bit less cool to me :-)

If you prefer to keep it around, I can bring it back. @gsmet WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machi1990 sorry, missed your comment. Yeah, it would be nice, I reckon it is not really related to the class loading issue, though I can see why Guillaume saw it being redundant, because the optimizations plans I'm referring to above are not yet implemented :-) and that item was only used to facilitate the sub-optimal tika config generation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not about being redundant. We certainly cannot pass those classes from build time to runtime and expect them to work.

I don't see the need for a build item if everything is done in the same extension.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsmet This PR itself has preserved a reasonable number of build steps, one generates the configuration now, there are few build steps which deal with loading some resources, common to all of the Tika parsers, and some - common to PDF, and there would be OOXML specific build step coming in, we are not going ahead with collapsing them all into a single step in this PR. So it is clean and nice. Now, if this build item is removed, then for me to implement the optimization idea I had in mind I'd have to basically collapse everything (but a single step involving the Tika-common resources) into a single build step (generate the config, optionally load PDF/OOXML specific resources).
I appreciate a build item is great for coordinating between different extensions, but in this case it would help the tika processor keep the parser specific build steps separate, keeping the native optimization in mind. Np if you don't agree, please go ahead with the merge.
@machi1990 thanks for fix, cheers

@sberyozkin sberyozkin self-requested a review January 28, 2020 12:17
Copy link
Member

@sberyozkin sberyozkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gsmet @machi1990 Lets go ahead with the merge as this fix may help @tpenakov in this issue.

I can reintroduce the build item in a follow up PR if I feel like it :-)

Thanks

Copy link
Member

@gsmet gsmet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.

@gsmet gsmet merged commit f0426bd into quarkusio:master Jan 31, 2020
@gsmet gsmet added this to the 1.3.0 milestone Jan 31, 2020
@gsmet gsmet changed the title feat(tika): generate tika xml configuration during build time Generate Tika xml configuration during build time Jan 31, 2020
@gsmet gsmet changed the title Generate Tika xml configuration during build time Generate Tika XML configuration during build time Jan 31, 2020
@machi1990 machi1990 deleted the fix-5700 branch January 31, 2020 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tika extension doesn't like the new class loading TikaProcessorTest should use Quarkus test framework
3 participants