-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate Tika XML configuration during build time #6752
Conversation
test: TikaProcessorTest should use Quarkus test framework Fixes quarkusio#6746 Fixes quarkusio#5700
void initializeTikaParser(BeanContainerBuildItem beanContainer, TikaRecorder recorder, | ||
BuildProducer<ServiceProviderBuildItem> serviceProvider, TikaConfiguration configuration) | ||
throws Exception { | ||
Map<String, List<TikaParserParameter>> parsers = getSupportedParserConfig(configuration.tikaConfigPath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@machi1990 first of all, thanks for this effort :-)
I'm not sure about removing a build item containing this Map. My plan, with the current master, has been to make this Map visible to various steps dealing with the individual parsers. For example, I suggested to @tpenakov in the issue where he works on OOXML support to group all PDF related build steps into a single method like preparePdfParser
and check this map if PDF parser key is available and only then do all those PDF specific native registrations, the same for OOXML.
For example, this method, this one and this one are specific to PDF and the plan has been to make a single method preparePDFParser
out of 3 of them and make all of that optional depending on whether org.apache.tika.parser.pdf.PDFParser
key is in the map or not.
So I think that build item still has to be retained. Let me know please if it makes sense. CC @gsmet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@machi1990 first of all, thanks for this effort :-)
Hi @sberyozkin thank you.
I'm not sure about removing a build item containing this Map. My plan, with the current master, has been to make this Map visible to various steps dealing with the individual parsers. For example, I suggested to @tpenakov in the issue where he works on OOXML support to group all PDF related build steps into a single method like
preparePdfParser
and check this map if PDF parser key is available and only then do all those PDF specific native registrations, the same for OOXML.
Oh sorry, I was not aware of the future plans around this class.
If I understood the description above correctly, it seems to be that these various methods (e.g preparePdfParser
) will be consuming this build item from within the TikaProcessor
class, correct?
If so, the same be achieved with an invocation of a private
method passing the parsers' map as a parameter ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@machi1990 sorry, had to sign off for a bit.
Yes, I actually thought about it while being offline :-), but then TikaProcessor would just have a large single method :-), calling preparePdfParser
, prepareOOXmlParser
, etc. It just looks a bit cleaner to me to have dedicated build steps per specific parser, or does the new CL model does not allow for it as far as Tika is concerned ? Not a big deal but wonder what is the best approach here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @machi1990 @gsmet so what do you think; I've been thinking today, I'd like, I think, to have a build step per each parser which requires some support. The main build step is the one which generates the tika config. may be it can be called as such, generateTikaConfig
or createTikaConfig
, it would return a Map with the build item, and then there would be PDF build step, OOXML build step, etc as required which will optionally register the extra bits if the map has the parser key. Otherwise it would indeed have to be this master build step keeping adding private calls to preparePDF
.etc and it would feel a bit less cool to me :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@machi1990 sorry, had to sign off for a bit.
It is okay, I missed your ping too.
Yes, I actually thought about it while being offline :-), but then TikaProcessor would just have a large single method :-), calling
preparePdfParser
,prepareOOXmlParser
, etc. It just looks a bit cleaner to me to have dedicated build steps per specific parser, or does the new CL model does not allow for it as far as Tika is concerned ?
Hi @sberyozkin the change is not CL related but on whether we need the BuildItem to achieve the above.
Hi @machi1990 @gsmet so what do you think; I've been thinking today, I'd like, I think, to have a build step per each parser which requires some support. The main build step is the one which generates the tika config. may be it can be called as such, generateTikaConfig or createTikaConfig, it would return a Map with the build item, and then there would be PDF build step, OOXML build step, etc as required which will optionally register the extra bits if the map has the parser key. Otherwise it would indeed have to be this master build step keeping adding private calls to preparePDF.etc and it would feel a bit less cool to me :-)
If you prefer to keep it around, I can bring it back. @gsmet WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@machi1990 sorry, missed your comment. Yeah, it would be nice, I reckon it is not really related to the class loading issue, though I can see why Guillaume saw it being redundant, because the optimizations plans I'm referring to above are not yet implemented :-) and that item was only used to facilitate the sub-optimal tika config generation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not about being redundant. We certainly cannot pass those classes from build time to runtime and expect them to work.
I don't see the need for a build item if everything is done in the same extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsmet This PR itself has preserved a reasonable number of build steps, one generates the configuration now, there are few build steps which deal with loading some resources, common to all of the Tika parsers, and some - common to PDF, and there would be OOXML specific build step coming in, we are not going ahead with collapsing them all into a single step in this PR. So it is clean and nice. Now, if this build item is removed, then for me to implement the optimization idea I had in mind I'd have to basically collapse everything (but a single step involving the Tika-common resources) into a single build step (generate the config, optionally load PDF/OOXML specific resources).
I appreciate a build item is great for coordinating between different extensions, but in this case it would help the tika processor keep the parser specific build steps separate, keeping the native optimization in mind. Np if you don't agree, please go ahead with the merge.
@machi1990 thanks for fix, cheers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gsmet @machi1990 Lets go ahead with the merge as this fix may help @tpenakov in this issue.
I can reintroduce the build item in a follow up PR if I feel like it :-)
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work.
test: TikaProcessorTest should use Quarkus test framework
Fixes #6746
Fixes #5700