You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tika's legacy behavior was to concatenate the content of embedded documents into one handler and ignore metadata from embedded documents. This was probably driven by the desire to allow Tika to handle reads and writes in a streaming fashion.
If you're willing to forego streaming and are willing to store the extracted content in memory, you might consider Jukka Zitting's and Nick Burch's "RecursiveParserWrapper" which returns a list of Metadata objects for each input file. The first Metadata object in the list represents the container document and then the rest represent each embedded document. The "text" for each document/embedded document is stored in each metadata object by the RecursiveParserWrapper.TIKA_CONTENT key.
You can see the output in Json format via tika-app's -J command or the /rmeta endpoint in tika-server.
See recursiveParserWrapperExample() in this example. You can specify whether you want the content as text, HTML or XHTML via the BasicContentHandlerFactory.HANDLER_TYPE.
This is critical for maintaining metadata from embedded objects. Imagine, as one use case, you have a zip of jpegs with lat/longs, this will allow you to index each individually.
See SOLR-7229 for work to integrate this into Solr's DIH...I haven't gotten around to submitting a PR for that. :(
The text was updated successfully, but these errors were encountered:
Tika's legacy behavior was to concatenate the content of embedded documents into one handler and ignore metadata from embedded documents. This was probably driven by the desire to allow Tika to handle reads and writes in a streaming fashion.
If you're willing to forego streaming and are willing to store the extracted content in memory, you might consider Jukka Zitting's and Nick Burch's "RecursiveParserWrapper" which returns a list of Metadata objects for each input file. The first Metadata object in the list represents the container document and then the rest represent each embedded document. The "text" for each document/embedded document is stored in each metadata object by the RecursiveParserWrapper.TIKA_CONTENT key.
You can see the output in Json format via tika-app's -J command or the /rmeta endpoint in tika-server.
See recursiveParserWrapperExample() in this example. You can specify whether you want the content as text, HTML or XHTML via the BasicContentHandlerFactory.HANDLER_TYPE.
This is critical for maintaining metadata from embedded objects. Imagine, as one use case, you have a zip of jpegs with lat/longs, this will allow you to index each individually.
See SOLR-7229 for work to integrate this into Solr's DIH...I haven't gotten around to submitting a PR for that. :(
The text was updated successfully, but these errors were encountered: