Allow indexing of embedded documents/attachments as individual docs #361

tballison · 2016-10-13T13:16:08Z

Tika's legacy behavior was to concatenate the content of embedded documents into one handler and ignore metadata from embedded documents. This was probably driven by the desire to allow Tika to handle reads and writes in a streaming fashion.

If you're willing to forego streaming and are willing to store the extracted content in memory, you might consider Jukka Zitting's and Nick Burch's "RecursiveParserWrapper" which returns a list of Metadata objects for each input file. The first Metadata object in the list represents the container document and then the rest represent each embedded document. The "text" for each document/embedded document is stored in each metadata object by the RecursiveParserWrapper.TIKA_CONTENT key.

You can see the output in Json format via tika-app's -J command or the /rmeta endpoint in tika-server.

See recursiveParserWrapperExample() in this example. You can specify whether you want the content as text, HTML or XHTML via the BasicContentHandlerFactory.HANDLER_TYPE.

This is critical for maintaining metadata from embedded objects. Imagine, as one use case, you have a zip of jpegs with lat/longs, this will allow you to index each individually.

See SOLR-7229 for work to integrate this into Solr's DIH...I haven't gotten around to submitting a PR for that. :(

tballison mentioned this issue Oct 13, 2016

Tika parser may not be parsing embedded documents #358

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow indexing of embedded documents/attachments as individual docs #361

Allow indexing of embedded documents/attachments as individual docs #361

tballison commented Oct 13, 2016

Allow indexing of embedded documents/attachments as individual docs #361

Allow indexing of embedded documents/attachments as individual docs #361

Comments

tballison commented Oct 13, 2016