Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow indexing of embedded documents/attachments as individual docs #361

Open
tballison opened this issue Oct 13, 2016 · 0 comments
Open

Comments

@tballison
Copy link
Contributor

Tika's legacy behavior was to concatenate the content of embedded documents into one handler and ignore metadata from embedded documents. This was probably driven by the desire to allow Tika to handle reads and writes in a streaming fashion.

If you're willing to forego streaming and are willing to store the extracted content in memory, you might consider Jukka Zitting's and Nick Burch's "RecursiveParserWrapper" which returns a list of Metadata objects for each input file. The first Metadata object in the list represents the container document and then the rest represent each embedded document. The "text" for each document/embedded document is stored in each metadata object by the RecursiveParserWrapper.TIKA_CONTENT key.

You can see the output in Json format via tika-app's -J command or the /rmeta endpoint in tika-server.

See recursiveParserWrapperExample() in this example. You can specify whether you want the content as text, HTML or XHTML via the BasicContentHandlerFactory.HANDLER_TYPE.

This is critical for maintaining metadata from embedded objects. Imagine, as one use case, you have a zip of jpegs with lat/longs, this will allow you to index each individually.

See SOLR-7229 for work to integrate this into Solr's DIH...I haven't gotten around to submitting a PR for that. :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant