Tika base64-to-extracted-text char filter (and: some troubles with Git) #170

bjorn-ali-goransson · 2015-10-12T19:41:27Z

I've made a small feature which I've been too curious to leave unimplemented.

(The implementation is very simplistic and crude, I haven't made any serious development with java in years. Hope the point gets through nontheless)

It's a char filter called attachments_test (up for renaming), a quite useful feature for getting acquainted with Tika, as well as troubleshooting "Why isn't query X giving a hit for attachment Y" tickets coming from clients.

So, a request like the following:

POST /_analyze?tokenizer=keyword&char_filters=attachments_test&text=e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0%3D

Would yield the following result:

{
   "tokens": [
      {
         "token": "Lorem ipsum dolor sit amet\n\n",
         "start_offset": 0,
         "end_offset": 0,
         "type": "word",
         "position": 1
      }
   ]
}

Of course, this is not something that should be used in actual analyzers (hence the stigmatizing suffix of _test)

Also, (and this would be a good micro-feature in the actual indexing logic) it will give an error for unpadded base64 strings, (there ended the good micro-feature in the indexing logic) indicating how many equal signs that were missing.

I hope you agree that this little feature can be regarded as quite useful! Without something similar, Tika is a mysterious little black box doing stuff that we don't understand, and people dream about copying the extracted text into own properties or using Luke to introspect the Lucene index (lots of questions about that on the net).

Also, there's a problem here, hope you can help me out... My commit was automatically merged with my previous pull request, it seems to be open still, is this correct as it was tagged for inclusion in 3.1.0?

The text was updated successfully, but these errors were encountered:

dadoonet · 2015-10-12T20:45:51Z

It sounds to me like a very good idea. I mean being able to extract content using Tika with the _analyze API is a lovely idea.

About your git issue. It's because you did not create a new branch.

The first PR you wrote should have been written in a new branch, such as doc/readme.
Then the current code you wrote in this commit (bjorn-ali-goransson@9f7f551) should be in another branch, for example pr/analyze.

Try to:

create a branch doc/readme on this commit: bjorn-ali-goransson@286e42e
update your PR More easily digested introduction #155 and use doc/readme as the origin branch (if it does not work, feel free to close the previous PR and open a new one)
create a branch pr/analyze and cherry-pick this commit: bjorn-ali-goransson@9f7f551
create a new PR based on this branch

bjorn-ali-goransson · 2015-10-18T08:58:38Z

(Fixing the Git problems are still on my list, I'm a bit choked right now by other work...)

Could this (non-)technique be used to save the extracted contents in the source? Or is that not possible? And, is the current unability to store the extracted text itself a feature?

dadoonet · 2015-10-18T09:03:56Z

By design, elasticsearch is not supposed to change a source document provided by the user.

dadoonet · 2015-10-18T09:05:10Z

But the extracted text might be stored as is if you set store to true on the field. Look at the README for this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tika base64-to-extracted-text char filter (and: some troubles with Git) #170

Tika base64-to-extracted-text char filter (and: some troubles with Git) #170

bjorn-ali-goransson commented Oct 12, 2015

dadoonet commented Oct 12, 2015

bjorn-ali-goransson commented Oct 18, 2015

dadoonet commented Oct 18, 2015

dadoonet commented Oct 18, 2015

Tika base64-to-extracted-text char filter (and: some troubles with Git) #170

Tika base64-to-extracted-text char filter (and: some troubles with Git) #170

Comments

bjorn-ali-goransson commented Oct 12, 2015

dadoonet commented Oct 12, 2015

bjorn-ali-goransson commented Oct 18, 2015

dadoonet commented Oct 18, 2015

dadoonet commented Oct 18, 2015