Using term_vector offsets to get text from the actual text of the PDF #145

apanimesh061 · 2015-07-25T03:39:16Z

I have large PDFs indexed in elasticsearch. I wish to retrieve words using the offsets of tokens that we get from the Termvector API. How can I get the actual text from the Base64? Is it even possible?

dadoonet · 2015-07-25T05:12:09Z

I think that the example provided in "Highlighting attachments" might work /,help.

helicalnitin · 2016-01-25T08:47:39Z

I am not getting the actual text even after using "Highlighting attachments" . I have followed all the examples provided in your document. Please help where am I going wrong.

{
   "took": 9,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.13561106,
      "hits": [
         {
            "_index": "test",
            "_type": "person",
            "_id": "1",
            "_score": 0.13561106,
            "highlight": {
               "file.content": [
                  "\"God Save the <em>Queen</em>\" (alternatively \"God Save the <em>King</em>\"\n"
               ]
            }
         }
      ]
   }
}

I am getting

{

    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 1,
        "hits": [
            {
                "_index": "test",
                "_type": "person",
                "_id": "1",
                "_score": 1,
                "_source": {
                    "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
                }
            }
        ]
    }

}

More information :

I am using elasticsearch 2.1.1
Using just one single node of it
Mapper attachment plugin is installed properly

dadoonet · 2016-01-25T08:59:03Z

@helicalnitin You can ask this on discuss.elastic.co which is a better place for questions.

That being said, it sounds like you highlighted file and not file.content.

helicalnitin · 2016-01-25T09:53:34Z

@dadoonet - Thanks for quick reply. Sure, I will definitely post the same to discuss.elastic.co.
I did highlight the file.content and not just file. Please see the json

{
  "fields": [],
  "query": {
    "match": {
      "file.content": "king queen"
    }
  },
  "highlight": {
    "fields": {
      "file.content": {}
    }
  }
}

dadoonet · 2016-01-25T09:57:51Z

Ha sorry. I misread your output.

Can you describe all the exact steps?

helicalnitin · 2016-01-25T10:07:25Z

Steps I followed

Installed elasticsearch 2.1.1
Installed mapper plugin using the below mentioned command
bin/plugin install elasticsearch/elasticsearch-mapper-attachments/3.1.1
Then I ran the following commands

DELETE /test
PUT /test
PUT /test/person/_mapping
{
  "person": {
    "properties": {
      "file": {
        "type": "attachment",
        "fields": {
          "content": {
            "type": "string",
            "term_vector":"with_positions_offsets",
            "store": true
          }
        }
      }
    }
  }
}
PUT /test/person/1?refresh=true
{
  "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
}
GET /test/person/_search
{
  "fields": [],
  "query": {
    "match": {
      "file.content": "king queen"
    }
  },
  "highlight": {
    "fields": {
      "file.content": {
      }
    }
  }
}

And the output which I get is mentioned below 

{

    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 1,
        "hits": [
            {
                "_index": "test",
                "_type": "person",
                "_id": "1",
                "_score": 1,
                "_source": {
                    "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
                }
            }
        ]
    }

}


I was expecting something like this 

{
   "took": 9,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.13561106,
      "hits": [
         {
            "_index": "test",
            "_type": "person",
            "_id": "1",
            "_score": 0.13561106,
            "highlight": {
               "file.content": [
                  "\"God Save the <em>Queen</em>\" (alternatively \"God Save the <em>King</em>\"\n"
               ]
            }
         }
      ]
   }
}

I am using Windows 8.1 and Java 1.8.0_25

dadoonet · 2016-01-25T10:51:30Z

Just to make sure. When do you start Elasticsearch node?

helicalnitin · 2016-01-25T11:23:53Z

After installing the plugin, I restarted the node.

helicalnitin · 2016-01-30T08:12:07Z

I just found a way to get the actual text. What worked for me is instead of GET I used POST for searching and highlighting the text.

POST /test/person/_search
{
  "fields": [],
  "query": {
    "match": {
      "file.content": "king queen"
    }
  },
  "highlight": {
    "fields": {
      "file.content": {
      }
    }
  }
}

dadoonet · 2016-01-30T09:17:25Z

Oh I see. You are not using curl or a client lib but a browser plugin or a 3rd party tool, right?

helicalnitin · 2016-01-30T09:21:23Z

Yes. I was using elastic-head but the same problem is there using java search api.

yogeshkoli · 2016-10-07T17:21:31Z

Hi @helicalnitin, I am getting same issue the search is returning base64 output as is, did you find the root cause for this issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using term_vector offsets to get text from the actual text of the PDF #145

Using term_vector offsets to get text from the actual text of the PDF #145

apanimesh061 commented Jul 25, 2015

dadoonet commented Jul 25, 2015

helicalnitin commented Jan 25, 2016

dadoonet commented Jan 25, 2016

helicalnitin commented Jan 25, 2016

dadoonet commented Jan 25, 2016

helicalnitin commented Jan 25, 2016

dadoonet commented Jan 25, 2016

helicalnitin commented Jan 25, 2016

helicalnitin commented Jan 30, 2016

dadoonet commented Jan 30, 2016

helicalnitin commented Jan 30, 2016

yogeshkoli commented Oct 7, 2016

Using term_vector offsets to get text from the actual text of the PDF #145

Using term_vector offsets to get text from the actual text of the PDF #145

Comments

apanimesh061 commented Jul 25, 2015

dadoonet commented Jul 25, 2015

helicalnitin commented Jan 25, 2016

dadoonet commented Jan 25, 2016

helicalnitin commented Jan 25, 2016

dadoonet commented Jan 25, 2016

helicalnitin commented Jan 25, 2016

dadoonet commented Jan 25, 2016

helicalnitin commented Jan 25, 2016

helicalnitin commented Jan 30, 2016

dadoonet commented Jan 30, 2016

helicalnitin commented Jan 30, 2016

yogeshkoli commented Oct 7, 2016