Skip to content
This repository has been archived by the owner on Jun 20, 2023. It is now read-only.

Using term_vector offsets to get text from the actual text of the PDF #145

Open
apanimesh061 opened this issue Jul 25, 2015 · 12 comments
Open

Comments

@apanimesh061
Copy link

I have large PDFs indexed in elasticsearch. I wish to retrieve words using the offsets of tokens that we get from the Termvector API. How can I get the actual text from the Base64? Is it even possible?

@dadoonet
Copy link
Member

I think that the example provided in "Highlighting attachments" might work /,help.

@helicalnitin
Copy link

I am not getting the actual text even after using "Highlighting attachments" . I have followed all the examples provided in your document. Please help where am I going wrong.

{
   "took": 9,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.13561106,
      "hits": [
         {
            "_index": "test",
            "_type": "person",
            "_id": "1",
            "_score": 0.13561106,
            "highlight": {
               "file.content": [
                  "\"God Save the <em>Queen</em>\" (alternatively \"God Save the <em>King</em>\"\n"
               ]
            }
         }
      ]
   }
}

I am getting

{

    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 1,
        "hits": [
            {
                "_index": "test",
                "_type": "person",
                "_id": "1",
                "_score": 1,
                "_source": {
                    "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
                }
            }
        ]
    }

}

More information :

  1. I am using elasticsearch 2.1.1
  2. Using just one single node of it
  3. Mapper attachment plugin is installed properly

@dadoonet
Copy link
Member

@helicalnitin You can ask this on discuss.elastic.co which is a better place for questions.

That being said, it sounds like you highlighted file and not file.content.

@helicalnitin
Copy link

@dadoonet - Thanks for quick reply. Sure, I will definitely post the same to discuss.elastic.co.
I did highlight the file.content and not just file. Please see the json

{
  "fields": [],
  "query": {
    "match": {
      "file.content": "king queen"
    }
  },
  "highlight": {
    "fields": {
      "file.content": {}
    }
  }
}

@dadoonet
Copy link
Member

Ha sorry. I misread your output.

Can you describe all the exact steps?

@helicalnitin
Copy link

Steps I followed

  1. Installed elasticsearch 2.1.1
  2. Installed mapper plugin using the below mentioned command
    bin/plugin install elasticsearch/elasticsearch-mapper-attachments/3.1.1
  3. Then I ran the following commands
DELETE /test
PUT /test
PUT /test/person/_mapping
{
  "person": {
    "properties": {
      "file": {
        "type": "attachment",
        "fields": {
          "content": {
            "type": "string",
            "term_vector":"with_positions_offsets",
            "store": true
          }
        }
      }
    }
  }
}
PUT /test/person/1?refresh=true
{
  "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
}
GET /test/person/_search
{
  "fields": [],
  "query": {
    "match": {
      "file.content": "king queen"
    }
  },
  "highlight": {
    "fields": {
      "file.content": {
      }
    }
  }
}

And the output which I get is mentioned below 

{

    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 1,
        "hits": [
            {
                "_index": "test",
                "_type": "person",
                "_id": "1",
                "_score": 1,
                "_source": {
                    "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
                }
            }
        ]
    }

}


I was expecting something like this 

{
   "took": 9,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.13561106,
      "hits": [
         {
            "_index": "test",
            "_type": "person",
            "_id": "1",
            "_score": 0.13561106,
            "highlight": {
               "file.content": [
                  "\"God Save the <em>Queen</em>\" (alternatively \"God Save the <em>King</em>\"\n"
               ]
            }
         }
      ]
   }
}

I am using Windows 8.1 and Java 1.8.0_25

@dadoonet
Copy link
Member

Just to make sure. When do you start Elasticsearch node?

@helicalnitin
Copy link

After installing the plugin, I restarted the node.

@helicalnitin
Copy link

I just found a way to get the actual text. What worked for me is instead of GET I used POST for searching and highlighting the text.

POST /test/person/_search
{
  "fields": [],
  "query": {
    "match": {
      "file.content": "king queen"
    }
  },
  "highlight": {
    "fields": {
      "file.content": {
      }
    }
  }
}

@dadoonet
Copy link
Member

Oh I see. You are not using curl or a client lib but a browser plugin or a 3rd party tool, right?

@helicalnitin
Copy link

Yes. I was using elastic-head but the same problem is there using java search api.

@yogeshkoli
Copy link

Hi @helicalnitin, I am getting same issue the search is returning base64 output as is, did you find the root cause for this issue?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants