-
Notifications
You must be signed in to change notification settings - Fork 94
Using term_vector offsets to get text from the actual text of the PDF #145
Comments
I think that the example provided in "Highlighting attachments" might work /,help. |
I am not getting the actual text even after using "Highlighting attachments" . I have followed all the examples provided in your document. Please help where am I going wrong. {
"took": 9,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.13561106,
"hits": [
{
"_index": "test",
"_type": "person",
"_id": "1",
"_score": 0.13561106,
"highlight": {
"file.content": [
"\"God Save the <em>Queen</em>\" (alternatively \"God Save the <em>King</em>\"\n"
]
}
}
]
}
} I am getting {
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "person",
"_id": "1",
"_score": 1,
"_source": {
"file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
}
}
]
}
} More information :
|
@helicalnitin You can ask this on discuss.elastic.co which is a better place for questions. That being said, it sounds like you highlighted |
@dadoonet - Thanks for quick reply. Sure, I will definitely post the same to discuss.elastic.co. {
"fields": [],
"query": {
"match": {
"file.content": "king queen"
}
},
"highlight": {
"fields": {
"file.content": {}
}
}
} |
Ha sorry. I misread your output. Can you describe all the exact steps? |
Steps I followed
DELETE /test
PUT /test
PUT /test/person/_mapping
{
"person": {
"properties": {
"file": {
"type": "attachment",
"fields": {
"content": {
"type": "string",
"term_vector":"with_positions_offsets",
"store": true
}
}
}
}
}
}
PUT /test/person/1?refresh=true
{
"file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
}
GET /test/person/_search
{
"fields": [],
"query": {
"match": {
"file.content": "king queen"
}
},
"highlight": {
"fields": {
"file.content": {
}
}
}
}
And the output which I get is mentioned below
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "person",
"_id": "1",
"_score": 1,
"_source": {
"file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
}
}
]
}
}
I was expecting something like this
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.13561106,
"hits": [
{
"_index": "test",
"_type": "person",
"_id": "1",
"_score": 0.13561106,
"highlight": {
"file.content": [
"\"God Save the <em>Queen</em>\" (alternatively \"God Save the <em>King</em>\"\n"
]
}
}
]
}
}
I am using Windows 8.1 and Java 1.8.0_25 |
Just to make sure. When do you start Elasticsearch node? |
After installing the plugin, I restarted the node. |
I just found a way to get the actual text. What worked for me is instead of GET I used POST for searching and highlighting the text. POST /test/person/_search
{
"fields": [],
"query": {
"match": {
"file.content": "king queen"
}
},
"highlight": {
"fields": {
"file.content": {
}
}
}
} |
Oh I see. You are not using curl or a client lib but a browser plugin or a 3rd party tool, right? |
Yes. I was using elastic-head but the same problem is there using java search api. |
Hi @helicalnitin, I am getting same issue the search is returning base64 output as is, did you find the root cause for this issue? |
I have large PDFs indexed in elasticsearch. I wish to retrieve words using the offsets of tokens that we get from the Termvector API. How can I get the actual text from the Base64? Is it even possible?
The text was updated successfully, but these errors were encountered: