-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sorting by '_id' in Document Table visualization potentially causes large increase in heap usage (field data) #274
Comments
From linked discussion -
|
Hi @fbaligand , The suggestion to sort by _doc sounds very promising. I've done a small bit of testing - modified the plugin to use _doc instead of _id -
The resulting sort portion of the queries now looks like this (when we exceed 10000 hits) -
I can confirm that queries that previously resulted in increases in large amounts of _id fielddata , no longer do so, so that's a step in the right direction I think . I see that the plugin uses the 'search after' feature, and that Elasticsearch recommend the use of _doc for sorting on here - https://www.elastic.co/guide/en/elasticsearch/reference/current/sort-search-results.html I have a small concern though, might be nothing , but worth testing..... I didn't personally see duplicates or missing records in tests I did so far (the count of docs in the table matched the expected count of events, no duplications etc... ) , but I can also see that the _doc value is different depending on whether the result is returned from a primary or a replica shard. To test that, use the Kibana dev tools console and search for a single doc , in an index with one or more replicas , and sort by _doc . Retry the query several times , you can see in the results that the _doc value differs . I have no idea offhand if this could be a potential issue when using the search after feature , or if its a solved problem, perhaps you would have a better idea . Cheers |
Further to previous comment ... have seen some duplicate entries in a recent test unfortunately. Tested loading a visualization from 1 hour ago to 'now' in a cluster where documents are often updated multiple times shortly after creation to enrich them with further information. Subsequent document updates change the _doc number. I suspect if a doc is updated while a long running visualization is loading (paginating through hits ) that it could potentially cause a problem with the sorting - again, not sure , but worth keeping in mind and testing further . |
A new field called '_shard_doc' has been added to recent versions of Elasticsearch, which could potentially be used in the sort - but it may have been added after version 7.10 which might be an issue if you want to remain compatible with OSS Elasticsearch and earlier versions. Perhaps making the sort tiebreaker default to '_id', but be a configurable option, could be useful to allow people to set their own unique sort fields. |
Thanks a lot for your tests. Do you have tested "_shard_doc"? No memory problem? No duplicates or misses? |
Hey, no problem. I didn't test _shard_doc (I'm running an earlier version of Elasticsearch) , it seems it was added in version 7.12. I mentioned version 7.10 as that's the last common version before the Opensearch fork. I'm not sure "_doc" will be 100% reliable for sorting in earlier versions (having seen duplicates in the limited testing I did) , but I could be wrong. Perhaps making this sort field configurable (but default to _id) would be a compromise, a different unique field could be selected by users to ensure sort order without loading field data in the heap. |
Well, I don’t think that sort by _id protects better against duplicates and misses, especially if id is not incremental. |
The only safe way to avoid duplicates or misses would be to use |
Hey @fbaligand The_id field would be unique and remain unchanged whereas the value of the _doc field can change in some conditions. For my own use case, there is a unique field that can be selected in the visualization as the sort field, and the combination of sorting on this unique field and _doc is proving effective, memory consumption is lighter and I'm not seeing duplicates . However if using the default sort field of _score in conjunction with _doc, there might be some scope to miss or duplicate events in the paginated searches. |
Well, _score sort is rarely the good sort in kibana, especially, if you want to get a lot of hits, or worst, download all the data. Concerning _id, indeed, it will not change for a document. But between 2 search queries, you can have new ids that can generate misses. |
Sounds good, thanks. Bear in mind that if the documents are frequently updated (the cluster I was testing can have several document updates shortly after creation), their _doc values will change. |
Hi @gplechuck, I have good news! Enhanced Table plugin v1.13.3 has been release with use of '_doc' for sort instead of '_id': |
Excellent, thanks! |
Discussed in #271
Originally posted by gplechuck May 17, 2022
Hey @fbaligand ,
Very nice plugin, have been using the 'Document Table' visualization for some time and am very happy with it as a lightweight alternative to the 'Data Table' visualization. Have noticed a problem when querying large datasets however!
It seems that when the 'Max Hits' is set higher than 10000 , a sort by '_id' field is introduced into the resulting query. Sorting by '_id' is not recommended by Elasticsearch and can lead to a large amount of cached _id field data occupying the heap (I think use of _id field data will actually be disabled as default in future versions , see elastic/elasticsearch#64511 ). I recently ran a query over a couple years worth of our data, and the overall heap usage spiked by 91GB, almost exclusively _id field data.
Use of the '_id' field does not appear to be configurable in the Document Table 'Query Parameters'
I had a very quick look at the source files and can see where the _id sort is introduced -
Is this behaviour essential for anything and would there be any adverse effects in removing that snippet of code from the plugin as a workaround, so we could continue to use the table for large datasets ?
Cheers
The text was updated successfully, but these errors were encountered: