[Lens] Allow time shift with terms #103063
Labels
enhancement
New value added to drive a business result
Feature:Lens
🧊 iceboxed
impact:low
Addressing this issue will have a low level of impact on the quality/strength of our product.
loe:needs-research
This issue requires some research before it can be worked on or estimated
Team:Visualizations
Visualization editors, elastic-charts and infrastructure
triage_needed
Right now terms can’t be used along with time shifts because we can’t guarantee data will be returned from all shards. The problem is the following:
When a terms agg with a sort metric is used, it’s not actually sorting all terms globally. Instead, each shard sorts the local terms locally and only sends the top shard_size terms to the coordinating node handling the request from Kibana. Here, the results from all shards are merged into a single consistent list of length size. By default the shard size parameter is a little higher than the size parameter to account for differences in local distribution. However, it can’t guarantee that the final response will be correct because it’s always possible a shard didn’t send some terms to the coordinating node that would have made it into the final list. This is the case today for regular terms aggregations on high cardinality fields with multiple shards and local differences in distribution of these values.
However for time shift, this problem becomes much more obvious in some cases:
Let’s say the user configures the following chart:
Top 3 values of “myField” ordered by count of records
Count of records
Count of records shifted by 1 week
Data for a week ago is in a different index and thus a different shard than data from today.
Both shards (a week ago and today) will locally build a list of 3 myField values and send them to the coordinating node. The coordinating node will order them by the count of records of todays data and send them to kibana.
This is how data in todays shard looks like - it will send X, Y and Z:
A 1
B 2
C 3
X 99
Y 98
Z 97
This is how data in the shard from a week ago looks like - it will send A, B and C:
A 99
B 98
C 97
X 3
Y 2
Z 1
The coordinating node tries to put together a result and sends the following:
The final table does not include any data for X, Y and Z - it was there, but the shard thought it wouldn’t get relevant because it had small local values.
If a user knows for sure there should be data for X, Y and Z from a week ago, they will be very confused in this case (besides misleading information that can impact business decisions).
The error will not always look this straightforward - it’s possible some of the terms make it from the previous shards while some don’t, or some terms make it on some shards, but not on all of them so the cell is filled, but the number isn’t actually correct. For sum and count of records the number can only be too small, but for average/median/min/max the number can be completely wrong as well.
Mitigating the case
By default the shard_size parameter is set to 1.5 * size + 10. We can bump up this number in case a time shift is used - maybe something like min(1000, 1.5 * size + 10). This will decrease the chance of this case happening without fully removing it - at the expense of lower performance and higher memory usage in the Elasticsearch cluster (elastic/elasticsearch#72684) We can work with the Elasticsearch team to do this in a safe manner. It depends a lot on the data the user is working with how much this will help.
Warning about the case
Elasticsearch tells us when it’s possible this is happening right now. The doc_count_error_upper_bound in the response will be 0 if it’s not possible there is missing data. For any other value there’s a chance some data is missing, but we can’t say for sure. This is a runtime error - we can’t say for sure whether it will happen and it can depend on the time range / filter / incoming data. When the case is happening, we can tell the user to either ignore it (because they know it’s not relevant to them), or do the following things:
Narrow down search scope - if there are less documents, the chance for a complete terms list is higher
Increase “size” of top values - more data transferred, but higher chance of getting all of it
I can imagine this as a warning label of some kind overlaying the chart.
As this is an issue in general (prominent for time shifts but can happen in any configuration), we should warn about it in general. See #94918
The warning could include the existing “fix action” to make terms static by turning them into filters - I still think that’s a helpful feature in general
Example bad case
A use case specific example where this problem could show up: A security analyst is looking at suspicious user activity. The build a chart: top 20 values of ips ordered by number of times the “login” endpoint has been accessed (filtered count of records metric). They wonder whether these ips have been active yesterday as well. They duplicate the count of records metric and shift it by one day. The second column doesn’t show any activity for these ips yesterday. The analyst concludes that the attack began today. However, they were actually probing the login endpoint yesterday as well, but only a small number of times. Because of this, the shard holding yesterdays data won’t return these IPs when queried for the top ips ordered by number of login attempts.
Implementation steps
kibana/src/plugins/data/common/search/aggs/buckets/terms.ts
Line 189 in 8cb3dbc
shard_size
parameter as explained aboveThe text was updated successfully, but these errors were encountered: