Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lens] Allow time shift with terms #103063

Closed
flash1293 opened this issue Jun 23, 2021 · 4 comments
Closed

[Lens] Allow time shift with terms #103063

flash1293 opened this issue Jun 23, 2021 · 4 comments
Labels
enhancement New value added to drive a business result Feature:Lens 🧊 iceboxed impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:needs-research This issue requires some research before it can be worked on or estimated Team:Visualizations Visualization editors, elastic-charts and infrastructure triage_needed

Comments

@flash1293
Copy link
Contributor

Right now terms can’t be used along with time shifts because we can’t guarantee data will be returned from all shards. The problem is the following:

When a terms agg with a sort metric is used, it’s not actually sorting all terms globally. Instead, each shard sorts the local terms locally and only sends the top shard_size terms to the coordinating node handling the request from Kibana. Here, the results from all shards are merged into a single consistent list of length size. By default the shard size parameter is a little higher than the size parameter to account for differences in local distribution. However, it can’t guarantee that the final response will be correct because it’s always possible a shard didn’t send some terms to the coordinating node that would have made it into the final list. This is the case today for regular terms aggregations on high cardinality fields with multiple shards and local differences in distribution of these values.

However for time shift, this problem becomes much more obvious in some cases:
Let’s say the user configures the following chart:
Top 3 values of “myField” ordered by count of records
Count of records
Count of records shifted by 1 week

Data for a week ago is in a different index and thus a different shard than data from today.
Both shards (a week ago and today) will locally build a list of 3 myField values and send them to the coordinating node. The coordinating node will order them by the count of records of todays data and send them to kibana.

This is how data in todays shard looks like - it will send X, Y and Z:
A 1
B 2
C 3
X 99
Y 98
Z 97

This is how data in the shard from a week ago looks like - it will send A, B and C:
A 99
B 98
C 97
X 3
Y 2
Z 1

The coordinating node tries to put together a result and sends the following:

myField Count of records Count of records - 1 week
X 99 -
Y 98 -
Z 97 -

The final table does not include any data for X, Y and Z - it was there, but the shard thought it wouldn’t get relevant because it had small local values.
If a user knows for sure there should be data for X, Y and Z from a week ago, they will be very confused in this case (besides misleading information that can impact business decisions).
The error will not always look this straightforward - it’s possible some of the terms make it from the previous shards while some don’t, or some terms make it on some shards, but not on all of them so the cell is filled, but the number isn’t actually correct. For sum and count of records the number can only be too small, but for average/median/min/max the number can be completely wrong as well.

Mitigating the case

By default the shard_size parameter is set to 1.5 * size + 10. We can bump up this number in case a time shift is used - maybe something like min(1000, 1.5 * size + 10). This will decrease the chance of this case happening without fully removing it - at the expense of lower performance and higher memory usage in the Elasticsearch cluster (elastic/elasticsearch#72684) We can work with the Elasticsearch team to do this in a safe manner. It depends a lot on the data the user is working with how much this will help.

Warning about the case

Elasticsearch tells us when it’s possible this is happening right now. The doc_count_error_upper_bound in the response will be 0 if it’s not possible there is missing data. For any other value there’s a chance some data is missing, but we can’t say for sure. This is a runtime error - we can’t say for sure whether it will happen and it can depend on the time range / filter / incoming data. When the case is happening, we can tell the user to either ignore it (because they know it’s not relevant to them), or do the following things:
Narrow down search scope - if there are less documents, the chance for a complete terms list is higher
Increase “size” of top values - more data transferred, but higher chance of getting all of it
I can imagine this as a warning label of some kind overlaying the chart.

As this is an issue in general (prominent for time shifts but can happen in any configuration), we should warn about it in general. See #94918

The warning could include the existing “fix action” to make terms static by turning them into filters - I still think that’s a helpful feature in general

Example bad case

A use case specific example where this problem could show up: A security analyst is looking at suspicious user activity. The build a chart: top 20 values of ips ordered by number of times the “login” endpoint has been accessed (filtered count of records metric). They wonder whether these ips have been active yesterday as well. They duplicate the count of records metric and shift it by one day. The second column doesn’t show any activity for these ips yesterday. The analyst concludes that the attack began today. However, they were actually probing the login endpoint yesterday as well, but only a small number of times. Because of this, the shard holding yesterdays data won’t return these IPs when queried for the top ips ordered by number of login attempts.

Implementation steps

  • Make sure terms agg will sort by the time range of the selected metric first, then by the unshifted time range (two sorting metrics) to get the best possible result (in case the distribution is similar in both shards) - otherwise the sorting of the shard not containing data for the shifted time range will fall back to alphabetical
    orderAgg = aggs.createAggConfig(
  • In this scenario, also set the shard_size parameter as explained above
  • Implement [Lens] Expose Elasticsearch accuracy warnings to the user #94918 in a general case (including a "fix it" action in the warning to turn top values into filters
  • Remove the error in Lens and allow top values together with multiple time shifts
  • Add a validation in Lens to make sure top values is never nested inside a date histogram (this would still be an error case) with a "fix it" action to swap date histogram and top values
@flash1293 flash1293 added enhancement New value added to drive a business result Team:Visualizations Visualization editors, elastic-charts and infrastructure Feature:Lens labels Jun 23, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app (Team:KibanaApp)

@drewdaemon drewdaemon added the impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. label Feb 9, 2023
@stratoula stratoula added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. and removed impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. labels Jun 12, 2023
@markov00 markov00 added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jun 4, 2024
@markov00
Copy link
Member

@stratoula can this be fixed by using ESQL instead?

@stratoula
Copy link
Contributor

We are returning all the terms when grouped by so I think yes but it will be better to test it or ask ES as I am not 100% sure.

@dej611
Copy link
Contributor

dej611 commented Jul 24, 2024

In order to provide better transparency of priorities, issues that will not be prioritized within the next 24 months are being closed.

@dej611 dej611 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Lens 🧊 iceboxed impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:needs-research This issue requires some research before it can be worked on or estimated Team:Visualizations Visualization editors, elastic-charts and infrastructure triage_needed
Projects
None yet
Development

No branches or pull requests

6 participants