-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use APM metrics to introduce low-fi data layer for space reduction #104
Comments
@elastic/apm-ui thoughts? |
@roncohen this is pretty exciting! When I was working on my own APM stuff w/ ES I was always struggling with storage vs resolution. One of the options I considered back then was to store everything as a metricset, with a resolution of 1:1. After n days, data would then be rolled up into increasingly lower resolution. I could then always query the metricset instead of the raw documents. If we could do something similar, that would help a lot, but not sure what it means for storage, agent support etc. If we would have to support both transactions and metricsets and then merge them in Kibana it's feasible, but hairy. What happens when you try to query the rollup search with a percentile agg? will it error out or just show no data? |
If you think about it in two layers, with the hi-fi one being optional, does that help? For example, for the transaction duration graph, the "avg" line comes from the low-fi layer and is based on the metricset documents. A separate query will calculate percentiles based on "transaction" documents. If the percentile queries return data then we "just" add two lines to the transaction duration graph. |
Sounds like a great plan for being able to support different data resolutions. Got a few questions;
|
great questions
It would probably make sense to change them to be based on metricsets eventually because the plan is to stop sending up unsampled transactions some day. But in the mean time it shouldn't matter. The numbers should be the same. I think we'd consider the ML data part of the low-fi layer.
As a start, it's probably simplest to wait for both to return before drawing the graphs. If it's not big difference in complexity, it's probably nice to show data as soon as we have something and then add to it when the other query arrives.
It's an interesting idea, but i don't think we should do that for now. |
@roncohen thanks for clarifying, makes sense |
We've been talking about this for a while - thanks for finally getting the ball rolling @roncohen! One aspect I didn't see mentioned is the query bar. Currently it is used to filter the UI via ES filters applied to transaction and error documents. Metric docs won't have these dimensions and will therefore render the query bar useless. |
that's a good point. I have two ideas for what we could do:
|
For all of those except |
I'm a bit worried about a cardinality increase when including the |
@felixbarny agreed that we need to be vigilant about cardinality increase |
This has been shipped (see |
We should use metrics transaction timing data to have two layers of data fidelity in the APM UI.
We'd have a low-fi layer and a hi-fi layer.
Motivation
Today, most graphs in the APM UI are querying transaction documents. This works because we're sending up all transactions, even unsampled.
As part of #78 we also started sending up transaction timing data as a metricset. Some of the data shown in the APM UI can be calculated using this new timing data instead of the transaction documents.
This would allow users to get rid of the transaction documents early, say after 7 days, but still be able to derive value from the APM UI beyond this timeframe. Setting a separate ILM policy for transactions is already supported through a bit of manual work.
User experience on low-fi data
The idea would be that the low-fi layer is calculated from the metrics data, while all the data that requires the (unsampled) transactions will be part of the hi-fi layer.
From the new metricset, we can show
We'd be unable to show the transaction distribution chart or any samples:
If agents eventually support histograms as a a metric, we could encode the transaction duration as a histogram and show the transaction distribution even with only the low-fi data. This shouldn't be a blocker at the moment.
Querying
To make things simple, the APM UI could always be using the new metrics data to draw the things it can. We'd then fire off separate queries for the "hi-fi" data (percentiles, distribution chart, actual transaction samples etc.). If the hi-fi data is available for the given time range the percentile lines show on the graphs etc. If not, we only show the low-fi data.
That means, if you pick a time range that has both low-fi and hi-fi data for the full time rage, you'll see exactly what you see today.
If you go back in time far enough, only low-fi data is available and you'll not see percentiles, distribution chart etc.
If you select a time range that includes hi-fi data some part of the time range, the percentiles graph might appear in the middle of a graph. For the distribution chart in particular, this is a complication because it's not clear that the visualization that it's partial as it is on the graphs. Users will be able to deduct that fact by looking at the other graphs on the same page.
We could try to detect that the data is partial and show a note. Detection could happen by comparing the number of transactions we have compared to the number we get from the metricsets. Probably not a blocker for the first version.
Transaction group list
The transaction group list could represents a special problem here as it would require us to merge the low-fi and hi-fi data in the list. I don't think the merge can be done in Elasticsearch.
Due to pagination etc., we'd need to ensure that low-fi and hi-fi queries return data for the same transaction groups, and then merge it in Kibana. We could potentially do it by making the queries sort by both lists by avg. transaction time calculated on the metricset and transaction data respectively, and then do the merge in Kibana. I have more thoughts on this, but we should probably do a POC to investigate the feasibility of this.
Rollups
Introducing the low-fi layer as described above allows users to delete transaction data and still see low-fi data. I expect that will be a significant storage reduction for users that want to keep hi-fi data for, say one week, and low-fi data for 2 months. Some users will want to keep low-fi data for much longer. For those users, applying rollups to the low-fi data to decrease time granularity will allow them to further reduce storage costs. Supporting rollups isn't something we'd need to do in the first phase.
Rollups includes functionality to transparently rewrite queries to search regular documents and rolled up data at the same time. So the queries for low-fi data should mostly just work for rolled up data. There are some improvements to rollups coming which we should probably wait for before spending time investigating more: elastic/elasticsearch#42720
Future
When elastic/elasticsearch#33214 arrives, agents could start sending up transaction duration histograms and we'll be able to move percentiles and distribution chart into the low-fi layer. We'd be able to stop sending up unsampled transactions. The hi-fi layer will then only be actual transaction samples.
The text was updated successfully, but these errors were encountered: