-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate multi data point TimeSeries #54
Comments
@jmacd @bogdandrutu I need your opinion please. |
We are working on a processor that does this, in order to reduce cardinality for the labels (some users need this). I would suggest that we just simply drop the repeated for the Points initially and keep only one Point. This way we still have the chance in the future to add the capability to support multiple points in a backwards compatible way. |
OK. So you are suggesting to do this: message Int64TimeSeries {
repeated string label_values = 1;
Int64Value point = 2;
} |
I think that will keep the possibility in the future to support multiple points if needed. Still removing one allocation. |
It is a smaller CPU saving (2% encoding, 10% decoding) but we can do it. |
I might suggest that we take both options. Use the format Tigran suggests for clients to report a single data point with a relatively short period, where the timestamp is implied by the enclosing message. This will be fast and cheap for the clients to use. Then, support a second format for multi-valued timeseries with timestamped values. This will be useful in situations where longer periods are being aggregated. This will enable flexibility. I'm not sure how this second format will be applied. |
@jmacd I understand what you are saying but am a bit reluctant to add a second parallel representation of the data. That complicates the schema and implementations (especially receivers who need to be able to process both formats). |
@akehlenbeck I'd like your input. |
I've never seen much demand at all for representing multiple points from the same time series in one write request. The main use is I'm aware of actually has nothing to do with realtime collection, but rather with backfilling. Eventually, someone will want to rename or otherwise transform existing historical data. The natural way to do that is with some kind of extract-transform-load job, and in the "load" step you end up wanting to write a lot of points from a single time series in one big batch. But: I would also imagine that kind of operation might be performed directly against a SaaS-specific API, rather than going through the same collectors that are used for realtime collection. So it sort of seems like it's out of scope for this discussion. My inclination would be to go with the format suggested by Tigran above. I don't see compelling benefit from the intermediate approach of keeping the If for some reason in the future someone comes along with a compelling reason for needing multiple points per time series and for very high performance, then revisit this question then with a separate API. (But I would be surprised if such a use case ever arose.) I suspect in that situation you'll want an entirely different representation anyways, for example
So basically: do what Josh said about having a second format, but don't bothering doing it until it's actually needed (if ever). |
Great. I support Tigran's proposal. |
I put a bit more thought into this and given that Bogdan thinks we may need to support multiple data points (and batching) in the future I do not want to pursue this anymore. It will be difficult (or impossible) to add support for multiple data points in a way that does not break existing code that deals with data points in the Collector. Yes, on the wire level |
What's the motivation for having multiple points from the same time series in one batch? I didn't really understand Bogdan's comment from higher up:
How does cardinality reduction translate into multiple points? |
I don't know what exactly Bogdan's use case is but purely technically it is possible to eliminate one (or more) of the labels that you don't care about and combine all data points for all values of that label into a single array (a Timeseries message). Obviously this loses data, not sure if this is really something desirable ever. |
I believe the use-case being imagined is one where we combine say a number of gauge values taken across hosts. In the aggregation, the host label is removed leaving us with many gauge values, still with independent timestamps, that should be preserved as multiple points. I still endorse a plan of supporting both options. There will be a relatively simple adapter in code to visit the single-value timeseries as a length-1 multi-value timeseries. |
Have just chatted with @bogdandrutu and we came to conclusion that there is no good use case for having multiple data points. All possible ways to combine data points that we imagined are either very expensive or make no semantic sense. I am going to submit a PR with a single Int64Value point. The intent is to keep Int64Value (and other value types) as a separate message for usability. If in the future we discover a use case for multiple data points we have 2 possible paths to extend the proto:
There are probably other ways as well. |
Resolves open-telemetry#54 After a discussion we came to conclusion that there is no good use case for having multiple data points. All possible ways to combine data points that we imagined are either very expensive or make no semantic sense. This commit eliminates timeseries that can have multiple data points and replaces them with single data points. If in the future we discover a use case for multiple data points we have 2 possible paths to extend the proto: - Add a corresponding points repeated field in the Int64DataPoint. - Introduce a different message type that is capable of storing multiple data points and assign it a different type in the MetricDescriptor.Type enum. There are probably other ways as well.
Resolves open-telemetry#54 After a discussion we came to conclusion that there is no good use case for having multiple data points. All possible ways to combine data points that we imagined are either very expensive or make no semantic sense. This commit eliminates timeseries that can have multiple data points and replaces them with single data points. If in the future we discover a use case for multiple data points we have 2 possible paths to extend the proto: - Add a corresponding points repeated field in the Int64DataPoint. - Introduce a different message type that is capable of storing multiple data points and assign it a different type in the MetricDescriptor.Type enum. There are probably other ways as well.
Resolves open-telemetry#54 After a discussion we came to conclusion that there is no good use case for having multiple data points. All possible ways to combine data points that we imagined are either very expensive or make no semantic sense. This commit eliminates timeseries that can have multiple data points and replaces them with single data points. If in the future we discover a use case for multiple data points we have 2 possible paths to extend the proto: - Add a corresponding points repeated field in the Int64DataPoint. - Introduce a different message type that is capable of storing multiple data points and assign it a different type in the MetricDescriptor.Type enum. There are probably other ways as well.
Resolves open-telemetry#54 After a discussion we came to conclusion that there is no good use case for having multiple data points. All possible ways to combine data points that we imagined are either very expensive or make no semantic sense. This commit eliminates timeseries that can have multiple data points and replaces them with single data points. If in the future we discover a use case for multiple data points we have 2 possible paths to extend the proto: - Add a corresponding points repeated field in the Int64DataPoint. - Introduce a different message type that is capable of storing multiple data points and assign it a different type in the MetricDescriptor.Type enum. There are probably other ways as well.
Resolves #54 After a discussion we came to conclusion that there is no good use case for having multiple data points. All possible ways to combine data points that we imagined are either very expensive or make no semantic sense. This commit eliminates timeseries that can have multiple data points and replaces them with single data points. If in the future we discover a use case for multiple data points we have 2 possible paths to extend the proto: - Add a corresponding points repeated field in the Int64DataPoint. - Introduce a different message type that is capable of storing multiple data points and assign it a different type in the MetricDescriptor.Type enum. There are probably other ways as well.
Metric proto currently allows to represent multiple data points for each metric, i.e. a timeseries with several values of the same metric, one value per timestamp.
However, reading the metric SDK proposal tells me that normally each collection will emit only a single data point per metric, the last aggregate state of the metric. There may be custom exporters which accumulate data points and emit them all in one batch but it seems to be a possible exceptional case and not a norm. Typically an exporter will simply send the last aggregate value.
Another use case where we could possibly have multiple data points per metric is in Collector, where we could in theory batch points from the same metric into one timeseries. However, in practice I do not think we will do it since it is a very costly operation that requires maintaining maps of metrics.
Given that a single data point per metric is the most common case I suggest that we optimize the proto for that case. Currently the proto definition is the following (only int64 is shown for simplicity):
I propose to change this to the following:
Preliminary benchmarking shows that for the case when we have 1 int64 data point per metric this change reduces CPU and RAM usage by 25% for encoding and decoding.
Note: this schema still allows representing multiple data points per metric if neccessary, however at the cost of repeating label values.
The text was updated successfully, but these errors were encountered: