-
Notifications
You must be signed in to change notification settings - Fork 63
Service doesn't combine metrics #542
Service doesn't combine metrics #542
Comments
What you described is an alternative approach we considered previously when designing Agent. That is, to have each process (instrumented by OpenCensus libraries) send raw measurements to Agent, and have Agent do the aggregation. We ended up not going with this approach because of the communication cost it implied. Instead, we expect the metrics sent from each process to be already-aggregated ones, i.e we expect the aggregation of metrics to be done within the process, and Agent will just pass the metrics as-is to APM backends.
Actually there is a way to avoid this. So the problem I see here is Agent doesn't distinguish the metrics from each process, and time series are overwritten incorrectly. @rghetia also mentioned this issue. A little more context: in fact Stackdriver doesn't allow you to write time series to the same metric concurrently, if the time series have the same metric labels and resource. Therefore we added a supplementary label called "opencensus_task" which distinguishes time series from each process (see https://github.com/census-instrumentation/opencensus-java/tree/master/exporters/stats/stackdriver#what-is-opencensus_task-metric-label-). However this label is only added when the Stackdriver exporter is running in the same process as the main application. When you switch to ocagent exporter, this information is dropped. To illustrate this more, consider we have a metric ( What should have happened is: What happened currently is: |
In conclusion,
Yes this is the expected behavior with our current design.
This is a bug we need to fix, as I described in my comment above.
Not really. These counts should be treated as unique time series and they should not be combined with our current design. |
This can also result into error because it is essentially updating same time series less than 60 seconds apart. |
Context: We have dozens of processes runing on the same machine all pushing the same metric type to opencensus receiver to export to stackdriver. Judging from the README.md, this is an supported use case of opencensus service.
Ex. If we have 10 processes pushing the same count view with count = 1, my expectation was that the service should collapse that into one call to stackdriver with count = 10. Instead, the service looks to be just pushing up 10 updates directly to stackdriver. Thus the value in stackdriver is 10 times less than it should be for the given resource (one GCE instance). The source code also lends itself to this conclusion
Is this analysis accurate? Assuming so, to handle fan-in (as the diagram in the README.md illustrates), I think it needs to sum counts as I've described above. I.e. to get the same behavior as we would if we had 10 machines (one process per machine) push directly to stackdriver, I don't see a way to avoid this?
The text was updated successfully, but these errors were encountered: