Service doesn't combine metrics #542

prateekr · 2019-05-03T03:15:03Z

Context: We have dozens of processes runing on the same machine all pushing the same metric type to opencensus receiver to export to stackdriver. Judging from the README.md, this is an supported use case of opencensus service.

Ex. If we have 10 processes pushing the same count view with count = 1, my expectation was that the service should collapse that into one call to stackdriver with count = 10. Instead, the service looks to be just pushing up 10 updates directly to stackdriver. Thus the value in stackdriver is 10 times less than it should be for the given resource (one GCE instance). The source code also lends itself to this conclusion

Is this analysis accurate? Assuming so, to handle fan-in (as the diagram in the README.md illustrates), I think it needs to sum counts as I've described above. I.e. to get the same behavior as we would if we had 10 machines (one process per machine) push directly to stackdriver, I don't see a way to avoid this?

songy23 · 2019-05-08T20:45:02Z

If we have 10 processes pushing the same count view with count = 1, my expectation was that the service should collapse that into one call to stackdriver with count = 10. Instead, the service looks to be just pushing up 10 updates directly to stackdriver. Thus the value in stackdriver is 10 times less than it should be for the given resource (one GCE instance).

What you described is an alternative approach we considered previously when designing Agent. That is, to have each process (instrumented by OpenCensus libraries) send raw measurements to Agent, and have Agent do the aggregation. We ended up not going with this approach because of the communication cost it implied. Instead, we expect the metrics sent from each process to be already-aggregated ones, i.e we expect the aggregation of metrics to be done within the process, and Agent will just pass the metrics as-is to APM backends.

Assuming so, to handle fan-in (as the diagram in the README.md illustrates), I think it needs to sum counts as I've described above. I.e. to get the same behavior as we would if we had 10 machines (one process per machine) push directly to stackdriver, I don't see a way to avoid this?

Actually there is a way to avoid this. So the problem I see here is Agent doesn't distinguish the metrics from each process, and time series are overwritten incorrectly. @rghetia also mentioned this issue.

A little more context: in fact Stackdriver doesn't allow you to write time series to the same metric concurrently, if the time series have the same metric labels and resource. Therefore we added a supplementary label called "opencensus_task" which distinguishes time series from each process (see https://github.com/census-instrumentation/opencensus-java/tree/master/exporters/stats/stackdriver#what-is-opencensus_task-metric-label-). However this label is only added when the Stackdriver exporter is running in the same process as the main application. When you switch to ocagent exporter, this information is dropped.

To illustrate this more, consider we have a metric (metric1) and two processes (p1, p2) writing to it:

What should have happened is:
p1 sends count 1 with its identifier -> Agent -> Stackdriver
p2 sends count 1 with its identifier -> Agent -> Stackdriver
And we got metric1 like this:
| pid | count |
| p1 | 1 |
| p2 | 1 |
When you omit the dimension pid, you'll get total count 2.

What happened currently is:
p1 sends count 1 with its identifier -> Agent (dropped the id) -> Stackdriver
p2 sends count 1 with its identifier -> Agent (dropped the id) -> Stackdriver
These two counts are treated as the same, so the latter one just overwrites the previous one. Also we lost the dimension pid:
| count |
| 1 |

songy23 · 2019-05-08T20:50:16Z

In conclusion,

Instead, the service looks to be just pushing up 10 updates directly to stackdriver.

Yes this is the expected behavior with our current design.

Thus the value in stackdriver is 10 times less than it should be for the given resource (one GCE instance).

This is a bug we need to fix, as I described in my comment above.

Assuming so, to handle fan-in (as the diagram in the README.md illustrates), I think it needs to sum counts as I've described above.

Not really. These counts should be treated as unique time series and they should not be combined with our current design.

rghetia · 2019-05-09T16:46:48Z

What happened currently is:
p1 sends count 1 with its identifier -> Agent (dropped the id) -> Stackdriver
p2 sends count 1 with its identifier -> Agent (dropped the id) -> Stackdriver
These two counts are treated as the same, so the latter one just overwrites the previous one. Also we lost the dimension pid:
| count |
| 1 |

This can also result into error because it is essentially updating same time series less than 60 seconds apart.

songy23 added the bug label May 8, 2019

songy23 mentioned this issue May 9, 2019

Add opencensus_task labels to proto metrics if no default labels are provided. census-ecosystem/opencensus-go-exporter-stackdriver#148

Merged

songy23 closed this as completed in census-ecosystem/opencensus-go-exporter-stackdriver#148 May 9, 2019

flands added this to the 0.1.7 milestone May 10, 2019

flands assigned songy23 May 10, 2019

songy23 mentioned this issue May 10, 2019

Update stackdriver exporter version. #548

Merged

ecourreges-orange mentioned this issue Nov 25, 2019

Metrics aggregation for same metric with 2 or more nodes open-telemetry/opentelemetry-collector#433

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service doesn't combine metrics #542

Service doesn't combine metrics #542

prateekr commented May 3, 2019 •

edited

Loading

songy23 commented May 8, 2019 •

edited

Loading

songy23 commented May 8, 2019

rghetia commented May 9, 2019

Service doesn't combine metrics #542

Service doesn't combine metrics #542

Comments

prateekr commented May 3, 2019 • edited Loading

songy23 commented May 8, 2019 • edited Loading

songy23 commented May 8, 2019

rghetia commented May 9, 2019

prateekr commented May 3, 2019 •

edited

Loading

songy23 commented May 8, 2019 •

edited

Loading