Skip to content
This repository has been archived by the owner on Nov 7, 2022. It is now read-only.

Service doesn't combine metrics #542

Closed
prateekr opened this issue May 3, 2019 · 3 comments · Fixed by census-ecosystem/opencensus-go-exporter-stackdriver#148
Closed
Assignees
Labels
Milestone

Comments

@prateekr
Copy link

prateekr commented May 3, 2019

Context: We have dozens of processes runing on the same machine all pushing the same metric type to opencensus receiver to export to stackdriver. Judging from the README.md, this is an supported use case of opencensus service.

Ex. If we have 10 processes pushing the same count view with count = 1, my expectation was that the service should collapse that into one call to stackdriver with count = 10. Instead, the service looks to be just pushing up 10 updates directly to stackdriver. Thus the value in stackdriver is 10 times less than it should be for the given resource (one GCE instance). The source code also lends itself to this conclusion

Is this analysis accurate? Assuming so, to handle fan-in (as the diagram in the README.md illustrates), I think it needs to sum counts as I've described above. I.e. to get the same behavior as we would if we had 10 machines (one process per machine) push directly to stackdriver, I don't see a way to avoid this?

@songy23 songy23 added the bug label May 8, 2019
@songy23
Copy link
Contributor

songy23 commented May 8, 2019

If we have 10 processes pushing the same count view with count = 1, my expectation was that the service should collapse that into one call to stackdriver with count = 10. Instead, the service looks to be just pushing up 10 updates directly to stackdriver. Thus the value in stackdriver is 10 times less than it should be for the given resource (one GCE instance).

What you described is an alternative approach we considered previously when designing Agent. That is, to have each process (instrumented by OpenCensus libraries) send raw measurements to Agent, and have Agent do the aggregation. We ended up not going with this approach because of the communication cost it implied. Instead, we expect the metrics sent from each process to be already-aggregated ones, i.e we expect the aggregation of metrics to be done within the process, and Agent will just pass the metrics as-is to APM backends.

Assuming so, to handle fan-in (as the diagram in the README.md illustrates), I think it needs to sum counts as I've described above. I.e. to get the same behavior as we would if we had 10 machines (one process per machine) push directly to stackdriver, I don't see a way to avoid this?

Actually there is a way to avoid this. So the problem I see here is Agent doesn't distinguish the metrics from each process, and time series are overwritten incorrectly. @rghetia also mentioned this issue.

A little more context: in fact Stackdriver doesn't allow you to write time series to the same metric concurrently, if the time series have the same metric labels and resource. Therefore we added a supplementary label called "opencensus_task" which distinguishes time series from each process (see https://github.com/census-instrumentation/opencensus-java/tree/master/exporters/stats/stackdriver#what-is-opencensus_task-metric-label-). However this label is only added when the Stackdriver exporter is running in the same process as the main application. When you switch to ocagent exporter, this information is dropped.

To illustrate this more, consider we have a metric (metric1) and two processes (p1, p2) writing to it:

What should have happened is:
p1 sends count 1 with its identifier -> Agent -> Stackdriver
p2 sends count 1 with its identifier -> Agent -> Stackdriver
And we got metric1 like this:
| pid | count |
| p1 | 1 |
| p2 | 1 |
When you omit the dimension pid, you'll get total count 2.

What happened currently is:
p1 sends count 1 with its identifier -> Agent (dropped the id) -> Stackdriver
p2 sends count 1 with its identifier -> Agent (dropped the id) -> Stackdriver
These two counts are treated as the same, so the latter one just overwrites the previous one. Also we lost the dimension pid:
| count |
| 1 |

@songy23
Copy link
Contributor

songy23 commented May 8, 2019

In conclusion,

Instead, the service looks to be just pushing up 10 updates directly to stackdriver.

Yes this is the expected behavior with our current design.

Thus the value in stackdriver is 10 times less than it should be for the given resource (one GCE instance).

This is a bug we need to fix, as I described in my comment above.

Assuming so, to handle fan-in (as the diagram in the README.md illustrates), I think it needs to sum counts as I've described above.

Not really. These counts should be treated as unique time series and they should not be combined with our current design.

@rghetia
Copy link
Contributor

rghetia commented May 9, 2019

What happened currently is:
p1 sends count 1 with its identifier -> Agent (dropped the id) -> Stackdriver
p2 sends count 1 with its identifier -> Agent (dropped the id) -> Stackdriver
These two counts are treated as the same, so the latter one just overwrites the previous one. Also we lost the dimension pid:
| count |
| 1 |

This can also result into error because it is essentially updating same time series less than 60 seconds apart.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
4 participants