Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec more granular metrics + data collection for fast verification per orchestrator #2336

Closed
yondonfu opened this issue Mar 23, 2022 · 7 comments

Comments

@yondonfu
Copy link
Member

  • We should be able to determine the fast verification success/failure rate for each O
  • We should be able to determine which step of fast verification that an O failed at
  • We should be able to collect data from failure cases for Os
@yondonfu
Copy link
Member Author

For the first two - I know we've had trouble with per orchestrator metrics with Prometheus in the past due to Prometheus not liking high cardinality data labels so we likely will need to take that into account.

@victorges @iameli Any thoughts here on the best way to collect per orchestrator metrics these days? Could https://github.com/livepeer/livepeer-data help?

@victorges
Copy link
Member

So now we are actually using Victoria Metrics instead of Prometheus, which does have support for higher cardinality metrics!

@tqian1 is adding support for per-stream metrics in the go-livepeer code, the only problem with that AFAIK (and correct me if I'm wrong) are the actual metrics exporting libraries themselves like OpenCensus or Prometheus, since once we measure one stream they stay in memory forever and there is no clear way to delete them, so the list of metrics only grows forever. Mist fixes this by having their own implementation of a metrics library which only keeps a metric while a stream is currently active.

As for exporting metrics per orchestrator tho, not necessarily per stream, we should be really fine! Maybe even in Prometheus, since metrics are always broken down by instance anyway with all the pod labels.

It would also be interesting to integrate that data into livepeer-data (stream health/analyzer) somehow, but if the purpose is only having visibility of these metrics in internal dashboards it is not necessary to do so, everything should hopefully 'just work' in our Prometheus+Victoria Metrics pipeline already.

@figintern
Copy link

figintern commented Mar 25, 2022

Thanks for the mention @victorges

Victor summarizes the problem well - the main limitation for per-stream metrics is the library (OpenCensus) that we use in go-livepeer. It does not have a clean way to register and deregister stream-specific tags like manifestId and so the list of exported metrics grows forever - very quickly. However this only affects metrics that we are tagging with stream specific tags because manifestId is a tag that increases indefinitely in cardinality.

I was about to create a new issue in the repo to go over an RFC doc I wrote around this, but I can perhaps I can piggyback on this one if that's alright:

https://www.notion.so/livepeer/Single-Stream-Health-OpenCensus-b402a38f5cf54479a7ad9845c57e5604

If anyone has a chance please help to review this doc - looking for some feedback on Option 1 and 2 to move forward with the per-stream metrics and establish some pattern around exporting high-cardinality or "leaky" metrics which do not clean themselves up
Leaving the doc internal for now because it references some metric dumps which may contain sensitive information

Also looking for some feedback on this PR please #2313 - it separates some of the current metrics into the original view, and a per-stream view, in order to reduce dimensionality of exported metrics overall. Doesn't address the "leakiness" problem but makes the cardinality more manageable.

@hthillman
Copy link
Contributor

cc @ArcAster possibly interesting for your work

@oscar-davids
Copy link
Contributor

We have already added the matrix into Opencensus per Broadcast here and here.
Here a fast verification fails case, need to save trusted data and untrusted data somewhere.
cc @leszko @red-0ne

@leszko
Copy link
Contributor

leszko commented Apr 29, 2022

We have already added the matrix into Opencensus per Broadcast here and here.

Thanks for the info @oscar-davids . So, if I understand correctly, we added the "per stream" metrics, but we actually didn't solve the issue with OpenCensus and high cardinality. So the current state of art is that if you enable flag -metricsPerStream, then you experience a leak, because the old metrics are not cleaned up. Is my understanding correct? CC: @tqian1 @victorges

@oscar-davids
Copy link
Contributor

oscar-davids commented Apr 29, 2022

I mean that we can know the total success rate for fast verification per broadcast but don't know the rate per orchestrator currently, so we need to collect matrix per orchestrator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants