Spec more granular metrics + data collection for fast verification per orchestrator #2336

yondonfu · 2022-03-23T15:58:36Z

We should be able to determine the fast verification success/failure rate for each O
We should be able to determine which step of fast verification that an O failed at
We should be able to collect data from failure cases for Os

yondonfu · 2022-03-23T16:01:09Z

For the first two - I know we've had trouble with per orchestrator metrics with Prometheus in the past due to Prometheus not liking high cardinality data labels so we likely will need to take that into account.

@victorges @iameli Any thoughts here on the best way to collect per orchestrator metrics these days? Could https://github.com/livepeer/livepeer-data help?

victorges · 2022-03-23T16:06:11Z

So now we are actually using Victoria Metrics instead of Prometheus, which does have support for higher cardinality metrics!

@tqian1 is adding support for per-stream metrics in the go-livepeer code, the only problem with that AFAIK (and correct me if I'm wrong) are the actual metrics exporting libraries themselves like OpenCensus or Prometheus, since once we measure one stream they stay in memory forever and there is no clear way to delete them, so the list of metrics only grows forever. Mist fixes this by having their own implementation of a metrics library which only keeps a metric while a stream is currently active.

As for exporting metrics per orchestrator tho, not necessarily per stream, we should be really fine! Maybe even in Prometheus, since metrics are always broken down by instance anyway with all the pod labels.

It would also be interesting to integrate that data into livepeer-data (stream health/analyzer) somehow, but if the purpose is only having visibility of these metrics in internal dashboards it is not necessary to do so, everything should hopefully 'just work' in our Prometheus+Victoria Metrics pipeline already.

figintern · 2022-03-25T00:01:26Z

Thanks for the mention @victorges

Victor summarizes the problem well - the main limitation for per-stream metrics is the library (OpenCensus) that we use in go-livepeer. It does not have a clean way to register and deregister stream-specific tags like manifestId and so the list of exported metrics grows forever - very quickly. However this only affects metrics that we are tagging with stream specific tags because manifestId is a tag that increases indefinitely in cardinality.

I was about to create a new issue in the repo to go over an RFC doc I wrote around this, but I can perhaps I can piggyback on this one if that's alright:

https://www.notion.so/livepeer/Single-Stream-Health-OpenCensus-b402a38f5cf54479a7ad9845c57e5604

If anyone has a chance please help to review this doc - looking for some feedback on Option 1 and 2 to move forward with the per-stream metrics and establish some pattern around exporting high-cardinality or "leaky" metrics which do not clean themselves up
Leaving the doc internal for now because it references some metric dumps which may contain sensitive information

Also looking for some feedback on this PR please #2313 - it separates some of the current metrics into the original view, and a per-stream view, in order to reduce dimensionality of exported metrics overall. Doesn't address the "leakiness" problem but makes the cardinality more manageable.

hthillman · 2022-04-13T19:17:14Z

cc @ArcAster possibly interesting for your work

oscar-davids · 2022-04-29T09:27:34Z

We have already added the matrix into Opencensus per Broadcast here and here.
Here a fast verification fails case, need to save trusted data and untrusted data somewhere.
cc @leszko @red-0ne

leszko · 2022-04-29T12:29:58Z

We have already added the matrix into Opencensus per Broadcast here and here.

Thanks for the info @oscar-davids . So, if I understand correctly, we added the "per stream" metrics, but we actually didn't solve the issue with OpenCensus and high cardinality. So the current state of art is that if you enable flag -metricsPerStream, then you experience a leak, because the old metrics are not cleaned up. Is my understanding correct? CC: @tqian1 @victorges

oscar-davids · 2022-04-29T12:45:39Z

I mean that we can know the total success rate for fast verification per broadcast but don't know the rate per orchestrator currently, so we need to collect matrix per orchestrator.

yondonfu assigned oscar-davids Mar 23, 2022

yondonfu added status: core contributors working on it in progress area: verification area: metrics labels May 3, 2022

oscar-davids mentioned this issue May 11, 2022

get matrics for fast verification per orchestrator #2397

Merged

5 tasks

yondonfu mentioned this issue May 12, 2022

Add panels for fast verification per orchestrator metrics livepeer/livepeer-monitoring#9

Closed

yondonfu closed this as completed May 19, 2022

This was referenced May 23, 2022

added 2 type errors matrices for fast verification #2418

Merged

add log info for video comparison #2436

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec more granular metrics + data collection for fast verification per orchestrator #2336

Spec more granular metrics + data collection for fast verification per orchestrator #2336

yondonfu commented Mar 23, 2022

yondonfu commented Mar 23, 2022

victorges commented Mar 23, 2022

figintern commented Mar 25, 2022 •

edited

Loading

hthillman commented Apr 13, 2022

oscar-davids commented Apr 29, 2022

leszko commented Apr 29, 2022

oscar-davids commented Apr 29, 2022 •

edited

Loading

Spec more granular metrics + data collection for fast verification per orchestrator #2336

Spec more granular metrics + data collection for fast verification per orchestrator #2336

Comments

yondonfu commented Mar 23, 2022

yondonfu commented Mar 23, 2022

victorges commented Mar 23, 2022

figintern commented Mar 25, 2022 • edited Loading

hthillman commented Apr 13, 2022

oscar-davids commented Apr 29, 2022

leszko commented Apr 29, 2022

oscar-davids commented Apr 29, 2022 • edited Loading

figintern commented Mar 25, 2022 •

edited

Loading

oscar-davids commented Apr 29, 2022 •

edited

Loading