Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per Orchestrator metrics in Broadcaster #2371

Closed
leszko opened this issue Apr 25, 2022 · 13 comments · Fixed by #2382
Closed

Per Orchestrator metrics in Broadcaster #2371

leszko opened this issue Apr 25, 2022 · 13 comments · Fixed by #2382

Comments

@leszko
Copy link
Contributor

leszko commented Apr 25, 2022

Is your feature request related to a problem? Please describe.
Currently, Livepeer Broadcaster doesn't export any metrics tagged per Orchestrator. This would be useful, for example, in test streams monitoring.

Describe the solution you'd like
Add per Orchestrator metrics recorded by Broadcaster:

  • transcode_overall_latency_seconds_per_orchestrator
  • segment_transcoded_all_appeared_total_per_orchestrator
  • segment_transcoded_unprocessed_total_per_orchestrator

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@github-actions github-actions bot added the status: triage this issue has not been evaluated yet label Apr 25, 2022
@leszko leszko added area: metrics and removed status: triage this issue has not been evaluated yet labels Apr 25, 2022
@leszko leszko assigned red-0ne and unassigned figintern Apr 26, 2022
@yondonfu
Copy link
Member

yondonfu commented Apr 26, 2022

Would it make sense to add labels (i.e. an orchestrator label) for the existing metrics instead of having separate metrics with the _per_orchestrator suffix?

@leszko
Copy link
Contributor Author

leszko commented Apr 27, 2022

Would it make sense to add labels (i.e. an orchestrator label) for the existing metrics instead of having separate metrics with the _per_orchestrator suffix?

I think we could do it that way. I actually tried to followed the pattern we have for per_stream metrics. We do have both:

  • success_rate
  • success_rate_per_stream

@tqian1 do you see any drawbacks of just adding a label to existing metrics instead of creating new ones? Also do you maybe know why we added new ones for per_stream?

@red-0ne red-0ne linked a pull request Apr 29, 2022 that will close this issue
5 tasks
@red-0ne
Copy link
Contributor

red-0ne commented Apr 29, 2022

I've added the requested metrics in census.

The approach was to create the metric and insert a tag with key kOrchestratorURI. The issue now is that we need to get uri from somewhere since it's not passed by SegmentFullyTranscoded(). I was not sure if I can get that from ctx or need to pass the value all along the function calls.

@red-0ne
Copy link
Contributor

red-0ne commented May 3, 2022

According to the implementation of manifestWithTag it seems that it's possible to get keys from ctx. @tqian1 Do you think it's possible to get orchestrator's uri from there? Also, If I tag that uri in the transcode_overall_latency_seconds instead of ..._per_orchestrators Can I have both functionalities (metrics per and overall orchestrators) by looking up by kOrchestratorURI key?

@figintern
Copy link

figintern commented May 5, 2022

@tqian1 do you see any drawbacks of just adding a label to existing metrics instead of creating new ones? Also do you maybe know why we added new ones for per_stream?

Sorry for delay here folks:
Adding a label to existing metrics is preferable to creating new ones

For context (and to clarify confusion), the reason we have independent metrics suffixed with per_stream on our side is because the high cardinality of manifest_id's combined with the high dimensionality of pre-existing Distribution view aggregation type increases the number of lines/metrics exported dramatically.

For metrics like http_client_segment_transcoded_realtime_ratio, the per_stream version contains the manifest_id tag and thus uses a single bucket view.Distribution() instead of multi-bucket ie. view.Distribution(0.5, 1, 2, 3, 5, 10, 50, 100)

This is not necessary for orchestrator metrics because the starting cardinality of kOrchestratorURI is not inherently high, and does not leak/grow as fast as manifest_id

@figintern
Copy link

figintern commented May 5, 2022

The approach was to create the metric and insert a tag with key kOrchestratorURI. The issue now is that we need to get uri from somewhere since it's not passed by SegmentFullyTranscoded(). I was not sure if I can get that from ctx or need to pass the value all along the function calls.

@red-0ne Great approach and in-line with the existing pattern, I just realized we never merged this PR on our side - hope you were able to have a look at the pattern there: #2313

Do you think it's possible to get orchestrator's uri from there?

AFAICT it is possible to get orchestrator_uri from ctx were you able to pull it?

Also, If I tag that uri in the transcode_overall_latency_seconds instead of ..._per_orchestrators Can I have both functionalities (metrics per and overall orchestrators) by looking up by kOrchestratorURI key?

Correct, if you tag the URI in transcode_overall_latency_seconds you would have both functionalities. The tag is optional for use in grafana/prometheus, without it you would get the total sum/distribution/count orchestrator agnostic, and you can specify the tag to get it per orchestrator

@figintern figintern assigned figintern and unassigned figintern May 5, 2022
@leszko
Copy link
Contributor Author

leszko commented May 9, 2022

Ok, then my bad in the GH Issue Description. Let's not use the per_orchestrator suffix. So, we'll just reuse the existing metrics:

  • transcode_overall_latency_seconds
  • segment_transcoded_all_appeared_total
  • segment_transcoded_unprocessed_total

@red-0ne
Copy link
Contributor

red-0ne commented May 9, 2022

As discussed with @yondonfu , orchestrator's address seems to be more indicative of O, but raised the case of orchestrators running in off-chain mode. We could consider some conditional behavior to log with the appropriate tag. But think we should keep it simple as of the scope of the issue and tagging addresses or URIs only, each one with its trade-offs.

I propose to keep orchURI as the tag used for O specific metrics for now. This would make us lose track of an O's metrics if it changes its URI but I think it's fine for now. WDYT?

@leszko
Copy link
Contributor Author

leszko commented May 9, 2022

To me it seems that Service URI could be "good enough". I know that O Eth Addr is "less prone to change", but still I believe Os don't change their Service URI too often. @yondonfu what do you think?

@yondonfu
Copy link
Member

I think using the service URI for now is okay for now and we can consider using the ETH address as the tag separately.

@yondonfu
Copy link
Member

Based on #2386 I think tagging these metrics with the O's ETH address as well would be helpful because:

  • The ETH address is a stable identifier for the O
  • While we can try to match the on-chain service URI (fetched from the subgraph) against the URI included in this tag, in certain cases these two values may not match because while the on-chain service URI will always be used for discovery requests from B, the O can redirect the B to a different URI for segment requests. See this comment for an example

If we only tag the metrics with the URI, then we'll be able to group the metrics by URI, but due to the above points we may not have an exact mapping of metrics to stable O identifiers. If we tag the metrics with the ETH address then I think we should be able to group the metrics by ETH address which would provide an exact mapping of metrics to stable O identifiers.

@red-0ne @leszko Would it be possible to also tag metrics for this issue with the ETH address?

@leszko
Copy link
Contributor Author

leszko commented May 12, 2022

Yeah, it makes sense.

I just wonder, if we tag with ETH address, then what about off-chain Os? I think we'll also need metrics for off-chain Os. Then, probably the best option is to have two tags attached: serviceURI AND ethAddress.

Wdyt? @yondonfu @red-0ne

@red-0ne
Copy link
Contributor

red-0ne commented May 12, 2022

I opt for tagging them both too. We could always filter with nullish values to get chainless Os

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants