[wsm-bridge] Dashboard with health metrics #9584

easyCZ · 2022-04-27T12:26:34Z

Description

Adds the following:

Labels y-axis
Replaces existing single metric with rate of events with the following:
- Incoming events total
- Succesful/error completed events
- Event lag - (incoming - outgoing)
- Event processing latency
All graphs work with dropdown toggles in the dashboard

You can see how the dashboard looks before this change here

Related Issue(s)

Relates to [bridge] Ensure we do not override "failed" instance states (affects prebuilds!) #8596

How to test

Load the JSON into the staging grafana and check the dashboard works. There may be a better way but I haven't discovered it yet.

Release Notes

[ws-manager-bridge] Add health metrics to grafana dashboard

Documentation

NONE

geropl · 2022-04-27T15:54:16Z

Nice, I like that! I have two questions:

latency: would it make sense to add an "Inf+" bucket after 2s, to separate the two?
"read/write replicas" is new to me; is that referrring to whether we're writing to the DB or not? If yes, I'd say let's call re-phrase to "events received by by governed vs. non-governed clusters". IMO this is the qualifying distinction.

easyCZ · 2022-04-28T06:48:06Z

@geropl Thanks for the review.

latency: would it make sense to add an "Inf+" bucket after 2s, to separate the two?

The y-axis is dynamic, so when there's actual data that would reach the Inf bucket, it will show up. It doesn't because it hasn't taken more than 2 secs. Once this lands, I'll also adjust the buckets to show slightly better granularity as we're not getting much detail from this view.

"read/write replicas" is new to me; is that referrring to whether we're writing to the DB or not? If yes, I'd say let's call re-phrase to "events received by by governed vs. non-governed clusters". IMO this is the qualifying distinction.

Fair. I guess we already have an inconcistency in the code where we don't really use the govern terminology and we use db_write (which suggests read/write). I don't really mind the change but in practice, the fact whether it's governing or not is a higher level detail to what WSM Bridge deals with. WSMB is only told whether it should write or not so the governing concept is a higher order concept in this case.

geropl · 2022-04-28T06:58:45Z

The y-axis is dynamic, so when there's actual data that would reach the Inf bucket, it will show up. It doesn't because it hasn't taken more than 2 secs.

Ok! So far we used static axes containing all buckets AFAIK. Maybe that would prevent this kind of confusion...?

"read/write replicas" is new to me; is that referrring to whether we're writing to the DB or not? If yes, I'd say let's call re-phrase to "events received by by governed vs. non-governed clusters". IMO this is the qualifying distinction.

Fair. I guess we already have an inconcistency in the code where we don't really use the govern terminology and we use db_write (which suggests read/write). I don't really mind the change but in practice, the fact whether it's governing or not is a higher level detail to what WSM Bridge deals with. WSMB is only told whether it should write or not so the governing concept is a higher order concept in this case.

That's the mismatch I wanted to get at. ws-manager-bridge knows both concepts, it's just that an individual bridge only cares about "writeToDB".
If we continue to name it "writeToDB", then this should be clear in the graph title that we're talking about "accumulated updates from all bridges (writeToDB: true/false)" (or sth). If the graph is meant to be interpreted as per ws-manager-bridge instance (as the dashboard context suggests without further clarification) it should be more a "updates from all bridges (govern: true/false)" (or so), and maybe we could rename the field in the metric.

easyCZ · 2022-04-28T07:09:48Z

Ok! So far we used static axes containing all buckets AFAIK. Maybe that would prevent this kind of confusion...?

IMO, this is not desirable as it massively stretches the y-axis to the point where you can't really make out blips in the latency. In this format, any case where the latency goes to Inf, you'll see massive spike anyway, but for the normal run case, you get better detail of the units on the y-axis.

How about we try this and if it doesn't work we change it?

If we continue to name it "writeToDB", then this should be clear in the graph title that we're talking about

Fair. This ultimately depends on who the consumer of the dashboard is. Right now, it's used by our team but also outside teams. For this use-case, we should then go with governing as that's the official terminology we're telling the world about.

For a more detailed dashboard intended to be used by our team only (more granual metrics) it'd probs be better to use db_write to be consistent with code (and here we expect a level of familiarity).

I'll use governing in this dashboard. I'll also do a follow-up to the metric to change the label to govern.

geropl

Thx for refining the title! 🙏

easyCZ requested a review from a team April 27, 2022 12:26

roboquat added release-note size/XL labels Apr 27, 2022

github-actions bot added the team: webapp Issue belongs to the WebApp team label Apr 27, 2022

[wsm-bridge] Dashboard with health metrics

a490f79

easyCZ force-pushed the mp/wsmb-health-metrics branch from b447202 to a490f79 Compare April 27, 2022 12:29

geropl self-assigned this Apr 27, 2022

geropl approved these changes Apr 29, 2022

View reviewed changes

roboquat merged commit 0b2aa0d into main Apr 29, 2022

roboquat deleted the mp/wsmb-health-metrics branch April 29, 2022 13:29

roboquat added deployed: webapp Meta team change is running in production deployed Change is completely running in production labels May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wsm-bridge] Dashboard with health metrics #9584

[wsm-bridge] Dashboard with health metrics #9584

easyCZ commented Apr 27, 2022 •

edited

Loading

geropl commented Apr 27, 2022

easyCZ commented Apr 28, 2022

geropl commented Apr 28, 2022

easyCZ commented Apr 28, 2022

geropl left a comment

[wsm-bridge] Dashboard with health metrics #9584

[wsm-bridge] Dashboard with health metrics #9584

Conversation

easyCZ commented Apr 27, 2022 • edited Loading

Description

Related Issue(s)

How to test

Release Notes

Documentation

geropl commented Apr 27, 2022

easyCZ commented Apr 28, 2022

geropl commented Apr 28, 2022

easyCZ commented Apr 28, 2022

geropl left a comment

Choose a reason for hiding this comment

easyCZ commented Apr 27, 2022 •

edited

Loading