Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wsm-bridge] Dashboard with health metrics #9584

Merged
merged 1 commit into from
Apr 29, 2022
Merged

Conversation

easyCZ
Copy link
Member

@easyCZ easyCZ commented Apr 27, 2022

Description

Adds the following:

  • Labels y-axis
  • Replaces existing single metric with rate of events with the following:
    • Incoming events total
    • Succesful/error completed events
    • Event lag - (incoming - outgoing)
    • Event processing latency
  • All graphs work with dropdown toggles in the dashboard

Screenshot 2022-04-27 at 14 29 32

You can see how the dashboard looks before this change here

Related Issue(s)

How to test

  • Load the JSON into the staging grafana and check the dashboard works. There may be a better way but I haven't discovered it yet.

Release Notes

[ws-manager-bridge] Add health metrics to grafana dashboard

Documentation

NONE

@easyCZ easyCZ requested a review from a team April 27, 2022 12:26
@github-actions github-actions bot added the team: webapp Issue belongs to the WebApp team label Apr 27, 2022
@easyCZ easyCZ force-pushed the mp/wsmb-health-metrics branch from b447202 to a490f79 Compare April 27, 2022 12:29
@geropl
Copy link
Member

geropl commented Apr 27, 2022

Nice, I like that! I have two questions:

  • latency: would it make sense to add an "Inf+" bucket after 2s, to separate the two?
    image
  • "read/write replicas" is new to me; is that referrring to whether we're writing to the DB or not? If yes, I'd say let's call re-phrase to "events received by by governed vs. non-governed clusters". IMO this is the qualifying distinction.

@geropl geropl self-assigned this Apr 27, 2022
@easyCZ
Copy link
Member Author

easyCZ commented Apr 28, 2022

@geropl Thanks for the review.

latency: would it make sense to add an "Inf+" bucket after 2s, to separate the two?

The y-axis is dynamic, so when there's actual data that would reach the Inf bucket, it will show up. It doesn't because it hasn't taken more than 2 secs. Once this lands, I'll also adjust the buckets to show slightly better granularity as we're not getting much detail from this view.

"read/write replicas" is new to me; is that referrring to whether we're writing to the DB or not? If yes, I'd say let's call re-phrase to "events received by by governed vs. non-governed clusters". IMO this is the qualifying distinction.

Fair. I guess we already have an inconcistency in the code where we don't really use the govern terminology and we use db_write (which suggests read/write). I don't really mind the change but in practice, the fact whether it's governing or not is a higher level detail to what WSM Bridge deals with. WSMB is only told whether it should write or not so the governing concept is a higher order concept in this case.

@geropl
Copy link
Member

geropl commented Apr 28, 2022

The y-axis is dynamic, so when there's actual data that would reach the Inf bucket, it will show up. It doesn't because it hasn't taken more than 2 secs.

Ok! So far we used static axes containing all buckets AFAIK. Maybe that would prevent this kind of confusion...?

"read/write replicas" is new to me; is that referrring to whether we're writing to the DB or not? If yes, I'd say let's call re-phrase to "events received by by governed vs. non-governed clusters". IMO this is the qualifying distinction.

Fair. I guess we already have an inconcistency in the code where we don't really use the govern terminology and we use db_write (which suggests read/write). I don't really mind the change but in practice, the fact whether it's governing or not is a higher level detail to what WSM Bridge deals with. WSMB is only told whether it should write or not so the governing concept is a higher order concept in this case.

That's the mismatch I wanted to get at. ws-manager-bridge knows both concepts, it's just that an individual bridge only cares about "writeToDB".
If we continue to name it "writeToDB", then this should be clear in the graph title that we're talking about "accumulated updates from all bridges (writeToDB: true/false)" (or sth). If the graph is meant to be interpreted as per ws-manager-bridge instance (as the dashboard context suggests without further clarification) it should be more a "updates from all bridges (govern: true/false)" (or so), and maybe we could rename the field in the metric.

@easyCZ
Copy link
Member Author

easyCZ commented Apr 28, 2022

Ok! So far we used static axes containing all buckets AFAIK. Maybe that would prevent this kind of confusion...?

IMO, this is not desirable as it massively stretches the y-axis to the point where you can't really make out blips in the latency. In this format, any case where the latency goes to Inf, you'll see massive spike anyway, but for the normal run case, you get better detail of the units on the y-axis.

How about we try this and if it doesn't work we change it?

If we continue to name it "writeToDB", then this should be clear in the graph title that we're talking about

Fair. This ultimately depends on who the consumer of the dashboard is. Right now, it's used by our team but also outside teams. For this use-case, we should then go with governing as that's the official terminology we're telling the world about.

For a more detailed dashboard intended to be used by our team only (more granual metrics) it'd probs be better to use db_write to be consistent with code (and here we expect a level of familiarity).

I'll use governing in this dashboard. I'll also do a follow-up to the metric to change the label to govern.

Copy link
Member

@geropl geropl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for refining the title! 🙏

@roboquat roboquat merged commit 0b2aa0d into main Apr 29, 2022
@roboquat roboquat deleted the mp/wsmb-health-metrics branch April 29, 2022 13:29
@roboquat roboquat added deployed: webapp Meta team change is running in production deployed Change is completely running in production labels May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: webapp Meta team change is running in production deployed Change is completely running in production release-note size/XL team: webapp Issue belongs to the WebApp team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants