-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud usage monitoring and alerting infrastructure and process #328
Comments
We could have a centralized organizational grafana board, that can pull in data from differnet sources. Since GCP exports billing data to bigquery, we can use a bigquery datasource in this grafana to display these graphs. |
We had an incident that is related to this issue: #908 I think that we should prioritize this one at least at an MVP level, so that we can give communities some assurance that they won't incur huge cloud costs. |
Update@GeorgianaElena and I discussed this one a bit today, and she'd be interested in giving it a shot to build out an MVP. We discussed two options:
We agreed that number 2 would be preferred, as long as there wasn't too much complexity in aggregating the prometheus feeds from each cluster. Plan@GeorgianaElena would like to spend a week answering these questions:
In a week, we can re-convene and decide whether to take approach 1 or approach 2 for now. |
Not ready yet to provide a super clear path forward for this but I'll leave here a few ideas that I'm planning to re-iterate on Monday with pros and cons: *Decide between:
I noticed that However, I don't think it works, as I don't see any data in the mybinder dashboard for the gesis cluster 😕 |
@GeorgianaElena, what would be the pros and cons for those 2 options? I can guess but you surely have more context to perform that comparison. |
I had a quick conversation with @GeorgianaElena today about this, I think her plan is to share a short write-up about these options and the research she's done, with the goal of discussing as a team tomorrow what would be a good step forward. Some major things to include:
I think tomorrow we should decide if we have enough information to just move forward and try implementing something. |
More info about the reading I did here ➡️ https://hackmd.io/HqE3RgjtTBq1MuofvAiLlQ?view |
Wow, thank you so much for doing this research, @GeorgianaElena. I love idea 1, which would be to use a central grafana that can talk to all the prometheuses. Prometheus supports basic auth (https://prometheus.io/docs/guides/basic-auth/) and grafana supports using that (https://grafana.com/docs/grafana/latest/datasources/prometheus/). So perhaps in our prometheus helm chart config in our support chart, we can setup an ingress (to allow traffic in) as well as basic authentication, and use a central grafana to access that via individual prometheus data sources. All alerts could also live in this central grafana. Thank you for doing all this research! TIL about Thanos and Cortex :) |
Planning Meeting / Next Steps
Next Steps
|
I've updated the top comment to track our latest conversations and planning. I think we have two things missing from the above:
|
Thanks a lot @choldgraf!
Let's shoot for what's left of this sprint and revisit during our next meeting?
I started with the CILogon one and opened #941 to address that. |
A quick update on this one as we discussed this a bit in the sustainability team meeting. Another useful aspect of reporting and monitoring is if we can quickly answer questions that demonstrate usage and impact for our hubs. So things like:
I think most of these should be doable with the prometheus logs that each hub is generating, just listing here so we don't lose track of it! |
A quick note here - I believe @yuvipanda is planning to work on #730 soon, and we thought it'd be good for him and @GeorgianaElena to coordinate a bit, since it's related to this one too. For example, we might want to use the same centralized Grafana dashboard to do reporting both for "usage alerts" and cost "cost reports". |
This is the beginning of implementing idea 1 from the list @georiganaelena made in 2i2c-org#328 (comment). We have one prometheus running per cluster, but manage many clusters. A single grafana that can connect to all such prometheus clusters will help with monitoring as well as reporting. So we need to expose it as securely as possible to the external world, as it can contain private information. In this case, we're using https + basic auth provided by nginx-ingress (https://kubernetes.github.io/ingress-nginx/examples/auth/basic/) to safely expose prometheus to the outside world. We can then use a grafana that knows these username / passwords to access this prometheus instance. Each cluster needs its own username / password (generated with pwgen 64 1), so users in one cluster can not access prometheus for another cluster. Ref 2i2c-org#328
@yuvipanda, now that #1091 was deployed the next step would be to list those prometheus instances as datasources for a central grafana, right? A few questions/thoughts about this:
|
Update: we'd like to prioritize this!We discussed this topic in our team meeting today, and there was general agreement that improving our reporting and alerting infrastructure would be a good investment of our time. Essentially the argument boiled down to this:
|
@GeorgianaElena yeah designating the existing grafana as a 'central grafana' seems the way to go. I think next steps here are:
|
@yuvipanda thanks a lot for the details 🚀 ! I think I have bandwidth to start working on this, using the steps you provided. But I will probably need some help/input from time to time. Do you think you have bandwidth to help out with this one or split the work somehow? cc @damianavila |
@GeorgianaElena absolutely have the bandwidth to help out :) |
This issue has become quite big, so I'm going to close it now since the monitoring infra is mostly in place and track the alerting part in different issues. Get context and track progressThe |
Thank you for all the hard work you have done on this one, @GeorgianaElena! |
Description of problem and opportunity to address it
Problem description
In #908 we ran into a case where a user was abusing the JupyterHub for crypto mining. This resulted in a lot of stress and high costs for the hub's community. Part of the problem was that we did not detect the mining activity for several weeks. This activity was basically:
Proposed solution
We should create a mechanism for automatically monitoring statistics around hub usage, and triggering notifications that suggest something nefarious is happening. Ideally, this would be a single process for all of our clusters, not one process for each cluster.
We need a quick way to:
What's the value and who would benefit
This would allow us to minimize the risk of abuse if somebody did try to use a hub for the wrong purposes. It would give our team more confidence that something isn't happening without us knowing about it, and would give communities more confidence that they won't have an unexpected spike in their cloud bill.
Implementation guide and constraints
A rough idea of what to try:
Set up a Grafana dashboard that aggregates activity across all of our clusters (this will be tricky because the Prometheus instances are private for our clusters, not public like the Binder ones).
Define a few metrics that are particularly useful for identifying abuse and problematic abnormal behavior. For example, here are two images from the openscapes grafana that were particularly useful:
5xx
errors from user pods in general is a good indication that something is wrong.Define some thresholds for these metrics, and create a reporting mechanism to ping
[email protected]
when it thinks something problematic is going on.Issues where we have been bitten by this
Updates and ongoing work
2022-01-06
@GeorgianaElena is going to work on these things for one week:
See #328 (comment) for more details!
2022-01-19
Some meeting notes around here: #328 (comment)
We agreed that the best way forward is to start by implementing option 1 from the HackMD above, which is to follow the mybinder.org model of one Grafana with multiple data sources.
Our next steps here are to:
2022-03-30
From #328 (comment)
The text was updated successfully, but these errors were encountered: