Cloud usage monitoring and alerting infrastructure and process #328

yuvipanda · 2021-03-27T20:08:56Z

Description of problem and opportunity to address it

Problem description
In #908 we ran into a case where a user was abusing the JupyterHub for crypto mining. This resulted in a lot of stress and high costs for the hub's community. Part of the problem was that we did not detect the mining activity for several weeks. This activity was basically:

The steady creation of new users on the hub
Each user maxing out their CPU and never shutting down their session

Proposed solution
We should create a mechanism for automatically monitoring statistics around hub usage, and triggering notifications that suggest something nefarious is happening. Ideally, this would be a single process for all of our clusters, not one process for each cluster.

We need a quick way to:

Keep an eye on all these projects in one place
Have automated alerts for abnormal costs
Do rounds of cost optimizations

What's the value and who would benefit
This would allow us to minimize the risk of abuse if somebody did try to use a hub for the wrong purposes. It would give our team more confidence that something isn't happening without us knowing about it, and would give communities more confidence that they won't have an unexpected spike in their cloud bill.

Implementation guide and constraints

A rough idea of what to try:

Set up a Grafana dashboard that aggregates activity across all of our clusters (this will be tricky because the Prometheus instances are private for our clusters, not public like the Binder ones).
Define a few metrics that are particularly useful for identifying abuse and problematic abnormal behavior. For example, here are two images from the openscapes grafana that were particularly useful:
- Users over time
- CPU usage histogram over time
- And noting 5xx errors from user pods in general is a good indication that something is wrong.
Define some thresholds for these metrics, and create a reporting mechanism to ping [email protected] when it thinks something problematic is going on.

Issues where we have been bitten by this

Updates and ongoing work

2022-01-06

@GeorgianaElena is going to work on these things for one week:

How complex will it be to aggregate feeds from each cluster's Prometheus?
See this HackMD for a brief analysis: https://hackmd.io/HqE3RgjtTBq1MuofvAiLlQ?view

See #328 (comment) for more details!

2022-01-19

Some meeting notes around here: #328 (comment)

We agreed that the best way forward is to start by implementing option 1 from the HackMD above, which is to follow the mybinder.org model of one Grafana with multiple data sources.

Our next steps here are to:

Follow @yuvipanda's advice about setting up ingress/auth for a Grafana instance that pulls from each Prometheus instance in all clusters: Cloud usage monitoring and alerting infrastructure and process #328 (comment) ➡️ Expose prometheus with basic auth #1091
Decide how much complexity this will add to our setup, and whether this justifies looking at a different option
If it works, create a few dashboards in Grafana that we can use for reporting

2022-03-30

From #328 (comment)

Write a script that'll read all the encrypted grafana secrets, and put them in the centralized grafana as data sources via the grafana API ➡️ Add a script that can add all clusters as datasources for central grafana #1215
Update the upstream jupyterhub/grafana-dashboard repo to support multiple datasources, via a datasource template variable. ➡️ Support multiple datasource jupyterhub/grafana-dashboards#37
Deploy support charts in the few clusters where we don't currently have them deployed! I think that's meom-ige and farallon? We will need to tune their resource requests to match the smaller clusters.
Split support dashboards -> Split cluster dashboard into cluster and support jupyterhub/grafana-dashboards#38
Support dashboards across all clusters -> Add support for global dashboards jupyterhub/grafana-dashboards#39

The text was updated successfully, but these errors were encountered:

yuvipanda · 2021-03-27T20:10:07Z

We could have a centralized organizational grafana board, that can pull in data from differnet sources. Since GCP exports billing data to bigquery, we can use a bigquery datasource in this grafana to display these graphs.

choldgraf · 2022-01-03T17:52:26Z

We had an incident that is related to this issue: #908

I think that we should prioritize this one at least at an MVP level, so that we can give communities some assurance that they won't incur huge cloud costs.

choldgraf · 2022-01-06T18:12:09Z

Update

@GeorgianaElena and I discussed this one a bit today, and she'd be interested in giving it a shot to build out an MVP. We discussed two options:

Build a simple reporting mechanism for each of our cluster grafana boards (e.g. send an email to support@ when a specific metric hits some threshold)
Aggregate the prometheus feeds into a single Grafana, and use one or two graphs there to do the emailing from a single place, rather than from each cluster-specific Grafana.

We agreed that number 2 would be preferred, as long as there wasn't too much complexity in aggregating the prometheus feeds from each cluster.

Plan

@GeorgianaElena would like to spend a week answering these questions:

How complex will it be to aggregate feeds from each cluster's Prometheus?
What are 1 or 2 graphs / metrics to use for our reporting?

In a week, we can re-convene and decide whether to take approach 1 or approach 2 for now.

GeorgianaElena · 2022-01-14T16:42:57Z

How complex will it be to aggregate feeds from each cluster's Prometheus?

Not ready yet to provide a super clear path forward for this but I'll leave here a few ideas that I'm planning to re-iterate on Monday with pros and cons:

*Decide between:

Using a Prometheus federation setup and have an aggregator prometheus instance + central Grafana deployment in a new cluster, or use something like Thanos.
The private Prometheus instances could be authenticated either by
- enabling basic auth for them
- using TLS encryption

I noticed that mybinder grafana pulls data from a Gesis prometheus privtate instance. Or so I think. This is the line of code that https://github.com/jupyterhub/mybinder.org-deploy/blob/master/grafana-data/datasources.json#L4 implies that prometheus instance uses basic auth.

However, I don't think it works, as I don't see any data in the mybinder dashboard for the gesis cluster 😕

damianavila · 2022-01-18T16:07:15Z

@GeorgianaElena, what would be the pros and cons for those 2 options?

I can guess but you surely have more context to perform that comparison.

choldgraf · 2022-01-18T17:58:35Z

I had a quick conversation with @GeorgianaElena today about this, I think her plan is to share a short write-up about these options and the research she's done, with the goal of discussing as a team tomorrow what would be a good step forward.

Some major things to include:

Pros / cons (as best we understand it) of the options
Any unanswered questions we think are important and should discuss

I think tomorrow we should decide if we have enough information to just move forward and try implementing something.

GeorgianaElena · 2022-01-19T15:49:08Z

More info about the reading I did here ➡️ https://hackmd.io/HqE3RgjtTBq1MuofvAiLlQ?view

yuvipanda · 2022-01-19T16:22:42Z

Wow, thank you so much for doing this research, @GeorgianaElena.

I love idea 1, which would be to use a central grafana that can talk to all the prometheuses. Prometheus supports basic auth (https://prometheus.io/docs/guides/basic-auth/) and grafana supports using that (https://grafana.com/docs/grafana/latest/datasources/prometheus/). So perhaps in our prometheus helm chart config in our support chart, we can setup an ingress (to allow traffic in) as well as basic authentication, and use a central grafana to access that via individual prometheus data sources. All alerts could also live in this central grafana.

Thank you for doing all this research! TIL about Thanos and Cortex :)

choldgraf · 2022-01-19T16:29:04Z

Planning Meeting / Next Steps

Noticed that there's no Grafana reporting for GESIS, did this used to work? GESIS was authenticated so maybe we could follow that pattern.
The "aims of the project" for Thanos seem to support our general use case of having many distributed prometheus instances.
We should keep the per-cluster Grafana reporting for communities, so that they have the ability to re-use the same infrastructure if they wanted to move away from 2i2c (from a right to replicate perspective)

Next Steps

Try implementing Georgiana's proposal number 1 in the hackmd (centralized grafana that pulls in many prometheus sources)
Decide if this is the right approach when we've implemented it and can understand its complexity a bit better.

choldgraf · 2022-01-19T21:17:51Z

I've updated the top comment to track our latest conversations and planning. I think we have two things missing from the above:

A time box - how much time would we like to invest in giving this a shot?
A starting point - @GeorgianaElena is also working towards Enable authentication via CILogon #315 - how would you like to queue up this work? Finish Enable authentication via CILogon #315 first? Start with this? I think it'd be best if we did these one after the other rather than doing both at once.

GeorgianaElena · 2022-01-21T14:25:33Z

I've updated the top comment to track our latest conversations and planning. I think we have two things missing from the above:

Thanks a lot @choldgraf!

A time box - how much time would we like to invest in giving this a shot?

Let's shoot for what's left of this sprint and revisit during our next meeting?

A starting point - @GeorgianaElena is also working towards Enable authentication via CILogon #315 - how would you like to queue up this work? Finish Enable authentication via CILogon Enable authentication via CILogon #315 first? Start with this? I think it'd be best if we did these one after the other rather than doing both at once.

I started with the CILogon one and opened #941 to address that.

choldgraf · 2022-01-25T01:26:35Z

A quick update on this one as we discussed this a bit in the sustainability team meeting. Another useful aspect of reporting and monitoring is if we can quickly answer questions that demonstrate usage and impact for our hubs. So things like:

How many unique users logged in to (all, or a specific) 2i2c hub today/this week/this month?
How much total time was spent in interactive sessions for all users?
How long is the average user session? How many times a week do they log in?
Any kind of usage of specific services like Dask Gateway

I think most of these should be doable with the prometheus logs that each hub is generating, just listing here so we don't lose track of it!

choldgraf · 2022-03-11T23:54:30Z

A quick note here - I believe @yuvipanda is planning to work on #730 soon, and we thought it'd be good for him and @GeorgianaElena to coordinate a bit, since it's related to this one too. For example, we might want to use the same centralized Grafana dashboard to do reporting both for "usage alerts" and cost "cost reports".

This is the beginning of implementing idea 1 from the list @georiganaelena made in 2i2c-org#328 (comment). We have one prometheus running per cluster, but manage many clusters. A single grafana that can connect to all such prometheus clusters will help with monitoring as well as reporting. So we need to expose it as securely as possible to the external world, as it can contain private information. In this case, we're using https + basic auth provided by nginx-ingress (https://kubernetes.github.io/ingress-nginx/examples/auth/basic/) to safely expose prometheus to the outside world. We can then use a grafana that knows these username / passwords to access this prometheus instance. Each cluster needs its own username / password (generated with pwgen 64 1), so users in one cluster can not access prometheus for another cluster. Ref 2i2c-org#328

GeorgianaElena · 2022-03-28T13:16:54Z

@yuvipanda, now that #1091 was deployed the next step would be to list those prometheus instances as datasources for a central grafana, right? A few questions/thoughts about this:

Should we reuse an existing Grafana for example, the 2i2c one as @consideRatio proposed/assumed in Expose prometheus with basic auth #1091 (review)?
Setting up prometheus datasources would require:
- Use the Grafana UI to list each cluster's prometheus isntances as datasources
- Export that config and store it to our repo
- Make the config reproductible and persistent between deploys by creating a script that allows exporting/importing that config into that central grafana (maybe similar with the mybinder grafana-export) either here or upstream in jupyterhub/grafana-dashboards

choldgraf · 2022-03-29T18:03:13Z

Update: we'd like to prioritize this!

We discussed this topic in our team meeting today, and there was general agreement that improving our reporting and alerting infrastructure would be a good investment of our time. Essentially the argument boiled down to this:

The most stressful time for our team is when there are major incidents that require immediate action.
The best way to deal with this is to prevent incidents from happening in general
A "problem" only becomes an "incident" when a user is actually affected by it.
We can potentially resolve "problems" before they become "incidents" by catching them ahead of time.
If we improving our reporting infrastructure, we can be alerted to "problems" before they become "incidents"
This would hopefully significantly reduce the stress associated with support/operations and major incidents.

yuvipanda · 2022-03-30T02:08:26Z

@GeorgianaElena yeah designating the existing grafana as a 'central grafana' seems the way to go.

I think next steps here are:

Write a script that'll read all the encrypted grafana secrets, and put them in the centralized grafana as data sources via the grafana API
Update the upstream jupyterhub/grafana-dashboard repo to support multiple datasources, via a datasource template variable. I removed that as part of commit 763c28acad89c9d7e95a860c54f004b6bf738240 in that repo - it just needs to be put back.
Deploy support charts in the few clusters where we don't currently have them deployed! I think that's meom-ige and farallon? We will need to tune their resource requests to match the smaller clusters.

GeorgianaElena · 2022-03-30T16:31:21Z

@yuvipanda thanks a lot for the details 🚀 ! I think I have bandwidth to start working on this, using the steps you provided. But I will probably need some help/input from time to time. Do you think you have bandwidth to help out with this one or split the work somehow?

cc @damianavila

yuvipanda · 2022-03-30T16:34:09Z

@GeorgianaElena absolutely have the bandwidth to help out :)

GeorgianaElena · 2022-07-07T08:24:30Z

This issue has become quite big, so I'm going to close it now since the monitoring infra is mostly in place and track the alerting part in different issues.

Get context and track progress

The Updates and ongoing work section in the initial comment has info about what has been achieved and links to still opened issues. There's also a project board for this project here Updates and ongoing work .

damianavila · 2022-07-07T20:45:49Z

Thank you for all the hard work you have done on this one, @GeorgianaElena!

yuvipanda added the goal label Mar 27, 2021

choldgraf added Enhancement An improvement to something or creating something new. and removed type: goal labels Apr 15, 2021

yuvipanda mentioned this issue Aug 23, 2021

Major development Managed Hub Service v1 #610

Closed

choldgraf added 🏷️ optimization labels Aug 31, 2021

choldgraf added the 🏷️ infrastructure reporting label Oct 9, 2021

choldgraf removed the impact: high label Oct 28, 2021

choldgraf changed the title ~~Establish process for keeping an eye on cloud costs~~ Cloud cost monitoring infrastructure and process Jan 3, 2022

choldgraf mentioned this issue Jan 3, 2022

[Incident] OpenScapes unauthenticated users and CPU usage spike #908

Closed

8 tasks

choldgraf changed the title ~~Cloud cost monitoring infrastructure and process~~ Cloud cost monitoring and notification infrastructure and process Jan 4, 2022

choldgraf mentioned this issue Jan 4, 2022

Deploy and operate a BinderHub for Pangeo #919

Open

7 tasks

choldgraf assigned GeorgianaElena Jan 6, 2022

GeorgianaElena mentioned this issue Jan 14, 2022

Add info about the grafana deployer action and api key #929

Merged

choldgraf unassigned GeorgianaElena Jan 26, 2022

choldgraf mentioned this issue Mar 11, 2022

Formula for calculating hub-specific cloud costs on a shared cluster #730

Closed

5 tasks

choldgraf changed the title ~~Cloud cost monitoring and notification infrastructure and process~~ Cloud usage monitoring and alerting infrastructure and process Mar 12, 2022

yuvipanda mentioned this issue Mar 12, 2022

Expose prometheus with basic auth #1091

Merged

This was referenced Mar 14, 2022

[New Hub] LEAP Pangeo #1050

Closed

Add a Grafana plot to monitor disk usage on the home directory #1119

Closed

[Incident] UToronto cluster ran out of disk space #1081

Closed

GeorgianaElena self-assigned this Apr 4, 2022

damianavila assigned yuvipanda Apr 12, 2022

GeorgianaElena mentioned this issue Apr 18, 2022

Add a script that can add all clusters as datasources for central grafana #1215

Merged

GeorgianaElena mentioned this issue May 5, 2022

Deploy the support charts in the remaining three clusters #1278

Closed

3 tasks

rabernat mentioned this issue May 5, 2022

Metrics and Reporting for LEAP Hub #1279

Closed

This was referenced May 6, 2022

Define metrics for identifying abuse and abnormal behavior #1281

Closed

Create a few Grafana dashboards that we can use for reporting #1282

Closed

Grafana alerting infrastructure #1288

Closed

choldgraf mentioned this issue Jul 6, 2022

Sprint Planning Meeting: Wednesday, July 6th 2i2c-org/team-compass#455

Closed

GeorgianaElena closed this as completed Jul 7, 2022

choldgraf mentioned this issue Mar 11, 2022

Launch user sessions in multiple cluster from a single hub 2i2c-org/features#7

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud usage monitoring and alerting infrastructure and process #328

Cloud usage monitoring and alerting infrastructure and process #328

yuvipanda commented Mar 27, 2021 •

edited by GeorgianaElena

Loading

yuvipanda commented Mar 27, 2021

choldgraf commented Jan 3, 2022 •

edited

Loading

choldgraf commented Jan 6, 2022

GeorgianaElena commented Jan 14, 2022

damianavila commented Jan 18, 2022 •

edited

Loading

choldgraf commented Jan 18, 2022

GeorgianaElena commented Jan 19, 2022

yuvipanda commented Jan 19, 2022

choldgraf commented Jan 19, 2022

choldgraf commented Jan 19, 2022

GeorgianaElena commented Jan 21, 2022

choldgraf commented Jan 25, 2022

choldgraf commented Mar 11, 2022

GeorgianaElena commented Mar 28, 2022

choldgraf commented Mar 29, 2022

yuvipanda commented Mar 30, 2022

GeorgianaElena commented Mar 30, 2022

yuvipanda commented Mar 30, 2022

GeorgianaElena commented Jul 7, 2022

damianavila commented Jul 7, 2022

Cloud usage monitoring and alerting infrastructure and process #328

Cloud usage monitoring and alerting infrastructure and process #328

Comments

yuvipanda commented Mar 27, 2021 • edited by GeorgianaElena Loading

Description of problem and opportunity to address it

Implementation guide and constraints

Issues where we have been bitten by this

Updates and ongoing work

2022-01-06

2022-01-19

2022-03-30

yuvipanda commented Mar 27, 2021

choldgraf commented Jan 3, 2022 • edited Loading

choldgraf commented Jan 6, 2022

Update

Plan

GeorgianaElena commented Jan 14, 2022

damianavila commented Jan 18, 2022 • edited Loading

choldgraf commented Jan 18, 2022

GeorgianaElena commented Jan 19, 2022

yuvipanda commented Jan 19, 2022

choldgraf commented Jan 19, 2022

Planning Meeting / Next Steps

Next Steps

choldgraf commented Jan 19, 2022

GeorgianaElena commented Jan 21, 2022

choldgraf commented Jan 25, 2022

choldgraf commented Mar 11, 2022

GeorgianaElena commented Mar 28, 2022

choldgraf commented Mar 29, 2022

Update: we'd like to prioritize this!

yuvipanda commented Mar 30, 2022

GeorgianaElena commented Mar 30, 2022

yuvipanda commented Mar 30, 2022

GeorgianaElena commented Jul 7, 2022

Get context and track progress

damianavila commented Jul 7, 2022

yuvipanda commented Mar 27, 2021 •

edited by GeorgianaElena

Loading

choldgraf commented Jan 3, 2022 •

edited

Loading

damianavila commented Jan 18, 2022 •

edited

Loading