Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-architect our Prometheus monitoring #1749

Closed
digitalronin opened this issue Mar 24, 2020 · 3 comments
Closed

Re-architect our Prometheus monitoring #1749

digitalronin opened this issue Mar 24, 2020 · 3 comments
Labels

Comments

@digitalronin
Copy link
Contributor

Prometheus is consuming a lot of resources - so much that we have had to set up
special, more powerful nodes to run it. The current setup will not scale as we
grow the platform, and also won't work in a multi-cluster setup (even if we
only have two clusters while we are replacing an old one with a new one). So,
we need to figure out a solution.

Short term

As a short-term workaround, we will have two large nodes, in a single
availability zone (so they can both access the EBS volume that stores metrics
data). This gives us enough monitoring capacity for our current needs, and
relatively quick failover (~2 minutes) in the event that one node fails.
This is not an ideal solution, for the reasons mentioned above.

TODO: Create a ticket

Test clusters: dedicated S3 buckets

There is a problem with test clusters and the production cluster using the same
S3 bucket to store Thanos metrics data. For live-1, needing to process the
metrics data from test clusters creates a (small) additional load on Thanos.
For the test clusters, there are major problems running Thanos on smaller,
test-cluster-sized nodes, but trying to process the S3 bucket which holds all
our production metrics data.

A solution is to create a separate S3 metrics bucket for each test cluster,
deleting the bucket when the cluster is destroyed. This means the test cluster
Thanos instance only has to worry about its own metrics data. We will have to
create all the additional resources (IAM policies, etc.) and we need to be
careful how we associate the S3 bucket with the test cluster. If we link them
via Terraform, then we have to delete the S3 bucket when we delete the
cluster, which may not always be what we want.

TODO: Create a ticket

Test clusters: Make Prometheus/Thanos optional

We don't always need Prometheus/Thanos in test clusters, so we can make it an
option, or possibly only create the extra-large nodes and install
Prometheus/Thanos when you select the "large" option for test cluster size (in
case we are testing something which does need a production-like test cluster).

TODO: Create a ticket

Explore duplicated Prometheus

Instead of having one Prometheus instance running on one of two nodes in the
same availability zone, another option is to have two separate, duplicate
Prometheus instances, in different availability zones. Both of these would
scrape the same metrics data, and rely on Thanos to deduplicate the data. In
the event that one node died, the other would already be doing everything
necessary, so there should be no downtime.

This setup would probably have unexpected side-effects, and there may be
considerations we haven't thought of yet, but it is worth exploring.

TODO: Create a ticket

A Prometheus instance per namespace

We should explore this as a potential longer-term strategy. Instead of having a
single, large Prometheus instance, and dealing with its vertical scaling
issues, we could scale horizontally, giving each namespace its own Prometheus,
Grafana, etc. Since each instance would be scraping, comparatively, very little
metrics data, this could be a more manageable solution, in the longer term.

TODO: Create a ticket

@mogaal
Copy link
Contributor

mogaal commented Mar 25, 2020

Short term solution was successfully implemented yesterday.

Regarding the latest point (Prometheus instance per namespace), IMHO I'd say Prometheus instance per team instead of per namespace. Which namespace to put it? We can either standardize it or just make it up to them. The crucial requirement is to have all technical details in the user-guide so they know how to do it.

Regarding Prometheus duplication, until we resolve all the problems we have with it (resources, Thanos, etc) the duplication sounds like duplicating problems too. Duplication would solve resilience issues but at the moment we aren't facing resilience problems (right?). Still, it is a good option to explore for the future.

Instead of test clusters using dedicated S3 buckets, I would also give a try to minio which has the same compatibility with Amazon S3 APIs. In theory, it should work.

@AntonyBishop
Copy link
Contributor

@AntonyBishop
Copy link
Contributor

Closing. Intention is to investigate Managed Prometheus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants