Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-14270: Deploy cloudwatch exporter to data plane #801

Merged
merged 4 commits into from
Feb 10, 2023

Conversation

stehessel
Copy link
Contributor

@stehessel stehessel commented Feb 8, 2023

Description

Add dp-terraform sub-chart to deploy a cloudwatch exporter. The exporter
scrapes RDS database from AWS and exports them as Prometheus metrics.

The exporter is configured via a config map and requires an AWS IAM user
with a cloudwatch permission set.

Based on these metrics, the exporter performs about 480 cloudwatch requests per minute. That equals roughly $200 in monthly AWS costs.

Checklist (Definition of Done)

  • Unit and integration tests added
  • Added test description under Test manual
  • Documentation added if necessary (i.e. changes to dev setup, test execution, ...)
  • CI and all relevant tests are passing
  • Add the ticket number to the PR title if available, i.e. ROX-12345: ...
  • Discussed security and business related topics privately. Will move any security and business related topics that arise to private communication channel.

Test manual

Tested the exporter on a dev cluster.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 8, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@stehessel stehessel temporarily deployed to development February 8, 2023 02:25 — with GitHub Actions Inactive
stehessel added a commit to stackrox/rhacs-observability-resources that referenced this pull request Feb 8, 2023
Add a pod monitor to scrape the cloudwatch exporter. The exporter is
used to scrape RDS database metrics from AWS.

See stackrox/acs-fleet-manager#801 for the
cloudwatch exporter deployment.
@stehessel stehessel force-pushed the ROX-14270/cloudwatch-exporter branch from 2957475 to 0dbdefb Compare February 8, 2023 02:30
@stehessel stehessel temporarily deployed to development February 8, 2023 02:31 — with GitHub Actions Inactive
Add dp-terraform sub-chart to deploy a cloudwatch exporter. The exporter
scrapes RDS database from AWS and exports them as Prometheus metrics.

The exporter is configured via a config map and requires an AWS IAM user
with a cloudwatch permission set.
@stehessel stehessel force-pushed the ROX-14270/cloudwatch-exporter branch from 0dbdefb to 3727260 Compare February 8, 2023 02:34
@stehessel stehessel temporarily deployed to development February 8, 2023 02:34 — with GitHub Actions Inactive
@stehessel stehessel marked this pull request as ready for review February 8, 2023 02:35
@stehessel stehessel temporarily deployed to development February 8, 2023 02:35 — with GitHub Actions Inactive
@stehessel stehessel requested a review from a team February 8, 2023 02:35
@stehessel stehessel temporarily deployed to development February 8, 2023 15:23 — with GitHub Actions Inactive
@stehessel stehessel requested a review from erthalion February 8, 2023 16:48
data:
config.yaml: |-
region: us-east-1
metrics:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any limitations how many metrics we could collect? I would like to suggest at least:

AuroraReplicaLag -- to make sure the replica is doing any good.
MaximumUsedTransactionIDs -- not sure if there is any danger of transaction wraparound on Aurora, but this metric suggest there is.
TransactionLogsDiskUsage -- to know how much WAL is there.
Deadlocks -- they're just nasty.
BufferCacheHitRatio -- to know what kind of workload are we dealing with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limitation is $$$. Right now the cost for the current metrics+statistics in this PR is about $200/month on stage, probably ~$1k/month on prod. Roughly speaking the costs scale with the number of metrics and tenants.

In general I don't want to just replicate all the AWS dashboards and metrics in our Grafana. I would rather keep it minimal, in case we need more we can always log in to the AWS console and get all the metrics there. Let's say:

  • a metric we need for alerts: definitely add it.
  • a metric we need at least once a week: let's add it for convenience.
  • a metric we need once a year for root cause analysis: maybe not worth it, go to AWS if need be.

Copy link

@erthalion erthalion Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but at least the replica lag, transaction id and deadlocks are pretty essential for the application correctness and availability. I'm also a bit concerned there are no out-of-the-box metrics describing autovacuum, probably such things we would need to figure out on the fly.

Copy link

@erthalion erthalion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, a good first step. Few notes above.

@openshift-ci openshift-ci bot removed the lgtm label Feb 9, 2023
@stehessel stehessel temporarily deployed to development February 9, 2023 19:04 — with GitHub Actions Inactive
@stehessel
Copy link
Contributor Author

It seems that switching to the cloudwatch API to use_get_metric_data: true reduced the overall cost compared to my initial estimation, so we should be able to add all the suggested metrics without going over that estimate.

@stehessel stehessel temporarily deployed to development February 10, 2023 13:32 — with GitHub Actions Inactive
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 10, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: erthalion, janisz, stehessel
Once this PR has been reviewed and has the lgtm label, please assign ebensh for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stehessel stehessel merged commit a0a74a5 into main Feb 10, 2023
@stehessel stehessel deleted the ROX-14270/cloudwatch-exporter branch February 10, 2023 14:43
stehessel added a commit to stackrox/rhacs-observability-resources that referenced this pull request Feb 13, 2023
Add a pod monitor to scrape the cloudwatch exporter. The exporter is
used to scrape RDS database metrics from AWS.

See stackrox/acs-fleet-manager#801 for the
cloudwatch exporter deployment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants