ROX-14270: Deploy cloudwatch exporter to data plane #801

stehessel · 2023-02-08T02:25:41Z

Description

Add dp-terraform sub-chart to deploy a cloudwatch exporter. The exporter
scrapes RDS database from AWS and exports them as Prometheus metrics.

The exporter is configured via a config map and requires an AWS IAM user
with a cloudwatch permission set.

Based on these metrics, the exporter performs about 480 cloudwatch requests per minute. That equals roughly $200 in monthly AWS costs.

Checklist (Definition of Done)

~~Unit and integration tests added~~
Added test description under Test manual
~~Documentation added if necessary (i.e. changes to dev setup, test execution, ...)~~
CI and all relevant tests are passing
Add the ticket number to the PR title if available, i.e. ROX-12345: ...
~~Discussed security and business related topics privately. Will move any security and business related topics that arise to private communication channel.~~

Test manual

Tested the exporter on a dev cluster.

openshift-ci · 2023-02-08T02:25:45Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Add a pod monitor to scrape the cloudwatch exporter. The exporter is used to scrape RDS database metrics from AWS. See stackrox/acs-fleet-manager#801 for the cloudwatch exporter deployment.

Add dp-terraform sub-chart to deploy a cloudwatch exporter. The exporter scrapes RDS database from AWS and exports them as Prometheus metrics. The exporter is configured via a config map and requires an AWS IAM user with a cloudwatch permission set.

dp-terraform/helm/rhacs-terraform/charts/cloudwatch/templates/01-operator-03-config-map.yaml

erthalion · 2023-02-09T17:39:12Z

dp-terraform/helm/rhacs-terraform/charts/cloudwatch/templates/01-operator-03-config-map.yaml

+data:
+  config.yaml: |-
+    region: us-east-1
+    metrics:


Are there any limitations how many metrics we could collect? I would like to suggest at least:

AuroraReplicaLag -- to make sure the replica is doing any good.
MaximumUsedTransactionIDs -- not sure if there is any danger of transaction wraparound on Aurora, but this metric suggest there is.
TransactionLogsDiskUsage -- to know how much WAL is there.
Deadlocks -- they're just nasty.
BufferCacheHitRatio -- to know what kind of workload are we dealing with.

The limitation is $$$. Right now the cost for the current metrics+statistics in this PR is about $200/month on stage, probably ~$1k/month on prod. Roughly speaking the costs scale with the number of metrics and tenants.

In general I don't want to just replicate all the AWS dashboards and metrics in our Grafana. I would rather keep it minimal, in case we need more we can always log in to the AWS console and get all the metrics there. Let's say:

a metric we need for alerts: definitely add it.

a metric we need at least once a week: let's add it for convenience.

a metric we need once a year for root cause analysis: maybe not worth it, go to AWS if need be.

Sure, but at least the replica lag, transaction id and deadlocks are pretty essential for the application correctness and availability. I'm also a bit concerned there are no out-of-the-box metrics describing autovacuum, probably such things we would need to figure out on the fly.

erthalion

Thanks, a good first step. Few notes above.

stehessel · 2023-02-09T19:14:49Z

It seems that switching to the cloudwatch API to use_get_metric_data: true reduced the overall cost compared to my initial estimation, so we should be able to add all the suggested metrics without going over that estimate.

openshift-ci · 2023-02-10T13:38:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: erthalion, janisz, stehessel
Once this PR has been reviewed and has the lgtm label, please assign ebensh for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Add a pod monitor to scrape the cloudwatch exporter. The exporter is used to scrape RDS database metrics from AWS. See stackrox/acs-fleet-manager#801 for the cloudwatch exporter deployment.

openshift-ci bot added the do-not-merge/work-in-progress label Feb 8, 2023

stehessel temporarily deployed to development February 8, 2023 02:25 — with GitHub Actions Inactive

stehessel mentioned this pull request Feb 8, 2023

ROX-14270: Add cloudwatch exporter pod monitor stackrox/rhacs-observability-resources#53

Merged

stehessel force-pushed the ROX-14270/cloudwatch-exporter branch from 2957475 to 0dbdefb Compare February 8, 2023 02:30

stehessel temporarily deployed to development February 8, 2023 02:31 — with GitHub Actions Inactive

stehessel force-pushed the ROX-14270/cloudwatch-exporter branch from 0dbdefb to 3727260 Compare February 8, 2023 02:34

stehessel temporarily deployed to development February 8, 2023 02:34 — with GitHub Actions Inactive

stehessel marked this pull request as ready for review February 8, 2023 02:35

openshift-ci bot removed the do-not-merge/work-in-progress label Feb 8, 2023

stehessel temporarily deployed to development February 8, 2023 02:35 — with GitHub Actions Inactive

stehessel requested a review from a team February 8, 2023 02:35

use GetMetricData

8574940

stehessel temporarily deployed to development February 8, 2023 15:23 — with GitHub Actions Inactive

stehessel requested a review from erthalion February 8, 2023 16:48

janisz approved these changes Feb 9, 2023

View reviewed changes

openshift-ci bot assigned janisz Feb 9, 2023

openshift-ci bot added the lgtm label Feb 9, 2023