-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROX-14270: Deploy cloudwatch exporter to data plane #801
Conversation
Skipping CI for Draft Pull Request. |
Add a pod monitor to scrape the cloudwatch exporter. The exporter is used to scrape RDS database metrics from AWS. See stackrox/acs-fleet-manager#801 for the cloudwatch exporter deployment.
2957475
to
0dbdefb
Compare
Add dp-terraform sub-chart to deploy a cloudwatch exporter. The exporter scrapes RDS database from AWS and exports them as Prometheus metrics. The exporter is configured via a config map and requires an AWS IAM user with a cloudwatch permission set.
0dbdefb
to
3727260
Compare
dp-terraform/helm/rhacs-terraform/charts/cloudwatch/templates/01-operator-03-config-map.yaml
Show resolved
Hide resolved
dp-terraform/helm/rhacs-terraform/charts/cloudwatch/templates/01-operator-03-config-map.yaml
Show resolved
Hide resolved
dp-terraform/helm/rhacs-terraform/charts/cloudwatch/templates/01-operator-03-config-map.yaml
Show resolved
Hide resolved
dp-terraform/helm/rhacs-terraform/charts/cloudwatch/templates/01-operator-03-config-map.yaml
Show resolved
Hide resolved
data: | ||
config.yaml: |- | ||
region: us-east-1 | ||
metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any limitations how many metrics we could collect? I would like to suggest at least:
AuroraReplicaLag -- to make sure the replica is doing any good.
MaximumUsedTransactionIDs -- not sure if there is any danger of transaction wraparound on Aurora, but this metric suggest there is.
TransactionLogsDiskUsage -- to know how much WAL is there.
Deadlocks -- they're just nasty.
BufferCacheHitRatio -- to know what kind of workload are we dealing with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limitation is $$$. Right now the cost for the current metrics+statistics in this PR is about $200/month on stage, probably ~$1k/month on prod. Roughly speaking the costs scale with the number of metrics and tenants.
In general I don't want to just replicate all the AWS dashboards and metrics in our Grafana. I would rather keep it minimal, in case we need more we can always log in to the AWS console and get all the metrics there. Let's say:
- a metric we need for alerts: definitely add it.
- a metric we need at least once a week: let's add it for convenience.
- a metric we need once a year for root cause analysis: maybe not worth it, go to AWS if need be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but at least the replica lag, transaction id and deadlocks are pretty essential for the application correctness and availability. I'm also a bit concerned there are no out-of-the-box metrics describing autovacuum, probably such things we would need to figure out on the fly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, a good first step. Few notes above.
It seems that switching to the cloudwatch API to |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: erthalion, janisz, stehessel The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Add a pod monitor to scrape the cloudwatch exporter. The exporter is used to scrape RDS database metrics from AWS. See stackrox/acs-fleet-manager#801 for the cloudwatch exporter deployment.
Description
Add dp-terraform sub-chart to deploy a cloudwatch exporter. The exporter
scrapes RDS database from AWS and exports them as Prometheus metrics.
The exporter is configured via a config map and requires an AWS IAM user
with a cloudwatch permission set.
Based on these metrics, the exporter performs about 480 cloudwatch requests per minute. That equals roughly $200 in monthly AWS costs.
Checklist (Definition of Done)
Unit and integration tests addedTest manual
Documentation added if necessary (i.e. changes to dev setup, test execution, ...)ROX-12345: ...
Discussed security and business related topics privately. Will move any security and business related topics that arise to private communication channel.Test manual
Tested the exporter on a dev cluster.