Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start without Slack #3

Open
sureshoss opened this issue Mar 26, 2024 · 14 comments
Open

Start without Slack #3

sureshoss opened this issue Mar 26, 2024 · 14 comments

Comments

@sureshoss
Copy link

This is impressive, is there an option that we can start the application without the slack integration, as we dont have options to connect to slack in the org

@AndreZiviani
Copy link
Owner

Hi @sureshoss
I can make the slack integration optional but it would not make much sense as is because that is the only integration available. I ended up not implementing the prometheus metrics (it is only used as a timer to check AWS Health for new events) but I can look into it again, what is your use case?

@sureshsubramaniam
Copy link

Thanks for your response, my use case is to get the AWS health data across regions exported and stored in Prometheus for visualizing ina Grafana map panel with traffic lights. And also was looking to see if we can scrape the account and resource level stats in the same way so we can build a drill down dashboards from the region to accounts and to the resources.

@AndreZiviani
Copy link
Owner

@sureshsubramaniam @sureshoss Do you think a metric like would solve your needs?

aws_health_exporter_event{accountid=<>, region=<>, service=<>}

I'm not sure adding the affected resources as labels is a good idea due to cardinality issues but maybe I can create a flag to enable it. The value of the metric could be the number of updates on that event, going back to zero when closed/resolved

@AndreZiviani
Copy link
Owner

I tried implementing metrics support but found a few issues with AWS API:

  • API does not return each update individually, only current state and last update timestamp, so can't count how many updates were made
  • Most of the events don't actually close, e.g. I have one for DMS version update but the affected resources don't exist anymore, so the metric would remain active (value of 1) forever
  • For now the logic is based on timestamps, the exporter checks if any update were made on any issue since the last scrape time (actually the AWS API does this, I only filter by time), this makes the exporter very responsive since it will only check each event once. I could change the logic to look the last X hours but there is no guarantee that the last update of the health event will be on that time range. Another option is to filter based on the status (open or closed) but there is the issue I mentioned before

The official AWS AHA implementation also does not have this concept of state where it does something if the event is opened or closed, it only notifies that something changed so I assume it is not possible (or practical) to try implementing something like that

These are some example metrics of what I managed to implement, I think the best route will be only a counter that increments on each update and resets on exporter restart, any suggestions?

aws_health_event{category="issue",code="AWS_EC2_OPERATIONAL_ISSUE",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-1",scope="PUBLIC",service="EC2"} 0
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_REDUNDANCY_LOSS",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 0
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_SINGLE_TUNNEL_NOTIFICATION",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 1
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_REDUNDANCY_LOSS",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 0
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_REDUNDANCY_LOSS",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-1",scope="ACCOUNT_SPECIFIC",service="VPN"} 0
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_SINGLE_TUNNEL_NOTIFICATION",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 1
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_ELASTICACHE_UPDATE_AVAILABLE",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="ELASTICACHE"} 1
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_RDS_OPERATIONAL_NOTIFICATION",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-1",scope="ACCOUNT_SPECIFIC",service="RDS"} 1
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_REDUNDANCY_LOSS",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 0
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_SINGLE_TUNNEL_NOTIFICATION",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 1

@AndreZiviani
Copy link
Owner

AndreZiviani commented Apr 4, 2024

If you want to give it a shot, but keep in mind this is untested
https://github.com/AndreZiviani/aws-health-exporter/releases/tag/v0.1.0

@sureshoss
Copy link
Author

Thanks @AndreZiviani , i will test it and update you, with some more comments

@sureshoss
Copy link
Author

Initial testing: Dependency on the GLIBC from the compiled binary
#-> ./aws-health-exporter --help
./aws-health-exporter: /lib64/libc.so.6: version GLIBC_2.32' not found (required by ./aws-health-exporter) ./aws-health-exporter: /lib64/libc.so.6: version GLIBC_2.34' not found (required by ./aws-health-exporter

I will be compling with the GLIBC version that i have in my system and update

@sureshoss
Copy link
Author

I compiled and started on my linux machine, however the exporter starts without issue but i am unable to see any of the health metrics for the account or for the org. i am running it on a EC2 with redhat linux

#-> ./aws-health-exporter --log-level debug --log-events true
DEBU[0000] Set log level to debug
INFO[0000] Starting AWS Health Exporter. [log-level=debug,log-events=true]
INFO[0000] Starting metric http endpoint [address=:8080, path=/metrics, regions=all-regions]

There are no debug logs printed to identify te issue
i see only the aws_health_process_runtime_go_gc_pause_ns_bucket, aws_health_process_runtime_go_mem_live_objects
much of them related to the exporter not the actual metrics like what you see

@AndreZiviani
Copy link
Owner

however the exporter starts without issue but i am unable to see any of the health metrics for the account or for the org.
that is expected because the exporter is stateless it will only check for new updates since the last time it was scraped (or started)

I've added a hidden command to inject some time on the first scrape, try running with --time-shift -240h to force it to look all events on the last 10 days

@AndreZiviani
Copy link
Owner

AndreZiviani commented Apr 5, 2024

Initial testing: Dependency on the GLIBC from the compiled binary

I forgot to disable CGO on release binaries, latest version should work for you
https://github.com/AndreZiviani/aws-health-exporter/releases/tag/v0.1.1

@sureshsubramaniam
Copy link

Awesome let me give it a try today and update you

@sureshoss
Copy link
Author

@AndreZiviani I took a shot to run the latest build and seems there is a panic in the code
Howeveri checked using the aws cli and was able to get the events without the throttle

#-> ./aws-health-exporter -v debug -r us-east-1 --time-shift -240h
DEBU[0000] Set log level to debug
INFO[0000] Starting AWS Health Exporter. [log-level=debug,log-events=false]
INFO[0017] Starting metric http endpoint [address=:8080, path=/metrics, regions=us-east-1]
panic: operation error Health: DescribeAffectedAccountsForOrganization, exceeded maximum number of attempts, 3, https response error StatusCode: 429, RequestID: xxx-xxx-xxxx-xxxx-xxxxxx, api error ThrottlingException: Rate exceeded

goroutine 68 [running]:
github.com/AndreZiviani/aws-health-exporter/exporter.Metrics.getAffectedAccountsForOrg({0xc000112680, 0x0, {0x0, 0x0}, {0x0, 0x0}, 0x13ed300, {0xc17aad47cc2b13c3, 0xfffcee3255623100, 0x13f8b40}, ...}, ...)
/home/runner/work/aws-health-exporter/aws-health-exporter/exporter/org.go:67 +0x208
github.com/AndreZiviani/aws-health-exporter/exporter.(*Metrics).EnrichOrgEvents(0xc0002da200, {0xedaf58, 0x1427ce0}, {0xc0004a9c00, 0x0, {0xc00033afb0, 0x10}, {0xc000360618, 0x13}, 0xc0004a9c10, ...})
/home/runner/work/aws-health-exporter/aws-health-exporter/exporter/org.go:50 +0x146
github.com/AndreZiviani/aws-health-exporter/exporter.(*Metrics).GetOrgEvents(0xc0002da200)
/home/runner/work/aws-health-exporter/aws-health-exporter/exporter/org.go:36 +0x36e
github.com/AndreZiviani/aws-health-exporter/exporter.(*Metrics).GetHealthEvents(0xc0002da200)
/home/runner/work/aws-health-exporter/aws-health-exporter/exporter/health.go:29 +0x33
github.com/AndreZiviani/aws-health-exporter/exporter.NewMetrics.func1({0xcfb660?, 0x1427ce0?}, {0xed9dc0, 0xc0004ae060})
/home/runner/work/aws-health-exporter/aws-health-exporter/exporter/metrics.go:27 +0x48
go.opentelemetry.io/otel/sdk/metric.(*meter).RegisterCallback.func1({0xedaf58, 0x1427ce0})
/home/runner/go/pkg/mod/go.opentelemetry.io/otel/sdk/[email protected]/meter.go:445 +0x55
go.opentelemetry.io/otel/sdk/metric.(*pipeline).produce(0xc0000fe510, {0xedaf58, 0x1427ce0?}, 0xc000352060)
/home/runner/go/pkg/mod/go.opentelemetry.io/otel/sdk/[email protected]/pipeline.go:134 +0x314
go.opentelemetry.io/otel/sdk/metric.(*ManualReader).Collect(0xc0000a3860, {0xedaf58, 0x1427ce0}, 0xc000352060)
/home/runner/go/pkg/mod/go.opentelemetry.io/otel/sdk/[email protected]/manual_reader.go:123 +0xe2
go.opentelemetry.io/otel/exporters/prometheus.(*collector).Collect(0xc0002ea000, 0xc000069f60?)
/home/runner/go/pkg/mod/go.opentelemetry.io/otel/exporters/[email protected]/exporter.go:158 +0x72
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
/home/runner/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:457 +0xe7
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather in goroutine 15
/home/runner/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:547 +0xbab

@AndreZiviani
Copy link
Owner

@sureshoss That's odd, looks like you have a lot accounts/events and the api is blocking you but the SDK should handle retires and rate-limit, will try to look into it

@AndreZiviani
Copy link
Owner

hey @sureshoss I wasn't able to reproduce your issue, probably because I don't have enough events/resources but I've changed the logic on the retryer please let me know if this fix your issue. If it does not then I can be more explicit and increase some other parameters
https://github.com/AndreZiviani/aws-health-exporter/releases/tag/v0.1.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants