Start without Slack #3

sureshoss · 2024-03-26T07:27:15Z

This is impressive, is there an option that we can start the application without the slack integration, as we dont have options to connect to slack in the org

AndreZiviani · 2024-03-26T12:34:56Z

Hi @sureshoss
I can make the slack integration optional but it would not make much sense as is because that is the only integration available. I ended up not implementing the prometheus metrics (it is only used as a timer to check AWS Health for new events) but I can look into it again, what is your use case?

sureshsubramaniam · 2024-04-04T03:19:11Z

Thanks for your response, my use case is to get the AWS health data across regions exported and stored in Prometheus for visualizing ina Grafana map panel with traffic lights. And also was looking to see if we can scrape the account and resource level stats in the same way so we can build a drill down dashboards from the region to accounts and to the resources.

AndreZiviani · 2024-04-04T13:00:54Z

@sureshsubramaniam @sureshoss Do you think a metric like would solve your needs?

aws_health_exporter_event{accountid=<>, region=<>, service=<>}

I'm not sure adding the affected resources as labels is a good idea due to cardinality issues but maybe I can create a flag to enable it. The value of the metric could be the number of updates on that event, going back to zero when closed/resolved

AndreZiviani · 2024-04-04T17:53:16Z

I tried implementing metrics support but found a few issues with AWS API:

API does not return each update individually, only current state and last update timestamp, so can't count how many updates were made
Most of the events don't actually close, e.g. I have one for DMS version update but the affected resources don't exist anymore, so the metric would remain active (value of 1) forever
For now the logic is based on timestamps, the exporter checks if any update were made on any issue since the last scrape time (actually the AWS API does this, I only filter by time), this makes the exporter very responsive since it will only check each event once. I could change the logic to look the last X hours but there is no guarantee that the last update of the health event will be on that time range. Another option is to filter based on the status (open or closed) but there is the issue I mentioned before

The official AWS AHA implementation also does not have this concept of state where it does something if the event is opened or closed, it only notifies that something changed so I assume it is not possible (or practical) to try implementing something like that

These are some example metrics of what I managed to implement, I think the best route will be only a counter that increments on each update and resets on exporter restart, any suggestions?

aws_health_event{category="issue",code="AWS_EC2_OPERATIONAL_ISSUE",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-1",scope="PUBLIC",service="EC2"} 0
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_REDUNDANCY_LOSS",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 0
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_SINGLE_TUNNEL_NOTIFICATION",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 1
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_REDUNDANCY_LOSS",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 0
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_REDUNDANCY_LOSS",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-1",scope="ACCOUNT_SPECIFIC",service="VPN"} 0
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_SINGLE_TUNNEL_NOTIFICATION",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 1
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_ELASTICACHE_UPDATE_AVAILABLE",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="ELASTICACHE"} 1
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_RDS_OPERATIONAL_NOTIFICATION",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-1",scope="ACCOUNT_SPECIFIC",service="RDS"} 1
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_REDUNDANCY_LOSS",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 0
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_SINGLE_TUNNEL_NOTIFICATION",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 1

AndreZiviani · 2024-04-04T19:08:28Z

If you want to give it a shot, but keep in mind this is untested
https://github.com/AndreZiviani/aws-health-exporter/releases/tag/v0.1.0

sureshoss · 2024-04-05T09:27:18Z

Thanks @AndreZiviani , i will test it and update you, with some more comments

sureshoss · 2024-04-05T09:37:21Z

Initial testing: Dependency on the GLIBC from the compiled binary
#-> ./aws-health-exporter --help
./aws-health-exporter: /lib64/libc.so.6: version GLIBC_2.32' not found (required by ./aws-health-exporter) ./aws-health-exporter: /lib64/libc.so.6: version GLIBC_2.34' not found (required by ./aws-health-exporter

I will be compling with the GLIBC version that i have in my system and update

sureshoss · 2024-04-05T10:19:48Z

I compiled and started on my linux machine, however the exporter starts without issue but i am unable to see any of the health metrics for the account or for the org. i am running it on a EC2 with redhat linux

#-> ./aws-health-exporter --log-level debug --log-events true
DEBU[0000] Set log level to debug
INFO[0000] Starting AWS Health Exporter. [log-level=debug,log-events=true]
INFO[0000] Starting metric http endpoint [address=:8080, path=/metrics, regions=all-regions]

There are no debug logs printed to identify te issue
i see only the aws_health_process_runtime_go_gc_pause_ns_bucket, aws_health_process_runtime_go_mem_live_objects
much of them related to the exporter not the actual metrics like what you see

AndreZiviani · 2024-04-05T11:46:10Z

however the exporter starts without issue but i am unable to see any of the health metrics for the account or for the org.
that is expected because the exporter is stateless it will only check for new updates since the last time it was scraped (or started)

I've added a hidden command to inject some time on the first scrape, try running with --time-shift -240h to force it to look all events on the last 10 days

AndreZiviani · 2024-04-05T11:56:14Z

Initial testing: Dependency on the GLIBC from the compiled binary

I forgot to disable CGO on release binaries, latest version should work for you
https://github.com/AndreZiviani/aws-health-exporter/releases/tag/v0.1.1

sureshsubramaniam · 2024-04-08T00:28:44Z

Awesome let me give it a try today and update you

sureshoss · 2024-04-11T04:56:31Z

@AndreZiviani I took a shot to run the latest build and seems there is a panic in the code
Howeveri checked using the aws cli and was able to get the events without the throttle

#-> ./aws-health-exporter -v debug -r us-east-1 --time-shift -240h
DEBU[0000] Set log level to debug
INFO[0000] Starting AWS Health Exporter. [log-level=debug,log-events=false]
INFO[0017] Starting metric http endpoint [address=:8080, path=/metrics, regions=us-east-1]
panic: operation error Health: DescribeAffectedAccountsForOrganization, exceeded maximum number of attempts, 3, https response error StatusCode: 429, RequestID: xxx-xxx-xxxx-xxxx-xxxxxx, api error ThrottlingException: Rate exceeded

goroutine 68 [running]:
github.com/AndreZiviani/aws-health-exporter/exporter.Metrics.getAffectedAccountsForOrg({0xc000112680, 0x0, {0x0, 0x0}, {0x0, 0x0}, 0x13ed300, {0xc17aad47cc2b13c3, 0xfffcee3255623100, 0x13f8b40}, ...}, ...)
/home/runner/work/aws-health-exporter/aws-health-exporter/exporter/org.go:67 +0x208
github.com/AndreZiviani/aws-health-exporter/exporter.(*Metrics).EnrichOrgEvents(0xc0002da200, {0xedaf58, 0x1427ce0}, {0xc0004a9c00, 0x0, {0xc00033afb0, 0x10}, {0xc000360618, 0x13}, 0xc0004a9c10, ...})
/home/runner/work/aws-health-exporter/aws-health-exporter/exporter/org.go:50 +0x146
github.com/AndreZiviani/aws-health-exporter/exporter.(*Metrics).GetOrgEvents(0xc0002da200)
/home/runner/work/aws-health-exporter/aws-health-exporter/exporter/org.go:36 +0x36e
github.com/AndreZiviani/aws-health-exporter/exporter.(*Metrics).GetHealthEvents(0xc0002da200)
/home/runner/work/aws-health-exporter/aws-health-exporter/exporter/health.go:29 +0x33
github.com/AndreZiviani/aws-health-exporter/exporter.NewMetrics.func1({0xcfb660?, 0x1427ce0?}, {0xed9dc0, 0xc0004ae060})
/home/runner/work/aws-health-exporter/aws-health-exporter/exporter/metrics.go:27 +0x48
go.opentelemetry.io/otel/sdk/metric.(*meter).RegisterCallback.func1({0xedaf58, 0x1427ce0})
/home/runner/go/pkg/mod/go.opentelemetry.io/otel/sdk/[email protected]/meter.go:445 +0x55
go.opentelemetry.io/otel/sdk/metric.(*pipeline).produce(0xc0000fe510, {0xedaf58, 0x1427ce0?}, 0xc000352060)
/home/runner/go/pkg/mod/go.opentelemetry.io/otel/sdk/[email protected]/pipeline.go:134 +0x314
go.opentelemetry.io/otel/sdk/metric.(*ManualReader).Collect(0xc0000a3860, {0xedaf58, 0x1427ce0}, 0xc000352060)
/home/runner/go/pkg/mod/go.opentelemetry.io/otel/sdk/[email protected]/manual_reader.go:123 +0xe2
go.opentelemetry.io/otel/exporters/prometheus.(*collector).Collect(0xc0002ea000, 0xc000069f60?)
/home/runner/go/pkg/mod/go.opentelemetry.io/otel/exporters/[email protected]/exporter.go:158 +0x72
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
/home/runner/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:457 +0xe7
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather in goroutine 15
/home/runner/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:547 +0xbab

AndreZiviani · 2024-04-11T13:32:35Z

@sureshoss That's odd, looks like you have a lot accounts/events and the api is blocking you but the SDK should handle retires and rate-limit, will try to look into it

AndreZiviani · 2024-04-19T12:08:24Z

hey @sureshoss I wasn't able to reproduce your issue, probably because I don't have enough events/resources but I've changed the logic on the retryer please let me know if this fix your issue. If it does not then I can be more explicit and increase some other parameters
https://github.com/AndreZiviani/aws-health-exporter/releases/tag/v0.1.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start without Slack #3

Start without Slack #3

sureshoss commented Mar 26, 2024

AndreZiviani commented Mar 26, 2024

sureshsubramaniam commented Apr 4, 2024

AndreZiviani commented Apr 4, 2024

AndreZiviani commented Apr 4, 2024

AndreZiviani commented Apr 4, 2024 •

edited

Loading

sureshoss commented Apr 5, 2024

sureshoss commented Apr 5, 2024

sureshoss commented Apr 5, 2024

AndreZiviani commented Apr 5, 2024

AndreZiviani commented Apr 5, 2024 •

edited

Loading

sureshsubramaniam commented Apr 8, 2024

sureshoss commented Apr 11, 2024

AndreZiviani commented Apr 11, 2024

AndreZiviani commented Apr 19, 2024

Start without Slack #3

Start without Slack #3

Comments

sureshoss commented Mar 26, 2024

AndreZiviani commented Mar 26, 2024

sureshsubramaniam commented Apr 4, 2024

AndreZiviani commented Apr 4, 2024

AndreZiviani commented Apr 4, 2024

AndreZiviani commented Apr 4, 2024 • edited Loading

sureshoss commented Apr 5, 2024

sureshoss commented Apr 5, 2024

sureshoss commented Apr 5, 2024

AndreZiviani commented Apr 5, 2024

AndreZiviani commented Apr 5, 2024 • edited Loading

sureshsubramaniam commented Apr 8, 2024

sureshoss commented Apr 11, 2024

AndreZiviani commented Apr 11, 2024

AndreZiviani commented Apr 19, 2024

AndreZiviani commented Apr 4, 2024 •

edited

Loading

AndreZiviani commented Apr 5, 2024 •

edited

Loading