-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queue Processor: Wrong AWS Health Events #550
Comments
I have encountered the same issue during an us-east-1 outage. From the code, it looks like filters for "AWS Health Event" should be more precise. Only |
This filter does the trick for me. |
Thanks for reporting this issue! We can update the rule change in the README. I think we should also look into making some errors in the monitors so be non-fatal, like in this case, skipping events, shouldn't cause a crash loop. |
This hit us, too, over the weekend due to AWS Sumerian being retired:
Seems like a place where good default behavior (as @bwagner5 mentioned on Jan 3) would be a win. Convention over configuration, yeah? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label. |
What is workaround for this or we should ignore |
Thanks for your patience! I’ve updated the README rules to be more precise, and skipping events should no longer be fatal. |
Fix released as part of v1.16.3, closing the issue. |
In the docs for the infrastructure required for Queue Processor mode, it mentions to create an EventBridge rule for catching
aws.health
messages. The issue with this rule however is that it will catch all AWS health events, even those that cannot be processed by the NTH.We recently found that this rule was catching notifications for ElasticSearch/OpenSearch relating to the Log4Shell CVE. This was causing the NTH to try and process events for the "ES" service and then would get stuck in a crash loop trying to process this message on the queue.
To avoid this in future, an extra event pattern such as "service": "ec2" would be needed for the
ScheduledChange
rule.Steps to reproduce
Expected outcome
NTH should either ignore messages for services it is not able to process OR event notifications not relevant to NTH should not end up in its SQS queue.
Application Logs
NTH Logs:
Problem Event Notification:
Environment
The text was updated successfully, but these errors were encountered: