Queue Processor: Wrong AWS Health Events #550

joshua-giumelli-deltatre · 2021-12-17T03:47:53Z

In the docs for the infrastructure required for Queue Processor mode, it mentions to create an EventBridge rule for catching aws.health messages. The issue with this rule however is that it will catch all AWS health events, even those that cannot be processed by the NTH.

We recently found that this rule was catching notifications for ElasticSearch/OpenSearch relating to the Log4Shell CVE. This was causing the NTH to try and process events for the "ES" service and then would get stuck in a crash loop trying to process this message on the queue.

To avoid this in future, an extra event pattern such as "service": "ec2" would be needed for the ScheduledChange rule.

Steps to reproduce

Create the EventBridge rules from docs for Infrastructure Setup when using Queue Processor
Receive non EC2 AWS Health event notification, e.g. for ElasticSearch/OpenSearch Log4J CVE

Expected outcome
NTH should either ignore messages for services it is not able to process OR event notifications not relevant to NTH should not end up in its SQS queue.

Application Logs
NTH Logs:

2021/12/17 03:28:02 ERR ignoring interruption event due to error error="events from Amazon EventBridge for service (ES) are not supported"
2021/12/17 03:28:02 ERR error processing interruption events error="some interruption events for message Id 1100000000000000001000000001100000101000 could not be processed"
2021/12/17 03:28:02 WRN There was a problem monitoring for events error="none of the waiting queue events could be processed" event_type=SQS_TERMINATE
2021/12/17 03:28:02 WRN Stopping NTH - Duplicate Error Threshold hit.
panic: none of the waiting queue events could be processed

goroutine 97 [running]:
main.main.func3(0x1, 0x24ef720, 0xc00007f4a0, 0x21be2d2, 0x1c, 0x0, 0x0, 0x24ff9c8, 0xc000714fc0, 0x24ff9c8, ...)
	/node-termination-handler/cmd/node-termination-handler.go:211 +0x649
created by main.main
	/node-termination-handler/cmd/node-termination-handler.go:193 +0xc91

Problem Event Notification:

{
    "version": "0",
    "id": "9ca40099-e97d-510f-e9df-6f92042f34e1",
    "detail-type": "AWS Health Event",
    "source": "aws.health",
    "account": "XXXXXX",
    "time": "2021-12-14T04:05:00Z",
    "region": "ap-south-1",
    "resources": [
        "XXXXXX"
    ],
    "detail": {
        "eventArn": "arn:aws:health:ap-south-1::event/ES/AWS_ES_SECURITY_NOTIFICATION/AWS_ES_SECURITY_NOTIFICATION_XXXXXXXXXXXXXXXXXXXXX",
        "service": "ES",
        "eventTypeCode": "AWS_ES_SECURITY_NOTIFICATION",
        "eventTypeCategory": "accountNotification",
        "startTime": "Tue, 14 Dec 2021 04:05:00 GMT",
        "endTime": "Tue, 14 Dec 2021 04:15:00 GMT",
        "eventDescription": [
            {
                "language": "en_US",
                "latestDescription": "AWS is aware of the recently disclosed security issue affecting the open source Apache “Log4j2” utility. This utility is used by Amazon OpenSearch Service. We have released a service software update R20211203-P2 that contains the updated “Log4j2” utility that addresses the issue for your domain(s) in the AP-SOUTH-1 Region. We strongly recommend that you apply this software update immediately to mitigate this issue for your OpenSearch domains.\n\nYou can update the service software using the Amazon OpenSearch Service console [1]. If you have already updated your domain(s), or you see that the current Service Software Release in the Domain overview page shows ‘R20211203-P2’, no further action is required. \n\nIf you have any questions or concerns, please contact AWS Support [2]. \n\n[1] https://docs.aws.amazon.com/opensearch-service/latest/developerguide/service-software.html\n[2] https://aws.amazon.com/support"
            }
        ],
        "affectedEntities": [
            {
                "entityValue": "XXXXXX"
            }
        ]
    }
}

Environment

NTH App Version: v1.14.0
NTH Mode (IMDS/Queue processor): Queue Processor
OS/Arch: Bottlerocket OS 1.4.0/amd64
Kubernetes version: 1.21
Installation method: Helm chart version 0.16.0

The text was updated successfully, but these errors were encountered:

szymonpk · 2021-12-22T13:55:34Z

I have encountered the same issue during an us-east-1 outage. From the code, it looks like filters for "AWS Health Event" should be more precise. Only scheduledChange type category is supported. I will try to correct it tomorrow in my environment.

szymonpk · 2021-12-23T06:26:17Z

{
  "source": ["aws.health"],
  "detail-type": ["AWS Health Event"],
  "detail": {
    "service": ["EC2"],
    "eventTypeCategory": ["scheduledChange"]
  }
}

This filter does the trick for me.

bwagner5 · 2022-01-03T17:49:59Z

Thanks for reporting this issue! We can update the rule change in the README. I think we should also look into making some errors in the monitors so be non-fatal, like in this case, skipping events, shouldn't cause a crash loop.

ismith · 2022-02-22T17:17:23Z

This hit us, too, over the weekend due to AWS Sumerian being retired:

{"level":"error","error":"events from Amazon EventBridge for service (SUMERIAN) are not supported","time":"2022-02-21T19:58:30Z","message":"ignoring interruption event due to error"}

Seems like a place where good default behavior (as @bwagner5 mentioned on Jan 3) would be a win. Convention over configuration, yeah?

github-actions · 2022-03-25T17:10:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

vumdao · 2022-04-11T03:51:57Z

What is workaround for this or we should ignore aws.health event?

LikithaVemulapalli · 2022-04-28T18:30:46Z

Thanks for your patience! I’ve updated the README rules to be more precise, and skipping events should no longer be fatal.

AustinSiu · 2022-05-11T22:15:36Z

Fix released as part of v1.16.3, closing the issue.

bwagner5 added Type: Bug Something isn't working Type: Enhancement New feature or request docs labels Jan 3, 2022

jillmon added the Priority: Medium This issue will be seen by about half of users label Jan 26, 2022

github-actions bot added the stale Issues / PRs with no activity label Mar 25, 2022

AustinSiu removed the stale Issues / PRs with no activity label Mar 30, 2022

jillmon assigned LikithaVemulapalli Apr 13, 2022

LikithaVemulapalli mentioned this issue Apr 26, 2022

Fix AWS Health Event Bridge Rule #633

Merged

snay2 added the Pending-Release Pending an NTH or eks-charts release label Apr 28, 2022

AustinSiu closed this as completed May 11, 2022

snay2 removed the Pending-Release Pending an NTH or eks-charts release label Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queue Processor: Wrong AWS Health Events #550

Queue Processor: Wrong AWS Health Events #550

joshua-giumelli-deltatre commented Dec 17, 2021 •

edited

Loading

szymonpk commented Dec 22, 2021

szymonpk commented Dec 23, 2021

bwagner5 commented Jan 3, 2022

ismith commented Feb 22, 2022 •

edited

Loading

github-actions bot commented Mar 25, 2022

vumdao commented Apr 11, 2022

LikithaVemulapalli commented Apr 28, 2022

AustinSiu commented May 11, 2022

Queue Processor: Wrong AWS Health Events #550

Queue Processor: Wrong AWS Health Events #550

Comments

joshua-giumelli-deltatre commented Dec 17, 2021 • edited Loading

szymonpk commented Dec 22, 2021

szymonpk commented Dec 23, 2021

bwagner5 commented Jan 3, 2022

ismith commented Feb 22, 2022 • edited Loading

github-actions bot commented Mar 25, 2022

vumdao commented Apr 11, 2022

LikithaVemulapalli commented Apr 28, 2022

AustinSiu commented May 11, 2022

joshua-giumelli-deltatre commented Dec 17, 2021 •

edited

Loading

ismith commented Feb 22, 2022 •

edited

Loading