Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue Processor: Wrong AWS Health Events #550

Closed
joshua-giumelli-deltatre opened this issue Dec 17, 2021 · 8 comments
Closed

Queue Processor: Wrong AWS Health Events #550

joshua-giumelli-deltatre opened this issue Dec 17, 2021 · 8 comments
Assignees
Labels
docs Priority: Medium This issue will be seen by about half of users Type: Bug Something isn't working Type: Enhancement New feature or request

Comments

@joshua-giumelli-deltatre
Copy link

joshua-giumelli-deltatre commented Dec 17, 2021

In the docs for the infrastructure required for Queue Processor mode, it mentions to create an EventBridge rule for catching aws.health messages. The issue with this rule however is that it will catch all AWS health events, even those that cannot be processed by the NTH.

We recently found that this rule was catching notifications for ElasticSearch/OpenSearch relating to the Log4Shell CVE. This was causing the NTH to try and process events for the "ES" service and then would get stuck in a crash loop trying to process this message on the queue.

To avoid this in future, an extra event pattern such as "service": "ec2" would be needed for the ScheduledChange rule.

Steps to reproduce

  1. Create the EventBridge rules from docs for Infrastructure Setup when using Queue Processor
  2. Receive non EC2 AWS Health event notification, e.g. for ElasticSearch/OpenSearch Log4J CVE

Expected outcome
NTH should either ignore messages for services it is not able to process OR event notifications not relevant to NTH should not end up in its SQS queue.

Application Logs
NTH Logs:

2021/12/17 03:28:02 ERR ignoring interruption event due to error error="events from Amazon EventBridge for service (ES) are not supported"
2021/12/17 03:28:02 ERR error processing interruption events error="some interruption events for message Id 1100000000000000001000000001100000101000 could not be processed"
2021/12/17 03:28:02 WRN There was a problem monitoring for events error="none of the waiting queue events could be processed" event_type=SQS_TERMINATE
2021/12/17 03:28:02 WRN Stopping NTH - Duplicate Error Threshold hit.
panic: none of the waiting queue events could be processed

goroutine 97 [running]:
main.main.func3(0x1, 0x24ef720, 0xc00007f4a0, 0x21be2d2, 0x1c, 0x0, 0x0, 0x24ff9c8, 0xc000714fc0, 0x24ff9c8, ...)
	/node-termination-handler/cmd/node-termination-handler.go:211 +0x649
created by main.main
	/node-termination-handler/cmd/node-termination-handler.go:193 +0xc91

Problem Event Notification:

{
    "version": "0",
    "id": "9ca40099-e97d-510f-e9df-6f92042f34e1",
    "detail-type": "AWS Health Event",
    "source": "aws.health",
    "account": "XXXXXX",
    "time": "2021-12-14T04:05:00Z",
    "region": "ap-south-1",
    "resources": [
        "XXXXXX"
    ],
    "detail": {
        "eventArn": "arn:aws:health:ap-south-1::event/ES/AWS_ES_SECURITY_NOTIFICATION/AWS_ES_SECURITY_NOTIFICATION_XXXXXXXXXXXXXXXXXXXXX",
        "service": "ES",
        "eventTypeCode": "AWS_ES_SECURITY_NOTIFICATION",
        "eventTypeCategory": "accountNotification",
        "startTime": "Tue, 14 Dec 2021 04:05:00 GMT",
        "endTime": "Tue, 14 Dec 2021 04:15:00 GMT",
        "eventDescription": [
            {
                "language": "en_US",
                "latestDescription": "AWS is aware of the recently disclosed security issue affecting the open source Apache “Log4j2” utility. This utility is used by Amazon OpenSearch Service. We have released a service software update R20211203-P2 that contains the updated “Log4j2” utility that addresses the issue for your domain(s) in the AP-SOUTH-1 Region. We strongly recommend that you apply this software update immediately to mitigate this issue for your OpenSearch domains.\n\nYou can update the service software using the Amazon OpenSearch Service console [1]. If you have already updated your domain(s), or you see that the current Service Software Release in the Domain overview page shows ‘R20211203-P2’, no further action is required. \n\nIf you have any questions or concerns, please contact AWS Support [2]. \n\n[1] https://docs.aws.amazon.com/opensearch-service/latest/developerguide/service-software.html\n[2] https://aws.amazon.com/support"
            }
        ],
        "affectedEntities": [
            {
                "entityValue": "XXXXXX"
            }
        ]
    }
}

Environment

  • NTH App Version: v1.14.0
  • NTH Mode (IMDS/Queue processor): Queue Processor
  • OS/Arch: Bottlerocket OS 1.4.0/amd64
  • Kubernetes version: 1.21
  • Installation method: Helm chart version 0.16.0
@szymonpk
Copy link

I have encountered the same issue during an us-east-1 outage. From the code, it looks like filters for "AWS Health Event" should be more precise. Only scheduledChange type category is supported. I will try to correct it tomorrow in my environment.

@szymonpk
Copy link

{
  "source": ["aws.health"],
  "detail-type": ["AWS Health Event"],
  "detail": {
    "service": ["EC2"],
    "eventTypeCategory": ["scheduledChange"]
  }
}

This filter does the trick for me.

@bwagner5 bwagner5 added Type: Bug Something isn't working Type: Enhancement New feature or request docs labels Jan 3, 2022
@bwagner5
Copy link
Contributor

bwagner5 commented Jan 3, 2022

Thanks for reporting this issue! We can update the rule change in the README. I think we should also look into making some errors in the monitors so be non-fatal, like in this case, skipping events, shouldn't cause a crash loop.

@jillmon jillmon added the Priority: Medium This issue will be seen by about half of users label Jan 26, 2022
@ismith
Copy link

ismith commented Feb 22, 2022

This hit us, too, over the weekend due to AWS Sumerian being retired:

{"level":"error","error":"events from Amazon EventBridge for service (SUMERIAN) are not supported","time":"2022-02-21T19:58:30Z","message":"ignoring interruption event due to error"}

Seems like a place where good default behavior (as @bwagner5 mentioned on Jan 3) would be a win. Convention over configuration, yeah?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label Mar 25, 2022
@AustinSiu AustinSiu removed the stale Issues / PRs with no activity label Mar 30, 2022
@vumdao
Copy link

vumdao commented Apr 11, 2022

What is workaround for this or we should ignore aws.health event?

@LikithaVemulapalli
Copy link
Contributor

Thanks for your patience! I’ve updated the README rules to be more precise, and skipping events should no longer be fatal.

@snay2 snay2 added the Pending-Release Pending an NTH or eks-charts release label Apr 28, 2022
@AustinSiu
Copy link
Contributor

Fix released as part of v1.16.3, closing the issue.

@snay2 snay2 removed the Pending-Release Pending an NTH or eks-charts release label Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Priority: Medium This issue will be seen by about half of users Type: Bug Something isn't working Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants