change 5XX Alarm to ELB (rather than backend) and codify NoHealthyInstancesAlarm #425
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We had an incident recently.
First off neither the 'High 5XX' alarm or the 'No Healthy Instances' alarm were in the CloudFormation (manually created in AWS Console by the looks of it). So this PR adds those, with the same
Threshold
,EvaluationPeriods
,ComparisonOperator
andStatistic
as the existing.Crucially though, the 5XX alarm now uses the SUM of
HTTPCode_ELB_5XX
andHTTPCode_Backend_5XX
because during the incident the backend was unresponsive so the ELB was serving504 Gateway Timeout
to consumers/clients, but no alarm fired because it was based onHTTPCode_Backend_5XX
count and the backend was unresponsive (OutOfMemory). See...There will be subsequent PRs for the other alarms improvements, just doing this one in isolation to keep the PR manageable. One of the subsequent PRs will be to adopt the naming convention of most of our other alarms, see https://docs.google.com/document/d/1_3El3cly9d7u_jPgTcRjLxmdG2e919zCLvmcFCLOYAk