Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change 5XX Alarm to ELB (rather than backend) and codify NoHealthyInstancesAlarm #425

Merged
merged 1 commit into from
Feb 24, 2020

Conversation

twrichards
Copy link
Contributor

@twrichards twrichards commented Feb 21, 2020

We had an incident recently.

First off neither the 'High 5XX' alarm or the 'No Healthy Instances' alarm were in the CloudFormation (manually created in AWS Console by the looks of it). So this PR adds those, with the same Threshold, EvaluationPeriods, ComparisonOperator and Statistic as the existing.

Crucially though, the 5XX alarm now uses the SUM of HTTPCode_ELB_5XX and HTTPCode_Backend_5XX because during the incident the backend was unresponsive so the ELB was serving 504 Gateway Timeout to consumers/clients, but no alarm fired because it was based on HTTPCode_Backend_5XX count and the backend was unresponsive (OutOfMemory). See...
image

There will be subsequent PRs for the other alarms improvements, just doing this one in isolation to keep the PR manageable. One of the subsequent PRs will be to adopt the naming convention of most of our other alarms, see https://docs.google.com/document/d/1_3El3cly9d7u_jPgTcRjLxmdG2e919zCLvmcFCLOYAk

@mariogalic
Copy link
Contributor

mariogalic commented Feb 21, 2020

What is the difference between HTTPCode_Target_5XX_Count and HTTPCode_Backend_5XX? It seems they are the same metric but for different load balancer versions (Application vs Classic).

HTTPCode_Target_5XX_Count

The number of HTTP response codes generated by the targets. This does not include any response codes generated by the load balancer.

HTTPCode_Backend_5XX

The number of HTTP response codes generated by registered instances. This count does not include any response codes generated by the load balancer.

@twrichards twrichards force-pushed the change-5XX-alarm-to-be-on-ELB branch from 08ccc8d to 8fa2ef2 Compare February 21, 2020 11:51
@twrichards
Copy link
Contributor Author

What is the difference between HTTPCode_Target_5XX_Count and HTTPCode_Backend_5XX? It seems they are the same metric but for different load balancer versions (Application vs Classic).

HTTPCode_Target_5XX_Count

The number of HTTP response codes generated by the targets. This does not include any response codes generated by the load balancer.

HTTPCode_Backend_5XX

The number of HTTP response codes generated by registered instances. This count does not include any response codes generated by the load balancer.

@mario-galic great catch, I had intended to change it to HTTPCode_ELB_5XX but bad copy/pasta 🙈. Updated now.

…ormation. Also changing the 5XX alarm to be both ELB AND backend to catch timeouts etc (unresponsive backend)
@twrichards twrichards force-pushed the change-5XX-alarm-to-be-on-ELB branch from 8fa2ef2 to 45f3cfa Compare February 21, 2020 13:40
@mariogalic
Copy link
Contributor

AFAIK, HTTPCode_ELB_5XX means

  • ELB cannot forward request to instance, or
  • instance returned un-parsable nonsense back to ELB

however it does NOT mean instance responded with 5xx. @adamnfish @sihil @jacobwinch can you confirm this? In other words, instance can return 5xx but ELB metric will not count it. Thus alarming on HTTPCode_ELB_5XX is insufficient, and we should alarm on the SUM of HTTPCode_Backend_5XX + HTTPCode_ELB_5XX as Tom has done in this PR?

@twrichards twrichards merged commit de44113 into master Feb 24, 2020
@twrichards twrichards deleted the change-5XX-alarm-to-be-on-ELB branch February 24, 2020 12:35
@prout-bot
Copy link

Seen on PROD (merged by @twrichards 8 minutes and 40 seconds ago) Please check your changes!

Sentry Release: members-data-api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants