change 5XX Alarm to ELB (rather than backend) and codify NoHealthyInstancesAlarm #425

twrichards · 2020-02-21T11:28:51Z

First off neither the 'High 5XX' alarm or the 'No Healthy Instances' alarm were in the CloudFormation (manually created in AWS Console by the looks of it). So this PR adds those, with the same Threshold, EvaluationPeriods, ComparisonOperator and Statistic as the existing.

Crucially though, the 5XX alarm now uses the SUM of HTTPCode_ELB_5XX and HTTPCode_Backend_5XX because during the incident the backend was unresponsive so the ELB was serving 504 Gateway Timeout to consumers/clients, but no alarm fired because it was based on HTTPCode_Backend_5XX count and the backend was unresponsive (OutOfMemory). See...

There will be subsequent PRs for the other alarms improvements, just doing this one in isolation to keep the PR manageable. One of the subsequent PRs will be to adopt the naming convention of most of our other alarms, see https://docs.google.com/document/d/1_3El3cly9d7u_jPgTcRjLxmdG2e919zCLvmcFCLOYAk

mariogalic · 2020-02-21T11:48:14Z

What is the difference between HTTPCode_Target_5XX_Count and HTTPCode_Backend_5XX? It seems they are the same metric but for different load balancer versions (Application vs Classic).

HTTPCode_Target_5XX_Count

The number of HTTP response codes generated by the targets. This does not include any response codes generated by the load balancer.

HTTPCode_Backend_5XX

The number of HTTP response codes generated by registered instances. This count does not include any response codes generated by the load balancer.

twrichards · 2020-02-21T11:56:30Z

What is the difference between HTTPCode_Target_5XX_Count and HTTPCode_Backend_5XX? It seems they are the same metric but for different load balancer versions (Application vs Classic).

HTTPCode_Target_5XX_Count

The number of HTTP response codes generated by the targets. This does not include any response codes generated by the load balancer.

HTTPCode_Backend_5XX

The number of HTTP response codes generated by registered instances. This count does not include any response codes generated by the load balancer.

@mario-galic great catch, I had intended to change it to HTTPCode_ELB_5XX but bad copy/pasta 🙈. Updated now.

…ormation. Also changing the 5XX alarm to be both ELB AND backend to catch timeouts etc (unresponsive backend)

mariogalic · 2020-02-21T13:54:54Z

AFAIK, HTTPCode_ELB_5XX means

ELB cannot forward request to instance, or
instance returned un-parsable nonsense back to ELB

however it does NOT mean instance responded with 5xx. @adamnfish @sihil @jacobwinch can you confirm this? In other words, instance can return 5xx but ELB metric will not count it. Thus alarming on HTTPCode_ELB_5XX is insufficient, and we should alarm on the SUM of HTTPCode_Backend_5XX + HTTPCode_ELB_5XX as Tom has done in this PR?

prout-bot · 2020-02-24T12:43:31Z

Seen on PROD (merged by @twrichards 8 minutes and 40 seconds ago) Please check your changes!

Sentry Release: members-data-api

twrichards requested review from arausuy, JoemG, mariogalic, paulbrown1982 and tomrf1 February 21, 2020 11:30

twrichards force-pushed the change-5XX-alarm-to-be-on-ELB branch from 08ccc8d to 8fa2ef2 Compare February 21, 2020 11:51

5XX & NoHealthy alarms are in AWS Console, but not codified in cloudf…

45f3cfa

…ormation. Also changing the 5XX alarm to be both ELB AND backend to catch timeouts etc (unresponsive backend)

twrichards force-pushed the change-5XX-alarm-to-be-on-ELB branch from 8fa2ef2 to 45f3cfa Compare February 21, 2020 13:40

twrichards mentioned this pull request Feb 21, 2020

Improve Alarm names PLUS link alarms to reader revenue alerts run book #426

Merged

mariogalic approved these changes Feb 24, 2020

View reviewed changes

twrichards merged commit de44113 into master Feb 24, 2020

prout-bot added the Pending-on-PROD label Feb 24, 2020

twrichards deleted the change-5XX-alarm-to-be-on-ELB branch February 24, 2020 12:35

prout-bot added Seen-on-PROD and removed Pending-on-PROD labels Feb 24, 2020

mkuzdowicz mentioned this pull request Mar 18, 2020

improve manage-frontend 5XX alarm (backend + ELB) guardian/manage-frontend#377

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change 5XX Alarm to ELB (rather than backend) and codify NoHealthyInstancesAlarm #425

change 5XX Alarm to ELB (rather than backend) and codify NoHealthyInstancesAlarm #425

twrichards commented Feb 21, 2020 •

edited

Loading

mariogalic commented Feb 21, 2020 •

edited

Loading

twrichards commented Feb 21, 2020

mariogalic commented Feb 21, 2020

prout-bot commented Feb 24, 2020

change 5XX Alarm to ELB (rather than backend) and codify NoHealthyInstancesAlarm #425

change 5XX Alarm to ELB (rather than backend) and codify NoHealthyInstancesAlarm #425

Conversation

twrichards commented Feb 21, 2020 • edited Loading

mariogalic commented Feb 21, 2020 • edited Loading

twrichards commented Feb 21, 2020

mariogalic commented Feb 21, 2020

prout-bot commented Feb 24, 2020

Sentry Release: members-data-api

twrichards commented Feb 21, 2020 •

edited

Loading

mariogalic commented Feb 21, 2020 •

edited

Loading