-
Notifications
You must be signed in to change notification settings - Fork 402
Auto Scaling Lessons Learned
In the requests-per-second (RPS) example, based on the load test, queuing increased when RPS hit 25. To avoid excessive queueing, the auto scaling policy was setup to increase capacity when RPS exceeded 20. The 'RPS headroom' (20 versus 25) serves two purposes. One, we are provisioned for any unexpected RPS bursts, too small to trigger an auto scaling event. Two, the buffer provides a safety net to avoid a "capacity spiral", falling behind and constantly adding capacity to keep up.
The time required for a metric to meet a threshold to scale down should be greater than the time to scale up. For example, in the previous RPS example, the scale up alarm is triggered to alert if RPS exceeds 20 for 5 minutes. However, the scale down alarm will fire if RPS drops below 10 for 20 minutes. Note, the 4x time difference. The reason for a slow scale down is to reduce scaling down due to a false-positive event. For example, a middle tier service may begin to scale down if an edge service has a full or partial production outage. When the edge service is back online, the middle tier may not have the correct capacity to meet the current need.
For smaller auto scaling groups, percentage based auto scaling may result in a particular availability zone (AZ) under provisioned. If fewer instances than AZs are added (10% of < 30, if 3 AZs), one or more AZs may not have enough capacity to handle load. Moreover, if the AZ is severely under provisioned, this may result in a decrease in throughput and/or an increase in latency. Note, this can also occur with a larger farm during periods of low traffic (morning hours), when the ASG size is small.
Alarms with small variance may results in "capacity thrashing". For example, scaling on load average using 2 and 3 for alarm triggers, may result in unexpected scaling. The problem is exacerbated if the period + occurrence time is small. A log rotation, CPU spike may be enough to trigger an alarm, resulting in a false positive.
To avoid "capacity thrashing", create auto scaling policies with symmetric percentages. For example, the RPS example scaled up by 10% and also scaled down by 10%. If the two percentages are unequal, too much capacity may be added, then quickly removed, causing "capacity thrashing".
CloudWatch aggregates over the period, with two different values, 300 and 600, the aggregate results can differ. This may result in some unexpected scaling behavior, especially with the small variance for the rules (see below). Example of different numbers reported based on different periods,
[awsprod@awsprod100] mon-get-stats _SystemLoadAverage --period 300 --dimensions "AutoScalingGroupName=api" --headers --namespace "NFLX" --statistics "Average"
Time Average Unit
2011-12-14 22:25:00 10.513188405797104 None
2011-12-14 22:30:00 14.119027777777777 None
2011-12-14 22:35:00 17.88585365853659 None
2011-12-14 22:40:00 12.57720930232558 None
2011-12-14 22:45:00 10.395000000000005 None
2011-12-14 22:50:00 14.785624999999996 None
2011-12-14 22:55:00 12.755767195767197 None
2011-12-14 23:00:00 2.8151906158357756 None
2011-12-14 23:05:00 2.398559999999998 None
2011-12-14 23:10:00 2.053230403800475 None
2011-12-14 23:15:00 2.7254260089686104 None
2011-12-14 23:20:00 3.3501363636363615 None
[prod@us-east-1]/apps/aws/scripts
[awsprod@awsprod100] mon-get-stats _SystemLoadAverage --period 600 --dimensions "AutoScalingGroupName=api" --headers --namespace "NFLX" --statistics "Average"
Time Average Unit
2011-12-14 22:25:00 12.35446808510638 None
2011-12-14 22:35:00 15.168333333333345 None
2011-12-14 22:45:00 12.26004424778762 None
2011-12-14 22:55:00 6.3600377358490565 None
2011-12-14 23:05:00 2.2159170854271353 None
2011-12-14 23:15:00 3.0356659142212195 None
A Netflix Original Production
Tech Blog | Twitter @NetflixOSS | Jobs