This AWS Step Function is used to rotate the nodes in an Elasticsearch cluster, without interrupting the availability of data.
It is designed so that node rotations can be scheduled frequently, in order to regularly update the underlying AMI (Amazon Machine Image) of each EC2 instance. We use this tool in combination with AMIgo (for baking AMIs) and Riff-Raff (for updating the AMI associated with an AutoScaling Group).
The project assumes that each Elasticsearch node is running on a dedicated EC2 instance, which is part of an ASG (AutoScaling Group).
- Ensure that all EC2 instances are running the AWS SSM agent.
- Create an S3 bucket (or choose an existing one), to store SSM command output. This is only required temporarily, so you may wish to configure object expiration for this bucket.
- Ensure that all EC2 instances have the required permissions.
- Create a new Cloudformation stack, using the template in this project. The frequency of node rotations is passed into the template as a parameter.
- Add
RotateWithElasticsearchNodeRotation: true
as a tag the AutoScaling groups containing the instances that will be rotated - Update the AMI associated with your AutoScaling Groups on a regular basis (using Riff-Raff's scheduled deploy feature).
Sometimes it's useful to rotate an ES node manually (e.g. during an ES upgrade), you can optionally pass a targetInstanceId
in the step function input object. It's usually easiest to open an existing execution and click New Execution
then just edit the input object.
This Step Function triggers a number of TypeScript lambdas, which coordinate the process of replacing a node by:
- Performing various sanity checks and identifying a node to rotate
- Adding a new node into the cluster
- Migrating all data from the target node onto the new node (if data is present)
- Shutting down Elasticsearch on the empty node
- Terminating the unused EC2 instance
In order to ensure that the new EC2 instance is brought up in the same Availablity Zone as the target EC2 instance, the target instance is detached from its ASG during the node rotation process.
In order to move all data off the target EC2 instance, the node is excluded from shard allocation. Shard rebalancing is temporarily disabled during the rotation process, to prevent Elasticsearch from moving shards unnecessarily.
This Step Function requires a number of IAM permissions in order to control the number of running EC2 instances and run commands against Elasticsearch nodes (which is achieved via SSM's EC2 Run Command). Full details of the permissions required can be found in this project's Cloudformation template
The EC2 instances (which are subject to rotation) will require the following IAM permissions in order to handle incoming SSM commands:
Statement:
- Effect: Allow
Action:
- ec2messages:AcknowledgeMessage
- ec2messages:DeleteMessage
- ec2messages:FailMessage
- ec2messages:GetEndpoint
- ec2messages:GetMessages
- ec2messages:SendReply
- ssm:UpdateInstanceInformation
- ssm:ListInstanceAssociations
- ssm:DescribeInstanceProperties
- ssm:DescribeDocumentParameters
Resource: "*"
- Effect: Allow
Action:
- s3:PutObject
Resource:
- arn:aws:s3:::<your_bucket_name_here>/*
This project is deployed to multiple AWS accounts. Consequently, you should ensure that a node rotation can still be performed successfully before merging any changes to main
.
Unfortunately, there is no pre-production environment available for testing changes. Consequently, we recommend using Riff-Raff to deploy your branch to an individual account in order to validate your changes in production. If in doubt, testing changes against the ELK stack (in the Deploy Tools account) is a good place to start.
In order to do this, select Preview
from the deployment page (instead of Deploy Now
). Next Deselect all
and then manually select all deployment tasks for a specific account. Once you’ve done this you can Preview with selections
, check the list of tasks and then Deploy
.
Once you have confirmed that the change works as expected, the PR can be merged. This will automatically roll the change out across several AWS accounts via Riff-Raff. Please also inform the Investigations & Reporting team, as they use a different deployment mechanism and will need to pick up the change manually.
If the change adds or removes a feature, significantly alters AWS resources or is considered to be especially risky, you might also want to inform the teams who own the affected AWS accounts via Chat/email.