From 1df001694076a388432b2a63ebe5c969ba81c0b1 Mon Sep 17 00:00:00 2001 From: sarayourfriend <24264157+sarayourfriend@users.noreply.github.com> Date: Mon, 18 Sep 2023 10:50:47 +1000 Subject: [PATCH 1/2] Add Elasticsearch cluster maintenance documentation --- documentation/meta/index.md | 1 + .../meta/maintenance/elasticsearch_cluster.md | 349 ++++++++++++++++++ documentation/meta/maintenance/index.md | 14 + 3 files changed, 364 insertions(+) create mode 100644 documentation/meta/maintenance/elasticsearch_cluster.md create mode 100644 documentation/meta/maintenance/index.md diff --git a/documentation/meta/index.md b/documentation/meta/index.md index cff028528ce..79f9104e177 100644 --- a/documentation/meta/index.md +++ b/documentation/meta/index.md @@ -10,6 +10,7 @@ ci_cd/index decision_making/index documentation/index monitoring/index +maintenance/index communication_aliases codespell ``` diff --git a/documentation/meta/maintenance/elasticsearch_cluster.md b/documentation/meta/maintenance/elasticsearch_cluster.md new file mode 100644 index 00000000000..d1b3bff79b7 --- /dev/null +++ b/documentation/meta/maintenance/elasticsearch_cluster.md @@ -0,0 +1,349 @@ +# Elasticsearch cluster maintenance + +This document covers upgrading and making configuration changes to our +Elasticsearch clusters. The audience of this document is Openverse maintainers +with access to Openverse's infrastructure. + +## External resources + +### Documentation + +```{warning} +The following links are pinned to the "current" version. You can switch between +specific Elasticsearch versions via the version dropdown at the top of the +table of contents on the right side of the webpage. +``` + +- [Migration guides](https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes.html) +- [Rest API](https://www.elastic.co/guide/en/elasticsearch/reference/current/rest-apis.html) +- [Cluster configuration](https://www.elastic.co/guide/en/elasticsearch/reference/current/settings.html) + +### Dependencies + +- [`elasticsearch-py`](https://github.com/elastic/elasticsearch-py) +- [`elasticsearch-dsl`](https://github.com/elastic/elasticsearch-dsl-py/) + +## Version upgrades + +Updating the Elasticsearch (ES) cluster involves three basic steps: + +- Upgrade the Elasticsearch version used in local development +- Deploy staging and production clusters with the new version +- Upgrade the Elasticsearch client versions to match the new version + - The API, ingestion server, and catalog all communicate with Elasticsearch + and need attention during this step + +The order of these steps, in particular when the client versions should be +updated, depends on the breaking changes in Elasticsearch, its clients, and +whether the clients are backwards or forwards compatible. The order above +represents a _forwards_ compatible client, one that works with the current and +new version of ES at the same time. Depending on the changes to the ES API and +the client, it may be necessary to upgrade the client first, in which case the +client would be _backwards_ compatible with the previous version. + +For example, the ES 7 client was _forwards_ compatible with ES 8 through an +environment variable that enabled a compatibility mode. This meant we could +update the local ES instance and the live clusters to ES 8 before updating the +local clients without introducing a compatibility issue. + +This guide primarily covers the process of upgrading the Elasticsearch cluster, +because it is both the most complicated and most stable process. While there are +several improvements that can be made to the current Elasticsearch Terraform +module that would ease this process, the current process works. The current pace +of ES updates is such that the existing process is sufficient for our purposes. + +### Steps + +#### 0. Read the Elasticsearch release notes + +Before doing anything, read the release notes for each minor or major version we +are upgrading through. Pay special attention, of course, to major version +release notes and sections on breaking changes. Openverse's Elasticsearch usage +is relatively simple, and we often avoid issues with breaking changes on the +cluster side, but it is paramount to understand these breaking changes before +embarking on a cluster upgrade. In particular, note any configuration changes +that we need to make to the cluster via environment variables or if changes need +to be made to the deployment process. For example, pay attention to any changes +related to node registration, inter-cluster communication, and transport layer +security (i.e., HTTPS, client keys, or other such changes). + +#### 1. Determine when to update the client version + +The API and ingestion server use +[`elasticsearch-py` and `elasticsearch-dsl`](#dependencies) to build queries and +communicate with Elasticsearch. Both libraries support pinning client versions +to Elasticsearch version by +[appending the major version number to the end of the package name](https://github.com/elastic/elasticsearch-py#compatibility), +e.g., `elasticsearch7` for the ES 7 client version and `elasticsearch-dsl7` for +the DSL version for ES 7. This can be useful if Pipenv presents difficulties +with pinning a particular version. + +Read the release notes for all ES client versions after the current client +version. **Check `Pipfile.lock` to confirm the current version as it may not +necessarily be the version in the `Pipfile`, particularly if it is only minor +version constrained (uses `~=X.X`)**. Patch versions shouldn't include breaking +changes, and neither should minor versions, but we're better safe than sorry. If +this is a major version upgrade, you should spend extra time reading through the +first major version release notes. +[Elastic publishes comprehensive and straightforward client migration documentation for major version releases](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/migration.html): +these notes are required reading for upgrading between major versions. + +If the current client version is forwards compatible with the new ES version, +then the ES client should be upgraded last. For major version releases, it is +often only the _last_ minor version release of the client library that has a +forwards compatibility option, so it might be necessary to upgrade the clients +to the last minor version of the library. In this case, the cluster will be at +an earlier _minor_ version than the client expects. The ES client should be +okay, but it is designed to match the cluster version. Luckily our unit and +integration tests in the API and ingestion server are comprehensive enough to +have confidence in Elasticsearch client upgrade compatibility because our tests +run against a real Elasticsearch node, rather than through mocks. Still, better +safe than sorry, so read the release notes! + +If the current client version is _not_ forwards compatible (perhaps because of a +security patch or change in cluster authentication) but is backwards compatible, +you must upgrade the client version before upgrading the cluster. + +#### 2. Upgrade the client version if necessary + +If in the previous step we determine that the current client version is not +forwards compatible, whether because it needs to be updated to a more recent +forwards compatible minor version or to a backwards compatible major version, +upgrade the client version now and deploy through to production. Confirm +services are stable. + +#### 3. Upgrade the local Elasticsearch version + +Update the local Elasticsearch version in the root `docker-compose.yml`. In step +0 we should have noted any configuration changes that will be necessary. The +Elasticsearch setup used locally is a single node setup, meaning it's generally +significantly simpler than the live clusters. **Do not take the ease with which +the local version is updated as an indication that the live cluster +configuration changes will be easy!** Elasticsearch configuration becomes +significantly more complicated when there are multiple nodes. + +Make changes to +[`docker/es/env.docker`](https://github.com//WordPress/openverse/blob/9afc32ab703c692980ba7326d5b771e6bb754c97/docker/es/env.docker), +to update the configuration as necessary. + +At this point CI and all integration tests should pass. If they do not, check +whether the issue is with the ES node configuration or with the client. Read +through release notes again to see if there's anything relevant and check the +GitHub repositories. Make changes as necessary. This is the first time when +integration between the application and the new ES version is being tested, so +it's critical that it passes. If changes to index settings or mappings are +necessary, these must be done in a way that is backwards compatible. The cluster +info API returns `cluster_name`, which can be used during runtime to swap +between mappings or index settings at runtime if necessary. + +Add new CI steps to test the previous Elasticsearch version as well as the new +version. Do this by using an environment variable to configure the Elasticsearch +version used, +[as we did in this PR for the ES 7 to 8 upgrade](https://github.com/WordPress/openverse/pull/2832). +If new configuration changes do not work with the previous version, it will be +necessary to have distinct `docker/es/env.docker` files for each version, +configured via the same environment variable that determines the ES version. + +Once everything passes for the current and new Elasticsearch versions and +changes are merged and deployed, this step is complete. + +#### 4. Create the new Terraform module + +Copy the current Elasticsearch Terraform module into a fully new module named +with the version in the directory name. For example, if upgrading to version +8.9.0, duplicate the existing Terraform module and name the new directory +`elasticsearch-8.9.0`[^past-foibles]. Make changes to the configuration based on +the release note reading done in previous steps (especially step 0), and deploy +the new cluster. It may take several attempts to successfully deploy the new +cluster. The critical things to ensure are: + +1. All nodes successfully register with the cluster, including the expected + number of data and master nodes based on the configuration. +1. All nodes have the correct resources allocated to the Elasticsearch process. +1. All configuration changes are reflected in the new cluster. + +[Elasticvue is especially useful for this](https://elasticvue.com/), in +particular the "Nodes" tab, which gives a good overview of the cluster's nodes. +It also makes it easy to send REST API calls to the cluster for confirming +configurations if needed. + +Tearing down and redeploying the new cluster over and over for each +configuration change during the initial deployment attempts can be cumbersome +and slow. Sometimes it will be easier to follow the +[instructions for in place configuration changes](#in-place-changes-to-live-nodes) +on the nodes in the new cluster until we have determined the correct final +configuration. The final successful deployment _must_ be from scratch: i.e., do +not treat an in-place configuration change as a success until the configuration +is verified to work for a wholly new cluster. Clusters must be able to spin up +in a fully configured state without further intervention for the module to be +considered usable. + +[^past-foibles]: + In the past we reused the single module and made changes in place. This + happened to work out, but if something catastrophic had happened, and we + needed to deploy the current cluster from scratch again, we would have had + our work cut out for us, juggling reverts and likely Terraform state issues. + This comes down to the fact that our Elasticsearch module is inflexible, + with configuration created inside the module, rather than defined at the + root module level. Because it's simpler to duplicate the module rather than + add conditionals inside the module to handle multiple versions, the + recommendation is to duplicate it. This ensures that we can redeploy the + current cluster, whether entirely or just a failed node, without needing to + juggle changes to the code configuring that cluster. + +Deploy a staging cluster first. Production will come later. The cluster boxes +are expensive, so minimising the length of time that we have multiple cluster +versions deployed per environment helps save us some money and makes the list of +EC2 boxes a little less confusing to navigate. + +```{warning} +The existing clusters will be deprovisioned at the end of the process and +should be left alone in the meantime. They are still being used by staging +and production until the final steps! +``` + +#### 5. Run a reindex with the new staging cluster + +Update the staging ingestion server to use the new cluster (requires a staging +deployment) and manually trigger a `REINDEX` job, followed by +`CREATE_AND_POPULATE_FILTERED_INDEX`. Confirm index settings and mappings are as +expected and that the new documents are retrievable. + +#### 6. Point the staging API to the new cluster + +Update the staging API configuration to use the new cluster. Confirm the API is +stable and search works as expected. + +#### 7. Deploy the new production cluster + +Deploy the new production cluster. This should only require adding a reference +to the new Elasticsearch module in the production root module. + +#### 8. Run a data refresh with the new production cluster + +Point Airflow and the production ingestion server to the new cluster and run a +full data refresh, including any related DAGs like filtered index creation. +Confirm this works. Do not proceed until we are 100% confident the new cluster +works as expected. If in doubt, re-run the data refresh and make changes until +we are confident everything works perfectly. + +#### 9. Point the production API to the new cluster + +Update the production API configuration to use the new cluster. Confirm the API +is stable and search works, etc. + +#### 10. Upgrade client versions, if necessary + +If the client versions were not upgraded in +[step 2](#2-upgrade-the-client-version-if-necessary), upgrade them now and +deploy all updated services through to production. Confirm the data refresh +works before proceeding to the new step. + +#### 10. Celebrate, breathe a sigh of relief, and deprovision the old clusters + +Congrats! We should now be on the new Elasticsearch version. We can tear down +the old clusters and say goodbye to that Elasticsearch version. Be sure to +update CI to remove tests against the previous version as well. Reflect on this +process and identify potential improvements. Please update documentation to help +clarify pain-points or any pitfalls identified during the upgrade so that we are +better prepared the next time. + +## Configuration changes + +```{tip} How to choose which process to use +If the configuration change is needed to handle a security vulnerability +(e.g., `log4j`, etc), use the in-place approach. If the configuration change +is relatively minor, also use the in-place approach. + +Otherwise, prefer to use the new cluster approach, as it is more robust and +has more opportunities to recover cleanly if something goes wrong. +``` + +### In place changes to live nodes + +Elasticsearch nodes can be restarted one-by-one without issue. They will drop +from the cluster and once they are back up the will re-register themselves with +the cluster. This can be done at any time and should be okay. However, if it's +possible to time this during a period when indexing is not happening, that is +ideal. For severe security issues, of course, just go ahead and do it. + +**Follow this process for the staging cluster first. After confirming that +cluster is okay, repeat for production.** If the changes only apply to +production then ignore this, but **_go slowly_**! + +```{warning} +Be aware that this process requires modifying files via text-based editors +like `vim` or `nano`. If you are not comfortable with that process and +confident in how to use these editors, **do not attempt this process +yourself**. Find someone to buddy with you who can drive during the file +editing process. +``` + +1. Describe the specific configuration changes we need to make to each node. If + we are modifying an existing configuration variable, document the existing + setting and the new setting. If we are adding or removing a setting, document + it! We need to make sure the existing configuration is as easy as possible to + return to if the new configuration does not work as expected. +1. Log into AWS and go to the EC2 dashboard and filter the instances to list + only the Elasticsearch nodes for the environment being modified. +1. Connect Elasticvue to the cluster and open the "Nodes" tab. +1. Open the CloudWatch API dashboard for the environment you are modifying and + the Elasticsearch dashboard when making changes to the production cluster. +1. For each node, one-by-one, without concern for the order, do the following: +1. Note the instance identifier in a document accessible to other Openverse + maintainers. Mark the node as started in the document. +1. Use the + [EC2 Instance Connect feature](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-connect-methods.html#ec2-instance-connect-connecting-console) + to connect to the node. +1. `cd /` and `ls` to confirm you can see the `docker-compose.yml`. This is the + configuration file for the node. +1. Run `vim` or `nano` with `sudo` to edit the `docker-compose.yml` and make the + configuration changes identified in the first step of the overall process. +1. Save the changes. +1. Run `docker-compose up -d` to restart the node with the new settings. +1. Mark the node as restarted in the shared document. +1. Monitor the cluster: + + - Confirm in Elasticvue that the node dropped from the cluster and wait for + it to reappear and become stable[^health-during-refresh]. + - Check the CloudWatch dashboards and confirm everything remains stable or, + if there are slight changes corresponding to the node dropping out, that + metrics stabilise once the node has reinitialised. + +1. Mark the node as completed in the shared document. +1. Do one last stability check for the environment's API and cluster. +1. Update the Terraform configuration to reflect the changes made to the live + nodes. + +That's it! Repeat the process for each relevant environment and open a PR for +the Terraform changes. + +[^health-during-refresh]: + If a data reindex is running then the cluster will probably report its + health as "yellow". You should check that the node has reconnected and that + its shards and replicas look good. + +### Changes requiring a new cluster deployment + +This process follows the same process outlined in +[version upgrades](#version-upgrades). That is, it is entirely likely that we +can treat the new configuration changes identically to a version upgrade, even +if the actual cluster version is not changing. If the cluster version is staying +the same and only configuration changes are being made, you can generally skip +reading release notes or confirming whether the client version is compatible, +etc. The main focus should be understanding the new configuration and +determining whether it's possible to test it locally. Otherwise, follow the +steps for deploying a new cluster and switching the live services over to it. + +## Potential improvements + +- Pass node configuration at the module level rather than generating it inside + the module. This will enable using the same Elasticsearch module rather than + needing to create a new one for each version. +- Use an auto-scaling group to deploy the nodes rather than creating each node + individually. +- Use a single "current" endpoint to represent the "current" ES version and + point all services to it. Rather than redeploying services with a new + endpoint, use a load balancer to change the target ASG for the "current" + endpoint. If services become unstable, just switch the target back to the old + cluster, rather than needing to redeploy again. diff --git a/documentation/meta/maintenance/index.md b/documentation/meta/maintenance/index.md new file mode 100644 index 00000000000..eaa4f259b6f --- /dev/null +++ b/documentation/meta/maintenance/index.md @@ -0,0 +1,14 @@ +# Openverse Infrastructure Maintenance + +```{note} +Most Openverse infrastructure documentation can be found in the infrastructure +repository itself: [`WordPress/openverse-infrastructure` (private repo)]( + https://github.com/WordPress/openverse-infrastructure +) +``` + +```{toctree} +:titlesonly: + +elasticsearch_cluster.md +``` From ca56eee2d8fa5b7f0a5f7df891e46f9777c3cc16 Mon Sep 17 00:00:00 2001 From: sarayourfriend <24264157+sarayourfriend@users.noreply.github.com> Date: Thu, 21 Sep 2023 14:00:41 +1000 Subject: [PATCH 2/2] Remove unnecessary extension Co-authored-by: Dhruv Bhanushali --- documentation/meta/maintenance/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/documentation/meta/maintenance/index.md b/documentation/meta/maintenance/index.md index eaa4f259b6f..bb00b822b96 100644 --- a/documentation/meta/maintenance/index.md +++ b/documentation/meta/maintenance/index.md @@ -10,5 +10,5 @@ repository itself: [`WordPress/openverse-infrastructure` (private repo)]( ```{toctree} :titlesonly: -elasticsearch_cluster.md +elasticsearch_cluster ```