From 797b32c8e76a72a123a1338e6c8768d57f6ed885 Mon Sep 17 00:00:00 2001 From: Staci Mullins <63313398+stacimc@users.noreply.github.com> Date: Tue, 16 Apr 2024 18:09:57 -0700 Subject: [PATCH] Implementation plan: Ingestion server removal (#4026) * Add initial plan * Correct wording Co-authored-by: sarayourfriend <24264157+sarayourfriend@users.noreply.github.com> * Add clarifications and formatting fixes * Adjust approach using ASG * Add more detail about cost analysis and link to breakdown * Fix link * Add prior art * Fix footnote * Update link and trigger rule * Accept aetherunbound approval Co-authored-by: Madison Swain-Bowden * Lint approval * Remove unnecessary step * Clarify use of tags for ASGs, add -pool suffix * Reference opportunity for future iteration * Record sarayourfriend approval --------- Co-authored-by: sarayourfriend <24264157+sarayourfriend@users.noreply.github.com> Co-authored-by: Madison Swain-Bowden --- ...mentation_plan_ingestion_server_removal.md | 611 ++++++++++++++++++ 1 file changed, 611 insertions(+) create mode 100644 documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md diff --git a/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md new file mode 100644 index 00000000000..cf3aa8553b9 --- /dev/null +++ b/documentation/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md @@ -0,0 +1,611 @@ +# 2024-03-28 Implementation Plan: Ingestion Server Removal + +**Author**: @stacimc + + + + +## Reviewers + + + +- [x] @sarayourfriend +- [x] @AetherUnbound + +## Project links + + + +- [Project Thread](https://github.com/WordPress/openverse/issues/3925) +- [Project Proposal](/projects/proposals/ingestion_server_removal/20240319-project_proposal.md) + +## Overview + + + +The critical data refresh process is orchestrated by Airflow via the +`_data_refresh` DAGs, but the bulk of the work is merely triggered +by Airflow on a remote ingestion server. Due to the reasons laid out in the +Project Proposal, we will be moving all operational pieces of the data refresh +into Airflow itself, and removing the ingestion server entirely. + +The work which must be moved into Airflow can be split into these abstract +steps, which will be discussed separately in this IP: + +- **Copy Data**: An + [FDW extension](https://www.postgresql.org/docs/current/postgres-fdw.html) is + used to connect the API database to the upstream (catalog) database. The + entire contents of the upstream media table are copied into a new temp table + in the API database. This temp table will later replace the main media table + in the API. +- **Clean Data**: Data clean up which includes removing denylisted tags and + cleaning URLs. This step will become unnecessary when the + [Catalog Data Cleaning project](https://github.com/WordPress/openverse/issues/430) + is completed, which moves all data cleaning steps upstream into the initial + provider ingestion process. +- **Create Index**: Create a new Elasticsearch index, matching the configuration + of the existing media index. +- **Distributed Reindex**: Convert each record from the new temp table to the + format required by an Elasticsearch document, and then reindex them into the + newly created index. Because this step is by far the most time consuming, the + reindex is distributed across multiple **indexer workers**. In the current + implementation, the indexer workers are themselves running on EC2 instances (6 + in production, 2 in staging). Each indexer worker serves a small API which can + be used to trigger a reindexing task for an equal portion of the work. The + workers are started by the ingestion-server when needed, and automatically + spun down when complete. +- **Create and Populate Filtered Index**: Create a new Elasticsearch index + matching the configuration of the existing _filtered_ index, and then reindex + documents into it from the new media index, applying appropriate filtering for + sensitive terms. +- **Reapply Constraints**: Recreate indices and constraints from the original + API table on the new temp table. +- **Promote Table**: Drop the old media table in the API and rename the temp + table and its indices, which has the effect of promoting them/replacing the + old table. +- **Promote Index**: Promote the new Elasticsearch index by unlinking the given + alias from the existing index and moving it to the new one. (Used for both the + main and filtered indices.) +- **Delete Index**: Delete the old Elasticsearch index. (Used for both the main + and filtered indices.) + +In addition to the data refresh DAGs, some combination of these steps on the +ingestion-server are also used by: + +- `recreate_full_staging_index` DAG +- `create_filtered__index` DAGs +- Our `load_sample_data` scripts, which use the ingestion server to load sample + data from the catalog into the API and Elasticsearch + +This IP includes details for moving all steps into Airflow. The bulk of the plan +deals with the `Distributed Reindex` step, which is the only step that is not +straightforward to move. + +## Expected Outcomes + + + +When this work is completed we will be able to: + +- Manage data refreshes entirely in Airflow, without the use of the ingestion + server +- Completely remove the staging and production ingestion servers, as well as all + related code in the monorepo and infrastructure repos +- Remove the `recreate_full_staging_index` DAG + +## Approach to the Distributed Reindex + +The majority of the data refresh process can be implemented in Airflow with only +slight refactoring, using our existing +[Postgres hooks/operators](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/catalog/dags/common/sql.py) +and reusable +[Elasticsearch tasks](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/catalog/dags/common/elasticsearch.py). +The exception is the _distributed reindex_. + +The simplest approach would be to remove the indexer workers entirely in favor +of doing the reindexing in parallelized dynamic tasks in Airflow. However, +currently the indexer workers and catalog EC2 instances are the same size +(m5.xlarge), and each of the six production workers were observed to use up to +~25% CPU and ~52% memory utilization during reindexing. The expense of +permanently increasing resources on the catalog instance to support reindexing +(which may happen concurrently with popularity refreshes, ingestion, and all +other processes that occur on the catalog) would significantly exceed the cost +of maintaining remote workers that can be run only when needed. + +Two other options were evaluated for this IP: + +- Keep the EC2 instances as they are now, but connect to them directly from + Airflow. Rather than managing the 8 total EC2 instances directly, we will + instead set up an EC2 Auto Scaling group for each environment, with an initial + desired capacity of 0. The data refresh DAGs will spin up instances by + increasing this desired capacity. (More details follow later in this + document.) +- Remove the EC2 instances entirely in favor of a new ECS task definition using + the `indexer-worker` image, which would remove all API code and contain only + the reindexing script. Airflow would spin up the appropriate number of ECS + tasks when needed. + +### Comparison of EC2 and ECS + +The following areas of comparison were considered when choosing between the two +approaches: + +#### Cost + +Maintaining the +[EC2 worker instances](https://calculator.aws/#/createCalculator/ec2-enhancement), +provided we continue to automatically stop the instances when not in use, would +be cheaper than using [ECS](https://calculator.aws/#/createCalculator/Fargate), +given equivalent resources. + +Since our existing indexer workers do not use all of the resources available, we +can also consider reducing resources for the workers. For EC2, we are already +using the cheapest possible offering that will comfortably accommodate our +memory consumption. Since ECS is more flexible and allows scaling vCPU and RAM +independently, it is possible that ECS could be cheaper than EC2 with careful +adjustment of resources[^1]. + +Notably both solutions are _signifcantly_ cheaper than the current +implementation, because the bulk of the cost comes from constantly running the +two m5.xlarge EC2 instances for the ingestion servers. + +#### Ease of infrastructure development + +EC2 is also likelier the easier solution to implement, because it requires so +few changes to the indexer-workers as they are currently configured. For ECS, +less total infrastructure scaffolding is required, but it is a greater departure +from the current setup. + +#### Code maintainability + +The ECS approach requires slightly less code, because it would allow us to +completely remove all of the server logic needed in the EC2 instances. The +Docker image need only contain the script for reindexing. + +For the EC2 instances we would continue to serve a minimal API with the +following endpoints: + +- `healthcheck` +- `reindexing_task` - Trigger the reindexing task for the given parameters + (e.g., run the script) +- `task/{task_id}` - Get the status of the running task + +That being said, the API needed for the indexer worker is simple and small. The +vast majority of the complexity in the data refresh is in the _ingestion +server's_ API, all of which would still be removed. + +#### Supporting local testing & development + +This is the most significant area in which EC2 is preferable to the ECS +approach. With the EC2 approach, assigning work to the workers requires starting +the worker instances and then triggering the reindexing task on each worker by +POSTing to its `reindexing_task` endpoint. Locally, we would simply POST to our +local Docker container rather than an EC2 instance. Minimal work is needed to +adapt the process to work in a local environment, and the DAG when run locally +is as close as possible to its production counterpart. + +For the ECS approach, in production we must spin up ECS instances using +Airflow's +[EcsRunTaskOperator](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/operators/ecs.html#run-a-task-definition). +Simulating this in a local development environment without actually connecting +to AWS is less straightforward, and requires supporting a separate workflow for +local environments. For example we could have a `branch` operator that checks +the environment and proceeds to the EcsRunTaskOperator in production to spin up +ECS tasks, or a +[DockerOperator](https://airflow.apache.org/docs/apache-airflow-providers-docker/stable/_api/airflow/providers/docker/operators/docker/index.html) +in local development to spin up a worker as a local Docker container. + +This is still relatively complex, however, because the worker Docker containers +must be launched from Airflow, which is itself dockerized. This requires either +using a Docker-in-Docker approach, which should typically be avoided, or +modifying the configuration to allow Airflow to launch sibling containers[^2]. +Requirements differ across operating systems, yet this **must** be tested and +maintained across platforms because the data refresh is used as a critical step +in setting up our local development environment. + +#### Support for progress tracking + +With the EC2 approach it is trivial to calculate the `percentage_complete` for +the reindexing task, and report to Airflow via the task status endpoint. +Airflow's ECS Operators do not have out-of-the-box support for streaming the +Cloudwatch logs from the task while it is running, so we would have less insight +into the status of the task from Airflow. + +#### Impact on Deployments + +This is a great strength of both approaches, each of which would eliminate the +need for deployments when changes are made to the data refresh code (including +changes to the indexer workers). + +Each time an ECS task is spun up on Fargate, it will pull the Docker image +configured in the task definition. By using the `latest` tag in this +configuration we can ensure that the latest indexer-worker image is pulled each +time, which means that any changes to the reindexing code will become live in +production as soon as the new docker image is published after merging on `main`. + +The EC2 approach uses an ASG to achieve a similar result. Because the ASG will +actually terminate (rather than stop) the instances when their work is complete +and start _new_ instances for each data refresh, we can pull the Docker image +with the `latest` tag in the `user_data` script so that the latest Docker image +is pulled each time a new data refresh starts. By using AWS Systems Manager +parameters instead of AMI IDs in the launch template, we can even use the ASG to +automatically use new AMI IDs without needing to deploy a new launch template +each time a system dependency is updated. + +### Conclusion + +Both approaches allow us to eliminate the need for deployments in almost all +cases, except for when changes are made to the task definition and launch +template respectively. + +The main advantage of the ECS approach is that it allows us to remove 100% of +the server management code. However, it is more difficult to implement, likely +more expensive, and requires a special secondary implementation for running +locally that may be prone to cross-platform issues. The EC2 approach on the +other hand is cheaper, quicker to implement, and gets us the vast majority of +what we want in terms of removing complexity. Consequently I argue here that the +EC2 approach is the one we should pursue. + +## Step-by-step plan + + + +Because the data refresh process is critical, we will not remove the ingestion +server or the existing data refresh DAGs until the new process has been +performed successfully in staging and production. The new data refresh will be +developed as separate DAGs alongside the current ones. + +1. Create a new data refresh factory to generate new data refresh DAGs for each + media type for staging and production, with the + `__data_refresh` DAG id. For simplicity of code review, + these initial DAGs should only perform the `Copy Data`, and `Create Index` + steps, which we will perform entirely in Airflow. +1. Create a new `catalog-indexer-worker` in the Catalog, and build the new + indexer worker image locally. +1. Add the distributed reindexing step to the DAG (excludes infrastructure + work). +1. Set up the necessary resources for the ASGs for staging and production in the + catalog Terraform configuration. +1. Add all remaining steps to the data refresh DAGs: + `Create and Populated Filtered Index`, `Reapply Constraints`, + `Promote Table`, `Promote Index`, `Delete Index`. +1. Deploy the catalog and the new indexer-workers configuration. +1. Run the staging audio data refresh, followed by the staging image data + refresh. +1. Once successful, run the production audio data refresh and image data + refresh. +1. Update the `create_filtered__index` DAGs to remove use of the + ingestion server. +1. Drop the `recreate_full_staging_index` DAG, which can be removed entirely in + favor of the staging data refresh. +1. Update the `load_sample_data.sh` script to run the DAG instead of using the + ingestion server. +1. Fully remove the ingestion server and related infrastructure. + +There are few opportunities for multiple streams of work. Updating the +`create_filtered__index` DAGs can happen at any time. + +## Step details + + + +While this may look daunting, it should be noted that with very few exceptions +the work described below is refactoring of existing logic. Links to the source +files are included for convenience. + +### Create the new data refresh DAG factory and move initial steps into Airflow + +In this step, we'll create a new data refresh DAG factory to generate data +refresh DAGs for each existing media_type and environment. Currently these four +will be generated: + +- staging_audio_data_refresh +- staging_image_data_refresh +- production_audio_data_refresh +- production_image_data_refresh + +Because the `environment` is added as a prefix, there will be no collision with +the existing DAG ids. In this initial step, we we will add only a small portion +of the logic in order to make the PR easier to review. The first steps are +already implemented in the current data refresh and can simply be copied: + +- Get the current record count from the target API table; this must be modified + to take the `environment` as an argument +- Perform concurrency checks on the other data refreshes and conflicting DAGs; + this must be modified to include the now larger list of data refresh DAG ids +- Get the name of the Elasticsearch index currently mapped to the `target_alias` +- Generate the new index suffix + +We will include new tasks to perform the initial few steps of the ingestion +server's work: + +- Copy Data: this should be a TaskGroup that will have multiple tasks for + creating the FDW from the upstream DB to the downstream DB, running the + copy_data query, and so on. It should fully replace the implementation of + [`refresh_api_table`](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/ingestion_server/ingestion_server/ingest.py#L248) + in the ingestion server. All steps in this section are SQL queries that can be + implemented using the existing + [PostgresHook and PGExecuteQueryOperator](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/catalog/dags/common/sql.py). +- Create Index: we can use our existing + [Elasticsearch tasks](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/catalog/dags/common/elasticsearch.py#L82) + to create the new elasticsearch index with the index suffix generated in the + previous task. + +### Implement new catalog indexer worker + +In this step we will create the new catalog-indexer-worker Docker image. This +step does not include adding the orchestration steps to the DAG, or the +infrastructure work to actually create the ASGs. + +First we will create a new `indexer-worker` directory under +`catalog/dags/data_refresh`, which will contain the contents of the new indexer +worker. **This implementation already exists in the ingestion server. The +relevant pieces can be pulled out and refactored slightly to fit the new, much +smaller image.** Broadly, this is the mapping of existing files to new files +needed: + +- `api.py` will defined the API for the worker, and is refactored from the + existing + [`indexer_worker.py`](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/ingestion_server/ingestion_server/indexer_worker.py). + It must be refactored to add task state and a `task_status` endpoint, which + takes a `task_id` and returns the status and progress of the given task. +- `indexer.py` will contain the logic for the actual indexing task. It will be + refactored from the existing + [`indexer.py`](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/ingestion_server/ingestion_server/indexer.py); + specifically all we need is the + [`replicate`](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/ingestion_server/ingestion_server/indexer.py#L157) + function. +- `elasticsearch_models.py`, pulled from the file of the + [same name](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/ingestion_server/ingestion_server/elasticsearch_models.py) + in the ingestion server. Defines a mapping from a database record to an + Elasticsearch document. +- Utility files for helper functions for connecting to Elasticsearch and + Postgres (e.g. + [`es_helpers.py`](https://github.com/WordPress/openverse/blob/main/ingestion_server/ingestion_server/es_helpers.py)) + +The +[`Dockerfile`](https://github.com/WordPress/openverse/blob/main/ingestion_server/Dockerfile) +can be copied from the existing ingestion server. It should be updated to +reference the new file structure, and to expose only a single port, which should +be distinguished from the ports currently in use by the ingestion server (8001 +and 8002). Other necessary files, including `env.docker`, `.dockerignore`, +`Pipfile`, and `gunicorn.conf.py` can all be copied in from the existing +ingestion server as well. + +Finally we will update the monorepo's root +[`docker-compose.yml`](https://github.com/WordPress/openverse/blob/05ff48d05f2163104151c5589cf352a156bc6a97/docker-compose.yml) +to add a new `catalog-indexer-worker` service. Its build context should point to +the nested `data_refresh/indexer_worker` directory, and it should map the +exposed port to enable the API to be reached by the catalog. + +When this work is complete, it should be possible to run `just catalog/shell` +and curl the new indexer worker. The existing ingestion-server and +indexer-worker services are unaffected (it is still possible to run legacy data +refreshes locally and in production). + +### Implement distributed reindexing locally + +In this step we will add tasks to the data refresh DAGs to orchestrate the +distributed reindex. At the end of this step, it will be possible to run a +distributed reindex _locally_, but because the infrastructure work to create the +ASGs is not complete, it can not be run on production yet. The following code +can all be refactored from +[`distributed_reindex_scheduler.py`](https://github.com/WordPress/openverse/blob/main/ingestion_server/ingestion_server/distributed_reindex_scheduler.py). + +- Use + [`describe_auto_scaling_groups`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/autoscaling/client/describe_auto_scaling_groups.html) + and filter by tags to select the appropriate ASG for the desired environment. + (Skips in local env.) +- Use + [`set_desired_capacity`](https://boto3.amazonaws.com/v1/documentation/api/1.26.86/reference/services/autoscaling/client/set_desired_capacity.html) + to increase the desired capacity of the ASG to the desired number of workers, + depending on the environment. This will cause the ASG to begin spinning up + instances. (Skips in local env.) +- Use + [`describe_auto_scaling_groups`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/autoscaling/client/describe_auto_scaling_groups.html) + to poll the ASG until all instances have been started, and get the EC2 + instance IDs. (Skips in local env.) +- Use dynamic task mapping to distribute reindexing across the indexer workers + by first calculating `start` and `end` indices that will split the records in + the media table into even portions, depending on the number of workers + available in the given environment. Then: + - POST to each worker's `reindexing_task` endpoint the `start_index` and + `end_index` it should handle + - Use a Sensor to ping the worker's `task/{task_id}` endpoint until the task + is complete, logging the progress as it goes +- Use + [`terminate_instance_in_auto_scaling_group`](https://boto3.amazonaws.com/v1/documentation/api/1.26.86/reference/services/autoscaling/client/terminate_instance_in_auto_scaling_group.html) + to terminate the instance. Make sure to set `ShouldDecrementDesiredCapacity` + to `True` to ensure that the ASG does not try to replace the instance. This + task should use the + [`NONE_SKIPPED` TriggerRule](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html#trigger-rules) + to ensure that the instances are terminated, even if there are upstream + failures. (Skips in local env.) +- Finally, after all tasks have finished (regardless of success/failure), we + should have a cleanup task that calls `set_desired_capacity` to 0. Generally + this should be a no-op, but if an instance crashes during reindexing (rather + than simply failing during reindexing) the ASG will spin up a replacement and + Airflow will not automatically clean it up. This task ensures that any + dangling instances are terminated. + +```{note} +It is not possible to retry a single indexer worker with this set up, because once a worker fails the instance is actually terminated (rather than simply stopped). If a task that triggers a reindex is cleared after an instance has been terminated, it will simply fail. The entire reindex must be restarted from the first step in this task group. + +However, there is a valuable tradoff to this approach: it ensures that all of the indexer workers in a data refresh are identical, while still allowing us to avoid manual deployments every time the indexer logic changes. For example, imagine some changes to the reindexing logic are merged to `main` while a data refresh is actively underway, and a new Docker image is published. If one indexer worker failed, and it were possible to retry **just** that indexer worker, it would use the new Docker image -- leading to inconsistency in the behavior of different workers within a single data refresh. + +In future iterations this may also be solved by using AMIs with the Docker image baked in, and then preventing launch template version bumps while a data refresh is running. +``` + +### Create the Terraform and Ansible resources needed to deploy the new indexer workers + +In this step we will add the resources needed to actually configure the ASGs. + +Some important notes: + +- Currently, the staging and production workers are split into separate + environments (i.e., a staging deploy is used to deploy the 2 staging workers + separately from the 6 production workers). It is more accurate to view all 8 + workers as production instances (i.e., part of the production catalog + deployment), which merely _operate_ on different environments. As such all 8 + should be part of the production deployment, but two separate ASGs which are + given a "staging-indexer-worker-pool" and "production-indexer-worker-pool" + tag, respectively, to indicate their intended environment. +- The playbooks must be updated to check if any of the four _new_ data refresh + DAGs are running before deploying, as well. +- The `user_data` script should be updated to pull the Docker image with the + `latest` tag. + +### Add remaining steps to the Data Refresh DAGs + +The final steps can now be added to the DAGs: + +- `Create and Populated Filtered Index`: this should be implemented as a + reusable TaskGroup. This work can already be implemented using our existing + Elasticsearch tasks and replaces + [this function](https://github.com/WordPress/openverse/blob/226aed0890a19ace1e0b54c1e784c86e9f26b4cb/ingestion_server/ingestion_server/indexer.py#L437). +- `Reapply Constraints` and `Promote Table`: these SQL queries can be performed + with the PostgresHook, and replaces + [this function](https://github.com/WordPress/openverse/blob/226aed0890a19ace1e0b54c1e784c86e9f26b4cb/ingestion_server/ingestion_server/ingest.py#L340). +- `Promote Index` and `Delete Index`: these can be implemented using our + existing Elasticsearch tasks, and replaces + [these](https://github.com/WordPress/openverse/blob/226aed0890a19ace1e0b54c1e784c86e9f26b4cb/ingestion_server/ingestion_server/indexer.py#L311) + [functions](https://github.com/WordPress/openverse/blob/226aed0890a19ace1e0b54c1e784c86e9f26b4cb/ingestion_server/ingestion_server/indexer.py#L377). + +### Deploy the catalog and indexer workers + +In this step we'll actually deploy the new workers using the Ansible playbooks. +The existing ingestion server and indexer workers remain untouched, so the +legacy data refresh can continue to run. + +### Run the data refreshes + +The next step is to run the `staging_` data refreshes, which act on the +staging API DB and Elasticsearch cluster and are therefore lower risk. This will +be the first test of the new process. Once successful, we can run the +`production_` refreshes. + +### Update `create_filtered__index` DAGs + +These DAGs currently use the ingestion server to perform the +`create_filtered_index` steps in isolation. We can update these to use the +reusable TaskGroup implemented in an earlier step. + +### Remove the `recreate_full_staging_index` DAG + +This DAG performs the second half of a data refresh (the distributed reindex and +promotion) in staging. Once the data refresh has been moved to Airflow, we can +add a +[DAG param](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/params.html) +to the data refresh that allows skipping the initial `Copy Data` steps. This DAG +will no longer be necessary and can then be deleted. + +### Update the load_sample_data scripts + +Our `load_sample_data.sh` scripts currently use `just` commands to +[run parts of the data refresh](https://github.com/WordPress/openverse/blob/226aed0890a19ace1e0b54c1e784c86e9f26b4cb/load_sample_data.sh#L98) +on the local ingestion server, as part of setting up the local development +environment. We should update the scripts to instead use the Airflow CLI to +unpause the two data refresh DAGs and await their completion. + +### Remove the ingestion server + +Once the new data refresh has been run successfully in production, we can +finally remove all ingestion-server code from the monorepo and the +infrastructure repo, and completely remove the associated EC2 instances. + +### Infrastructure + + + +Infrastructure is a large part of this project. We will be adding 8 new EC2 +instances but deprovisioning ten, as described above. There will be significant +cost benefits to removing the two EC2 instances which are constantly running. + +### Tools & packages + + + +No new tools or packages are required. + +### Other projects or work + + + +- [Deploying Airflow with Ansible](https://github.com/WordPress/openverse-infrastructure/pull/829), + part of the + [Move Airflow to openverse.org](https://github.com/WordPress/openverse-infrastructure/milestone/6) + Project. +- [Catalog Data Cleaning project](https://github.com/WordPress/openverse/issues/430), + which removes the Clean Data steps from the data refresh +- [Switch Python Package Management Away from Pipenv](https://github.com/WordPress/openverse/issues/286) + +## Alternatives + + + +The alternative options of using an ECS approach or performing the reindex +entirely in Airflow are discussed at length in the +[Approach to the Distributed Reindex](#approach-to-the-distributed-reindex) +section. + +It is also possible to use EC2 instances but manage them directly in Airflow +using EC2 operators to start and stop the instances as needed. However, more +infrastructure work is required in this approach, and we would require +deployments whenever there are code changes in the indexer workers. + +## Blockers + + + +Development can start immediately, but the infrastructure components of this +project are blocked by completion of the effort to move Airflow to +openverse.org, specifically +[this issue](https://github.com/WordPress/openverse-infrastructure/issues/776) +to run a (legacy) data refresh on the new instance. + +## Rollback + + + +We will not remove the ingestion server or the existing data refresh DAGs until +the new DAGs have been run successfully against both staging and production. We +can rollback at any time by simply removing the new instances and DAGs. + +## Risks + + + +There is minimal risk, as we will be able to test against production data in +staging before running against production. However the data refresh does +completely replace the production media tables and Elasticsearch indices, so +there is always inherent risk that production data could be lost or malformed if +something goes wrong. We will ensure production backups are available and +monitor the first production data refreshes closely. + +## Prior art + + + +- [Search relevancy sandbox](https://github.com/WordPress/openverse/issues/392) + +[^1]: + See + [comment](https://github.com/WordPress/openverse/pull/4026#pullrequestreview-1978477921) + for specific cost assessment, courtesy of @sarayourfriend + +[^2]: + https://towardsdatascience.com/using-apache-airflow-dockeroperator-with-docker-compose-57d0217c8219