A reindex operation runs the source data from the adapters through the pipeline causing it to be re-transformed / matched & merged as appropriate.
To run a reindex follow these steps:
Copy one of the per-pipeline folders in pipeline/terraform
– these are labelled with the date of the pipeline.
Rename the new folder with the date of your pipeline (usually the current date).
Update the reindexing_state
variables in main.tf
– you want them all to be true
if you're about to do a complete reindex, as this adds extra capacity and scaling to the pipeline.
NOTE: once the reindexing of the new pipeline has completed, change true
to false
then terraform apply
the changes to scale ES clusters/services down.
Remember to create a pull request with this change.
You can now run terraform
inside the folder you've created:
# Use the run_terraform.sh script to get Elastic Cloud credentials
> ./run_terraform.sh plan
...
Plan: 389 to add, 0 to change, 0 to destroy.
# Before applying review the plan operation and verify it makes sense
> ./run_terraform.sh apply
Now we have our pipeline connected for reindexing and running our chosen version of the pipeline code we can start a reindex operation.
The reindex script can be found in ./reindexer/start_reindex.py.
> python3 ./start_reindex.py
Which source do you want to reindex? (all, miro, sierra, mets, calm): all
Which pipeline are you sending this to? (catalogue, catalogue_miro_updates, reporting): catalogue
Every record (complete), just a few (partial), or specific records (specific)? (complete, partial, specific): partial
A partial reindex will allow sending a few records (by default 10) in order to verify that a pipeline is functioning as expected without incurring the costs of a full reindex.
You can monitor a reindex in progress by looking at CloudWatch metrics in the platform
AWS account.
Non-empty DLQs will be reported in the Wellcome #wc-platform-alerts Slack channel.
You can monitor a reindex using the ./reindexer/get_reindex_status.py script, specifying the ID of the pipeline you wish to check.
> python3 get_reindex_status.py YYYY-MM-DD
*** Source tables ***
...
sierra 2,168,470
TOTAL 3,243,477
*** Work index stats ***
source records 3,243,477
...
works-indexed 3,243,477
API 3,242,763 ▼ 714
*** Image index stats ***
...
images-indexed 144,981
API 144,981
Approximately 99% of records have been reindexed successfully (counts may not be exact)
A reindex should take a few hours to complete.
When you have a complete successful reindex you will want to present it via the Catalogue API.
A new index can be referenced by updating the ElasticConfig
object:
The indexDate
should be the one used to reference the deployment and terraformed pipeline:
object ElasticConfig {
// We use this to share config across API applications
// i.e. The API and the snapshot generator.
val indexDate = "YYYY-MM-DD"
def apply(): ElasticConfig =
ElasticConfig(
worksIndex = Index(s"works-indexed-$indexDate"),
imagesIndex = Index(s"images-indexed-$indexDate")
)
}
You will want to PR & deploy this change through the API stage environment and allow CI to perform the usual API checks.
Be sure to check the diff_tool output in CI in before deploying to production (you can also run this manually).
Visit the Buildkite job for catalogue-api job for your PR after it is merged to main
to view the diff_tool
output and access the "Deploy to prod" button.