From cab6cb9b93538905208fc786817174478d0f20f5 Mon Sep 17 00:00:00 2001
From: eecavanna <eecavanna@users.noreply.github.com>
Date: Mon, 11 Nov 2024 16:25:56 -0800
Subject: [PATCH 1/2] Implement the `11.0.3`-to-`11.1.0` migration notebook

---
 .../notebooks/.notebook.env.example           |   4 +
 .../notebooks/migrate_11_0_3_to_11_1_0.ipynb  | 828 ++++++++++++++++++
 2 files changed, 832 insertions(+)
 create mode 100644 demo/metadata_migration/notebooks/migrate_11_0_3_to_11_1_0.ipynb

diff --git a/demo/metadata_migration/notebooks/.notebook.env.example b/demo/metadata_migration/notebooks/.notebook.env.example
index 066392b6..91113f1c 100644
--- a/demo/metadata_migration/notebooks/.notebook.env.example
+++ b/demo/metadata_migration/notebooks/.notebook.env.example
@@ -8,6 +8,8 @@ PATH_TO_MONGORESTORE_BINARY = "__REPLACE_ME__"  # e.g. "/Users/Alice/Downloads/m
 PATH_TO_MONGOSH_BINARY = "__REPLACE_ME__"  # e.g. "/Users/Alice/Downloads/mongosh-1.10.6-darwin-x64/bin/mongosh"
 
 # Connection parameters for the Origin Mongo server (typically a remote serve).
+# Note: The Origin Mongo server is the one that contains the database you want
+#       to migrate. It is where the "E" and "L" in "ETL" will take place.
 ORIGIN_MONGO_HOST="__REPLACE_ME__"
 ORIGIN_MONGO_PORT="__REPLACE_ME__"
 ORIGIN_MONGO_USERNAME="__REPLACE_ME__"
@@ -15,6 +17,8 @@ ORIGIN_MONGO_PASSWORD="__REPLACE_ME__"
 ORIGIN_MONGO_DATABASE_NAME="__REPLACE_ME__"  # e.g. "nmdc"
 
 # Connection parameters for the Transformer Mongo server (typically a local server).
+# Note: The Transformer Mongo server is the one you want to use to perform the
+#       database transformations. It is where the "T" in "ETL" will take place.
 TRANSFORMER_MONGO_HOST="__REPLACE_ME__"
 TRANSFORMER_MONGO_PORT="__REPLACE_ME__"
 TRANSFORMER_MONGO_USERNAME="__REPLACE_ME__"
diff --git a/demo/metadata_migration/notebooks/migrate_11_0_3_to_11_1_0.ipynb b/demo/metadata_migration/notebooks/migrate_11_0_3_to_11_1_0.ipynb
new file mode 100644
index 00000000..5916ad47
--- /dev/null
+++ b/demo/metadata_migration/notebooks/migrate_11_0_3_to_11_1_0.ipynb
@@ -0,0 +1,828 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "initial_id",
+   "metadata": {
+    "collapsed": true,
+    "jupyter": {
+     "outputs_hidden": true
+    }
+   },
+   "source": [
+    "# Migrate MongoDB database from `nmdc-schema` `v11.0.3` to `v11.1.0`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3c31d85d",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "\n",
+    "This notebook will be used to migrate the database from `nmdc-schema` `v11.0.3` ([released](https://github.com/microbiomedata/nmdc-schema/releases/tag/v11.0.3) October 17, 2024) to `v11.1.0`.\n",
+    "\n",
+    "### Heads up\n",
+    "\n",
+    "Unlike some previous migrators, this one does not \"pick and choose\" which collections it will dump. There are two reasons for this: (1) migrators no longer have a dedicated `self.agenda` dictionary that indicates all the collections involved in the migration; and (2) migrators can now create, rename, and drop collections; none of which are things that the old `self.agenda`-based system was designed to handle. So, instead of picking and choosing collections, this migrator **dumps them all.**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f65ad4ab",
+   "metadata": {},
+   "source": [
+    "## Prerequisites"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "17f351e8",
+   "metadata": {},
+   "source": [
+    "### 1. Coordinate with stakeholders.\n",
+    "\n",
+    "We will be enacting full Runtime and Database downtime for this migration. Ensure stakeholders are aware of that."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "233a35c3",
+   "metadata": {},
+   "source": [
+    "### 2. Set up notebook environment.\n",
+    "\n",
+    "Here, you'll prepare an environment for running this notebook.\n",
+    "\n",
+    "1. Start a **MongoDB server** on your local machine (and ensure it does **not** already contain a database having the name specified in the notebook configuration file).\n",
+    "    1. You can start a [Docker](https://hub.docker.com/_/mongo)-based MongoDB server at `localhost:27055` by running the following command. A MongoDB server started this way will be accessible without a username or password.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8aee55e3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!docker run --rm --detach --name mongo-migration-transformer -p 27055:27017 mongo:6.0.4"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6cd05ccb",
+   "metadata": {},
+   "source": [
+    "2. Create and populate a **notebook configuration file** named `.notebook.env`.\n",
+    "   > You can use `.notebook.env.example` as a template."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "69937b18",
+   "metadata": {},
+   "source": [
+    "## Procedure"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe81196a",
+   "metadata": {},
+   "source": [
+    "### Install Python packages\n",
+    "\n",
+    "In this step, you'll [install](https://saturncloud.io/blog/what-is-the-difference-between-and-in-jupyter-notebooks/) the Python packages upon which this notebook depends.\n",
+    "\n",
+    "> Note: If the output of this cell says \"Note: you may need to restart the kernel to use updated packages\", restart the kernel (not the notebook cells), then proceed to the next cell.\n",
+    "\n",
+    "##### References\n",
+    "\n",
+    "| Description                                                                     | Link                                                   |\n",
+    "|---------------------------------------------------------------------------------|--------------------------------------------------------|\n",
+    "| NMDC Schema PyPI package | https://pypi.org/project/nmdc-schema                   |\n",
+    "| How to `pip install` from a Git branch<br>instead of PyPI                       | https://stackoverflow.com/a/20101940                   |"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e25a0af308c3185b",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    },
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "%pip install --upgrade pip\n",
+    "%pip install -r requirements.txt\n",
+    "%pip install nmdc-schema==11.1.0"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a407c354",
+   "metadata": {},
+   "source": [
+    "### Import Python dependencies\n",
+    "\n",
+    "Import the Python objects upon which this notebook depends.\n",
+    "\n",
+    "#### References\n",
+    "\n",
+    "| Description                            | Link                                                                                                  |\n",
+    "|----------------------------------------|-------------------------------------------------------------------------------------------------------|\n",
+    "| Dynamically importing a Python module  | [`importlib.import_module`](https://docs.python.org/3/library/importlib.html#importlib.import_module) |\n",
+    "| Confirming something is a Python class | [`inspect.isclass`](https://docs.python.org/3/library/inspect.html#inspect.isclass)                   |"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9e8a3ceb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "MIGRATOR_MODULE_NAME = \"migrator_from_11_0_3_to_11_1_0\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dbecd561",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Standard library packages:\n",
+    "import subprocess\n",
+    "from typing import List\n",
+    "import importlib\n",
+    "from inspect import isclass\n",
+    "\n",
+    "# Third-party packages:\n",
+    "import pymongo\n",
+    "from linkml.validator import Validator, ValidationReport\n",
+    "from linkml.validator.plugins import JsonschemaValidationPlugin\n",
+    "from nmdc_schema.nmdc_data import get_nmdc_schema_definition\n",
+    "from nmdc_schema.migrators.adapters.mongo_adapter import MongoAdapter\n",
+    "from linkml_runtime import SchemaView\n",
+    "\n",
+    "# First-party packages:\n",
+    "from helpers import Config, setup_logger, get_collection_names_from_schema, derive_schema_class_name_from_document\n",
+    "from bookkeeper import Bookkeeper, MigrationEvent\n",
+    "\n",
+    "# Dynamic imports:\n",
+    "migrator_module = importlib.import_module(MIGRATOR_MODULE_NAME, package=\"nmdc_schema.migrators\")\n",
+    "Migrator = getattr(migrator_module, \"Migrator\")  # gets the class\n",
+    "assert isclass(Migrator), \"Failed to import Migrator class.\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "99b20ff4",
+   "metadata": {},
+   "source": [
+    "### Parse configuration files\n",
+    "\n",
+    "Parse the notebook and Mongo configuration files."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1eac645a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cfg = Config()\n",
+    "\n",
+    "# Define some aliases we can use to make the shell commands in this notebook easier to read.\n",
+    "mongodump = cfg.mongodump_path\n",
+    "mongorestore = cfg.mongorestore_path\n",
+    "mongosh = cfg.mongosh_path\n",
+    "\n",
+    "# Make the base CLI options for Mongo shell commands.\n",
+    "origin_mongo_cli_base_options = Config.make_mongo_cli_base_options(\n",
+    "    mongo_host=cfg.origin_mongo_host,\n",
+    "    mongo_port=cfg.origin_mongo_port,\n",
+    "    mongo_username=cfg.origin_mongo_username,\n",
+    "    mongo_password=cfg.origin_mongo_password,\n",
+    ")\n",
+    "transformer_mongo_cli_base_options = Config.make_mongo_cli_base_options(\n",
+    "    mongo_host=cfg.transformer_mongo_host,\n",
+    "    mongo_port=cfg.transformer_mongo_port,\n",
+    "    mongo_username=cfg.transformer_mongo_username,\n",
+    "    mongo_password=cfg.transformer_mongo_password,\n",
+    ")\n",
+    "\n",
+    "# Perform a sanity test of the application paths.\n",
+    "!{mongodump} --version\n",
+    "!{mongorestore} --version\n",
+    "!{mongosh} --version"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "68245d2b",
+   "metadata": {},
+   "source": [
+    "### Create MongoDB clients\n",
+    "\n",
+    "Create MongoDB clients you can use to access the \"origin\" and \"transformer\" MongoDB servers."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8e95f559",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Mongo client for \"origin\" MongoDB server.\n",
+    "origin_mongo_client = pymongo.MongoClient(host=cfg.origin_mongo_host, \n",
+    "                                          port=int(cfg.origin_mongo_port),\n",
+    "                                          username=cfg.origin_mongo_username,\n",
+    "                                          password=cfg.origin_mongo_password,\n",
+    "                                          directConnection=True)\n",
+    "\n",
+    "# Mongo client for \"transformer\" MongoDB server.\n",
+    "transformer_mongo_client = pymongo.MongoClient(host=cfg.transformer_mongo_host, \n",
+    "                                               port=int(cfg.transformer_mongo_port),\n",
+    "                                               username=cfg.transformer_mongo_username,\n",
+    "                                               password=cfg.transformer_mongo_password,\n",
+    "                                               directConnection=True)\n",
+    "\n",
+    "# Perform sanity tests of those MongoDB clients' abilities to access their respective MongoDB servers.\n",
+    "with pymongo.timeout(3):\n",
+    "    # Display the MongoDB server version (running on the \"origin\" Mongo server).\n",
+    "    print(\"Origin Mongo server version:      \" + origin_mongo_client.server_info()[\"version\"])\n",
+    "\n",
+    "    # Sanity test: Ensure the origin database exists.\n",
+    "    assert cfg.origin_mongo_database_name in origin_mongo_client.list_database_names(), \"Origin database does not exist.\"\n",
+    "\n",
+    "    # Display the MongoDB server version (running on the \"transformer\" Mongo server).\n",
+    "    print(\"Transformer Mongo server version: \" + transformer_mongo_client.server_info()[\"version\"])\n",
+    "\n",
+    "    # Sanity test: Ensure the transformation database does not exist.\n",
+    "    assert cfg.transformer_mongo_database_name not in transformer_mongo_client.list_database_names(), \"Transformation database already exists.\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1e195db1",
+   "metadata": {},
+   "source": [
+    "Delete the transformer database from the transformer MongoDB server if that database already exists there (e.g. if it was left over from an experiment).\n",
+    "\n",
+    "#### References\n",
+    "\n",
+    "| Description                  | Link                                                          |\n",
+    "|------------------------------|---------------------------------------------------------------|\n",
+    "| Python's `subprocess` module | https://docs.python.org/3/library/subprocess.html             |\n",
+    "| `mongosh` CLI options        | https://www.mongodb.com/docs/mongodb-shell/reference/options/ |"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8939a2ed",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Note: I run this command via Python's `subprocess` module instead of via an IPython magic `!` command\n",
+    "#       because I expect to eventually use regular Python scripts—not Python notebooks—for migrations.\n",
+    "shell_command = f\"\"\"\n",
+    "  {cfg.mongosh_path} {transformer_mongo_cli_base_options} \\\n",
+    "      --eval 'use {cfg.transformer_mongo_database_name}' \\\n",
+    "      --eval 'db.dropDatabase()' \\\n",
+    "      --quiet\n",
+    "\"\"\"\n",
+    "completed_process = subprocess.run(shell_command, shell=True)\n",
+    "print(f\"\\nReturn code: {completed_process.returncode}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bc387abc62686091",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "source": [
+    "### Create validator\n",
+    "\n",
+    "In this step, you'll create a validator that can be used to check whether data conforms to the NMDC Schema. You'll use it later, to do that.\n",
+    "\n",
+    "#### References\n",
+    "\n",
+    "| Description                  | Link                                                                         |\n",
+    "|------------------------------|------------------------------------------------------------------------------|\n",
+    "| LinkML's `Validator` class   | https://linkml.io/linkml/code/validator.html#linkml.validator.Validator      |\n",
+    "| Validating data using LinkML | https://linkml.io/linkml/data/validating-data.html#validation-in-python-code |"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5c982eb0c04e606d",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "schema_definition = get_nmdc_schema_definition()\n",
+    "validator = Validator(\n",
+    "    schema=schema_definition,\n",
+    "    validation_plugins=[JsonschemaValidationPlugin(closed=True)],\n",
+    ")\n",
+    "\n",
+    "# Perform a sanity test of the validator.\n",
+    "assert callable(validator.validate), \"Failed to instantiate a validator\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e7e8befb362a1670",
+   "metadata": {},
+   "source": [
+    "### Create SchemaView\n",
+    "\n",
+    "In this step, you'll instantiate a `SchemaView` that is bound to the destination schema. \n",
+    "\n",
+    "#### References\n",
+    "\n",
+    "| Description                 | Link                                                |\n",
+    "|-----------------------------|-----------------------------------------------------|\n",
+    "| LinkML's `SchemaView` class | https://linkml.io/linkml/developers/schemaview.html |"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "625a6e7df5016677",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "schema_view = SchemaView(get_nmdc_schema_definition())\n",
+    "\n",
+    "# As a sanity test, confirm we can use the `SchemaView` instance to access a schema class.\n",
+    "schema_view.get_class(class_name=\"Database\")[\"name\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3975ac24",
+   "metadata": {},
+   "source": [
+    "### Revoke access from the \"origin\" MongoDB server\n",
+    "\n",
+    "We revoke both \"write\" and \"read\" access to the server.\n",
+    "\n",
+    "#### Rationale\n",
+    "\n",
+    "We revoke \"write\" access so people don't make changes to the original data while the migration is happening, given that the migration ends with an overwriting of the original data (which would wipe out any changes made in the meantime).\n",
+    "\n",
+    "We also revoke \"read\" access. The revocation of \"read\" access is technically optional, but (a) the JavaScript mongosh script will be easier for me to maintain if it revokes everything and (b) this prevents people from reading data during the restore step, during which the database may not be self-consistent.\n",
+    "\n",
+    "#### References\n",
+    "\n",
+    "| Description                    | Link                                                      |\n",
+    "|--------------------------------|-----------------------------------------------------------|\n",
+    "| Running a script via `mongosh` | https://www.mongodb.com/docs/mongodb-shell/write-scripts/ |"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f761caad",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "shell_command = f\"\"\"\n",
+    "  {cfg.mongosh_path} {origin_mongo_cli_base_options} \\\n",
+    "      --file='mongosh-scripts/revoke-privileges.mongo.js' \\\n",
+    "      --quiet\n",
+    "\"\"\"\n",
+    "completed_process = subprocess.run(shell_command, shell=True)\n",
+    "print(f\"\\nReturn code: {completed_process.returncode}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7f9c87de6fb8530c",
+   "metadata": {},
+   "source": [
+    "### Delete obsolete dumps from previous migrations\n",
+    "\n",
+    "Delete any existing dumps before we create new ones in this notebook. This is so the dumps you generate with this notebook do not get merged with any unrelated ones."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6a949d0fcb4b6fa0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!rm -rf {cfg.origin_dump_folder_path}\n",
+    "!rm -rf {cfg.transformer_dump_folder_path}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b7799910b6b0715d",
+   "metadata": {},
+   "source": [
+    "### Dump collections from the \"origin\" MongoDB server\n",
+    "\n",
+    "Use `mongodump` to dump all the collections **from** the \"origin\" MongoDB server **into** a local directory.\n",
+    "\n",
+    "- TODO: Consider only dumping collections represented by the initial schema."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "da530d6754c4f6fe",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# Dump all collections from the \"origin\" database.\n",
+    "shell_command = f\"\"\"\n",
+    "  {mongodump} {origin_mongo_cli_base_options} \\\n",
+    "      --db='{cfg.origin_mongo_database_name}' \\\n",
+    "      --out='{cfg.origin_dump_folder_path}' \\\n",
+    "      --gzip\n",
+    "\"\"\"\n",
+    "completed_process = subprocess.run(shell_command, shell=True)\n",
+    "print(f\"\\nReturn code: {completed_process.returncode}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "932ebde8abdd70ec",
+   "metadata": {},
+   "source": [
+    "### Load the dumped collections into the \"transformer\" MongoDB server\n",
+    "\n",
+    "Use `mongorestore` to load the dumped collections **from** the local directory **into** the \"transformer\" MongoDB server.\n",
+    "\n",
+    "References:\n",
+    "- https://www.mongodb.com/docs/database-tools/mongorestore/#std-option-mongorestore\n",
+    "- https://www.mongodb.com/docs/database-tools/mongorestore/mongorestore-examples/#copy-clone-a-database"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "79bd888e82d52a93",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# Restore the dumped collections to the \"transformer\" MongoDB server.\n",
+    "shell_command = f\"\"\"\n",
+    "  {mongorestore} {transformer_mongo_cli_base_options} \\\n",
+    "      --nsFrom='{cfg.origin_mongo_database_name}.*' \\\n",
+    "      --nsTo='{cfg.transformer_mongo_database_name}.*' \\\n",
+    "      --dir='{cfg.origin_dump_folder_path}' \\\n",
+    "      --stopOnError \\\n",
+    "      --drop \\\n",
+    "      --gzip\n",
+    "\"\"\"\n",
+    "completed_process = subprocess.run(shell_command, shell=True)\n",
+    "print(f\"\\nReturn code: {completed_process.returncode}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c3e3c9c4",
+   "metadata": {},
+   "source": [
+    "### Transform the collections within the \"transformer\" MongoDB server\n",
+    "\n",
+    "Use the migrator to transform the collections in the \"transformer\" database.\n",
+    "\n",
+    "> Reminder: The database transformation functions are defined in the `nmdc-schema` Python package installed earlier.\n",
+    "\n",
+    "> Reminder: The \"origin\" database is **not** affected by this step."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9c89c9dd3afe64e2",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# Instantiate a MongoAdapter bound to the \"transformer\" database.\n",
+    "adapter = MongoAdapter(\n",
+    "    database=transformer_mongo_client[cfg.transformer_mongo_database_name],\n",
+    "    on_collection_created=lambda name: print(f'Created collection \"{name}\"'),\n",
+    "    on_collection_renamed=lambda old_name, name: print(f'Renamed collection \"{old_name}\" to \"{name}\"'),\n",
+    "    on_collection_deleted=lambda name: print(f'Deleted collection \"{name}\"'),\n",
+    ")\n",
+    "\n",
+    "# Instantiate a Migrator bound to that adapter.\n",
+    "logger = setup_logger()\n",
+    "migrator = Migrator(adapter=adapter, logger=logger)\n",
+    "\n",
+    "# Execute the Migrator's `upgrade` method to perform the migration.\n",
+    "migrator.upgrade()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4c090068",
+   "metadata": {},
+   "source": [
+    "### Validate the transformed documents\n",
+    "\n",
+    "Now that we have transformed the database, validate each document in each collection in the \"transformer\" MongoDB server."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e1c50b9911e02e70",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Get the names of all collections.\n",
+    "collection_names: List[str] = get_collection_names_from_schema(schema_view)\n",
+    "\n",
+    "# Ensure that, if the (large) \"functional_annotation_agg\" collection is present in `collection_names`,\n",
+    "# it goes at the end of the list we process. That way, we can find out about validation errors in\n",
+    "# other collections without having to wait for that (large) collection to be validated.\n",
+    "ordered_collection_names = sorted(collection_names.copy())\n",
+    "large_collection_name = \"functional_annotation_agg\"\n",
+    "if large_collection_name in ordered_collection_names:\n",
+    "    ordered_collection_names = list(filter(lambda n: n != large_collection_name, ordered_collection_names))\n",
+    "    ordered_collection_names.append(large_collection_name)  # puts it last\n",
+    "\n",
+    "# Process each collection.\n",
+    "for collection_name in ordered_collection_names:\n",
+    "    collection = transformer_mongo_client[cfg.transformer_mongo_database_name][collection_name]\n",
+    "    num_documents_in_collection = collection.count_documents({})\n",
+    "    print(f\"Validating collection {collection_name} ({num_documents_in_collection} documents) [\", end=\"\")  # no newline\n",
+    "\n",
+    "    # Calculate how often we'll display a tick mark (i.e. a sign of life).\n",
+    "    num_documents_per_tick = num_documents_in_collection * 0.10  # one tenth of the total\n",
+    "    num_documents_since_last_tick = 0\n",
+    "    \n",
+    "    for document in collection.find():\n",
+    "        # Validate the transformed document.\n",
+    "        #\n",
+    "        # Reference: https://github.com/microbiomedata/nmdc-schema/blob/main/src/docs/schema-validation.md\n",
+    "        #\n",
+    "        # Note: Dictionaries originating as Mongo documents include a Mongo-generated key named `_id`. However,\n",
+    "        #       the NMDC Schema does not describe that key and, indeed, data validators consider dictionaries\n",
+    "        #       containing that key to be invalid with respect to the NMDC Schema. So, here, we validate a\n",
+    "        #       copy (i.e. a shallow copy) of the document that lacks that specific key.\n",
+    "        #\n",
+    "        # Note: The reason we don't use a progress bar library such as `rich[jupyter]`, `tqdm`, or `ipywidgets`\n",
+    "        #       is that _PyCharm's_ Jupyter Notebook integration doesn't fully work with any of them. :(\n",
+    "        #\n",
+    "        schema_class_name = derive_schema_class_name_from_document(schema_view=schema_view, document=document)\n",
+    "        document_without_underscore_id_key = {key: value for key, value in document.items() if key != \"_id\"}\n",
+    "        validation_report: ValidationReport = validator.validate(document_without_underscore_id_key, schema_class_name)\n",
+    "        if len(validation_report.results) > 0:\n",
+    "            result_messages = [result.message for result in validation_report.results]\n",
+    "            raise TypeError(f\"Document is invalid.\\n{result_messages=}\\n{document_without_underscore_id_key=}\")\n",
+    "\n",
+    "        # Display a tick mark if we have validated enough documents since we last displayed one.\n",
+    "        num_documents_since_last_tick += 1\n",
+    "        if num_documents_since_last_tick >= num_documents_per_tick:\n",
+    "            num_documents_since_last_tick = 0\n",
+    "            print(\".\", end=\"\")  # no newline\n",
+    "            \n",
+    "    print(\"]\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3edf77c7",
+   "metadata": {},
+   "source": [
+    "### Dump the collections from the \"transformer\" MongoDB server\n",
+    "\n",
+    "Now that the collections have been transformed and validated, dump them **from** the \"transformer\" MongoDB server **into** a local directory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "db6e432d",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# Dump the database from the \"transformer\" MongoDB server.\n",
+    "shell_command = f\"\"\"\n",
+    "  {mongodump} {transformer_mongo_cli_base_options} \\\n",
+    "      --db='{cfg.transformer_mongo_database_name}' \\\n",
+    "      --out='{cfg.transformer_dump_folder_path}' \\\n",
+    "      --gzip\n",
+    "\"\"\"\n",
+    "completed_process = subprocess.run(shell_command, shell=True)\n",
+    "print(f\"\\nReturn code: {completed_process.returncode}\")  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "997fcb281d9d3222",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "source": [
+    "### Create a bookkeeper\n",
+    "\n",
+    "Create a `Bookkeeper` that can be used to document migration events in the \"origin\" server."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dbbe706d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bookkeeper = Bookkeeper(mongo_client=origin_mongo_client)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1e0c8891",
+   "metadata": {},
+   "source": [
+    "### Indicate — on the \"origin\" server — that the migration is underway\n",
+    "\n",
+    "Add an entry to the migration log collection to indicate that this migration has started."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ca49f61a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bookkeeper.record_migration_event(migrator=migrator, event=MigrationEvent.MIGRATION_STARTED)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9c253e6f",
+   "metadata": {},
+   "source": [
+    "### Drop the original collections from the \"origin\" MongoDB server\n",
+    "\n",
+    "This is necessary for situations where collections were renamed or deleted. (The `--drop` option of `mongorestore` would only drop collections that exist in the dump being restored, which would not include renamed or deleted collections.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0b26e434",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "shell_command = f\"\"\"\n",
+    "  {cfg.mongosh_path} {origin_mongo_cli_base_options} \\\n",
+    "      --eval 'use {cfg.origin_mongo_database_name}' \\\n",
+    "      --eval 'db.dropDatabase()'\n",
+    "\"\"\"\n",
+    "completed_process = subprocess.run(shell_command, shell=True)\n",
+    "print(f\"\\nReturn code: {completed_process.returncode}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d84bdc11",
+   "metadata": {},
+   "source": [
+    "### Load the collections into the \"origin\" MongoDB server\n",
+    "\n",
+    "Load the transformed collections into the \"origin\" MongoDB server."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1dfbcf0a",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# Load the transformed collections into the origin server, replacing any same-named ones that are there.\n",
+    "shell_command = f\"\"\"\n",
+    "  {mongorestore} {origin_mongo_cli_base_options} \\\n",
+    "      --nsFrom='{cfg.transformer_mongo_database_name}.*' \\\n",
+    "      --nsTo='{cfg.origin_mongo_database_name}.*' \\\n",
+    "      --dir='{cfg.transformer_dump_folder_path}' \\\n",
+    "      --stopOnError \\\n",
+    "      --verbose \\\n",
+    "      --drop \\\n",
+    "      --gzip\n",
+    "\"\"\"\n",
+    "completed_process = subprocess.run(shell_command, shell=True)\n",
+    "print(f\"\\nReturn code: {completed_process.returncode}\")  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ca5ee89a79148499",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "source": [
+    "### Indicate that the migration is complete\n",
+    "\n",
+    "Add an entry to the migration log collection to indicate that this migration is complete."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d1eaa6c92789c4f3",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "bookkeeper.record_migration_event(migrator=migrator, event=MigrationEvent.MIGRATION_COMPLETED)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "04c856a8",
+   "metadata": {},
+   "source": [
+    "### Restore access to the \"origin\" MongoDB server\n",
+    "\n",
+    "This effectively un-does the access revocation that we did earlier."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9aab3c7e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "shell_command = f\"\"\"\n",
+    "  {cfg.mongosh_path} {origin_mongo_cli_base_options} \\\n",
+    "      --file='mongosh-scripts/restore-privileges.mongo.js' \\\n",
+    "      --quiet\n",
+    "\"\"\"\n",
+    "completed_process = subprocess.run(shell_command, shell=True)\n",
+    "print(f\"\\nReturn code: {completed_process.returncode}\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

From 4ca46b1010f39e426aaad118fd0c31a810beab01 Mon Sep 17 00:00:00 2001
From: eecavanna <eecavanna@users.noreply.github.com>
Date: Wed, 13 Nov 2024 14:08:05 -0800
Subject: [PATCH 2/2] Drop/create Mongo index before/after transformation, and
 fix `import` bug

---
 .../notebooks/migrate_11_0_3_to_11_1_0.ipynb  | 54 ++++++++++++++++++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/demo/metadata_migration/notebooks/migrate_11_0_3_to_11_1_0.ipynb b/demo/metadata_migration/notebooks/migrate_11_0_3_to_11_1_0.ipynb
index 5916ad47..e1f2105a 100644
--- a/demo/metadata_migration/notebooks/migrate_11_0_3_to_11_1_0.ipynb
+++ b/demo/metadata_migration/notebooks/migrate_11_0_3_to_11_1_0.ipynb
@@ -175,7 +175,7 @@
     "from bookkeeper import Bookkeeper, MigrationEvent\n",
     "\n",
     "# Dynamic imports:\n",
-    "migrator_module = importlib.import_module(MIGRATOR_MODULE_NAME, package=\"nmdc_schema.migrators\")\n",
+    "migrator_module = importlib.import_module(f\".{MIGRATOR_MODULE_NAME}\", package=\"nmdc_schema.migrators\")\n",
     "Migrator = getattr(migrator_module, \"Migrator\")  # gets the class\n",
     "assert isclass(Migrator), \"Failed to import Migrator class.\""
    ]
@@ -505,6 +505,32 @@
     "print(f\"\\nReturn code: {completed_process.returncode}\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "84a5d514",
+   "metadata": {},
+   "source": [
+    "### Do pre-transformation steps\n",
+    "\n",
+    "In this section, we'll delete an index that is not consistent with the new schema—it is a compound index that includes a field that the migrators will be removing from all documents in the collection.\n",
+    "\n",
+    "#### References\n",
+    "\n",
+    "| Description         | Link                                                                                                          |\n",
+    "|---------------------|---------------------------------------------------------------------------------------------------------------|\n",
+    "| `drop_index` method | https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.drop_index |"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "21bb1ba2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "transformer_mongo_client[cfg.transformer_mongo_database_name].get_collection(\"functional_annotation_agg\").drop_index(\"gene_function_id_1_metagenome_annotation_id_1\")"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "c3e3c9c4",
@@ -544,6 +570,32 @@
     "migrator.upgrade()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "d568ed23",
+   "metadata": {},
+   "source": [
+    "### Do post-transformation steps\n",
+    "\n",
+    "In this section, we'll create an index that is just like the one we deleted earlier, but uses the new field name instead of the old one.\n",
+    "\n",
+    "#### References\n",
+    "\n",
+    "| Description           | Link                                                                                                            |\n",
+    "|-----------------------|-----------------------------------------------------------------------------------------------------------------|\n",
+    "| `create_index` method | https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.create_index |"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce460536",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "transformer_mongo_client[cfg.transformer_mongo_database_name].get_collection(\"functional_annotation_agg\").create_index([\"gene_function_id\", \"was_generated_by\"], unique=True)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "4c090068",