Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement notebook that migrates Mongo database from "legacy" to "Berkeley" schema #553

Conversation

eecavanna
Copy link
Collaborator

@eecavanna eecavanna commented Jun 13, 2024

Description

Created a new migration notebook

In this branch, I implemented a Python notebook that can be used to migrate the NMDC Mongo database from conforming to nmdc-schema version 10 (a.k.a. the latest version of the "legacy" schema), to conforming to nmdc-schema version 11 (a.k.a. the "Berkeley" schema).

Here's the rendered notebook (it maybe be easier to review in this rendered format compared to reviewing its source code): migrate_10_8_0_to_11_0_0.ipynb

Introduced a new dependency

I introduced the program, mongosh, as a dependency of the migration notebooks. That program allows the notebook to perform arbitrary Mongo commands—instead of just dumping and restoring collections. For example, it allows the notebook to change people's Mongo roles (to temporarily revoke their access during the migration process).

Changed configuration file

Since mongosh does not support the configuration options that mongodump and mongorestore do, in order to be able to share configuration between all three programs, I changed the notebook configuration file format to accommodate all three programs.

Writing migrator log to a file

I made it so the log messages generated by migrators, themselves, get written to a log file. Previously, they were ignored/not shown. This was particularly useful for this migration notebook, since there are multiple partial migrators involved. This is our most complex migration so far.

Fixes #519

Type of change

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • ⚠️ Warning: The configuration file changes in this branch will break old migration notebooks. That consequence will be documented in an upcoming GitHub Issue, which I will create once this PR branch gets merged into main. There is already a workaround for it (which is to check out an older version of this repository), should anyone want to use the old migration notebooks before that upcoming Issue gets resolved.

As a reminder; although the migration notebooks live in this repository, they have no impact on the Runtime (and vice versa). So, the above breaking change does not affect the Runtime.

How Has This Been Tested?

  • Manually tested individual cells while implementing them
  • Used an earlier version of the notebook to set up the Berkeley environment on Spin
  • Python unittest tests that target the Config class, all pass (as confirmed by a GitHub Actions workflow in this PR)

Checklist:

@eecavanna eecavanna self-assigned this Jun 13, 2024
@eecavanna eecavanna changed the title Implement notebook that migrates Mongo database to "Berkeley schema Implement notebook that migrates Mongo database to "Berkeley" schema Jun 13, 2024
@eecavanna eecavanna marked this pull request as ready for review August 18, 2024 19:17
@eecavanna eecavanna requested review from turbomam and brynnz22 August 18, 2024 19:17
@eecavanna eecavanna changed the title Implement notebook that migrates Mongo database to "Berkeley" schema Implement notebook that migrates Mongo database from "legacy" to "Berkeley" schema Aug 18, 2024
@eecavanna eecavanna marked this pull request as draft August 20, 2024 05:23
@eecavanna
Copy link
Collaborator Author

eecavanna commented Aug 20, 2024

I converted this back to a draft. I still want to implement the following things:

  1. Only dump/restore collections described by the NMDC Schema—and not the other "application-specific" collections that happen to reside in the same database and are never transformed by the migration framework. Note: I want those non-schema-described collections to eventually reside in a different database from the schema-described ones, but that's out of the scope of this notebook (and squad).
    • I was previous hesitant to do this because there's a chance the initial schema changes between now and show time; but it has been frozen at 10.7.0 for a month now, so I don't think it will change often, if at all.
  2. Use a SchemaView in a place where the notebook currently accesses the JSON Schema directly (e.g. json_schema["$defs"]["Database"])
    • This will involve making the LinkML runtime be a dependency of the notebook.

@eecavanna
Copy link
Collaborator Author

I will mark this "Ready for review" since it works as-is. I'll create a separate ticket about "optimizing" it to do the following:

  1. Download the old schema from GitHub via $ curl (or Python requests, etc.)
    • Note: This would make the notebook dependent upon an Internet connection (no longer self-contained). As a reminder, it's already dependent upon Python packages existing or being downloadable.
  2. Instantiate a SchemaView bound to that schema (call this sv_old)
  3. Instantiate a SchemaView bound to the new schema, which is import-ed (call this sv_new)
  4. Use sv_old to get collection names in old schema (call this clns_old)
  5. Use sv_old to get collection names in new schema (call this clns_new)
  6. When running mongodump on the origin, dump only clns_old
  7. When running mongorestore on the transformer, restore only clns_old
  8. (Use the migrator to) transform the contents of the transformer database
  9. Dump clns_new from the transformer
  10. Drop clns_old from the origin
  11. Restore clns_new into the origin

@eecavanna eecavanna marked this pull request as ready for review August 27, 2024 05:32
@eecavanna eecavanna requested review from turbomam and brynnz22 August 27, 2024 05:34
@eecavanna
Copy link
Collaborator Author

eecavanna commented Aug 28, 2024

Merging without formal review. No impact on Runtime application (just shares a repo). Migration squad members did a brief pair review session today, during which I presented some of the major changes.

@eecavanna eecavanna merged commit 7955875 into main Aug 28, 2024
2 checks passed
@eecavanna
Copy link
Collaborator Author

I'll create a separate ticket about "optimizing" it

Instead of creating a new ticket, I added the following comment to an existing ticket, which was already about the same thing: #449 (comment)

@eecavanna eecavanna deleted the 519-migrations-implement-notebook-that-runs-all-berkeley-schema-migrators branch September 2, 2024 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrations: Implement notebook that runs all Berkeley schema migrators
1 participant