Skip to content

Data Portal Management

Michael Nagler edited this page Dec 23, 2024 · 8 revisions

NMDC Data Portal

The data portal is not a system of record for NMDC Data. It is a transformed copy of the system of record from MongoDB, transformed into a relational schema optimized for the sort of queries that the data portal client needs to make.

The NMDC Data Portal is build with these technologies

  • Python and FastAPI
  • PostgreSQL and SQLAlchemy
  • Celery and Redis Queue
  • Vue JS and Vuetify

Software versions

General architecture

nmdc-diagram

API Documentation

Information about how to use the search portal REST API can be found in the wiki.

Development documentation

Data Ingest

The ingestion procedure is a two step process.

  1. Data is ingested into a staging database
  2. The staging and production database are swapped in spin

The steps to perform an ingest are as follows:

Step 1. Ingest into the staging database

Prerequisites:

  1. You must be logged into the data portal using your ORCiD
  2. Your account must be flagged as an administrator. Administrator access can be granted on the user list page https://data.microbiomedata.org/users

Execute the POST /api/jobs/ingest endpoint through the swagger docs.

  • You can choose to do a "fast" ingest by setting skip_annotations or setting the function_limit. This ingest takes ~30 minutes.
  • A full ingest pulls all gene functions from Mongo and takes ~24 hours.

You should verify that the ingest job completed successfully by looking at the logs in the worker before moving on.

Step 2. Modify environment variables to swap prod/staging

In the rancher2 UI, select Resources -> Secrets from the toolbar and click on the postgres secret group. Now, click on the button in the upper right with three vertical dots and select Edit. Now, swap the values under INGEST_URI and POSTGRES_URI and click save.

Step 3. Restart the containers

Go back to the workloads page and redeploy both the backend and worker services. If the site doesn't work with the newest data, you can always revert the changes to the secrets provided you haven't started a new ingest.

Ingest via cron job

Ingest is automated via a cron job in rancher2. In the nmdc-dev namespace, it runs nightly. In the nmcd namespace, it runs weekly on Sundays. Ad-hoc ingest can be performed by clicking "Run Now" from the 3-dot menu drop down. When ingest is done this way, API calls are made to rancher to automatically swap the secrets and restart the containers. Logs from these ingest runs are not stored in Redis, but can be accessed by viewing a recent run and clicking "View Logs."

Troubleshooting

There is a simple locking mechanism to prevent multiple ingests from running concurrently. Occasionally, a task will get shut down ungracefully and you have to clear the lock out manually. To do that, just truncate the ingest_lock table on the production (not ingest) database. This can be done by gaining access to the "db" container's shell and starting a psql session with an unqualified psql command. You may need to explicitly connect to the production database with a psql command like \connect <PRODUCTION_DB_NAME>. See your .env file to determine the production database name.

TRUNCATE TABLE ingest_lock;

Sometimes, data schema changes cause ingest to fail. These schema serialization failures are typically logged by the ingest worker and may require correction in schemas.py.