-
Notifications
You must be signed in to change notification settings - Fork 0
Data Portal Management
The data portal is not a system of record for NMDC Data. It is a transformed copy of the system of record from MongoDB, transformed into a relational schema optimized for the sort of queries that the data portal client needs to make.
The NMDC Data Portal is build with these technologies
- Python and FastAPI
- PostgreSQL and SQLAlchemy
- Celery and Redis Queue
- Vue JS and Vuetify
Information about how to use the search portal REST API can be found in the wiki.
The ingestion procedure is a two step process.
- Data is ingested into a staging database
- The staging and production database are swapped in spin
The steps to perform an ingest are as follows:
Prerequisites:
- You must be logged into the data portal using your ORCiD
- Your account must be flagged as an administrator. Administrator access can be granted on the user list page https://data.microbiomedata.org/users
Execute the POST /api/jobs/ingest
endpoint through the swagger docs.
- You can choose to do a "fast" ingest by setting
skip_annotations
or setting thefunction_limit
. This ingest takes ~30 minutes. - A full ingest pulls all gene functions from Mongo and takes ~24 hours.
You should verify that the ingest job completed successfully by looking at the logs in the worker before moving on.
In the rancher2 UI, select Resources -> Secrets
from the toolbar and click on the postgres
secret group. Now, click on the button in the upper right with three vertical dots and select Edit
. Now, swap the values under INGEST_URI
and POSTGRES_URI
and click save.
Go back to the workloads page and redeploy both the backend
and worker
services. If the site doesn't work with the newest data, you can always revert the changes to the secrets provided you haven't started a new ingest.
Ingest is automated via a cron job in rancher2. In the nmdc-dev
namespace, it runs nightly. In the nmcd
namespace, it runs weekly on Sundays. Ad-hoc ingest can be performed by clicking "Run Now" from the 3-dot menu drop down. When ingest is done this way, API calls are made to rancher to automatically swap the secrets and restart the containers. Logs from these ingest runs are not stored in Redis, but can be accessed by viewing a recent run and clicking "View Logs."
There is a simple locking mechanism to prevent multiple ingests from running concurrently. Occasionally, a task will get shut down ungracefully and you have to clear the lock out manually. To do that, just truncate the ingest_lock table on the production (not ingest) database. This can be done by gaining access to the "db" container's shell and starting a psql session with an unqualified psql
command. You may need to explicitly connect to the production database with a psql command like \connect <PRODUCTION_DB_NAME>
. See your .env
file to determine the production database name.
TRUNCATE TABLE ingest_lock;
Sometimes, data schema changes cause ingest to fail. These schema serialization failures are typically logged by the ingest worker and may require correction in schemas.py
.