Skip to content

Commit

Permalink
feat(IPVC-2283): mitochondrial transcript workflow (#20)
Browse files Browse the repository at this point in the history
  • Loading branch information
nvta1209 authored Apr 9, 2024
1 parent a072cb1 commit d65d764
Show file tree
Hide file tree
Showing 12 changed files with 280 additions and 303 deletions.
95 changes: 56 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -289,60 +289,77 @@ To develop UTA, follow these steps.
4. Testing

$ docker build --target uta-test -t uta-test .
$ docker run -it --rm uta-test python -m unittest
$ docker run --rm uta-test python -m unittest

## UTA update procedure

### 1. Download files from NCBI
Requires docker.

Run `sbin/ncbi-download-docker`. Requires bash and docker.
### 0. Setup

Example:
Make directories:
```
sbin/ncbi-download-docker $(pwd)/ncbi-data
mkdir -p $(pwd)/ncbi-data
mkdir -p $(pwd)/output/artifacts
mkdir -p $(pwd)/output/logs
```

The specified directory will have the following structure:

├── gene
│ └── DATA
│ ├── GENE_INFO
│ │ └── Mammalia
│ │ └── Homo_sapiens.gene_info.gz
│ └── gene2accession.gz
├── genomes
│ └── refseq
│ └── vertebrate_mammalian
│ └── Homo_sapiens
│ └── all_assembly_versions
│ └── GCF_000001405.25_GRCh37.p13
│ ├── GCF_000001405.25_GRCh37.p13_genomic.fna.gz
│ └── GCF_000001405.25_GRCh37.p13_genomic.gff.gz
└── refseq
└── H_sapiens
└── mRNA_Prot
├── human.1.protein.faa.gz
├── human.1.rna.fna.gz
└── human.1.rna.gbff.gz

### 2. Download SeqRepo data

Run `sbin/seqrepo-download`. Requires bash and docker.

Example:
Set variables:
```
sbin/seqrepo-download 2024-02-20 $(pwd)/seqrepo-data
export UTA_ETL_OLD_SEQREPO_VERSION=2024-02-20
export UTA_ETL_OLD_UTA_VERSION=uta_20210129b
export UTA_ETL_NCBI_DIR=./ncbi-data
export UTA_ETL_SEQREPO_DIR=./seqrepo-data
export UTA_ETL_WORK_DIR=./output/artifacts
export UTA_ETL_LOG_DIR=./output/logs
```

### 3. Update UTA and SeqRepo
Build the UTA image:
```
docker build --target uta -t uta-update .
```

### 1. Download SeqRepo data
```
docker pull biocommons/seqrepo:$UTA_ETL_OLD_SEQREPO_VERSION
# download seqrepo. can skip if container already exists.
docker run --name seqrepo biocommons/seqrepo:$UTA_ETL_OLD_SEQREPO_VERSION
# copy seqrepo data into a local directory
docker run -v $UTA_ETL_SEQREPO_DIR:/output-dir --volumes-from seqrepo ubuntu bash -c 'cp -R /usr/local/share/seqrepo/* /output-dir'
# allow seqrepo to be modified
docker run -it -v $UTA_ETL_SEQREPO_DIR:/output-dir ubuntu bash -c 'chmod -R +w /output-dir'
```

Run `sbin/uta-update`. Requires bash and docker.
Note: pulling data takes ~30 minutes and requires ~13 GB.
Note: a container called seqrepo will be left behind.

Example:
### 2. Extract and transform data from NCBI

Download files from NCBI, extract into intermediate files, and load into UTA and SeqRepo.

See 2A for nuclear transcripts and 2B for mitochondrial transcripts.

#### 2A. Nuclear transcripts
```
docker compose run ncbi-download
docker compose run uta-extract
docker compose run seqrepo-load
UTA_ETL_SKIP_GENE_LOAD=false docker compose run uta-load
```
sbin/uta-update $(pwd)/ncbi-data $(pwd)/seqrepo-data $(pwd)/uta-build uta_20210129b 2024-02-20

#### 2B. Mitochondrial transcripts
```
docker compose run mito-extract
docker compose run seqrepo-load
UTA_ETL_SKIP_GENE_LOAD=true docker compose run uta-load
```

UTA has updated and the database has been dumped into a pgd file in `UTA_ETL_WORK_DIR`. SeqRepo has been updated in place.


## Migrations
UTA uses alembic to manage database migrations. To auto-generate a migration:
```
Expand All @@ -353,7 +370,7 @@ Adjust the upgrade and downgrade function definitions. To apply the migration:
```
alembic -c etc/alembic.ini upgrade head
```
To reverse a migration, use `downgrade` with the number of steps to reverse. For example, to reverse the last:
To reverse a migration, use `downgrade` with the number of steps to reverse. For example, to reverse the last:
```
alembic -c etc/alembic.ini downgrade -1
```
50 changes: 43 additions & 7 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,24 +3,60 @@
version: '3'

services:
ncbi-download:
image: uta-update
command: sbin/ncbi-download /ncbi-dir
volumes:
- .:/opt/repos/uta
- ${UTA_ETL_NCBI_DIR}:/ncbi-dir
working_dir: /opt/repos/uta
network_mode: host
uta-extract:
image: uta-update
command: sbin/uta-extract /ncbi-dir /uta-extract/work /uta-extract/logs
volumes:
- ${UTA_ETL_NCBI_DIR}:/ncbi-dir
- ${UTA_ETL_SEQREPO_DIR}:/usr/local/share/seqrepo
- ${UTA_ETL_WORK_DIR}:/uta-extract/work
- ${UTA_ETL_LOG_DIR}:/uta-extract/logs
working_dir: /opt/repos/uta
network_mode: host
seqrepo-load:
image: uta-update
command: sbin/seqrepo-load /usr/local/share/seqrepo 2024-02-20 /seqrepo-load/work /seqrepo-load/logs
volumes:
- ${UTA_ETL_SEQREPO_DIR}:/usr/local/share/seqrepo
- ${UTA_ETL_WORK_DIR}:/seqrepo-load/work
- ${UTA_ETL_LOG_DIR}:/seqrepo-load/logs
working_dir: /opt/repos/uta
network_mode: host
uta:
container_name: uta
image: biocommons/uta:${UTA_VERSION}
image: biocommons/uta:${UTA_ETL_OLD_UTA_VERSION}
environment:
- POSTGRES_HOST_AUTH_METHOD=trust
healthcheck:
test: psql -h localhost -U anonymous -d uta -c "select * from ${UTA_VERSION}.meta"
test: psql -h localhost -U anonymous -d uta -c "select * from ${UTA_ETL_OLD_UTA_VERSION}.meta"
interval: 10s
retries: 60
network_mode: host
uta-update:
uta-load:
image: uta-update
command: etc/scripts/run-uta-build.sh ${UTA_VERSION} ${SEQREPO_VERSION} /ncbi-dir /workdir
command: sbin/uta-load ${UTA_ETL_OLD_UTA_VERSION} /ncbi-dir /uta-load/work /uta-load/logs ${UTA_ETL_SKIP_GENE_LOAD}
depends_on:
uta:
condition: service_healthy
volumes:
- ${NCBI_DIR}:/ncbi-dir
- ${SEQREPO_DIR}:/usr/local/share/seqrepo
- ${WORKING_DIR}:/workdir
- ${UTA_ETL_NCBI_DIR}:/ncbi-dir
- ${UTA_ETL_SEQREPO_DIR}:/usr/local/share/seqrepo
- ${UTA_ETL_WORK_DIR}:/uta-load/work
- ${UTA_ETL_LOG_DIR}:/uta-load/logs
network_mode: host
mito-extract:
image: uta-update
command: sbin/ncbi_process_mito.py NC_012920.1 --output-dir /mito-extract/work | tee /mito-extract/logs/mito.log
volumes:
- ${UTA_ETL_WORK_DIR}:/mito-extract/work
- ${UTA_ETL_LOG_DIR}:/mito-extract/logs
working_dir: /opt/repos/uta
network_mode: host
98 changes: 0 additions & 98 deletions etc/scripts/run-uta-build.sh

This file was deleted.

24 changes: 24 additions & 0 deletions sbin/ncbi-download
Original file line number Diff line number Diff line change
@@ -1,6 +1,29 @@
#!/usr/bin/env bash

# This script downloads the files needed for a UTA+SeqRepo update into to the given directory.
#
# DONWLOAD_DIR will have the following structure:
#
# ├── gene
# │ └── DATA
# │ ├── GENE_INFO
# │ │ └── Mammalia
# │ │ └── Homo_sapiens.gene_info.gz
# │ └── gene2accession.gz
# ├── genomes
# │ └── refseq
# │ └── vertebrate_mammalian
# │ └── Homo_sapiens
# │ └── all_assembly_versions
# │ └── GCF_000001405.25_GRCh37.p13
# │ ├── GCF_000001405.25_GRCh37.p13_genomic.fna.gz
# │ └── GCF_000001405.25_GRCh37.p13_genomic.gff.gz
# └── refseq
# └── H_sapiens
# └── mRNA_Prot
# ├── human.1.protein.faa.gz
# ├── human.1.rna.fna.gz
# └── human.1.rna.gbff.gz

set -e

Expand All @@ -26,6 +49,7 @@ do
DOWNLOAD_MODULE="${DOWNLOAD_PATH%%/*}"
DOWNLOAD_SRC="ftp.ncbi.nlm.nih.gov::$DOWNLOAD_PATH"
DOWNLOAD_DST="$DOWNLOAD_DIR/$DOWNLOAD_MODULE"
mkdir -p $DOWNLOAD_DST
echo "Downloading $DOWNLOAD_SRC to $DOWNLOAD_DST"
rsync --no-motd -DHPRprtv "$DOWNLOAD_SRC" "$DOWNLOAD_DST"
done
24 changes: 0 additions & 24 deletions sbin/ncbi-download-docker

This file was deleted.

Loading

0 comments on commit d65d764

Please sign in to comment.