Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(IPVC-2283): mitochondrial transcript workflow #20

Merged
merged 38 commits into from
Apr 9, 2024
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
33638c4
add shebang and comment to mito script, and make it executable
nvta1209 Apr 4, 2024
59323c7
move ncbi parsing scripts into uta-extract
nvta1209 Apr 5, 2024
6297bac
produce gzip files from mito script
nvta1209 Apr 5, 2024
aa97fe6
move seqrepo load into its own script
nvta1209 Apr 5, 2024
46d8ac3
copy fasta files into the loading dir
nvta1209 Apr 5, 2024
17168b9
remove unneeded seqrepo version input
nvta1209 Apr 5, 2024
894ef4d
rename uta-update uta-load, in line with seqrepo-load and extract-tra…
nvta1209 Apr 5, 2024
89f8b3b
simplify readme
nvta1209 Apr 5, 2024
6bd387a
allow seqrepo to be modified
nvta1209 Apr 5, 2024
ede145e
restructure readme
nvta1209 Apr 5, 2024
44054fa
change dirs in readme
nvta1209 Apr 5, 2024
cc32915
be explicit about all dirs
nvta1209 Apr 5, 2024
df62d46
mkdir needs to happen in both nuclear and mito paths
nvta1209 Apr 5, 2024
f6e5032
consistent dir name
nvta1209 Apr 5, 2024
2361726
mito: strand should be an int
nvta1209 Apr 5, 2024
cf46cc0
reanme uta loading script
nvta1209 Apr 5, 2024
329fdc4
remove docker wrapper script for download
nvta1209 Apr 5, 2024
bcfbf38
create compose service for uta-extract
nvta1209 Apr 5, 2024
a3dcbcf
create compose service for mito-extract
nvta1209 Apr 5, 2024
d327c0a
create compose service for seqrepo-load
nvta1209 Apr 5, 2024
540b4dd
clean up uta-load command
nvta1209 Apr 5, 2024
edd2a09
delete docker wrapper script for uta-load
nvta1209 Apr 5, 2024
844b127
remove docker wraper script for sr download
nvta1209 Apr 5, 2024
d63f20f
remove current dir mount
nvta1209 Apr 5, 2024
f096f42
clean up readme
nvta1209 Apr 5, 2024
0300a0e
Merge branch 'main' into IPVC-2283-mito-workflow
nvta1209 Apr 5, 2024
66ac44f
skip gene load for mito
nvta1209 Apr 5, 2024
49a2ce8
set -e on uta-load
nvta1209 Apr 5, 2024
f6e9d9d
fix naming
nvta1209 Apr 5, 2024
3780859
move step that requires seqrepo out of extract script, so that seqrep…
nvta1209 Apr 5, 2024
e4bebf5
move uta-load script into sbin
nvta1209 Apr 5, 2024
50dbeb8
always set skip_load_genes
nvta1209 Apr 5, 2024
b7e0f42
restore missed changes from alembic pr
nvta1209 Apr 5, 2024
608deb9
consistent naming of log_dir var
nvta1209 Apr 5, 2024
e32add9
invert condition
nvta1209 Apr 5, 2024
10f4e0e
fix tests for mito strand change
nvta1209 Apr 5, 2024
858b143
hard links
nvta1209 Apr 9, 2024
a5f1fd2
Merge branch 'main' into IPVC-2283-mito-workflow
nvta1209 Apr 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 54 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -289,56 +289,72 @@ To develop UTA, follow these steps.
4. Testing

$ docker build --target uta-test -t uta-test .
$ docker run -it --rm uta-test python -m unittest
$ docker run --rm uta-test python -m unittest

## UTA update procedure

### 1. Download files from NCBI
Requires docker.

Run `sbin/ncbi-download-docker`. Requires bash and docker.
### 0. Setup

Example:
Make directories:
```
sbin/ncbi-download-docker $(pwd)/ncbi-data
mkdir -p $(pwd)/ncbi-data
mkdir -p $(pwd)/output/artifacts
mkdir -p $(pwd)/output/logs
```

The specified directory will have the following structure:

├── gene
│ └── DATA
│ ├── GENE_INFO
│ │ └── Mammalia
│ │ └── Homo_sapiens.gene_info.gz
│ └── gene2accession.gz
├── genomes
│ └── refseq
│ └── vertebrate_mammalian
│ └── Homo_sapiens
│ └── all_assembly_versions
│ └── GCF_000001405.25_GRCh37.p13
│ ├── GCF_000001405.25_GRCh37.p13_genomic.fna.gz
│ └── GCF_000001405.25_GRCh37.p13_genomic.gff.gz
└── refseq
└── H_sapiens
└── mRNA_Prot
├── human.1.protein.faa.gz
├── human.1.rna.fna.gz
└── human.1.rna.gbff.gz

### 2. Download SeqRepo data

Run `sbin/seqrepo-download`. Requires bash and docker.

Example:
Set variables:
```
sbin/seqrepo-download 2024-02-20 $(pwd)/seqrepo-data
export UTA_ETL_OLD_SEQREPO_VERSION=2024-02-20
export UTA_ETL_OLD_UTA_VERSION=uta_20210129b
export UTA_ETL_NCBI_DIR=./ncbi-data
export UTA_ETL_SEQREPO_DIR=./seqrepo-data
export UTA_ETL_WORK_DIR=./output/artifacts
export UTA_ETL_LOG_DIR=./output/logs
```

### 3. Update UTA and SeqRepo
Build the UTA image:
```
docker build --target uta -t uta-update .
```

### 1. Download SeqRepo data
```
docker pull biocommons/seqrepo:$UTA_ETL_OLD_SEQREPO_VERSION

Run `sbin/uta-update`. Requires bash and docker.
# download seqrepo. can skip if container already exists.
docker run --name seqrepo biocommons/seqrepo:$UTA_ETL_OLD_SEQREPO_VERSION

Example:
# copy seqrepo data into a local directory
docker run -v $UTA_ETL_SEQREPO_DIR:/output-dir --volumes-from seqrepo ubuntu bash -c 'cp -R /usr/local/share/seqrepo/* /output-dir'

# allow seqrepo to be modified
docker run -it -v $UTA_ETL_SEQREPO_DIR:/output-dir ubuntu bash -c 'chmod -R +w /output-dir'
```
sbin/uta-update $(pwd)/ncbi-data $(pwd)/seqrepo-data $(pwd)/uta-build uta_20210129b 2024-02-20

Note: pulling data takes ~30 minutes and requires ~13 GB.
Note: a container called seqrepo will be left behind.

### 2. Extract and transform data from NCBI

Download files from NCBI, extract into intermediate files, and load into UTA and SeqRepo.

See 2A for nuclear transcripts and 2B for mitochondrial transcripts.

#### 2A. Nuclear transcripts
```
docker compose run ncbi-download
docker compose run uta-extract
docker compose run seqrepo-load
UTA_ETL_SKIP_GENE_LOAD=false docker compose run uta-load
```

#### 2B. Mitochondrial transcripts
```
docker compose run mito-extract
docker compose run seqrepo-load
UTA_ETL_SKIP_GENE_LOAD=true docker compose run uta-load
```

UTA has updated and the database has been dumped into a pgd file in `UTA_ETL_WORK_DIR`. SeqRepo has been updated in place.
50 changes: 43 additions & 7 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,24 +3,60 @@
version: '3'

services:
ncbi-download:
image: uta-update
command: sbin/ncbi-download /ncbi-dir
volumes:
- .:/opt/repos/uta
- ${UTA_ETL_NCBI_DIR}:/ncbi-dir
working_dir: /opt/repos/uta
network_mode: host
uta-extract:
image: uta-update
command: sbin/uta-extract /ncbi-dir /uta-extract/work /uta-extract/logs
volumes:
- ${UTA_ETL_NCBI_DIR}:/ncbi-dir
- ${UTA_ETL_SEQREPO_DIR}:/usr/local/share/seqrepo
- ${UTA_ETL_WORK_DIR}:/uta-extract/work
- ${UTA_ETL_LOG_DIR}:/uta-extract/logs
working_dir: /opt/repos/uta
network_mode: host
seqrepo-load:
image: uta-update
command: sbin/seqrepo-load /usr/local/share/seqrepo 2024-02-20 /seqrepo-load/work /seqrepo-load/logs
volumes:
- ${UTA_ETL_SEQREPO_DIR}:/usr/local/share/seqrepo
- ${UTA_ETL_WORK_DIR}:/seqrepo-load/work
- ${UTA_ETL_LOG_DIR}:/seqrepo-load/logs
working_dir: /opt/repos/uta
network_mode: host
uta:
container_name: uta
image: biocommons/uta:${UTA_VERSION}
image: biocommons/uta:${UTA_ETL_OLD_UTA_VERSION}
environment:
- POSTGRES_HOST_AUTH_METHOD=trust
healthcheck:
test: psql -h localhost -U anonymous -d uta -c "select * from ${UTA_VERSION}.meta"
test: psql -h localhost -U anonymous -d uta -c "select * from ${UTA_ETL_OLD_UTA_VERSION}.meta"
interval: 10s
retries: 60
network_mode: host
uta-update:
uta-load:
image: uta-update
command: etc/scripts/run-uta-build.sh ${UTA_VERSION} ${SEQREPO_VERSION} /ncbi-dir /workdir
command: sbin/uta-load ${UTA_ETL_OLD_UTA_VERSION} /ncbi-dir /uta-load/work /uta-load/logs ${UTA_ETL_SKIP_GENE_LOAD}
depends_on:
uta:
condition: service_healthy
volumes:
- ${NCBI_DIR}:/ncbi-dir
- ${SEQREPO_DIR}:/usr/local/share/seqrepo
- ${WORKING_DIR}:/workdir
- ${UTA_ETL_NCBI_DIR}:/ncbi-dir
- ${UTA_ETL_SEQREPO_DIR}:/usr/local/share/seqrepo
- ${UTA_ETL_WORK_DIR}:/uta-load/work
- ${UTA_ETL_LOG_DIR}:/uta-load/logs
network_mode: host
mito-extract:
image: uta-update
command: sbin/ncbi_process_mito.py NC_012920.1 --output-dir /mito-extract/work | tee /mito-extract/logs/mito.log
volumes:
- ${UTA_ETL_WORK_DIR}:/mito-extract/work
- ${UTA_ETL_LOG_DIR}:/mito-extract/logs
working_dir: /opt/repos/uta
network_mode: host
98 changes: 0 additions & 98 deletions etc/scripts/run-uta-build.sh

This file was deleted.

24 changes: 24 additions & 0 deletions sbin/ncbi-download
Original file line number Diff line number Diff line change
@@ -1,6 +1,29 @@
#!/usr/bin/env bash

# This script downloads the files needed for a UTA+SeqRepo update into to the given directory.
#
# DONWLOAD_DIR will have the following structure:
#
# ├── gene
# │ └── DATA
# │ ├── GENE_INFO
# │ │ └── Mammalia
# │ │ └── Homo_sapiens.gene_info.gz
# │ └── gene2accession.gz
# ├── genomes
# │ └── refseq
# │ └── vertebrate_mammalian
# │ └── Homo_sapiens
# │ └── all_assembly_versions
# │ └── GCF_000001405.25_GRCh37.p13
# │ ├── GCF_000001405.25_GRCh37.p13_genomic.fna.gz
# │ └── GCF_000001405.25_GRCh37.p13_genomic.gff.gz
# └── refseq
# └── H_sapiens
# └── mRNA_Prot
# ├── human.1.protein.faa.gz
# ├── human.1.rna.fna.gz
# └── human.1.rna.gbff.gz

set -e

Expand All @@ -26,6 +49,7 @@ do
DOWNLOAD_MODULE="${DOWNLOAD_PATH%%/*}"
DOWNLOAD_SRC="ftp.ncbi.nlm.nih.gov::$DOWNLOAD_PATH"
DOWNLOAD_DST="$DOWNLOAD_DIR/$DOWNLOAD_MODULE"
mkdir -p $DOWNLOAD_DST
echo "Downloading $DOWNLOAD_SRC to $DOWNLOAD_DST"
rsync --no-motd -DHPRprtv "$DOWNLOAD_SRC" "$DOWNLOAD_DST"
done
24 changes: 0 additions & 24 deletions sbin/ncbi-download-docker

This file was deleted.

Loading
Loading