Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(IPVC-2440/IPVC-2442): specify input and output schema names, move one time workflows #38

Merged
merged 4 commits into from
May 15, 2024

Conversation

bsgiles73
Copy link

@bsgiles73 bsgiles73 commented May 14, 2024

IPVC-2440 and IPVC-2442 were grouped together because it was easier to test.

  • IPVC-2440: is intended to allow users to run multiple workflows in sequential steps without having to build a docker image and then rebuild a database from it when they already have it. The idea is that you pull that latest version from biocommons for UTA, and then depending on the build will have multiple database versions. One representing the end and start of each subsequent workflow. For instance:

uta_20210129b -> gene-update -> uta_20210129c -> mito-extract -> uta_20240514

  • IPVC-2442: is a reorganization of docker-compose services. The top level docker compose file will keep the workflows that are thought to be used for every uta build (ncbi-download, uta-extract, seqrepo-load, uta-load, and splign-manual). Others can be moved and called using a docker compose override.

Included in this PR:

  • Introduction of two new runtime environmental variables.
    -- UTA_ETL_OLD_UTA_IMAGE_TAG: used to pull and run UTA postgres image from latest UTA release
    -- UTA_ETL_NEW_UTA_VERSION: used to rename and dump new UTA database schema after workflow
  • Updates to docker-compose and readme to add new variables.
  • Move onetime workflows to docker-compose override files (uta-gene-update, mito-extract, and uta-extract-historical).
  • Add pg_dump command to run after database update (upgrade-uta-schema.sh and uta-load).

To test:

I used the chr22 test data set to run the following commands.

Gene Update:

sgiles-MD6M:uta shane.giles$ UTA_ETL_OLD_UTA_VERSION=uta_20210129b UTA_ETL_NEW_UTA_VERSION=uta_20210129c docker compose -f docker-compose.yml -f misc/gene-update/docker-compose-gene-update.yml run uta-gene-update
[+] Creating 1/0
 ✔ Container uta  Running                                                                                                                                                                                                                                            0.0s
+ source_uta_v=uta_20210129b
+ working_uta_v=uta
+ dest_uta_v=uta_20210129c
+ tmp_dumps_dir=/tmp/dumps
+ mkdir -p /tmp/dumps
...

Mito Extract

sgiles-MD6M:uta shane.giles$ docker compose -f docker-compose.yml -f misc/mito-transcripts/docker-compose-mito-extract.yml run mito-extract
2024-05-14 20:17:19 INFO     [__main__] downloading files for NC_012920.1
2024-05-14 20:17:19 INFO     [__main__] downloading gb file to /mito-extract/work/NC_012920.1.gbff
2024-05-14 20:17:22 INFO     [__main__] downloading fasta file to /mito-extract/work/NC_012920.1.fna
2024-05-14 20:17:24 INFO     [__main__] processing NCBI GBFF file from /mito-extract/work/NC_012920.1.gbff
2024-05-14 20:17:24 INFO     [__main__] processing NCBI GBFF file from /mito-extract/work/NC_012920.1.fna
2024-05-14 20:17:24 INFO     [__main__] found 37 genes from parsing /mito-extract/work/NC_012920.1.gbff

chr22 Test: uta-extract and uta-load

sgiles-MD6M:uta shane.giles$ docker compose run uta-extract
2024-05-14 20:22:03 INFO     [__main__] opened /ncbi-dir/refseq/H_sapiens/mRNA_Prot/human.test.rna.gbff.gz
/usr/local/lib/python3.10/dist-packages/Bio/GenBank/Scanner.py:1217: BiopythonParserWarning: Premature end of file in sequence data
  warnings.warn(
2024-05-14 20:24:14 INFO     [__main__] 642 genes in /ncbi-dir/refseq/H_sapiens/mRNA_Prot/human.test.rna.gbff.gz (Counter({'NM': 1384, 'NR': 380}))
2024-05-14 20:24:14 INFO     [__main__] 642 genes in 1 files (Counter({'NM': 1384, 'NR': 380}))
2024-05-14 20:24:15 INFO     [__main__] read 1778 transcript alignments from file(s): /ncbi-dir/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz
2024-05-14 20:24:16 INFO     [__main__] Filtered out exon sets for 14 transcript(s)
sgiles-MD6M:uta shane.giles$ UTA_ETL_OLD_UTA_VERSION=uta_20210129c UTA_ETL_NEW_UTA_VERSION=uta_20240514 docker compose run uta-load
[+] Creating 1/0
 ✔ Container uta  Running                                                                                                                                                                                                                                            0.0s
+ source_uta_v=uta_20210129c
+ dest_uta_v=uta_20240514
+ ncbi_dir=/ncbi-dir
+ working_dir=/uta-load/work
+ log_dir=/uta-load/logs
...
+-----------------------+------+---------+---------+-----+---------+------+------------------------------------------------+
|         table         |  t   |    n1   |    n2   | nu1 |    nc   | nu2  |                      cols                      |
+-----------------------+------+---------+---------+-----+---------+------+------------------------------------------------+
| associated_accessions | 6.9  |  265035 |  265195 |  0  |  265035 | 160  |              tx_ac,pro_ac,origin               |
|          exon         | 41.7 | 8310936 | 8313485 |  0  | 8310936 | 2549 |                       *                        |
|        exon_aln       | 34.5 | 5604190 | 5605456 |  0  | 5604190 | 1266 | exon_aln_id,tx_exon_id,alt_exon_id,cigar,added |
|        exon_set       | 5.7  |  894082 |  894385 |  0  |  894082 | 303  |                       *                        |
|          gene         | 0.4  |  64055  |  64063  |  0  |  64055  |  8   |                    gene_id                     |
|          meta         | 0.0  |    4    |    4    |  0  |    4    |  0   |                       *                        |
|         origin        | 0.0  |    6    |    6    |  0  |    6    |  0   |                       *                        |
|          seq          | 21.7 |  340384 |  340539 |  0  |  340384 | 155  |                       *                        |
|        seq_anno       | 2.2  |  360063 |  360220 |  0  |  360063 | 157  |     seq_anno_id,seq_id,origin_id,ac,added      |
|       transcript      | 11.2 |  314227 |  314380 |  0  |  314227 | 153  |                       ac                       |
+-----------------------+------+---------+---------+-----+---------+------+------------------------------------------------+
+ psql -h localhost -U uta_admin -d uta -c 'DROP SCHEMA IF EXISTS uta_20240514 CASCADE;'
NOTICE:  schema "uta_20240514" does not exist, skipping
DROP SCHEMA
+ psql -h localhost -U uta_admin -d uta -c 'ALTER SCHEMA uta RENAME TO uta_20240514'
ALTER SCHEMA
+ pg_dump -h localhost -U uta_admin -d uta -n uta_20240514
+ gzip -c

@bsgiles73 bsgiles73 changed the title feat(IPVC-2440): add in separate variables for specifying input and o… feat(IPVC-2440/IPVC-2442): add in separate variables for specifying input and o… May 14, 2024
@bsgiles73 bsgiles73 changed the title feat(IPVC-2440/IPVC-2442): add in separate variables for specifying input and o… feat(IPVC-2440/IPVC-2442): specify input and output schema, move one time workflows May 14, 2024
@bsgiles73 bsgiles73 marked this pull request as ready for review May 14, 2024 20:35
@bsgiles73 bsgiles73 requested review from sptaylor and nvta1209 May 14, 2024 20:36
@bsgiles73 bsgiles73 changed the title feat(IPVC-2440/IPVC-2442): specify input and output schema, move one time workflows feat(IPVC-2440/IPVC-2442): specify input and output schema names, move one time workflows May 14, 2024
@bsgiles73 bsgiles73 merged commit 9e2a927 into main May 15, 2024
1 check passed
@bsgiles73 bsgiles73 deleted the IPVC-2440-specify-input-and-output-schema branch May 15, 2024 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants