Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(IPVC-2264) add gene_id to UTA models #24

Merged
merged 11 commits into from
Apr 18, 2024
Merged

Conversation

bsgiles73
Copy link

@bsgiles73 bsgiles73 commented Apr 15, 2024

This PR:

  • Updates to SQLalchemy models
    -- Add gene_id, gene_symbol, type and xrefs to UTA gene model
    -- Add geneid to transcript table.
    -- Backfill gene id and gene symbol values from output of IPVC-2266
    -- Set primary key of gene to “gene_id”
    -- Set gene_id from transcript table as having a foreign key relationship to gene
  • Supply schema migrations and plan for update
  • Update UTA views as needed

The gene_id update happens in two stages. First we have an Alembic migration to add new nullable columns, a script to backfill the gene_id values, then a second migration to update nullable, primary keys, foreign keys, and affected views. These steps are represented in misc/gene-update/upgrade-uta-schema.sh.

This updates cannot use the existing docker-compose without a modificaiton. We need to start with an updated UTA DB schema. The current yaml requires a UTA database up and running with the base version uta_20210129b. By commenting out a few lines in the yaml and using the updated schema name it works.

The following steps here can be used to start with the current UTA version and run the upgrade uta schema script...

## in shell #1
docker run --rm -e POSTGRES_PASSWORD=postgres -v /tmp:/tmp -v uta_vol:/var/lib/postgresql/data --name uta --network=host biocommons/uta:uta_20210129b

Once the database is ready you can run the following in a separate shell.

docker build --target uta -t uta-update .
docker run -it --rm --name uta-backfill --volume $(pwd)/misc:/opt/repos/uta/misc --volume $(pwd)/output:/workdir --network=host uta-update:latest
bash misc/gene-update/upgrade-uta-schema.sh uta_20210129c

The following lines, 46-48, in the docker-compose.yml need to be commented out.

  uta-load:
    image: uta-update
    command: sbin/uta-load ${UTA_ETL_OLD_UTA_VERSION} /ncbi-dir /uta-load/work /uta-load/logs ${UTA_ETL_SKIP_GENE_LOAD}
#    depends_on:
#      uta:
#        condition: service_healthy
    volumes:
      - ${UTA_ETL_NCBI_DIR}:/ncbi-dir
      - ${UTA_ETL_SEQREPO_DIR}:/usr/local/share/seqrepo
      - ${UTA_ETL_WORK_DIR}:/uta-load/work
      - ${UTA_ETL_LOG_DIR}:/uta-load/logs
    network_mode: host

The uta-load outlined in the readme will now work.

docker compose run ncbi-download   ## or use existing test files for chr 22.
docker compose run uta-extract
docker compose run seqrepo-load
UTA_ETL_SKIP_GENE_LOAD=false docker compose run uta-load

The diff on the output below shows the same number of updated sequences, genes, transcripts, exons, and alignments added compared to a run without the gene_id update.

+-----------------------+------+---------+---------+-----+---------+------+------------------------------------------------+
|         table         |  t   |    n1   |    n2   | nu1 |    nc   | nu2  |                      cols                      |
+-----------------------+------+---------+---------+-----+---------+------+------------------------------------------------+
| associated_accessions | 6.8  |  265035 |  265195 |  0  |  265035 | 160  |              tx_ac,pro_ac,origin               |
|          exon         | 39.5 | 8310936 | 8313646 |  0  | 8310936 | 2710 |                       *                        |
|        exon_aln       | 29.9 | 5604190 | 5605587 |  0  | 5604190 | 1397 | exon_aln_id,tx_exon_id,alt_exon_id,cigar,added |
|        exon_set       | 5.7  |  894082 |  894408 |  9  |  894073 | 335  |                       *                        |
|          gene         | 0.4  |  64055  |  64063  |  0  |  64055  |  8   |                    gene_id                     |
|          meta         | 0.0  |    4    |    4    |  0  |    4    |  0   |                       *                        |
|         origin        | 0.0  |    6    |    6    |  0  |    6    |  0   |                       *                        |
|          seq          | 19.0 |  340384 |  340535 |  0  |  340384 | 151  |                       *                        |
|        seq_anno       | 2.3  |  360063 |  360216 |  0  |  360063 | 153  |                       *                        |
|       transcript      | 9.9  |  314227 |  314384 |  0  |  314227 | 157  |                       ac                       |
+-----------------------+------+---------+---------+-----+---------+------+------------------------------------------------+

The new columns can be see in the gene and transcript tables...

select gene_id, symbol, hgnc, aliases, type, summary, descr, xrefs, added
from uta.gene as g
where g.gene_id in ('410','421','427');

+-------+------+-----+-------------------------------+--------------+----------------------------------+----------------------------------+--------------------------------------------------------------------------+--------------------------+
|gene_id|symbol|hgnc |aliases                        |type          |summary                           |descr                             |xrefs                                                                     |added                     |
+-------+------+-----+-------------------------------+--------------+----------------------------------+----------------------------------+--------------------------------------------------------------------------+--------------------------+
|410    |ARSA  |ARSA |{ASA,MLD}                      |protein-coding|arylsulfatase A                   |arylsulfatase A                   |{MIM:607574,HGNC:HGNC:713,Ensembl:ENSG00000100299,AllianceGenome:HGNC:713}|2014-02-10 22:59:21.153414|
|421    |ARVCF |ARVCF|{-}                            |protein-coding|ARVCF delta catenin family member |ARVCF delta catenin family member |{MIM:602269,HGNC:HGNC:728,Ensembl:ENSG00000099889,AllianceGenome:HGNC:728}|2014-02-10 22:59:21.153414|
|427    |ASAH1 |ASAH1|AC,ACDase,ASAH,PHP,PHP32,SMAPME|null          |N-acylsphingosine amidohydrolase 1|N-acylsphingosine amidohydrolase 1|null                                                                      |2014-02-10 22:59:21.153414|
+-------+------+-----+-------------------------------+--------------+----------------------------------+----------------------------------+--------------------------------------------------------------------------+--------------------------+

.gitignore Show resolved Hide resolved
Copy link
Author

@bsgiles73 bsgiles73 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsgiles73 Please make changes

@@ -336,40 +336,17 @@ def load_geneinfo(session, opts, cf):
for i_gi, gi in enumerate(gir):
session.merge(
usam.Gene(
hgnc=gi.hgnc,
gene_id=gi.gene_id,
hgnc=gi.gene_symbol,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix this, it should pull HGNC from the gene parser.

src/uta/loading.py Show resolved Hide resolved
… value in intermediate file, and transcript to gene id changes should raise exception
Copy link

@andreasprlic andreasprlic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - we discussed during design review.

@bsgiles73 bsgiles73 merged commit 5432818 into main Apr 18, 2024
1 check passed
@bsgiles73 bsgiles73 deleted the IPVC-2264-add-geneid branch April 18, 2024 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants