feat(IPVC-2264) add gene_id to UTA models #24

bsgiles73 · 2024-04-15T16:54:12Z

This PR:

Updates to SQLalchemy models
-- Add gene_id, gene_symbol, type and xrefs to UTA gene model
-- Add geneid to transcript table.
-- Backfill gene id and gene symbol values from output of IPVC-2266
-- Set primary key of gene to “gene_id”
-- Set gene_id from transcript table as having a foreign key relationship to gene
Supply schema migrations and plan for update
Update UTA views as needed

The gene_id update happens in two stages. First we have an Alembic migration to add new nullable columns, a script to backfill the gene_id values, then a second migration to update nullable, primary keys, foreign keys, and affected views. These steps are represented in misc/gene-update/upgrade-uta-schema.sh.

This updates cannot use the existing docker-compose without a modificaiton. We need to start with an updated UTA DB schema. The current yaml requires a UTA database up and running with the base version uta_20210129b. By commenting out a few lines in the yaml and using the updated schema name it works.

The following steps here can be used to start with the current UTA version and run the upgrade uta schema script...

## in shell #1
docker run --rm -e POSTGRES_PASSWORD=postgres -v /tmp:/tmp -v uta_vol:/var/lib/postgresql/data --name uta --network=host biocommons/uta:uta_20210129b

Once the database is ready you can run the following in a separate shell.

docker build --target uta -t uta-update .
docker run -it --rm --name uta-backfill --volume $(pwd)/misc:/opt/repos/uta/misc --volume $(pwd)/output:/workdir --network=host uta-update:latest
bash misc/gene-update/upgrade-uta-schema.sh uta_20210129c

The following lines, 46-48, in the docker-compose.yml need to be commented out.

  uta-load:
    image: uta-update
    command: sbin/uta-load ${UTA_ETL_OLD_UTA_VERSION} /ncbi-dir /uta-load/work /uta-load/logs ${UTA_ETL_SKIP_GENE_LOAD}
#    depends_on:
#      uta:
#        condition: service_healthy
    volumes:
      - ${UTA_ETL_NCBI_DIR}:/ncbi-dir
      - ${UTA_ETL_SEQREPO_DIR}:/usr/local/share/seqrepo
      - ${UTA_ETL_WORK_DIR}:/uta-load/work
      - ${UTA_ETL_LOG_DIR}:/uta-load/logs
    network_mode: host

The uta-load outlined in the readme will now work.

docker compose run ncbi-download   ## or use existing test files for chr 22.
docker compose run uta-extract
docker compose run seqrepo-load
UTA_ETL_SKIP_GENE_LOAD=false docker compose run uta-load

The diff on the output below shows the same number of updated sequences, genes, transcripts, exons, and alignments added compared to a run without the gene_id update.

+-----------------------+------+---------+---------+-----+---------+------+------------------------------------------------+
|         table         |  t   |    n1   |    n2   | nu1 |    nc   | nu2  |                      cols                      |
+-----------------------+------+---------+---------+-----+---------+------+------------------------------------------------+
| associated_accessions | 6.8  |  265035 |  265195 |  0  |  265035 | 160  |              tx_ac,pro_ac,origin               |
|          exon         | 39.5 | 8310936 | 8313646 |  0  | 8310936 | 2710 |                       *                        |
|        exon_aln       | 29.9 | 5604190 | 5605587 |  0  | 5604190 | 1397 | exon_aln_id,tx_exon_id,alt_exon_id,cigar,added |
|        exon_set       | 5.7  |  894082 |  894408 |  9  |  894073 | 335  |                       *                        |
|          gene         | 0.4  |  64055  |  64063  |  0  |  64055  |  8   |                    gene_id                     |
|          meta         | 0.0  |    4    |    4    |  0  |    4    |  0   |                       *                        |
|         origin        | 0.0  |    6    |    6    |  0  |    6    |  0   |                       *                        |
|          seq          | 19.0 |  340384 |  340535 |  0  |  340384 | 151  |                       *                        |
|        seq_anno       | 2.3  |  360063 |  360216 |  0  |  360063 | 153  |                       *                        |
|       transcript      | 9.9  |  314227 |  314384 |  0  |  314227 | 157  |                       ac                       |
+-----------------------+------+---------+---------+-----+---------+------+------------------------------------------------+

The new columns can be see in the gene and transcript tables...

select gene_id, symbol, hgnc, aliases, type, summary, descr, xrefs, added
from uta.gene as g
where g.gene_id in ('410','421','427');

+-------+------+-----+-------------------------------+--------------+----------------------------------+----------------------------------+--------------------------------------------------------------------------+--------------------------+
|gene_id|symbol|hgnc |aliases                        |type          |summary                           |descr                             |xrefs                                                                     |added                     |
+-------+------+-----+-------------------------------+--------------+----------------------------------+----------------------------------+--------------------------------------------------------------------------+--------------------------+
|410    |ARSA  |ARSA |{ASA,MLD}                      |protein-coding|arylsulfatase A                   |arylsulfatase A                   |{MIM:607574,HGNC:HGNC:713,Ensembl:ENSG00000100299,AllianceGenome:HGNC:713}|2014-02-10 22:59:21.153414|
|421    |ARVCF |ARVCF|{-}                            |protein-coding|ARVCF delta catenin family member |ARVCF delta catenin family member |{MIM:602269,HGNC:HGNC:728,Ensembl:ENSG00000099889,AllianceGenome:HGNC:728}|2014-02-10 22:59:21.153414|
|427    |ASAH1 |ASAH1|AC,ACDase,ASAH,PHP,PHP32,SMAPME|null          |N-acylsphingosine amidohydrolase 1|N-acylsphingosine amidohydrolase 1|null                                                                      |2014-02-10 22:59:21.153414|
+-------+------+-----+-------------------------------+--------------+----------------------------------+----------------------------------+--------------------------------------------------------------------------+--------------------------+

…, and a backfill script

…refs to gene

.gitignore

etc/scripts/create-new-schema.sh

sbin/ncbi-parse-geneinfo

sbin/uta-extract

sbin/uta-load

bsgiles73

@bsgiles73 Please make changes

bsgiles73 · 2024-04-18T17:49:25Z

src/uta/loading.py

@@ -336,40 +336,17 @@ def load_geneinfo(session, opts, cf):
    for i_gi, gi in enumerate(gir):
        session.merge(
            usam.Gene(
-                hgnc=gi.hgnc,
+                gene_id=gi.gene_id,
+                hgnc=gi.gene_symbol,


Fix this, it should pull HGNC from the gene parser.

src/uta/loading.py

… value in intermediate file, and transcript to gene id changes should raise exception

andreasprlic

LGTM - we discussed during design review.

bsgiles73 added 7 commits April 11, 2024 15:59

feat(IPVC-2264): model changes to add gene_id, new Alembic migrations…

ccdd431

…, and a backfill script

feat(IPVC-2264): add in database update script, update loading methods

0d4d308

feat(IPVC-2264): rename schema rather than export and re-import

d5833c4

Merge branch 'main' into IPVC-2264-add-geneid

8e70e11

feat(IPVC-2264): address hgnc to gene_id updates

6c4b378

feat(IPVC-2264): update shell script, don't drop hgnc, add type and x…

4ed5101

…refs to gene

feat(IPVC-2264): update .gitignore

807b8e6