Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(IPVC-2471): add codon_table to txinfo file #42

Merged
merged 4 commits into from
May 23, 2024

Conversation

bsgiles73
Copy link

Upon analysis of previous results it was found that we were not setting the codon_table column of the transcript table appropriately for mitochondrial genes or non-coding genes. The value for mitochondrial coding genes should be "2", and for non-coding genes the value really should be NULL. Like cds_start_i, cds_end_i and cds_md5.

This PR makes several updates to accomplish this.

  • The alembic migration file was updated to make the new column nullable, a text value, and removed the default.
  • Backfills the existing transcript table based on the presence of cds values.
  • It adds codon_table to the TxInfo uta format.
  • Adds a default of "1" if a RefSeq sequence record has a CDS feature.
  • Adds logic to the loading module that sets the value to NULL if the coding has no coding information.

To test these changes I ran the mito workflow on a database that had the gene_id backfill performed.

Run the mito-extract workflow:
=======================
sgiles-MD6M:uta shane.giles$ docker compose -f docker-compose.yml -f misc/mito-transcripts/docker-compose-mito-extract.yml run mito-extract
2024-05-23 01:58:58 INFO     [__main__] downloading files for NC_012920.1
2024-05-23 01:58:58 INFO     [__main__] downloading gb file to /mito-extract/work/NC_012920.1.gbff
2024-05-23 01:59:00 INFO     [__main__] downloading fasta file to /mito-extract/work/NC_012920.1.fna
2024-05-23 01:59:02 INFO     [__main__] processing NCBI GBFF file from /mito-extract/work/NC_012920.1.gbff
2024-05-23 01:59:02 INFO     [__main__] processing NCBI GBFF file from /mito-extract/work/NC_012920.1.fna
2024-05-23 01:59:02 INFO     [__main__] found 37 genes from parsing /mito-extract/work/NC_012920.1.gbff

Verify the codon_table column was in the txinfo file:
=======================================
origin  ac      gene_id gene_symbol     cds_se_i        exons_se_i      codon_table     transl_except
NCBI    NC_012920.1_08526_09207 4508    MT-ATP6 0,681   0,681   2
NCBI    NC_012920.1_09206_09990 4514    MT-CO3  0,784   0,784   2       (pos:9990,aa:TERM)
NCBI    NC_012920.1_09990_10058 4563    MT-TG           0,68

Run the uta-load workflow:
====================
sgiles-MD6M:uta shane.giles$ UTA_ETL_OLD_UTA_VERSION=uta_20210129c UTA_ETL_NEW_UTA_VERSION=uta_20240521 docker compose run uta-load
[+] Creating 1/0
 ✔ Container uta  Running                                                                                                                                                                                                                          0.0s
+ source_uta_v=uta_20210129c
+ dest_uta_v=uta_20240521
...

In the source databaese:
==================
uta=> SELECT t.codon_table, COUNT(*) row_count FROM uta_20210129c.transcript AS t GROUP BY t.codon_table;
 codon_table | row_count
-------------+-----------
 1           |    202275
             |    111952
(2 rows)

In the destination:
==============
uta=> SELECT t.codon_table, COUNT(*) row_count FROM uta_20240521.transcript AS t GROUP BY t.codon_table;
 codon_table | row_count
-------------+-----------
 1           |    202275
 2           |        13
             |    111976
(3 rows)

@bsgiles73 bsgiles73 marked this pull request as ready for review May 23, 2024 04:11
@bsgiles73 bsgiles73 requested review from sptaylor and nvta1209 May 23, 2024 04:11
@bsgiles73 bsgiles73 merged commit e8c811c into main May 23, 2024
1 check passed
@bsgiles73 bsgiles73 deleted the IPVC-2471-fix-cond-table-mito branch May 23, 2024 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants