Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: port over access to the UTA data structures (#10) #11

Merged
merged 25 commits into from
Feb 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
src/static_data/**/*.json* filter=lfs diff=lfs merge=lfs -text
tests/data/**.gz
28 changes: 28 additions & 0 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,12 +47,37 @@ jobs:
Testing:
needs: Formatting
runs-on: ubuntu-latest

services:
# The tests need a postgres server; the data will be loaded later
# after checkout.
postgres:
image: postgres
env:
POSTGRES_DB: uta
POSTGRES_USER: uta_admin
POSTGRES_PASSWORD: uta_admin
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
- 5432:5432

steps:
- name: Checkout repository
uses: actions/checkout@v2
with:
lfs: true

- name: Import test database.
run: |
zcat tests/data/data/uta_20210129-subset.pgd.gz \
| psql -v ON_ERROR_STOP=1 -U uta_admin -h 0.0.0.0 -d uta
env:
PGPASSWORD: uta_admin

- name: Install stable toolchain
uses: actions-rs/toolchain@v1
with:
Expand All @@ -66,6 +91,9 @@ jobs:
with:
version: 0.16.0
args: "-- --test-threads 1"
env:
TEST_UTA_DATABASE_URL: postgres://uta_admin:[email protected]/uta
TEST_UTA_DATABASE_SCHEMA: uta_20210129

- name: Codecov
uses: codecov/codecov-action@v3
Expand Down
3 changes: 3 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,13 @@ edition = "2021"

[dependencies]
anyhow = "1.0.69"
chrono = "0.4.23"
enum-map = "2.4.2"
flate2 = "1.0.25"
lazy_static = "1.4.0"
linked-hash-map = "0.5.6"
nom = "7.1.3"
postgres = { version = "0.19.4", features = ["with-chrono-0_4"] }
pretty_assertions = "1.3.0"
serde = { version = "1.0.152", features = ["derive"] }
serde_json = "1.0.93"
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,16 @@
# hgvs-rs

This is a port of [biocommons/hgvs](https://github.com/biocommons/hgvs) to the Rust programming language.

## Running Tests

The tests need an instance of UTA to run.
Either you setup a local copy (with minimal dataset in `tests/data/data/*.pgd.gz`) or use the public one.
You will have to set the environment variables `TEST_UTA_DATABASE_URL` and `TEST_UTA_DATABASE_SCHEMA` appropriately.
To use the public database:

```
export TEST_UTA_DATABASE_URL=postgres://anonymous:[email protected]:/uta
export TEST_UTA_DATABASE_SCHEMA=uta_20210129
$ cargo test
```
314 changes: 314 additions & 0 deletions src/data/interface.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,314 @@
//! Definition of the interface for accessing the transcript database.

use chrono::NaiveDateTime;
use linked_hash_map::LinkedHashMap;

use crate::static_data::Assembly;

/// Information about a gene.
///
/// ```text
/// hgnc | ATM
/// maploc | 11q22-q23
/// descr | ataxia telangiectasia mutated
/// summary | The protein encoded by this gene belongs to the PI3/PI4-kinase family. This...
/// aliases | AT1,ATA,ATC,ATD,ATE,ATDC,TEL1,TELO1
/// added | 2014-02-04 21:39:32.57125
/// ```
#[derive(Debug, PartialEq)]
pub struct GeneInfoRecord {
pub hgnc: String,
pub maploc: String,
pub descr: String,
pub summary: String,
pub aliases: Vec<String>,
pub added: NaiveDateTime,
}

/// Information about similar transcripts.
///
/// ```text
/// tx_ac1 | NM_001285829.1
/// tx_ac2 | ENST00000341255
/// hgnc_eq | f
/// cds_eq | f
/// es_fp_eq | f
/// cds_es_fp_eq | f
/// cds_exon_lengths_fp_eq | t
/// ```
///
/// Hint: "es" = "exon set", "fp" = "fingerprint", "eq" = "equal"
///
/// "Exon structure" refers to the start and end coordinates on a
/// specified reference sequence. Thus, having the same exon
/// structure means that the transcripts are defined on the same
/// reference sequence and have the same exon spans on that
/// sequence.
#[derive(Debug, PartialEq)]
pub struct TxSimilarityRecord {
/// Accession of first transcript.
pub tx_ac1: String,
/// Accession of second transcript.
pub tx_ac2: String,
pub hgnc_eq: bool,
/// Whether CDS sequences are identical.
pub cds_eq: bool,
/// Whether the full exon structures are identical (i.e., incl. UTR).
pub es_fp_eq: bool,
/// Whether the cds-clipped portions of the exon structures are identical
/// (i.e., ecluding. UTR).
pub cds_es_fp_eq: bool,
pub cds_exon_lengths_fp_eq: bool,
}

///```text
/// hgnc | TGDS
/// tx_ac | NM_001304430.1
/// alt_ac | NC_000013.10
/// alt_aln_method | blat
/// alt_strand | -1
/// ord | 0
/// tx_start_i | 0
/// tx_end_i | 301
/// alt_start_i | 95248228
/// alt_end_i | 95248529
/// cigar | 301=
/// tx_aseq |
/// alt_aseq |
/// tx_exon_set_id | 348239
/// alt_exon_set_id | 722624
/// tx_exon_id | 3518579
/// alt_exon_id | 6063334
/// exon_aln_id | 3461425
///```
#[derive(Debug, PartialEq)]
pub struct TxExonsRecord {
pub hgnc: String,
pub tx_ac: String,
pub alt_ac: String,
pub alt_aln_method: String,
pub alt_strand: i16,
pub ord: i32,
pub tx_start_i: i32,
pub tx_end_i: i32,
pub alt_start_i: i32,
pub alt_end_i: i32,
pub cigar: String,
pub tx_aseq: Option<String>,
pub alt_aseq: Option<String>,
pub tx_exon_set_id: i32,
pub alt_exon_set_id: i32,
pub tx_exon_id: i32,
pub alt_exon_id: i32,
pub exon_aln_id: i32,
}

/// ```text
/// tx_ac | NM_001304430.2
/// alt_ac | NC_000013.10
/// alt_strand | -1
/// alt_aln_method | splign
/// start_i | 95226307
/// end_i | 95248406
/// ```
#[derive(Debug, PartialEq)]
pub struct TxForRegionRecord {
pub tx_ac: String,
pub alt_ac: String,
pub alt_strand: i16,
pub alt_aln_method: String,
pub start_i: i32,
pub end_i: i32,
}

/// ```text
/// tx_ac | NM_199425.2
/// alt_ac | NM_199425.2
/// alt_aln_method | transcript
/// cds_start_i | 283
/// cds_end_i | 1003
/// lengths | {707,79,410}
/// hgnc | VSX1
/// ```
#[derive(Debug, PartialEq)]
pub struct TxIdentityInfo {
pub tx_ac: String,
pub alt_ac: String,
pub alt_aln_method: String,
pub cds_start_i: i32,
pub cds_end_i: i32,
pub lengths: Vec<i32>,
pub hgnc: String,
}

/// ```text
/// hgnc | ATM
/// cds_start_i | 385
/// cds_end_i | 9556
/// tx_ac | NM_000051.3
/// alt_ac | AC_000143.1
/// alt_aln_method | splign
/// ```
#[derive(Debug, PartialEq)]
pub struct TxInfoRecord {
pub hgnc: String,
pub cds_start_i: Option<i32>,
pub cds_end_i: Option<i32>,
pub tx_ac: String,
pub alt_ac: String,
pub alt_aln_method: String,
}

/// ```text
/// -[ RECORD 1 ]--+----------------
/// tx_ac | ENST00000000233
/// alt_ac | NC_000007.13
/// alt_aln_method | genebuild
/// -[ RECORD 2 ]--+----------------
/// tx_ac | ENST00000000412
/// alt_ac | NC_000012.11
/// alt_aln_method | genebuild
/// ```
#[derive(Debug, PartialEq)]
pub struct TxMappingOptionsRecord {
pub tx_ac: String,
pub alt_ac: String,
pub alt_aln_method: String,
}

pub trait Interface {
/// Return the data version, e.g., `uta_20180821`.
fn data_version(&self) -> &str;

/// Return the schema version, e.g., `"1.1"`.
fn schema_version(&self) -> &str;

/// Return a map from accession to chromosome name for the given assembly
///
/// For example, when `assembly_name = "GRCh38.p5"`, the value for `"NC_000001.11"`
/// would be `"1"`.
///
/// # Arguments
///
/// * `assembly` - The assembly to build the map for.
fn get_assembly_map(&self, assembly: Assembly) -> LinkedHashMap<String, String>;

/// Returns the basic information about the gene.
///
/// # Arguments
///
/// * `hgnc` - HGNC gene name
fn get_gene_info(&mut self, hgnc: &str) -> Result<GeneInfoRecord, anyhow::Error>;

/// Return the (single) associated protein accession for a given transcript accession,
/// or None if not found.
///
/// # Arguments
///
/// * `tx_ac` -- transcript accession with version (e.g., 'NM_000051.3')
fn get_pro_ac_for_tx_ac(&mut self, tx_ac: &str) -> Result<Option<String>, anyhow::Error>;

/// Return full sequence for the given accession.
///
/// # Arguments
///
/// * `ac` -- accession
fn get_seq(&mut self, ac: &str) -> Result<String, anyhow::Error>;

/// Return sequence part for the given accession.
///
/// # Arguments
///
/// * `ac` -- accession
/// * `start` -- start position (0-based, start of sequence if missing)
/// * `end` -- end position (0-based, end of sequence if missing)
fn get_seq_part(
&mut self,
ac: &str,
begin: Option<usize>,
end: Option<usize>,
) -> Result<String, anyhow::Error>;

/// Return a list of transcripts that are similar to the given transcript, with relevant
/// similarity criteria.
///
/// # Arguments
///
/// * `tx_ac` -- transcript accession with version (e.g., 'NM_000051.3')
fn get_similar_transcripts(
&mut self,
tx_ac: &str,
) -> Result<Vec<TxSimilarityRecord>, anyhow::Error>;

/// Return transcript exon info for supplied accession (tx_ac, alt_ac, alt_aln_method),
/// or empty `Vec` if not found.
///
/// # Arguments
///
/// * `tx_ac` -- transcript accession with version (e.g., 'NM_000051.3')
/// * `alt_ac` -- specific genomic sequence (e.g., NC_000011.4)
/// * `alt_aln_method` -- sequence alignment method (e.g., splign, blat)
fn get_tx_exons(
&mut self,
tx_ac: &str,
alt_ac: &str,
alt_aln_method: &str,
) -> Result<Vec<TxExonsRecord>, anyhow::Error>;

/// Return transcript info records for supplied gene, in order of decreasing length.
///
/// # Arguments
///
/// * `gene` - HGNC gene name
fn get_tx_for_gene(&mut self, gene: &str) -> Result<Vec<TxInfoRecord>, anyhow::Error>;

/// Return transcripts that overlap given region.
///
/// # Arguments
///
// * `alt_ac` -- reference sequence (e.g., NC_000007.13)
// * `alt_aln_method` -- alignment method (e.g., splign)
// * `start_i` -- 5' bound of region
// * `end_i` -- 3' bound of region
fn get_tx_for_region(
&mut self,
alt_ac: &str,
alt_aln_method: &str,
start_i: i32,
end_i: i32,
) -> Result<Vec<TxForRegionRecord>, anyhow::Error>;

/// Return features associated with a single transcript.
///
/// # Arguments
///
/// * `tx_ac` -- transcript accession with version (e.g., 'NM_199425.2')
fn get_tx_identity_info(&mut self, tx_ac: &str) -> Result<TxIdentityInfo, anyhow::Error>;

/// Return a single transcript info for supplied accession (tx_ac, alt_ac, alt_aln_method), or None if not found.
///
/// # Arguments
///
/// * `tx_ac` -- transcript accession with version (e.g., 'NM_000051.3')
/// * `alt_ac -- specific genomic sequence (e.g., NC_000011.4)
/// * `alt_aln_method` -- sequence alignment method (e.g., splign, blat)
fn get_tx_info(
&mut self,
tx_ac: &str,
alt_ac: &str,
alt_aln_method: &str,
) -> Result<TxInfoRecord, anyhow::Error>;

/// Return all transcript alignment sets for a given transcript accession (tx_ac).
///
/// Returns empty list if transcript does not exist. Use this method to discovery
/// possible mapping options supported in the database.
///
/// # Arguments
///
/// * `tx_ac` -- transcript accession with version (e.g., 'NM_000051.3')
fn get_tx_mapping_options(
&mut self,
tax_ac: &str,
) -> Result<Vec<TxMappingOptionsRecord>, anyhow::Error>;
}
Loading