Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refactor db subset CLI, add subset by VCF and subset by TxId options #641

Merged
merged 1 commit into from
Dec 3, 2024

Conversation

tedil
Copy link
Contributor

@tedil tedil commented Dec 3, 2024

This adds a little bit of code duplication (get_tx_for_region, _chrom_to_acc and _acc_to_chrom) but makes it way easier to create test-cases (as you will now get the exact subset of the txdb you need to test the input VCF).

Summary by CodeRabbit

  • New Features

    • Introduced a method to retrieve transcript records for specified genomic regions, enhancing transcript querying capabilities.
    • Added a new selection structure for filtering transcripts based on various criteria, improving the transcript database subsetting feature.
  • Bug Fixes

    • Enhanced error handling for overlapping transcripts in the subsetting process.
  • Refactor

    • Restructured the subsetting logic for better clarity and maintainability.

Copy link
Contributor

coderabbitai bot commented Dec 3, 2024

Walkthrough

The changes introduce new functionality for querying transcript records based on genomic regions in the TxIntervalTrees and Provider structs. A new public method, get_tx_for_region, is added to both structs, allowing retrieval of transcripts using parameters such as a transcript sequence database and genomic indices. Additionally, the Args struct in the subset module is updated to include a Selection struct for more flexible transcript filtering options, enhancing the subsetting feature with new helper functions and improved error handling.

Changes

File Path Change Summary
src/annotate/seqvars/provider.rs - Added pub fn get_tx_for_region in TxIntervalTrees for retrieving transcripts by genomic region.
- Added fn get_tx_for_region in Provider mirroring the above method.
src/db/subset/mod.rs - Updated Args struct to include Selection for flexible transcript filtering.
- Restructured run function to utilize Selection.
- Added helper functions for transcript extraction based on various criteria.
- Enhanced error handling in _extract_transcripts_by_region.
- Updated subset_tx_db function to accept Selection.

Possibly related PRs

  • fix: refactor csq analysis code #599: The changes in this PR involve modifications to the ConsequencePredictor class in src/annotate/seqvars/csq.rs, which is related to the handling of sequence variants, similar to the new methods added in the TxIntervalTrees and Provider structs for retrieving transcript records based on genomic regions. Both PRs enhance functionality related to genomic data processing.

Suggested reviewers

  • holtgrewe

Poem

In the fields of genes so bright,
New methods spring forth, a wondrous sight.
With transcripts gathered, region by region,
A rabbit's joy in each new legion.
Selection now dances, flexible and free,
In the world of data, hop along with glee! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (8)
src/db/subset/mod.rs (4)

65-85: Refactor extension handling for clarity and correctness

The current handling of file extensions results in a nested Option<Option<&str>>, which makes the comparison verbose and less readable. Consider using and_then to flatten the options and simplify the logic.

Apply this diff to refactor the extension handling:

 let ext = args.path_out.extension().and_then(|s| s.to_str());
 let mut writer: Box<dyn Write> = if ext == Some("gz") {
     Box::new(flate2::write::GzEncoder::new(
         file,
         flate2::Compression::default(),
     ))
 } else if ext == Some("zst") {
     Box::new(
         zstd::Encoder::new(file, 0)
             .map_err(|e| {
                 anyhow::anyhow!(
                     "failed to open zstd encoder for {}: {}",
                     args.path_out.display(),
                     e
                 )
             })?
             .auto_finish(),
     )
 } else {
     Box::new(file)
 };

248-258: Simplify conditional logic in transcript filtering

In the _extract_transcripts_by_hgnc_id function, the conditional check for matching HGNC IDs is complex and could be simplified for readability.

Simplify the condition using a set of normalized HGNC IDs:

let normalized_hgnc_ids: IndexSet<String> = hgnc_ids
    .iter()
    .map(|id| id.trim_start_matches("HGNC:").to_string())
    .collect();

if normalized_hgnc_ids.contains(&tx.gene_symbol)
    || normalized_hgnc_ids.contains(&tx.gene_id.trim_start_matches("HGNC:").to_string())
{
    // ...
}

This approach avoids repetitive string replacement and enhances readability.


302-309: Avoid unnecessary string allocations in chromosome mapping

In the _chrom_to_acc function, the code creates new String instances unnecessarily when inserting into chrom_to_acc.

Optimize the code by avoiding extra allocations:

-for (acc, chrom) in &acc_to_chrom {
-    let chrom = if chrom.starts_with("chr") {
-        chrom.strip_prefix("chr").unwrap()
-    } else {
-        chrom
-    };
-    chrom_to_acc.insert(chrom.to_string(), acc.clone());
-    chrom_to_acc.insert(format!("chr{}", chrom), acc.clone());
+for (acc, chrom_name) in &acc_to_chrom {
+    let chrom_variants = if chrom_name.starts_with("chr") {
+        vec![&chrom_name[3..], chrom_name.as_str()]
+    } else {
+        vec![chrom_name.as_str(), &format!("chr{}", chrom_name)]
+    };
+    for chrom in chrom_variants {
+        chrom_to_acc.insert(chrom.to_string(), acc.clone());
+    }
 }

329-332: Ensure comprehensive test coverage for new selection options

The test test_subset_tx_db only checks subsetting by hgnc_id. Consider adding test cases for transcript_id and vcf selection options to ensure all code paths are tested.

Would you like assistance in creating additional test cases for the new selection options?

src/annotate/seqvars/provider.rs (4)

109-112: Handle missing contigs more gracefully

In the get_tx_for_region method, if the alt_ac is not found in contig_to_idx, the error returned is NoTranscriptFound. Since the issue is a missing contig rather than a missing transcript, consider using a more descriptive error.

Modify the error to reflect the missing contig:

-.ok_or(Error::NoTranscriptFound(alt_ac.to_string()))?;
+.ok_or_else(|| Error::InvalidInput(format!("Contig '{}' not found in index", alt_ac)))?;

This provides clearer feedback to the user about the nature of the error.


Line range hint 290-310: Combine _acc_to_chrom and _chrom_to_acc functions for efficiency

The functions _acc_to_chrom and _chrom_to_acc perform similar mappings in opposite directions. Consider combining them to build both mappings in a single iteration, reducing redundancy.

Refactor the code as follows:

fn build_chrom_acc_mappings(assembly: Assembly) -> (IndexMap<String, String>, IndexMap<String, String>) {
    let mut acc_to_chrom = IndexMap::new();
    let mut chrom_to_acc = IndexMap::new();

    for record in &ASSEMBLY_INFOS[assembly].sequences {
        let acc = record.refseq_ac.clone();
        let chrom = record.name.clone();

        acc_to_chrom.insert(acc.clone(), chrom.clone());

        let chrom_variants = if chrom.starts_with("chr") {
            vec![chrom[3..].to_string(), chrom]
        } else {
            vec![chrom.clone(), format!("chr{}", chrom)]
        };
        for chrom_name in chrom_variants {
            chrom_to_acc.insert(chrom_name, acc.clone());
        }
    }
    (acc_to_chrom, chrom_to_acc)
}

Then, adjust your code to use this combined function.


Line range hint 227-237: Use &[String] instead of &Vec<String> in function parameters

In the _extract_transcripts_by_txid function, accepting a slice (&[String]) instead of a reference to a Vec provides more flexibility and aligns with Rust best practices.

Update the function signature:

-fn _extract_transcripts_by_txid(
-    container_tx_db: &TranscriptDb,
-    transcript_ids: &Vec<String>,
-) -> (IndexSet<usize>, IndexSet<String>, IndexSet<String>) {
+fn _extract_transcripts_by_txid(
+    container_tx_db: &TranscriptDb,
+    transcript_ids: &[String],
+) -> (IndexSet<usize>, IndexSet<String>, IndexSet<String>) {

This allows the function to accept any slice of strings, not just Vec<String>.


Line range hint 248-258: Avoid multiple string allocations in loop

In the for-loop when filtering transcripts, multiple string allocations occur due to tx.gene_id.replace("HGNC:", ""). This can be optimized.

Instead of calling replace in each iteration, normalize the gene_id once:

let gene_id_normalized = tx.gene_id.trim_start_matches("HGNC:").to_string();

if hgnc_ids.contains(&tx.gene_symbol)
    || hgnc_ids.contains(&tx.gene_id)
    || hgnc_ids.contains(&gene_id_normalized)
{
    // ...
}

This reduces unnecessary allocations and improves performance.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1ec23df and 38ff7eb.

📒 Files selected for processing (2)
  • src/annotate/seqvars/provider.rs (1 hunks)
  • src/db/subset/mod.rs (4 hunks)

Comment on lines +199 to +212
if let (Some(Ok(start)), Ok(end)) = (record.variant_start(), record.variant_end(&header)) {
let chrom = record.reference_sequence_name().to_string();
let accession = chrom_to_acc.get(&chrom).unwrap_or(&chrom);

tracing::info!("... and writing ...");
// Open file and if necessary, wrap in a decompressor.
let file = std::fs::File::create(&args.path_out)
.map_err(|e| anyhow::anyhow!("failed to create file {}: {}", args.path_out.display(), e))?;
let ext = args.path_out.extension().map(|s| s.to_str());
let mut writer: Box<dyn Write> = if ext == Some(Some("gz")) {
Box::new(flate2::write::GzEncoder::new(
file,
flate2::Compression::default(),
))
} else if ext == Some(Some("zst")) {
Box::new(
zstd::Encoder::new(file, 0)
.map_err(|e| {
anyhow::anyhow!(
"failed to open zstd encoder for {}: {}",
args.path_out.display(),
e
)
})?
.auto_finish(),
)
} else {
Box::new(file)
};
writer
.write_all(&buf)
.map_err(|e| anyhow::anyhow!("failed to write to {}: {}", args.path_out.display(), e))?;
let txs = trees.get_tx_for_region(
container,
accession,
"splign",
usize::from(start).try_into()?,
usize::from(end).try_into()?,
)?;
transcript_ids.extend(txs.into_iter().map(|tx| tx.tx_ac));
}
}
Copy link
Contributor

@coderabbitai coderabbitai bot Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct type conversions for start and end variables

There is a potential issue with type conversions for start and end. The variables start and end are i32, but you're attempting to convert them using usize::from(start), which expects a usize input. This may lead to runtime errors or incorrect behavior when dealing with negative values.

Apply this diff to correct the type conversions:

- usize::from(start).try_into()?,
- usize::from(end).try_into()?,
+ start.try_into()?,
+ end.try_into()?,

Ensure that start and end are valid non-negative integers before performing the conversion to usize:

let start_usize = if start >= 0 {
    start as usize
} else {
    return Err(anyhow::anyhow!("Start index is negative: {}", start));
};
let end_usize = if end >= 0 {
    end as usize
} else {
    return Err(anyhow::anyhow!("End index is negative: {}", end));
};

Then use start_usize and end_usize in the function call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, start and end are of type Position from the noodles core crate, which in turn implements conversions to and from usize; so we first obtain a usize from a Position, then convert that usize into an i32

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedil Thank you for the clarification! I apologize for the misunderstanding. Since start and end are of type Position from the noodles core crate, and they implement conversions to and from usize, your approach makes sense. I'll make a note of this for future reviews.

(_/)
(•‿•)
/ 🧐


✏️ Learnings added
Learnt from: tedil
PR: varfish-org/mehari#641
File: src/db/subset/mod.rs:199-212
Timestamp: 2024-12-03T10:07:58.852Z
Learning: In `src/db/subset/mod.rs`, the variables `start` and `end` are of type `Position` from the noodles core crate, which implements conversions to and from `usize`. Thus, converting them from `Position` to `usize`, and then to `i32`, is appropriate.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines +185 to +224
fn _extract_transcripts_by_region(
container: &TxSeqDatabase,
vcf: &PathBuf,
) -> Result<(IndexSet<usize>, IndexSet<String>, IndexSet<String>)> {
let container_tx_db = container.tx_db.as_ref().expect("no tx_db");
let trees = TxIntervalTrees::new(container);

tracing::info!("Subsetting ...");
let tx_db = subset_tx_db(&tx_db, &args.gene_symbols)?;
let chrom_to_acc = _chrom_to_acc(container.assembly());

tracing::info!("... and encoding ...");
let mut buf = Vec::with_capacity(tx_db.encoded_len());
tx_db
.encode(&mut buf)
.map_err(|e| anyhow::anyhow!("failed to encode: {}", e))?;
let mut transcript_ids = IndexSet::<String>::new();
let mut reader = noodles::vcf::io::reader::Builder::default().build_from_path(vcf)?;
let header = reader.read_header()?;
for result in reader.records() {
let record = result?;
if let (Some(Ok(start)), Ok(end)) = (record.variant_start(), record.variant_end(&header)) {
let chrom = record.reference_sequence_name().to_string();
let accession = chrom_to_acc.get(&chrom).unwrap_or(&chrom);

tracing::info!("... and writing ...");
// Open file and if necessary, wrap in a decompressor.
let file = std::fs::File::create(&args.path_out)
.map_err(|e| anyhow::anyhow!("failed to create file {}: {}", args.path_out.display(), e))?;
let ext = args.path_out.extension().map(|s| s.to_str());
let mut writer: Box<dyn Write> = if ext == Some(Some("gz")) {
Box::new(flate2::write::GzEncoder::new(
file,
flate2::Compression::default(),
))
} else if ext == Some(Some("zst")) {
Box::new(
zstd::Encoder::new(file, 0)
.map_err(|e| {
anyhow::anyhow!(
"failed to open zstd encoder for {}: {}",
args.path_out.display(),
e
)
})?
.auto_finish(),
)
} else {
Box::new(file)
};
writer
.write_all(&buf)
.map_err(|e| anyhow::anyhow!("failed to write to {}: {}", args.path_out.display(), e))?;
let txs = trees.get_tx_for_region(
container,
accession,
"splign",
usize::from(start).try_into()?,
usize::from(end).try_into()?,
)?;
transcript_ids.extend(txs.into_iter().map(|tx| tx.tx_ac));
}
}
if transcript_ids.is_empty() {
return Err(anyhow::anyhow!(
"no transcripts overlapping VCF variants found in transcript database"
));
}
let (tx_idxs, gene_ids) = __extract_transcripts_from_db(container_tx_db, &transcript_ids);
Ok((
IndexSet::<usize>::from_iter(tx_idxs),
transcript_ids,
gene_ids,
))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle missing accession mappings gracefully

In the _extract_transcripts_by_region function, the code assumes that every chromosome name in the VCF will have a corresponding accession in the chrom_to_acc mapping. If a chromosome is not found, the unwrap_or fallback uses the chromosome name itself. This may cause issues if the chromosome names in the VCF do not match the expected format.

Consider adding error handling for unmapped chromosomes:

-let accession = chrom_to_acc.get(&chrom).unwrap_or(&chrom);
+let accession = chrom_to_acc.get(&chrom).ok_or_else(|| {
+    anyhow::anyhow!("Chromosome '{}' not found in accession mapping", chrom)
+})?;

This will ensure that the code fails fast with a clear error message if an unmapped chromosome is encountered.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn _extract_transcripts_by_region(
container: &TxSeqDatabase,
vcf: &PathBuf,
) -> Result<(IndexSet<usize>, IndexSet<String>, IndexSet<String>)> {
let container_tx_db = container.tx_db.as_ref().expect("no tx_db");
let trees = TxIntervalTrees::new(container);
tracing::info!("Subsetting ...");
let tx_db = subset_tx_db(&tx_db, &args.gene_symbols)?;
let chrom_to_acc = _chrom_to_acc(container.assembly());
tracing::info!("... and encoding ...");
let mut buf = Vec::with_capacity(tx_db.encoded_len());
tx_db
.encode(&mut buf)
.map_err(|e| anyhow::anyhow!("failed to encode: {}", e))?;
let mut transcript_ids = IndexSet::<String>::new();
let mut reader = noodles::vcf::io::reader::Builder::default().build_from_path(vcf)?;
let header = reader.read_header()?;
for result in reader.records() {
let record = result?;
if let (Some(Ok(start)), Ok(end)) = (record.variant_start(), record.variant_end(&header)) {
let chrom = record.reference_sequence_name().to_string();
let accession = chrom_to_acc.get(&chrom).unwrap_or(&chrom);
tracing::info!("... and writing ...");
// Open file and if necessary, wrap in a decompressor.
let file = std::fs::File::create(&args.path_out)
.map_err(|e| anyhow::anyhow!("failed to create file {}: {}", args.path_out.display(), e))?;
let ext = args.path_out.extension().map(|s| s.to_str());
let mut writer: Box<dyn Write> = if ext == Some(Some("gz")) {
Box::new(flate2::write::GzEncoder::new(
file,
flate2::Compression::default(),
))
} else if ext == Some(Some("zst")) {
Box::new(
zstd::Encoder::new(file, 0)
.map_err(|e| {
anyhow::anyhow!(
"failed to open zstd encoder for {}: {}",
args.path_out.display(),
e
)
})?
.auto_finish(),
)
} else {
Box::new(file)
};
writer
.write_all(&buf)
.map_err(|e| anyhow::anyhow!("failed to write to {}: {}", args.path_out.display(), e))?;
let txs = trees.get_tx_for_region(
container,
accession,
"splign",
usize::from(start).try_into()?,
usize::from(end).try_into()?,
)?;
transcript_ids.extend(txs.into_iter().map(|tx| tx.tx_ac));
}
}
if transcript_ids.is_empty() {
return Err(anyhow::anyhow!(
"no transcripts overlapping VCF variants found in transcript database"
));
}
let (tx_idxs, gene_ids) = __extract_transcripts_from_db(container_tx_db, &transcript_ids);
Ok((
IndexSet::<usize>::from_iter(tx_idxs),
transcript_ids,
gene_ids,
))
}
fn _extract_transcripts_by_region(
container: &TxSeqDatabase,
vcf: &PathBuf,
) -> Result<(IndexSet<usize>, IndexSet<String>, IndexSet<String>)> {
let container_tx_db = container.tx_db.as_ref().expect("no tx_db");
let trees = TxIntervalTrees::new(container);
let chrom_to_acc = _chrom_to_acc(container.assembly());
let mut transcript_ids = IndexSet::<String>::new();
let mut reader = noodles::vcf::io::reader::Builder::default().build_from_path(vcf)?;
let header = reader.read_header()?;
for result in reader.records() {
let record = result?;
if let (Some(Ok(start)), Ok(end)) = (record.variant_start(), record.variant_end(&header)) {
let chrom = record.reference_sequence_name().to_string();
let accession = chrom_to_acc.get(&chrom).ok_or_else(|| {
anyhow::anyhow!("Chromosome '{}' not found in accession mapping", chrom)
})?;
let txs = trees.get_tx_for_region(
container,
accession,
"splign",
usize::from(start).try_into()?,
usize::from(end).try_into()?,
)?;
transcript_ids.extend(txs.into_iter().map(|tx| tx.tx_ac));
}
}
if transcript_ids.is_empty() {
return Err(anyhow::anyhow!(
"no transcripts overlapping VCF variants found in transcript database"
));
}
let (tx_idxs, gene_ids) = __extract_transcripts_from_db(container_tx_db, &transcript_ids);
Ok((
IndexSet::<usize>::from_iter(tx_idxs),
transcript_ids,
gene_ids,
))
}

Comment on lines +101 to +141
pub fn get_tx_for_region(
&self,
tx_seq_db: &TxSeqDatabase,
alt_ac: &str,
_alt_aln_method: &str,
start_i: i32,
end_i: i32,
) -> Result<Vec<TxForRegionRecord>, Error> {
let contig_idx = *self
.contig_to_idx
.get(alt_ac)
.ok_or(Error::NoTranscriptFound(alt_ac.to_string()))?;
let query = start_i..end_i;
let tx_idxs = self.trees[contig_idx].find(query);

Ok(tx_idxs
.iter()
.map(|entry| {
let tx = &tx_seq_db.tx_db.as_ref().expect("no tx_db?").transcripts
[*entry.data() as usize];
assert_eq!(
tx.genome_alignments.len(),
1,
"Can only have one alignment in Mehari"
);
let alt_strand = tx.genome_alignments.first().unwrap().strand;
TxForRegionRecord {
tx_ac: tx.id.clone(),
alt_ac: alt_ac.to_string(),
alt_strand: match Strand::try_from(alt_strand).expect("invalid strand") {
Strand::Plus => 1,
Strand::Minus => -1,
_ => unreachable!("invalid strand {}", alt_strand),
},
alt_aln_method: ALT_ALN_METHOD.to_string(),
start_i,
end_i,
}
})
.collect())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Duplicate implementation of get_tx_for_region

The get_tx_for_region method added to TxIntervalTrees appears to duplicate logic already present in the Provider implementation. This duplication can lead to maintenance difficulties and inconsistencies.

Consider refactoring to reuse the get_tx_for_region logic from TxIntervalTrees within Provider. For example, modify the Provider's method to delegate to tx_trees.get_tx_for_region:

fn get_tx_for_region(
    &self,
    alt_ac: &str,
    _alt_aln_method: &str,
    start_i: i32,
    end_i: i32,
) -> Result<Vec<TxForRegionRecord>, Error> {
    self.tx_trees.get_tx_for_region(
        &self.tx_seq_db,
        alt_ac,
        _alt_aln_method,
        start_i,
        end_i,
    )
}

This approach reduces code duplication and centralizes the logic.

Comment on lines +126 to +139
let alt_strand = tx.genome_alignments.first().unwrap().strand;
TxForRegionRecord {
tx_ac: tx.id.clone(),
alt_ac: alt_ac.to_string(),
alt_strand: match Strand::try_from(alt_strand).expect("invalid strand") {
Strand::Plus => 1,
Strand::Minus => -1,
_ => unreachable!("invalid strand {}", alt_strand),
},
alt_aln_method: ALT_ALN_METHOD.to_string(),
start_i,
end_i,
}
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid panics with assert and provide meaningful errors

Using assert_eq! and unwrap() can cause the program to panic if conditions are not met. Instead, handle these cases gracefully by returning errors.

Replace assert_eq! and unwrap() with proper error handling:

-let tx = &tx_seq_db.tx_db.as_ref().expect("no tx_db?").transcripts
-    [*entry.data() as usize];
-assert_eq!(
-    tx.genome_alignments.len(),
-    1,
-    "Can only have one alignment in Mehari"
-);
-let alt_strand = tx.genome_alignments.first().unwrap().strand;
+let tx_db = tx_seq_db.tx_db.as_ref().ok_or_else(|| {
+    Error::InvalidInput("Transcript database is missing".to_string())
+})?;
+let tx = tx_db.transcripts.get(*entry.data() as usize).ok_or_else(|| {
+    Error::InvalidInput("Transcript index out of bounds".to_string())
+})?;
+if tx.genome_alignments.len() != 1 {
+    return Err(Error::InvalidInput(format!(
+        "Expected one genome alignment, found {}",
+        tx.genome_alignments.len()
+    )));
+}
+let alt_strand = tx.genome_alignments.first().ok_or_else(|| {
+    Error::InvalidInput("Genome alignment is missing".to_string())
+})?.strand;

This change prevents unexpected panics and provides useful error messages to the caller.

Committable suggestion skipped: line range outside the PR's diff.

@tedil tedil merged commit 9286286 into main Dec 3, 2024
10 checks passed
@tedil tedil deleted the db-subset-from-vcf branch December 3, 2024 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant