-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: refactor db subset CLI, add subset by VCF and subset by TxId options #641
Conversation
WalkthroughThe changes introduce new functionality for querying transcript records based on genomic regions in the Changes
Possibly related PRs
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (8)
src/db/subset/mod.rs (4)
65-85
: Refactor extension handling for clarity and correctnessThe current handling of file extensions results in a nested
Option<Option<&str>>
, which makes the comparison verbose and less readable. Consider usingand_then
to flatten the options and simplify the logic.Apply this diff to refactor the extension handling:
let ext = args.path_out.extension().and_then(|s| s.to_str()); let mut writer: Box<dyn Write> = if ext == Some("gz") { Box::new(flate2::write::GzEncoder::new( file, flate2::Compression::default(), )) } else if ext == Some("zst") { Box::new( zstd::Encoder::new(file, 0) .map_err(|e| { anyhow::anyhow!( "failed to open zstd encoder for {}: {}", args.path_out.display(), e ) })? .auto_finish(), ) } else { Box::new(file) };
248-258
: Simplify conditional logic in transcript filteringIn the
_extract_transcripts_by_hgnc_id
function, the conditional check for matching HGNC IDs is complex and could be simplified for readability.Simplify the condition using a set of normalized HGNC IDs:
let normalized_hgnc_ids: IndexSet<String> = hgnc_ids .iter() .map(|id| id.trim_start_matches("HGNC:").to_string()) .collect(); if normalized_hgnc_ids.contains(&tx.gene_symbol) || normalized_hgnc_ids.contains(&tx.gene_id.trim_start_matches("HGNC:").to_string()) { // ... }This approach avoids repetitive string replacement and enhances readability.
302-309
: Avoid unnecessary string allocations in chromosome mappingIn the
_chrom_to_acc
function, the code creates newString
instances unnecessarily when inserting intochrom_to_acc
.Optimize the code by avoiding extra allocations:
-for (acc, chrom) in &acc_to_chrom { - let chrom = if chrom.starts_with("chr") { - chrom.strip_prefix("chr").unwrap() - } else { - chrom - }; - chrom_to_acc.insert(chrom.to_string(), acc.clone()); - chrom_to_acc.insert(format!("chr{}", chrom), acc.clone()); +for (acc, chrom_name) in &acc_to_chrom { + let chrom_variants = if chrom_name.starts_with("chr") { + vec![&chrom_name[3..], chrom_name.as_str()] + } else { + vec![chrom_name.as_str(), &format!("chr{}", chrom_name)] + }; + for chrom in chrom_variants { + chrom_to_acc.insert(chrom.to_string(), acc.clone()); + } }
329-332
: Ensure comprehensive test coverage for new selection optionsThe test
test_subset_tx_db
only checks subsetting byhgnc_id
. Consider adding test cases fortranscript_id
andvcf
selection options to ensure all code paths are tested.Would you like assistance in creating additional test cases for the new selection options?
src/annotate/seqvars/provider.rs (4)
109-112
: Handle missing contigs more gracefullyIn the
get_tx_for_region
method, if thealt_ac
is not found incontig_to_idx
, the error returned isNoTranscriptFound
. Since the issue is a missing contig rather than a missing transcript, consider using a more descriptive error.Modify the error to reflect the missing contig:
-.ok_or(Error::NoTranscriptFound(alt_ac.to_string()))?; +.ok_or_else(|| Error::InvalidInput(format!("Contig '{}' not found in index", alt_ac)))?;This provides clearer feedback to the user about the nature of the error.
Line range hint
290-310
: Combine_acc_to_chrom
and_chrom_to_acc
functions for efficiencyThe functions
_acc_to_chrom
and_chrom_to_acc
perform similar mappings in opposite directions. Consider combining them to build both mappings in a single iteration, reducing redundancy.Refactor the code as follows:
fn build_chrom_acc_mappings(assembly: Assembly) -> (IndexMap<String, String>, IndexMap<String, String>) { let mut acc_to_chrom = IndexMap::new(); let mut chrom_to_acc = IndexMap::new(); for record in &ASSEMBLY_INFOS[assembly].sequences { let acc = record.refseq_ac.clone(); let chrom = record.name.clone(); acc_to_chrom.insert(acc.clone(), chrom.clone()); let chrom_variants = if chrom.starts_with("chr") { vec![chrom[3..].to_string(), chrom] } else { vec![chrom.clone(), format!("chr{}", chrom)] }; for chrom_name in chrom_variants { chrom_to_acc.insert(chrom_name, acc.clone()); } } (acc_to_chrom, chrom_to_acc) }Then, adjust your code to use this combined function.
Line range hint
227-237
: Use&[String]
instead of&Vec<String>
in function parametersIn the
_extract_transcripts_by_txid
function, accepting a slice (&[String]
) instead of a reference to aVec
provides more flexibility and aligns with Rust best practices.Update the function signature:
-fn _extract_transcripts_by_txid( - container_tx_db: &TranscriptDb, - transcript_ids: &Vec<String>, -) -> (IndexSet<usize>, IndexSet<String>, IndexSet<String>) { +fn _extract_transcripts_by_txid( + container_tx_db: &TranscriptDb, + transcript_ids: &[String], +) -> (IndexSet<usize>, IndexSet<String>, IndexSet<String>) {This allows the function to accept any slice of strings, not just
Vec<String>
.
Line range hint
248-258
: Avoid multiple string allocations in loopIn the for-loop when filtering transcripts, multiple string allocations occur due to
tx.gene_id.replace("HGNC:", "")
. This can be optimized.Instead of calling
replace
in each iteration, normalize thegene_id
once:let gene_id_normalized = tx.gene_id.trim_start_matches("HGNC:").to_string(); if hgnc_ids.contains(&tx.gene_symbol) || hgnc_ids.contains(&tx.gene_id) || hgnc_ids.contains(&gene_id_normalized) { // ... }This reduces unnecessary allocations and improves performance.
if let (Some(Ok(start)), Ok(end)) = (record.variant_start(), record.variant_end(&header)) { | ||
let chrom = record.reference_sequence_name().to_string(); | ||
let accession = chrom_to_acc.get(&chrom).unwrap_or(&chrom); | ||
|
||
tracing::info!("... and writing ..."); | ||
// Open file and if necessary, wrap in a decompressor. | ||
let file = std::fs::File::create(&args.path_out) | ||
.map_err(|e| anyhow::anyhow!("failed to create file {}: {}", args.path_out.display(), e))?; | ||
let ext = args.path_out.extension().map(|s| s.to_str()); | ||
let mut writer: Box<dyn Write> = if ext == Some(Some("gz")) { | ||
Box::new(flate2::write::GzEncoder::new( | ||
file, | ||
flate2::Compression::default(), | ||
)) | ||
} else if ext == Some(Some("zst")) { | ||
Box::new( | ||
zstd::Encoder::new(file, 0) | ||
.map_err(|e| { | ||
anyhow::anyhow!( | ||
"failed to open zstd encoder for {}: {}", | ||
args.path_out.display(), | ||
e | ||
) | ||
})? | ||
.auto_finish(), | ||
) | ||
} else { | ||
Box::new(file) | ||
}; | ||
writer | ||
.write_all(&buf) | ||
.map_err(|e| anyhow::anyhow!("failed to write to {}: {}", args.path_out.display(), e))?; | ||
let txs = trees.get_tx_for_region( | ||
container, | ||
accession, | ||
"splign", | ||
usize::from(start).try_into()?, | ||
usize::from(end).try_into()?, | ||
)?; | ||
transcript_ids.extend(txs.into_iter().map(|tx| tx.tx_ac)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct type conversions for start
and end
variables
There is a potential issue with type conversions for start
and end
. The variables start
and end
are i32
, but you're attempting to convert them using usize::from(start)
, which expects a usize
input. This may lead to runtime errors or incorrect behavior when dealing with negative values.
Apply this diff to correct the type conversions:
- usize::from(start).try_into()?,
- usize::from(end).try_into()?,
+ start.try_into()?,
+ end.try_into()?,
Ensure that start
and end
are valid non-negative integers before performing the conversion to usize
:
let start_usize = if start >= 0 {
start as usize
} else {
return Err(anyhow::anyhow!("Start index is negative: {}", start));
};
let end_usize = if end >= 0 {
end as usize
} else {
return Err(anyhow::anyhow!("End index is negative: {}", end));
};
Then use start_usize
and end_usize
in the function call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, start and end are of type Position
from the noodles core crate, which in turn implements conversions to and from usize; so we first obtain a usize from a Position
, then convert that usize into an i32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tedil Thank you for the clarification! I apologize for the misunderstanding. Since start
and end
are of type Position
from the noodles core crate, and they implement conversions to and from usize
, your approach makes sense. I'll make a note of this for future reviews.
(_/)
(•‿•)
/ 🧐
✏️ Learnings added
Learnt from: tedil
PR: varfish-org/mehari#641
File: src/db/subset/mod.rs:199-212
Timestamp: 2024-12-03T10:07:58.852Z
Learning: In `src/db/subset/mod.rs`, the variables `start` and `end` are of type `Position` from the noodles core crate, which implements conversions to and from `usize`. Thus, converting them from `Position` to `usize`, and then to `i32`, is appropriate.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
fn _extract_transcripts_by_region( | ||
container: &TxSeqDatabase, | ||
vcf: &PathBuf, | ||
) -> Result<(IndexSet<usize>, IndexSet<String>, IndexSet<String>)> { | ||
let container_tx_db = container.tx_db.as_ref().expect("no tx_db"); | ||
let trees = TxIntervalTrees::new(container); | ||
|
||
tracing::info!("Subsetting ..."); | ||
let tx_db = subset_tx_db(&tx_db, &args.gene_symbols)?; | ||
let chrom_to_acc = _chrom_to_acc(container.assembly()); | ||
|
||
tracing::info!("... and encoding ..."); | ||
let mut buf = Vec::with_capacity(tx_db.encoded_len()); | ||
tx_db | ||
.encode(&mut buf) | ||
.map_err(|e| anyhow::anyhow!("failed to encode: {}", e))?; | ||
let mut transcript_ids = IndexSet::<String>::new(); | ||
let mut reader = noodles::vcf::io::reader::Builder::default().build_from_path(vcf)?; | ||
let header = reader.read_header()?; | ||
for result in reader.records() { | ||
let record = result?; | ||
if let (Some(Ok(start)), Ok(end)) = (record.variant_start(), record.variant_end(&header)) { | ||
let chrom = record.reference_sequence_name().to_string(); | ||
let accession = chrom_to_acc.get(&chrom).unwrap_or(&chrom); | ||
|
||
tracing::info!("... and writing ..."); | ||
// Open file and if necessary, wrap in a decompressor. | ||
let file = std::fs::File::create(&args.path_out) | ||
.map_err(|e| anyhow::anyhow!("failed to create file {}: {}", args.path_out.display(), e))?; | ||
let ext = args.path_out.extension().map(|s| s.to_str()); | ||
let mut writer: Box<dyn Write> = if ext == Some(Some("gz")) { | ||
Box::new(flate2::write::GzEncoder::new( | ||
file, | ||
flate2::Compression::default(), | ||
)) | ||
} else if ext == Some(Some("zst")) { | ||
Box::new( | ||
zstd::Encoder::new(file, 0) | ||
.map_err(|e| { | ||
anyhow::anyhow!( | ||
"failed to open zstd encoder for {}: {}", | ||
args.path_out.display(), | ||
e | ||
) | ||
})? | ||
.auto_finish(), | ||
) | ||
} else { | ||
Box::new(file) | ||
}; | ||
writer | ||
.write_all(&buf) | ||
.map_err(|e| anyhow::anyhow!("failed to write to {}: {}", args.path_out.display(), e))?; | ||
let txs = trees.get_tx_for_region( | ||
container, | ||
accession, | ||
"splign", | ||
usize::from(start).try_into()?, | ||
usize::from(end).try_into()?, | ||
)?; | ||
transcript_ids.extend(txs.into_iter().map(|tx| tx.tx_ac)); | ||
} | ||
} | ||
if transcript_ids.is_empty() { | ||
return Err(anyhow::anyhow!( | ||
"no transcripts overlapping VCF variants found in transcript database" | ||
)); | ||
} | ||
let (tx_idxs, gene_ids) = __extract_transcripts_from_db(container_tx_db, &transcript_ids); | ||
Ok(( | ||
IndexSet::<usize>::from_iter(tx_idxs), | ||
transcript_ids, | ||
gene_ids, | ||
)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle missing accession mappings gracefully
In the _extract_transcripts_by_region
function, the code assumes that every chromosome name in the VCF will have a corresponding accession in the chrom_to_acc
mapping. If a chromosome is not found, the unwrap_or
fallback uses the chromosome name itself. This may cause issues if the chromosome names in the VCF do not match the expected format.
Consider adding error handling for unmapped chromosomes:
-let accession = chrom_to_acc.get(&chrom).unwrap_or(&chrom);
+let accession = chrom_to_acc.get(&chrom).ok_or_else(|| {
+ anyhow::anyhow!("Chromosome '{}' not found in accession mapping", chrom)
+})?;
This will ensure that the code fails fast with a clear error message if an unmapped chromosome is encountered.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
fn _extract_transcripts_by_region( | |
container: &TxSeqDatabase, | |
vcf: &PathBuf, | |
) -> Result<(IndexSet<usize>, IndexSet<String>, IndexSet<String>)> { | |
let container_tx_db = container.tx_db.as_ref().expect("no tx_db"); | |
let trees = TxIntervalTrees::new(container); | |
tracing::info!("Subsetting ..."); | |
let tx_db = subset_tx_db(&tx_db, &args.gene_symbols)?; | |
let chrom_to_acc = _chrom_to_acc(container.assembly()); | |
tracing::info!("... and encoding ..."); | |
let mut buf = Vec::with_capacity(tx_db.encoded_len()); | |
tx_db | |
.encode(&mut buf) | |
.map_err(|e| anyhow::anyhow!("failed to encode: {}", e))?; | |
let mut transcript_ids = IndexSet::<String>::new(); | |
let mut reader = noodles::vcf::io::reader::Builder::default().build_from_path(vcf)?; | |
let header = reader.read_header()?; | |
for result in reader.records() { | |
let record = result?; | |
if let (Some(Ok(start)), Ok(end)) = (record.variant_start(), record.variant_end(&header)) { | |
let chrom = record.reference_sequence_name().to_string(); | |
let accession = chrom_to_acc.get(&chrom).unwrap_or(&chrom); | |
tracing::info!("... and writing ..."); | |
// Open file and if necessary, wrap in a decompressor. | |
let file = std::fs::File::create(&args.path_out) | |
.map_err(|e| anyhow::anyhow!("failed to create file {}: {}", args.path_out.display(), e))?; | |
let ext = args.path_out.extension().map(|s| s.to_str()); | |
let mut writer: Box<dyn Write> = if ext == Some(Some("gz")) { | |
Box::new(flate2::write::GzEncoder::new( | |
file, | |
flate2::Compression::default(), | |
)) | |
} else if ext == Some(Some("zst")) { | |
Box::new( | |
zstd::Encoder::new(file, 0) | |
.map_err(|e| { | |
anyhow::anyhow!( | |
"failed to open zstd encoder for {}: {}", | |
args.path_out.display(), | |
e | |
) | |
})? | |
.auto_finish(), | |
) | |
} else { | |
Box::new(file) | |
}; | |
writer | |
.write_all(&buf) | |
.map_err(|e| anyhow::anyhow!("failed to write to {}: {}", args.path_out.display(), e))?; | |
let txs = trees.get_tx_for_region( | |
container, | |
accession, | |
"splign", | |
usize::from(start).try_into()?, | |
usize::from(end).try_into()?, | |
)?; | |
transcript_ids.extend(txs.into_iter().map(|tx| tx.tx_ac)); | |
} | |
} | |
if transcript_ids.is_empty() { | |
return Err(anyhow::anyhow!( | |
"no transcripts overlapping VCF variants found in transcript database" | |
)); | |
} | |
let (tx_idxs, gene_ids) = __extract_transcripts_from_db(container_tx_db, &transcript_ids); | |
Ok(( | |
IndexSet::<usize>::from_iter(tx_idxs), | |
transcript_ids, | |
gene_ids, | |
)) | |
} | |
fn _extract_transcripts_by_region( | |
container: &TxSeqDatabase, | |
vcf: &PathBuf, | |
) -> Result<(IndexSet<usize>, IndexSet<String>, IndexSet<String>)> { | |
let container_tx_db = container.tx_db.as_ref().expect("no tx_db"); | |
let trees = TxIntervalTrees::new(container); | |
let chrom_to_acc = _chrom_to_acc(container.assembly()); | |
let mut transcript_ids = IndexSet::<String>::new(); | |
let mut reader = noodles::vcf::io::reader::Builder::default().build_from_path(vcf)?; | |
let header = reader.read_header()?; | |
for result in reader.records() { | |
let record = result?; | |
if let (Some(Ok(start)), Ok(end)) = (record.variant_start(), record.variant_end(&header)) { | |
let chrom = record.reference_sequence_name().to_string(); | |
let accession = chrom_to_acc.get(&chrom).ok_or_else(|| { | |
anyhow::anyhow!("Chromosome '{}' not found in accession mapping", chrom) | |
})?; | |
let txs = trees.get_tx_for_region( | |
container, | |
accession, | |
"splign", | |
usize::from(start).try_into()?, | |
usize::from(end).try_into()?, | |
)?; | |
transcript_ids.extend(txs.into_iter().map(|tx| tx.tx_ac)); | |
} | |
} | |
if transcript_ids.is_empty() { | |
return Err(anyhow::anyhow!( | |
"no transcripts overlapping VCF variants found in transcript database" | |
)); | |
} | |
let (tx_idxs, gene_ids) = __extract_transcripts_from_db(container_tx_db, &transcript_ids); | |
Ok(( | |
IndexSet::<usize>::from_iter(tx_idxs), | |
transcript_ids, | |
gene_ids, | |
)) | |
} |
pub fn get_tx_for_region( | ||
&self, | ||
tx_seq_db: &TxSeqDatabase, | ||
alt_ac: &str, | ||
_alt_aln_method: &str, | ||
start_i: i32, | ||
end_i: i32, | ||
) -> Result<Vec<TxForRegionRecord>, Error> { | ||
let contig_idx = *self | ||
.contig_to_idx | ||
.get(alt_ac) | ||
.ok_or(Error::NoTranscriptFound(alt_ac.to_string()))?; | ||
let query = start_i..end_i; | ||
let tx_idxs = self.trees[contig_idx].find(query); | ||
|
||
Ok(tx_idxs | ||
.iter() | ||
.map(|entry| { | ||
let tx = &tx_seq_db.tx_db.as_ref().expect("no tx_db?").transcripts | ||
[*entry.data() as usize]; | ||
assert_eq!( | ||
tx.genome_alignments.len(), | ||
1, | ||
"Can only have one alignment in Mehari" | ||
); | ||
let alt_strand = tx.genome_alignments.first().unwrap().strand; | ||
TxForRegionRecord { | ||
tx_ac: tx.id.clone(), | ||
alt_ac: alt_ac.to_string(), | ||
alt_strand: match Strand::try_from(alt_strand).expect("invalid strand") { | ||
Strand::Plus => 1, | ||
Strand::Minus => -1, | ||
_ => unreachable!("invalid strand {}", alt_strand), | ||
}, | ||
alt_aln_method: ALT_ALN_METHOD.to_string(), | ||
start_i, | ||
end_i, | ||
} | ||
}) | ||
.collect()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Duplicate implementation of get_tx_for_region
The get_tx_for_region
method added to TxIntervalTrees
appears to duplicate logic already present in the Provider
implementation. This duplication can lead to maintenance difficulties and inconsistencies.
Consider refactoring to reuse the get_tx_for_region
logic from TxIntervalTrees
within Provider
. For example, modify the Provider
's method to delegate to tx_trees.get_tx_for_region
:
fn get_tx_for_region(
&self,
alt_ac: &str,
_alt_aln_method: &str,
start_i: i32,
end_i: i32,
) -> Result<Vec<TxForRegionRecord>, Error> {
self.tx_trees.get_tx_for_region(
&self.tx_seq_db,
alt_ac,
_alt_aln_method,
start_i,
end_i,
)
}
This approach reduces code duplication and centralizes the logic.
let alt_strand = tx.genome_alignments.first().unwrap().strand; | ||
TxForRegionRecord { | ||
tx_ac: tx.id.clone(), | ||
alt_ac: alt_ac.to_string(), | ||
alt_strand: match Strand::try_from(alt_strand).expect("invalid strand") { | ||
Strand::Plus => 1, | ||
Strand::Minus => -1, | ||
_ => unreachable!("invalid strand {}", alt_strand), | ||
}, | ||
alt_aln_method: ALT_ALN_METHOD.to_string(), | ||
start_i, | ||
end_i, | ||
} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid panics with assert
and provide meaningful errors
Using assert_eq!
and unwrap()
can cause the program to panic if conditions are not met. Instead, handle these cases gracefully by returning errors.
Replace assert_eq!
and unwrap()
with proper error handling:
-let tx = &tx_seq_db.tx_db.as_ref().expect("no tx_db?").transcripts
- [*entry.data() as usize];
-assert_eq!(
- tx.genome_alignments.len(),
- 1,
- "Can only have one alignment in Mehari"
-);
-let alt_strand = tx.genome_alignments.first().unwrap().strand;
+let tx_db = tx_seq_db.tx_db.as_ref().ok_or_else(|| {
+ Error::InvalidInput("Transcript database is missing".to_string())
+})?;
+let tx = tx_db.transcripts.get(*entry.data() as usize).ok_or_else(|| {
+ Error::InvalidInput("Transcript index out of bounds".to_string())
+})?;
+if tx.genome_alignments.len() != 1 {
+ return Err(Error::InvalidInput(format!(
+ "Expected one genome alignment, found {}",
+ tx.genome_alignments.len()
+ )));
+}
+let alt_strand = tx.genome_alignments.first().ok_or_else(|| {
+ Error::InvalidInput("Genome alignment is missing".to_string())
+})?.strand;
This change prevents unexpected panics and provides useful error messages to the caller.
Committable suggestion skipped: line range outside the PR's diff.
This adds a little bit of code duplication (
get_tx_for_region
,_chrom_to_acc
and_acc_to_chrom
) but makes it way easier to create test-cases (as you will now get the exact subset of the txdb you need to test the input VCF).Summary by CodeRabbit
New Features
Bug Fixes
Refactor