Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: suppress unknown consequence errors #491

Merged
merged 2 commits into from
Oct 9, 2024

Conversation

holtgrewe
Copy link
Contributor

@holtgrewe holtgrewe commented Oct 9, 2024

Summary by CodeRabbit

  • New Features

    • Introduced the seqvars ingest subcommand for improved processing of VCF files and variant data.
    • Enhanced consequence handling by filtering only valid mappings in output headers and gene-related annotations.
    • Added the "strucvars query" subcommand with improved result handling and structural variant processing.
  • Bug Fixes

    • Improved robustness by ensuring only valid consequences are included, preventing potential errors during conversion.
  • Documentation

    • Updated comments and documentation for clarity on recent changes.
    • Enhanced logging details to improve traceability during execution.

Copy link
Contributor

coderabbitai bot commented Oct 9, 2024

Walkthrough

The changes in this pull request involve modifications across three primary files: src/seqvars/ingest/mod.rs, src/seqvars/query/mod.rs, and src/strucvars/query/mod.rs. The first file simplifies the load_tx_db function call by removing the format! macro, enhancing code clarity without altering functionality. The second file refines the handling of variant consequences, employing filter_map for better filtering in the output header and gene-related annotations. The third file introduces new structures and functions for processing structural variants, enhancing the overall functionality of the "strucvars query" subcommand.

Changes

File Path Change Summary
src/seqvars/ingest/mod.rs - Simplified load_tx_db function call by removing the format! macro and reference operator (&).
- Various comments and documentation updates without functional changes.
src/seqvars/query/mod.rs - Updated consequence handling in the output header to filter valid consequences using filter_map instead of map and collect.
- Adjusted write_header function to reflect new filtering logic.
- Similar updates in gene_related_annotation module.
- Method signature updates for write_header and consequences functions.
src/strucvars/query/mod.rs - Added new structures: ResultPayload, Gene, GeneTranscriptEffects, QueryStats.
- Enhanced run_query function for processing structural variants.
- Changed logging level from debug to trace.
- Added utility functions and defined load_databases.
- Updated method signature for run function.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Ingest
    participant Query
    participant Strucvars

    User->>Ingest: seqvars ingest command
    Ingest->>Ingest: Load transaction database
    Ingest-->>User: Processed VCF files

    User->>Query: Request variant consequences
    Query->>Query: Filter consequences
    Query-->>User: Return valid consequences

    User->>Strucvars: strucvars query command
    Strucvars->>Strucvars: Load databases
    Strucvars->>Strucvars: Process structural variants
    Strucvars-->>User: Return structural variant results
Loading

🐰 In the code we hop and play,
🐇 Simplified paths light the way.
🌟 Consequences now shine so bright,
🎉 Filtering wrongs, making it right!
🥕 With each change, we dance and cheer,
🐇 Code so clear, let’s spread the cheer!


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 22.22222% with 7 lines in your changes missing coverage. Please review.

Project coverage is 74%. Comparing base (501e6e0) to head (3197b96).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/seqvars/query/mod.rs 0% 7 Missing ⚠️
Additional details and impacted files
@@         Coverage Diff         @@
##           main   #491   +/-   ##
===================================
- Coverage    74%    74%   -1%     
===================================
  Files        42     42           
  Lines      7365   7366    +1     
===================================
  Hits       5470   5470           
- Misses     1895   1896    +1     
Files with missing lines Coverage Δ
src/seqvars/ingest/mod.rs 88% <100%> (ø)
src/strucvars/query/mod.rs 54% <100%> (ø)
src/seqvars/query/mod.rs 15% <0%> (-1%) ⬇️

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
src/seqvars/ingest/mod.rs (1)

326-326: LGTM! Consider using a const for the file name.

The change to remove the & operator is correct, as it now passes an owned String to load_tx_db. This aligns with an updated function signature and improves efficiency by avoiding a temporary reference.

For improved clarity and maintainability, consider extracting the "txs.bin.zst" file name into a constant:

+const TRANSCRIPT_DB_FILENAME: &str = "txs.bin.zst";

 let tx_db = mehari::annotate::seqvars::load_tx_db(format!(
-    "{}/{}/txs.bin.zst",
+    "{}/{}/{}",
     &args.path_mehari_db,
-    path_component(args.genomebuild)
+    path_component(args.genomebuild),
+    TRANSCRIPT_DB_FILENAME
 ))?;

This change would make it easier to update the file name in the future if needed.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 501e6e0 and 971319b.

📒 Files selected for processing (2)
  • src/seqvars/ingest/mod.rs (1 hunks)
  • src/seqvars/query/mod.rs (2 hunks)
🧰 Additional context used

src/seqvars/query/mod.rs Show resolved Hide resolved
@holtgrewe holtgrewe enabled auto-merge (squash) October 9, 2024 07:44
@holtgrewe holtgrewe merged commit c95ed79 into main Oct 9, 2024
10 of 12 checks passed
@holtgrewe holtgrewe deleted the fix-suppress-unknown-consequences branch October 9, 2024 07:49
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)
src/strucvars/query/mod.rs (5)

Line range hint 12-12: Inconsistent Use of Logging Frameworks

The code imports and uses both log and tracing crates for logging:

  • Line 12: use log::warn;
  • Line 571: tracing::info!("args_common = {:?}", &args_common);
  • Line 623: tracing::info!("Loading mehari tx database...");

Mixing different logging frameworks can lead to inconsistencies in log formatting and output destinations. It's recommended to standardize on a single logging framework throughout the codebase.

Suggested Action: Refactor the code to consistently use the tracing crate for all logging purposes, replacing any usage of log macros with their tracing equivalents.

Also applies to: 623-623, 571-571


Line range hint 12-12: Inconsistent Error Handling and Lack of Warnings

  • Line 12: Imports log::warn but does not use it consistently throughout the code.
  • Lines 158-185 (resolve_hgvs_id function): When an hgvs_id is not found, the function returns a Gene with mostly None fields without logging a warning or error.
  • Lines 371-378 (compute_tx_effects_for_breakpoint function): If the chromosome is not found in chrom_to_acc, the function silently returns a default value without logging.

Suggested Action:

  • For resolve_hgvs_id: Add a warning log when an hgvs_id cannot be resolved to inform users of potential data issues.

    if let Some(record_idxs) = record_idxs {
        // Existing code...
    } else {
        warn!("HGVS ID '{}' could not be resolved", hgvs_id);
        // Existing code...
    }
  • For compute_tx_effects_for_breakpoint: Log a warning when the chromosome is not found.

    let chrom = chrom_to_acc.get(&annonars::common::cli::canonicalize(&sv.chrom));
    if chrom.is_none() {
        warn!("Chromosome '{}' not found in chrom_to_acc", sv.chrom);
        return Default::default();
    }
  • Ensure that the log::warn is replaced with tracing::warn if standardizing on the tracing crate.

Also applies to: 158-185, 371-378


Line range hint 465-481: Possible Panic Due to Unchecked Chromosome Index

In the overlapping_hgnc_ids function:

let tree = &tx_idx.trees[chrom_idx];

Accessing tx_idx.trees without checking if chrom_idx is within bounds may lead to a panic if an invalid chrom_idx is provided.

Suggested Action: Add validation to ensure chrom_idx is within the bounds of tx_idx.trees.

if chrom_idx >= tx_idx.trees.len() {
    return Vec::new(); // or handle the error appropriately
}
let tree = &tx_idx.trees[chrom_idx];

Alternatively, match on tx_idx.trees.get(chrom_idx) to safely access the tree.


Line range hint 548-562: Error Context Missing in load_databases Function

The load_databases function propagates errors from multiple database loading functions using the ? operator:

Ok(InMemoryDbs {
    bg_dbs: load_bg_dbs(path_worker_db, genome_release)?,
    patho_dbs: load_patho_dbs(path_worker_db, genome_release)?,
    tad_sets: load_tads(path_worker_db, genome_release, max_tad_distance)?,
    masked: load_masked_dbs(path_worker_db, genome_release)?,
    genes: load_gene_db(path_worker_db, genome_release)?,
    clinvar_sv: load_clinvar_sv(path_worker_db, genome_release)?,
})

If an error occurs, it might be unclear which database failed to load.

Suggested Action: Add context to each error using the with_context method from the anyhow crate to indicate which database failed.

Example:

Ok(InMemoryDbs {
    bg_dbs: load_bg_dbs(path_worker_db, genome_release)
        .with_context(|| "Failed to load background databases")?,
    patho_dbs: load_patho_dbs(path_worker_db, genome_release)
        .with_context(|| "Failed to load pathogenic databases")?,
    // ... similarly for other databases
})

This will provide more informative error messages, aiding in debugging.


Line range hint 725-753: Missing Assertions in the Test Module

In the smoke_test function:

insta::assert_snapshot!(std::fs::read_to_string(args.path_output.as_str())?);

There's an assertion to compare the output file to a snapshot. However, there are no additional assertions to verify the test results or to clean up the temporary directory after the test runs.

Suggested Action: Consider adding assertions to check for expected outcomes and ensure that the temporary directory is properly managed.

// Additional assertions can be added here
assert!(std::path::Path::new(&args.path_output).exists(), "Output file was not created");

// Clean up the temporary directory at the end of the test
drop(tmpdir);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 971319b and 3197b96.

📒 Files selected for processing (2)
  • src/seqvars/query/mod.rs (5 hunks)
  • src/strucvars/query/mod.rs (1 hunks)
🧰 Additional context used
🔇 Additional comments (5)
src/seqvars/query/mod.rs (4)

299-299: Improved tracing for better debugging

The addition of this trace log is a good improvement. It provides more detailed information during the processing of each record, which can be valuable for debugging and monitoring the application's behavior.


427-427: Enhanced logging for file operations

This addition of a debug log is beneficial. It clearly indicates when the program starts writing to the noheader file, which can be useful for tracking the progress of the operation and for debugging purposes.


481-481: Improved logging for output file writing

This addition of a debug log is a good improvement. It clearly indicates when the program starts writing to the output file, which enhances the ability to track the progress of the operation and aids in debugging.


534-545: 🛠️ Refactor suggestion

Consider refactoring consequence filtering logic

While the formatting has been improved, the underlying logic remains the same. As suggested in a previous review, consider refactoring this consequence filtering logic into a helper function. This would enhance maintainability and reduce code duplication, especially since similar logic appears elsewhere in the file (around lines 743-751).

Here's a reminder of the suggested refactoring:

  1. Create a helper function:
fn convert_consequence<T>(csq: T) -> Option<i32>
where
    T: TryInto<pbs_query::Consequence>,
{
    csq.try_into().ok().map(|csq| csq as i32)
}
  1. Update this code segment:
-                .filter_map(|(csq, count)| -> Option<pbs_output::ConsequenceCount> {
-                    // We ignore consequences that don't have a mapping into the protobuf.
-                    if let Ok(csq) = TryInto::<pbs_query::Consequence>::try_into(*csq) {
-                        Some(pbs_output::ConsequenceCount {
-                            consequence: csq as i32,
-                            count: *count as u32,
-                        })
-                    } else {
-                        None
-                    }
-                })
+                .filter_map(|(csq, count)| {
+                    convert_consequence(*csq).map(|csq| pbs_output::ConsequenceCount {
+                        consequence: csq,
+                        count: *count as u32,
+                    })
+                })

This refactoring would also apply to the similar code around lines 743-751.

src/strucvars/query/mod.rs (1)

255-255: Potential Performance Impact Due to Trace-Level Logging

Line 255 introduces a trace-level logging statement:

tracing::trace!("processing record {:?}", record_sv);

Using trace-level logging can generate a large volume of log data, which may impact performance and clutter log files, especially when processing many records.

Suggested Action: Verify if this granular level of logging is necessary in a production environment. If detailed logging is required for debugging purposes, consider:

  • Making the logging level configurable via command-line arguments or configuration files.
  • Using the debug level instead if it provides sufficient detail.

@holtgrewe holtgrewe restored the fix-suppress-unknown-consequences branch October 9, 2024 11:04
@holtgrewe holtgrewe deleted the fix-suppress-unknown-consequences branch October 9, 2024 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant