-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: only consider MAPPING/ALIGNING SUMMARY entries from dragen qc summary file #134
Conversation
Caution Review failedThe pull request is closed. WalkthroughThe changes in this pull request involve modifications to the Changes
Assessment against linked issues
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (2)
varfish_cli/cli/tools/__init__.py (2)
27-27
: Consider separating fieldnames for different file types.The same fieldnames are used for parsing both QC coverage and mapping metrics files. This coupling could make the code brittle if the file formats evolve differently.
Consider defining separate fieldnames for each file type:
- fieldnames = ["_summary", "_empty", "label", "value"] + qc_coverage_fieldnames = ["_summary", "_empty", "label", "value"] + mapping_metrics_fieldnames = ["_summary", "_empty", "label", "value"]Also, consider adding validation for the CSV structure before parsing to fail fast if the file format changes.
Line range hint
77-91
: Improve handling of NA values in metrics.The PR objectives mention handling NA values, but the current error handling doesn't distinguish between NA values and other invalid values. This could make debugging harder when NA values are encountered.
Consider adding specific handling for NA values:
key_insert_length_mean = "Insert length: mean" try: + value = mapping_metrics[key_insert_length_mean] + if value.upper() == "NA": + logger.warning(f"NA value found for {key_insert_length_mean}, using 0.0") + insert_length_mean = 0.0 + else: + insert_length_mean = float(value) - insert_length_mean = float(mapping_metrics[key_insert_length_mean]) except (KeyError, ValueError) as e: - logger.error(f"'{key_insert_length_mean}' not in {mapping_metrics_path} or no int") + logger.error( + f"'{key_insert_length_mean}' not in {mapping_metrics_path} or invalid value: {str(e)}" + ) raise typer.Exit() from eApply similar changes to other metric parsing blocks to handle NA values consistently.
mapping_metrics = { | ||
row["label"]: row["value"] | ||
for row in reader | ||
if row["_summary"] == "MAPPING/ALIGNING SUMMARY" and row["_empty"] == "" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Handle NA values in mapping metrics dictionary construction
The verification reveals that NA values are present in several important metrics, but the current implementation doesn't explicitly handle them. This could lead to potential issues when these values are used downstream.
- The code should explicitly handle 'NA' string values when constructing the mapping_metrics dictionary
- Multiple metrics contain NA values, including:
- "Number of duplicate marked and mate reads removed"
- "Reads with MAPQ NA"
- "Bases in target bed [% of genome]"
- "Provided sex chromosome ploidy"
🔗 Analysis chain
Verify handling of Dragen output variations.
Let's verify that the implementation handles different variations of Dragen output files correctly.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for variations in Dragen output files to ensure robust parsing
# Test 1: Check for different summary types in mapping metrics files
echo "Checking for different summary types in mapping metrics files..."
rg -A 1 "^_summary" --no-filename | sort -u
# Test 2: Check for multiple occurrences of insert length fields
echo "Checking for multiple insert length entries..."
rg -B 1 "Insert length: mean" --no-filename
# Test 3: Check for NA values in metrics
echo "Checking for NA values in metrics..."
rg -i "\bna\b" --no-filename
Length of output: 2768
🛠️ Refactor suggestion
Add logging and validation for mapping metrics filtering.
While the filtering correctly implements the requirement to only consider "MAPPING/ALIGNING SUMMARY" entries, there are several improvements that could make the code more robust and maintainable:
- The purpose of the
row["_empty"] == ""
condition is not documented - Silent filtering could make debugging difficult if no rows match the criteria
- No validation ensures that required metrics are present after filtering
Consider implementing these improvements:
mapping_metrics: Dict[str, str] = {}
+ required_metrics = {
+ "Total input reads",
+ "Number of duplicate marked reads",
+ "Insert length: mean",
+ "Insert length: standard deviation"
+ }
+ filtered_count = 0
with open(mapping_metrics_path, "rt") as coverage_metrics_file:
reader = csv.DictReader(coverage_metrics_file, fieldnames=fieldnames, delimiter=",")
mapping_metrics = {
row["label"]: row["value"]
for row in reader
if row["_summary"] == "MAPPING/ALIGNING SUMMARY" and row["_empty"] == ""
}
+ logger.debug(
+ f"Filtered {filtered_count} non-MAPPING/ALIGNING SUMMARY entries from {mapping_metrics_path}"
+ )
+ missing_metrics = required_metrics - mapping_metrics.keys()
+ if missing_metrics:
+ logger.error(
+ f"Required metrics {missing_metrics} not found in {mapping_metrics_path}"
+ )
+ raise typer.Exit()
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I've also confirmed with LB that we should use the values from the 'Summary' section, not the 'per RG' ones.
It seems that some test snapshots are no longer valid then ( https://github.com/varfish-org/varfish-cli/actions/runs/11668684344/job/32488937744?pr=134 ); in other words, we might not always have used the correct entries? |
Fixes #131
However, I assume that the
MAPPING/ALIGNING SUMMARY
entries were supposed to be used, when in fact theMAPPING/ALIGNING PER RG
entries were the ones actually ending up being used. Not sure if these were actually different in the old files, but in the new files, they certainly are, since they are partitioned into the WGS, EvidenceRead_Normal and EvidenceHaplotype cases.Summary by CodeRabbit
New Features
Bug Fixes
Tests
test_dragen_to_bam_qc
andtest_load_bam_qc
, ensuring accuracy in metrics.