-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: inhouse freqs useable in seqvars query (#232) #522
feat: inhouse freqs useable in seqvars query (#232) #522
Conversation
Caution Review failedThe pull request is closed. WalkthroughThe changes in this pull request involve modifications to several files, primarily focusing on enhancing the handling of large files through Git LFS, updating JSON configurations for genomic data, and refining the functionality of query processing in the Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (3)
tests/seqvars/query/dragen.query.json (1)
76-77
: Consider documenting empty sections.
The locus
and clinvar
sections are empty objects. Consider adding comments or documentation to explain their purpose and expected future usage.
src/seqvars/query/interpreter/frequency.rs (2)
48-64
: LGTM! Consider enhancing the documentation.
The implementation for in-house frequency filtering in mtDNA is correct and follows the established pattern. The comment about allele frequency calculation limitation is helpful.
Consider expanding the comment to explain why differentiating between "no coverage" and "homozygous reference" affects allele frequency calculation:
-// Note that allele frequency cannot be calculated from inhouse data as of yet
-// as we cannot differentiate between "no coverage" and "hom. ref." yet.
+// Note: Allele frequency cannot be calculated from in-house data yet because we cannot
+// differentiate between "no coverage" and "homozygous reference" cases. This limitation
+// affects the denominator in frequency calculations, as we cannot accurately determine
+// the total number of alleles observed at each position.
Missing test coverage for in-house frequency filtering functionality
Looking at the code, I can confirm that while there are comprehensive test cases for gnomad_exomes, gnomad_genomes, helix_chrmt, and gnomad_mtdna frequency filtering, there are no test cases for the in-house frequency filtering functionality. The implementation in the passes
function handles in-house frequency filtering for both mtDNA and nuclear DNA cases, but this functionality remains untested.
Key areas needing test coverage:
- In-house frequency filtering for mtDNA variants (het and hom counts)
- In-house frequency filtering for nuclear DNA variants (het, hom, and hemi counts)
- Edge cases like disabled filters and boundary conditions
🔗 Analysis chain
Line range hint 1-1
: Add test coverage for in-house frequency filtering.
While the existing frequency filters have comprehensive test coverage, the newly added in-house frequency filtering functionality lacks test cases. Consider adding test cases similar to the existing ones for both mtDNA and nuclear DNA scenarios.
Let's verify the current test coverage:
Would you like me to help generate the test cases for in-house frequency filtering?
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for test cases related to in-house frequency filtering
# Test: Search for test functions covering in-house frequency. Expect: Should find test cases.
rg -t rust "fn passes_frequency_inhouse" .
Length of output: 44
Script:
#!/bin/bash
# Let's first check for any test files related to frequency filtering
rg -t rust "mod tests" src/seqvars/query/interpreter/frequency.rs -A 10
# Also check for any test functions containing "frequency" in their name
rg -t rust "fn.*frequency.*\(" src/seqvars/query/interpreter/frequency.rs
# Let's see the actual implementation to understand what we're testing
rg -t rust "passes_frequency_inhouse" src/seqvars/query/interpreter/frequency.rs -A 10
Length of output: 414
Script:
#!/bin/bash
# Let's look at the entire frequency.rs file to understand the test module structure
# and the implementation of in-house frequency filtering
cat src/seqvars/query/interpreter/frequency.rs
Length of output: 28531
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (5)
src/seqvars/query/snapshots/varfish_server_worker__seqvars__query__test__smoke_test@Case_1.ingested.vcf-no_inhouse.snap
is excluded by!**/*.snap
src/seqvars/query/snapshots/varfish_server_worker__seqvars__query__test__smoke_test@Case_1.ingested.vcf-with_inhouse-2.snap
is excluded by!**/*.snap
src/seqvars/query/snapshots/varfish_server_worker__seqvars__query__test__smoke_test@Case_1.ingested.vcf-with_inhouse.snap
is excluded by!**/*.snap
src/seqvars/query/snapshots/varfish_server_worker__seqvars__query__test__smoke_test@dragen.ingested.vcf-no_inhouse-2.snap
is excluded by!**/*.snap
src/seqvars/query/snapshots/varfish_server_worker__seqvars__query__test__smoke_test@dragen.ingested.vcf-no_inhouse.snap
is excluded by!**/*.snap
📒 Files selected for processing (18)
- .gitattributes (1 hunks)
- src/seqvars/aggregate/mod.rs (1 hunks)
- src/seqvars/query/interpreter/frequency.rs (2 hunks)
- src/seqvars/query/mod.rs (8 hunks)
- tests/seqvars/query/Case_1.query.json (1 hunks)
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/paths.txt (1 hunks)
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/000017.sst (1 hunks)
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/000018.sst (1 hunks)
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/000019.sst (1 hunks)
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/CURRENT (1 hunks)
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/IDENTITY (1 hunks)
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/LOG (1 hunks)
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/LOG.old.1730295216382820 (1 hunks)
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/MANIFEST-000014 (1 hunks)
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/OPTIONS-000009 (1 hunks)
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/OPTIONS-000016 (1 hunks)
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/active (1 hunks)
- tests/seqvars/query/dragen.query.json (1 hunks)
✅ Files skipped from review due to trivial changes (12)
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/paths.txt
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/000017.sst
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/000018.sst
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/000019.sst
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/CURRENT
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/IDENTITY
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/LOG
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/LOG.old.1730295216382820
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/MANIFEST-000014
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/OPTIONS-000009
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/OPTIONS-000016
- tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/active
🔇 Additional comments (13)
.gitattributes (2)
5-5
: LGTM! Appropriate LFS configuration for seqvars test data.
The Git LFS configuration for tests/seqvars/query/db-dynamic/**
follows the established pattern and will properly handle the binary test files.
10-10
: LGTM! Verify binary files are properly tracked.
The Git LFS configuration for tests/strucvars/query/db-dynamic/**
is correct. Let's verify that all binary files are being tracked properly.
✅ Verification successful
The db-dynamic directory does not exist yet, but LFS configuration is properly set
The .gitattributes file correctly includes the LFS configuration for tests/strucvars/query/db-dynamic/**
, but the directory doesn't exist yet. This is expected as the configuration is proactively set up to handle binary files that will be added to this directory in the future.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that binary files in the new directories are properly tracked by Git LFS
# Test: Check if any binary files exist in the new directories
echo "Checking for binary files in new directories..."
find tests/seqvars/query/db-dynamic tests/strucvars/query/db-dynamic -type f -exec file {} \; | grep -i "binary"
# Test: Verify these files are tracked by Git LFS
echo "Verifying LFS tracking..."
git lfs ls-files | grep -E "tests/(seq|struc)vars/query/db-dynamic/"
Length of output: 985
Script:
#!/bin/bash
# Let's first check if the directories exist and what files they contain
echo "Checking directory structure..."
fd . tests/strucvars/query/db-dynamic
# Check if the .gitattributes file contains the LFS configuration
echo "Checking .gitattributes content..."
cat .gitattributes
# List all files in the repository to see the actual structure
echo "Listing repository files..."
fd . tests/strucvars
Length of output: 16181
tests/seqvars/query/dragen.query.json (3)
2-11
: LGTM: Well-structured genotype configuration.
The genotype section provides a clear and complete configuration for sample genotypes with proper enum values.
12-24
: Verify quality thresholds against best practices.
The quality thresholds appear reasonable, but please verify:
- minDpHet: 10 (minimum depth for heterozygous variants)
- minDpHom: 5 (minimum depth for homozygous variants)
- minGq: 10 (minimum genotype quality)
- minAb: 0.2 (minimum allele balance)
- minAd: 3 (minimum allele depth)
These values might need adjustment based on your sequencing protocol and analysis requirements.
45-74
: LGTM: Comprehensive consequence configuration.
The consequence section provides a complete and well-organized configuration covering:
- All relevant variant types
- Appropriate transcript type filtering
- Extensive list of consequence types
tests/seqvars/query/Case_1.query.json (4)
22-52
: LGTM! Quality thresholds are well-defined.
The quality configuration uses appropriate thresholds for WGS data:
- Minimum depth for heterozygous (10x) and homozygous (5x) variants
- Minimum genotype quality of 10
- Minimum allele balance of 0.2
- Minimum allele depth of 3
73-103
: LGTM! Comprehensive consequence configuration.
The consequence configuration is well-structured with:
- Appropriate variant types (SNV, INDEL, MNV, complex)
- Focus on coding transcripts
- Comprehensive list of consequences ordered by severity
2-21
: LGTM! Verify if identical genotype configuration is intended.
The genotype section is well-structured with clear sample identification. All three samples (index, father, mother) have identical configuration with GENOTYPE_CHOICE_ANY
.
Please verify if this is the intended test scenario:
✅ Verification successful
Identical genotype configuration is appropriate for this test case
Looking at other query configurations in the codebase:
dragen.query.json
uses a simpler single-sample configurationCase_3.query.json
in strucvars has a different schema for structural variants- The identical
GENOTYPE_CHOICE_ANY
configuration for all three samples inCase_1.query.json
appears to be a valid test case scenario, likely testing the basic functionality with a trio of samples.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if there are other test cases with different genotype configurations
rg -g '*.query.json' '"genotype":\s*{' -A 20
Length of output: 140
Script:
#!/bin/bash
# Check if there are other test cases with different genotype configurations
rg -g '*.query.json' '"genotype": \{' -A 20
Length of output: 3962
104-105
: Verify empty locus and clinvar sections.
Both locus
and clinvar
sections are empty objects. Please clarify if:
- These sections are optional
- Default values will be used
- Configuration should be added
✅ Verification successful
Empty locus and clinvar sections are valid default configurations
Based on the schema definitions and test snapshots, both sections are optional with well-defined default values:
locus
: Optional section for filtering by genes and genomic regions- Default: empty arrays for
genes
andgenome_regions
- Default: empty arrays for
clinvar
: Optional section for ClinVar-based filtering- Default:
presence_required: false
, emptygermline_descriptions
, andallow_conflicting_interpretations: false
- Default:
The empty objects in your query file represent these default configurations, which is a valid state indicating no specific filtering requirements for these sections.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if these sections are configured in other test cases
rg -g '*.query.json' -A 10 '"locus":|"clinvar":'
Length of output: 441
Script:
#!/bin/bash
# Search for any non-empty locus or clinvar configurations in JSON files
rg -g '*.json' '"locus":\s*\{[^{}]+\}|"clinvar":\s*\{[^{}]+\}'
# Search for any documentation or schema files that might describe these fields
fd -e md -e schema.json -e yaml -e yml | xargs rg -l 'locus|clinvar'
Length of output: 158
Script:
#!/bin/bash
# Check README.md and CHANGELOG.md for any documentation about these fields
cat README.md CHANGELOG.md | grep -A 5 -B 5 -i "locus\|clinvar"
# Look for any test files that might contain assertions or validations for these fields
fd -e test.js -e test.ts -e spec.js -e spec.ts | xargs rg -l 'locus|clinvar'
# Look for any schema validation files
fd schema | xargs rg -l 'locus|clinvar'
Length of output: 13305
Script:
#!/bin/bash
# Check the schema files for locus and clinvar definitions
cat src/seqvars/query/schema/query.rs | grep -A 10 -B 10 'locus\|clinvar'
# Check test snapshots for examples of non-empty configurations
cat src/seqvars/query/schema/snapshots/varfish_server_worker__seqvars__query__schema__query__tests__smoke_test_load@empty.json.snap
Length of output: 13502
src/seqvars/aggregate/mod.rs (1)
399-403
: Consider adding documentation for metadata schema and migration path.
While the implementation is correct, consider:
- Documenting the metadata schema and expected values
- Adding a migration path for existing databases
Let's verify if this metadata is used by the query module:
✅ Verification successful
The metadata key "genome-release" is properly used and documented
The verification shows that:
- The metadata is read and validated in the query module with proper error handling
- The genome release is logged during database opening
- The functionality is documented in README.md with CLI examples
- The metadata is consistently used across aggregate and query operations
The original review comment's concerns about documentation and verification are addressed by the existing implementation.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for usage of "genome-release" metadata key
rg -l "genome-release"
# Search for any genome release related queries
ast-grep --pattern 'db.get_cf($_, "genome-release")'
Length of output: 226
Script:
#!/bin/bash
# Let's check how the genome-release metadata is used in the query module
rg -A 5 -B 5 "genome-release" src/seqvars/query/mod.rs
# Also check README.md for any documentation about this metadata
rg -A 5 -B 5 "genome-release" README.md
Length of output: 1564
src/seqvars/query/interpreter/frequency.rs (1)
108-127
: LGTM! Implementation is consistent and complete.
The in-house frequency filtering for nuclear DNA correctly implements all necessary checks (heterozygous, homozygous, and hemizygous) while maintaining consistency with other frequency filters.
src/seqvars/query/mod.rs (2)
52-54
: Addition of optional path_inhouse_db
argument is appropriate
The new path_inhouse_db
optional argument is correctly added to the Args
struct with proper annotations for command-line parsing.
306-313
: Proper handling of optional in-house database during annotation
The code correctly checks for the presence of the in-house database before attempting to annotate record_seqvar
. The use of map_err
provides clear error context if the annotation fails.
"frequency": { | ||
"gnomadExomes": { | ||
"enabled": true, | ||
"maxHet": 1, | ||
"maxHom": 0, | ||
"maxAf": 0.002 | ||
}, | ||
"gnomadGenomes": { | ||
"enabled": true, | ||
"maxHet": 1, | ||
"maxHom": 0, | ||
"maxAf": 0.002 | ||
}, | ||
"gnomadMtdna": {}, | ||
"helixmtdb": {}, | ||
"inhouse": { | ||
"enabled": true, | ||
"maxCarriers": 20 | ||
} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider enhancing inhouse frequency configuration.
While the basic inhouse frequency configuration is present, it could benefit from additional parameters to match the granularity of gnomAD settings:
- maxAf (maximum allele frequency)
- maxHet (maximum heterozygous count)
- maxHom (maximum homozygous count)
This would provide consistent filtering capabilities across all frequency sources.
"inhouse": {
"enabled": true,
- "maxCarriers": 20
+ "maxCarriers": 20,
+ "maxAf": 0.002,
+ "maxHet": 1,
+ "maxHom": 0
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
"frequency": { | |
"gnomadExomes": { | |
"enabled": true, | |
"maxHet": 1, | |
"maxHom": 0, | |
"maxAf": 0.002 | |
}, | |
"gnomadGenomes": { | |
"enabled": true, | |
"maxHet": 1, | |
"maxHom": 0, | |
"maxAf": 0.002 | |
}, | |
"gnomadMtdna": {}, | |
"helixmtdb": {}, | |
"inhouse": { | |
"enabled": true, | |
"maxCarriers": 20 | |
} | |
}, | |
"frequency": { | |
"gnomadExomes": { | |
"enabled": true, | |
"maxHet": 1, | |
"maxHom": 0, | |
"maxAf": 0.002 | |
}, | |
"gnomadGenomes": { | |
"enabled": true, | |
"maxHet": 1, | |
"maxHom": 0, | |
"maxAf": 0.002 | |
}, | |
"gnomadMtdna": {}, | |
"helixmtdb": {}, | |
"inhouse": { | |
"enabled": true, | |
"maxCarriers": 20, | |
"maxAf": 0.002, | |
"maxHet": 1, | |
"maxHom": 0 | |
} | |
}, |
"frequency": { | ||
"gnomadExomes": { | ||
"enabled": true, | ||
"maxHet": 1, | ||
"maxHom": 0, | ||
"maxAf": 0.002 | ||
}, | ||
"Case_1_father-N1-DNA1-WGS1": { | ||
"dp_het": 10, | ||
"dp_hom": 5, | ||
"gq": 10, | ||
"ab": 0.2, | ||
"ad": 3, | ||
"ad_max": null, | ||
"filter_active": "drop-variant" | ||
"gnomadGenomes": { | ||
"enabled": true, | ||
"maxHet": 1, | ||
"maxHom": 0, | ||
"maxAf": 0.002 | ||
}, | ||
"Case_1_mother-N1-DNA1-WGS1": { | ||
"dp_het": 10, | ||
"dp_hom": 5, | ||
"gq": 10, | ||
"ab": 0.2, | ||
"ad": 3, | ||
"ad_max": null, | ||
"filter_active": "drop-variant" | ||
"gnomadMtdna": {}, | ||
"helixmtdb": {}, | ||
"inhouse": { | ||
"enabled": true, | ||
"maxCarriers": 20 | ||
} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Empty configurations for gnomadMtdna and helixmtdb need to be populated
Based on the test files in the repository, these databases should have proper configurations:
helixmtdb
should include fields forenabled
,frequency
,heteroplasmic
, andhomoplasmic
as seen infull-pb.json
- Both databases have configurations in other test files (
empty.json
) with fields forenabled
,heteroplasmic
,homoplasmic
, andfrequency
The empty configurations in Case_1.query.json should be updated to match the schema used in other test files, either enabling or explicitly disabling these databases.
🔗 Analysis chain
Verify empty database configurations.
The frequency thresholds for gnomAD are well-configured for rare variant analysis. The new inhouse frequency configuration with maxCarriers: 20 is a good addition. However, there are empty configurations for:
- gnomadMtdna
- helixmtdb
Please verify if these databases should be configured:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if these databases are configured in other test cases
rg -g '*.query.json' -A 5 '"gnomadMtdna"|"helixmtdb"'
Length of output: 441
Script:
#!/bin/bash
# Let's search for any configuration patterns of these databases in query files
rg -g '*.query.json' '"gnomadMtdna":\s*{[^}]*}|"helixmtdb":\s*{[^}]*}'
# Also check for any documentation or schema files that might define these configurations
fd -e json -e md -e yaml | xargs rg -l 'gnomadMtdna|helixmtdb'
Length of output: 491
Script:
#!/bin/bash
# Let's check the content of these files to see how gnomadMtdna and helixmtdb are configured
for file in tests/seqvars/query/with_extra.json tests/seqvars/query/full-pb.json tests/seqvars/query/full.json tests/seqvars/query/empty.json tests/seqvars/query/dragen.query.json; do
echo "=== $file ==="
rg -A 5 -B 5 '"gnomadMtdna"|"helixmtdb"' "$file"
done
Length of output: 2448
#[case::case_1_ingested_vcf_with_inhouse("tests/seqvars/query/Case_1.ingested.vcf", true)] | ||
#[case::dragen_ingested_vcf_no_inhouse("tests/seqvars/query/dragen.ingested.vcf", false)] | ||
#[case::case_1_ingested_vcf_with_inhouse("tests/seqvars/query/Case_1.ingested.vcf", true)] | ||
#[case::dragen_ingested_vcf_no_inhouse("tests/seqvars/query/dragen.ingested.vcf", false)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate test cases detected
The test cases for case_1_ingested_vcf_with_inhouse
and dragen_ingested_vcf_no_inhouse
are duplicated. Removing the duplicates will streamline the tests and prevent redundancy.
Apply this diff to remove the duplicate test cases:
#[rstest::rstest]
#[case::case_1_ingested_vcf_with_inhouse("tests/seqvars/query/Case_1.ingested.vcf", true)]
#[case::dragen_ingested_vcf_no_inhouse("tests/seqvars/query/dragen.ingested.vcf", false)]
-#[case::case_1_ingested_vcf_with_inhouse("tests/seqvars/query/Case_1.ingested.vcf", true)]
-#[case::dragen_ingested_vcf_no_inhouse("tests/seqvars/query/dragen.ingested.vcf", false)]
#[tokio::test]
Committable suggestion was skipped due to low confidence.
let (inhouse_db, inhouse_meta) = | ||
open_rocksdb(&path, "counts", "meta").map_err(|e| { | ||
anyhow::anyhow!( | ||
"problem opening {} metadata at {}: {}", | ||
"counts", | ||
path.as_ref().as_os_str().to_string_lossy(), | ||
e | ||
) | ||
})?; | ||
|
||
if inhouse_meta.genome_release != genome_release { | ||
anyhow::bail!( | ||
"genome release mismatch: expected {} but got {}", | ||
genome_release, | ||
inhouse_meta.genome_release | ||
); | ||
} | ||
|
||
Ok(Self { | ||
inhouse_db, | ||
inhouse_meta, | ||
}) | ||
} | ||
|
||
/// Query for in-house data frequencies. | ||
fn query( | ||
&self, | ||
key: &annonars::common::keys::Var, | ||
) -> Result<Option<Counts>, anyhow::Error> { | ||
let cf_counts = self | ||
.inhouse_db | ||
.cf_handle("counts") | ||
.expect("cannot find column family 'counts'"); | ||
let key: Vec<u8> = key.clone().into(); | ||
let res = self | ||
.inhouse_db | ||
.get_cf(&cf_counts, key) | ||
.map_err(|e| anyhow::anyhow!("querying in-house RocksDB failed: {}", e))?; | ||
Ok(res.as_ref().map(|res| Counts::from_vec(res))) | ||
} | ||
|
||
/// Annotate the given seqvars record. | ||
pub fn annotate_seqvar( | ||
&self, | ||
record: VariantRecord, | ||
) -> Result<VariantRecord, anyhow::Error> { | ||
let VariantRecord { | ||
vcf_variant, | ||
call_infos, | ||
ann_fields, | ||
population_frequencies, | ||
} = record; | ||
let PopulationFrequencies { | ||
gnomad_exomes, | ||
gnomad_genomes, | ||
gnomad_mtdna, | ||
helixmtdb, | ||
inhouse, | ||
} = population_frequencies; | ||
|
||
let inhouse = if let Some(inhouse_counts) = | ||
self.query(&annonars::common::keys::Var::from( | ||
&vcf_variant.chrom, | ||
vcf_variant.pos, | ||
&vcf_variant.ref_allele, | ||
&vcf_variant.alt_allele, | ||
))? { | ||
let Counts { | ||
count_homref, | ||
count_hemiref, | ||
count_het, | ||
count_homalt, | ||
count_hemialt, | ||
} = inhouse_counts; | ||
InHouseFrequencies { | ||
an: (count_homref + count_hemiref + count_het + count_homalt + count_hemialt) | ||
as i32, | ||
hom: count_homalt as i32, | ||
het: count_het as i32, | ||
hemi: count_hemialt as i32, | ||
} | ||
} else { | ||
inhouse | ||
}; | ||
|
||
Ok(VariantRecord { | ||
vcf_variant, | ||
call_infos, | ||
ann_fields, | ||
population_frequencies: PopulationFrequencies { | ||
gnomad_exomes, | ||
gnomad_genomes, | ||
gnomad_mtdna, | ||
helixmtdb, | ||
inhouse, | ||
}, | ||
}) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential error handling improvement in inhouse::Dbs::query
method
In the query
method, when querying the in-house database, consider handling the case where the column family 'counts'
might not exist. This would prevent a potential panic if the column family is missing.
Apply this diff to handle the missing column family gracefully:
- let cf_counts = self
- .inhouse_db
- .cf_handle("counts")
- .expect("cannot find column family 'counts'");
+ let cf_counts = match self.inhouse_db.cf_handle("counts") {
+ Some(cf) => cf,
+ None => return Err(anyhow::anyhow!("Column family 'counts' not found in in-house database")),
+ };
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
/// Code for accessing the in-house frequencies. | |
pub(crate) mod inhouse { | |
use std::{path::Path, sync::Arc}; | |
use crate::seqvars::aggregate::ds::Counts; | |
use super::schema::data::{InHouseFrequencies, PopulationFrequencies, VariantRecord}; | |
/// Meta information of the database. | |
pub struct Meta { | |
/// Genome release. | |
pub genome_release: String, | |
} | |
/// Container for the database structures. | |
pub struct Dbs { | |
/// Inhouse frequency RocksDB handle. | |
pub inhouse_db: Arc<rocksdb::DBWithThreadMode<rocksdb::MultiThreaded>>, | |
/// Meta information. | |
#[allow(dead_code)] | |
pub inhouse_meta: Meta, | |
} | |
/// Open RocksDb given path and column family name for data and metadata. | |
pub fn open_rocksdb<P: AsRef<std::path::Path>>( | |
path_rocksdb: P, | |
cf_data: &str, | |
cf_meta: &str, | |
) -> Result<(Arc<rocksdb::DBWithThreadMode<rocksdb::MultiThreaded>>, Meta), anyhow::Error> { | |
tracing::info!("Opening RocksDB database ..."); | |
let before_open = std::time::Instant::now(); | |
let cf_names = &[cf_meta, cf_data]; | |
let resolved_path_rocksdb = annonars::common::readlink_f(&path_rocksdb)?; | |
let db = Arc::new(rocksdb::DB::open_cf_for_read_only( | |
&rocksdb::Options::default(), | |
resolved_path_rocksdb, | |
cf_names, | |
true, | |
)?); | |
tracing::info!(" reading meta information"); | |
let meta = { | |
let cf_meta = db.cf_handle(cf_meta).unwrap(); | |
let meta_genome_release = String::from_utf8( | |
db.get_cf(&cf_meta, "genome-release")? | |
.ok_or_else(|| anyhow::anyhow!("missing value meta:genome-release"))?, | |
)?; | |
Meta { | |
genome_release: meta_genome_release, | |
} | |
}; | |
tracing::info!(" meta:genome-release = {}", &meta.genome_release); | |
tracing::info!( | |
"... opening RocksDB database took {:?}", | |
before_open.elapsed() | |
); | |
Ok((db, meta)) | |
} | |
impl Dbs { | |
/// Initialize from path that contains the annonars databases. | |
pub fn with_path<P: AsRef<Path>>( | |
path: P, | |
genome_release: &str, | |
) -> Result<Self, anyhow::Error> { | |
let (inhouse_db, inhouse_meta) = | |
open_rocksdb(&path, "counts", "meta").map_err(|e| { | |
anyhow::anyhow!( | |
"problem opening {} metadata at {}: {}", | |
"counts", | |
path.as_ref().as_os_str().to_string_lossy(), | |
e | |
) | |
})?; | |
if inhouse_meta.genome_release != genome_release { | |
anyhow::bail!( | |
"genome release mismatch: expected {} but got {}", | |
genome_release, | |
inhouse_meta.genome_release | |
); | |
} | |
Ok(Self { | |
inhouse_db, | |
inhouse_meta, | |
}) | |
} | |
/// Query for in-house data frequencies. | |
fn query( | |
&self, | |
key: &annonars::common::keys::Var, | |
) -> Result<Option<Counts>, anyhow::Error> { | |
let cf_counts = self | |
.inhouse_db | |
.cf_handle("counts") | |
.expect("cannot find column family 'counts'"); | |
let key: Vec<u8> = key.clone().into(); | |
let res = self | |
.inhouse_db | |
.get_cf(&cf_counts, key) | |
.map_err(|e| anyhow::anyhow!("querying in-house RocksDB failed: {}", e))?; | |
Ok(res.as_ref().map(|res| Counts::from_vec(res))) | |
} | |
/// Annotate the given seqvars record. | |
pub fn annotate_seqvar( | |
&self, | |
record: VariantRecord, | |
) -> Result<VariantRecord, anyhow::Error> { | |
let VariantRecord { | |
vcf_variant, | |
call_infos, | |
ann_fields, | |
population_frequencies, | |
} = record; | |
let PopulationFrequencies { | |
gnomad_exomes, | |
gnomad_genomes, | |
gnomad_mtdna, | |
helixmtdb, | |
inhouse, | |
} = population_frequencies; | |
let inhouse = if let Some(inhouse_counts) = | |
self.query(&annonars::common::keys::Var::from( | |
&vcf_variant.chrom, | |
vcf_variant.pos, | |
&vcf_variant.ref_allele, | |
&vcf_variant.alt_allele, | |
))? { | |
let Counts { | |
count_homref, | |
count_hemiref, | |
count_het, | |
count_homalt, | |
count_hemialt, | |
} = inhouse_counts; | |
InHouseFrequencies { | |
an: (count_homref + count_hemiref + count_het + count_homalt + count_hemialt) | |
as i32, | |
hom: count_homalt as i32, | |
het: count_het as i32, | |
hemi: count_hemialt as i32, | |
} | |
} else { | |
inhouse | |
}; | |
Ok(VariantRecord { | |
vcf_variant, | |
call_infos, | |
ann_fields, | |
population_frequencies: PopulationFrequencies { | |
gnomad_exomes, | |
gnomad_genomes, | |
gnomad_mtdna, | |
helixmtdb, | |
inhouse, | |
}, | |
}) | |
} | |
} | |
} | |
/// Code for accessing the in-house frequencies. | |
pub(crate) mod inhouse { | |
use std::{path::Path, sync::Arc}; | |
use crate::seqvars::aggregate::ds::Counts; | |
use super::schema::data::{InHouseFrequencies, PopulationFrequencies, VariantRecord}; | |
/// Meta information of the database. | |
pub struct Meta { | |
/// Genome release. | |
pub genome_release: String, | |
} | |
/// Container for the database structures. | |
pub struct Dbs { | |
/// Inhouse frequency RocksDB handle. | |
pub inhouse_db: Arc<rocksdb::DBWithThreadMode<rocksdb::MultiThreaded>>, | |
/// Meta information. | |
#[allow(dead_code)] | |
pub inhouse_meta: Meta, | |
} | |
/// Open RocksDb given path and column family name for data and metadata. | |
pub fn open_rocksdb<P: AsRef<std::path::Path>>( | |
path_rocksdb: P, | |
cf_data: &str, | |
cf_meta: &str, | |
) -> Result<(Arc<rocksdb::DBWithThreadMode<rocksdb::MultiThreaded>>, Meta), anyhow::Error> { | |
tracing::info!("Opening RocksDB database ..."); | |
let before_open = std::time::Instant::now(); | |
let cf_names = &[cf_meta, cf_data]; | |
let resolved_path_rocksdb = annonars::common::readlink_f(&path_rocksdb)?; | |
let db = Arc::new(rocksdb::DB::open_cf_for_read_only( | |
&rocksdb::Options::default(), | |
resolved_path_rocksdb, | |
cf_names, | |
true, | |
)?); | |
tracing::info!(" reading meta information"); | |
let meta = { | |
let cf_meta = db.cf_handle(cf_meta).unwrap(); | |
let meta_genome_release = String::from_utf8( | |
db.get_cf(&cf_meta, "genome-release")? | |
.ok_or_else(|| anyhow::anyhow!("missing value meta:genome-release"))?, | |
)?; | |
Meta { | |
genome_release: meta_genome_release, | |
} | |
}; | |
tracing::info!(" meta:genome-release = {}", &meta.genome_release); | |
tracing::info!( | |
"... opening RocksDB database took {:?}", | |
before_open.elapsed() | |
); | |
Ok((db, meta)) | |
} | |
impl Dbs { | |
/// Initialize from path that contains the annonars databases. | |
pub fn with_path<P: AsRef<Path>>( | |
path: P, | |
genome_release: &str, | |
) -> Result<Self, anyhow::Error> { | |
let (inhouse_db, inhouse_meta) = | |
open_rocksdb(&path, "counts", "meta").map_err(|e| { | |
anyhow::anyhow!( | |
"problem opening {} metadata at {}: {}", | |
"counts", | |
path.as_ref().as_os_str().to_string_lossy(), | |
e | |
) | |
})?; | |
if inhouse_meta.genome_release != genome_release { | |
anyhow::bail!( | |
"genome release mismatch: expected {} but got {}", | |
genome_release, | |
inhouse_meta.genome_release | |
); | |
} | |
Ok(Self { | |
inhouse_db, | |
inhouse_meta, | |
}) | |
} | |
/// Query for in-house data frequencies. | |
fn query( | |
&self, | |
key: &annonars::common::keys::Var, | |
) -> Result<Option<Counts>, anyhow::Error> { | |
let cf_counts = match self.inhouse_db.cf_handle("counts") { | |
Some(cf) => cf, | |
None => return Err(anyhow::anyhow!("Column family 'counts' not found in in-house database")), | |
}; | |
let key: Vec<u8> = key.clone().into(); | |
let res = self | |
.inhouse_db | |
.get_cf(&cf_counts, key) | |
.map_err(|e| anyhow::anyhow!("querying in-house RocksDB failed: {}", e))?; | |
Ok(res.as_ref().map(|res| Counts::from_vec(res))) | |
} | |
/// Annotate the given seqvars record. | |
pub fn annotate_seqvar( | |
&self, | |
record: VariantRecord, | |
) -> Result<VariantRecord, anyhow::Error> { | |
let VariantRecord { | |
vcf_variant, | |
call_infos, | |
ann_fields, | |
population_frequencies, | |
} = record; | |
let PopulationFrequencies { | |
gnomad_exomes, | |
gnomad_genomes, | |
gnomad_mtdna, | |
helixmtdb, | |
inhouse, | |
} = population_frequencies; | |
let inhouse = if let Some(inhouse_counts) = | |
self.query(&annonars::common::keys::Var::from( | |
&vcf_variant.chrom, | |
vcf_variant.pos, | |
&vcf_variant.ref_allele, | |
&vcf_variant.alt_allele, | |
))? { | |
let Counts { | |
count_homref, | |
count_hemiref, | |
count_het, | |
count_homalt, | |
count_hemialt, | |
} = inhouse_counts; | |
InHouseFrequencies { | |
an: (count_homref + count_hemiref + count_het + count_homalt + count_hemialt) | |
as i32, | |
hom: count_homalt as i32, | |
het: count_het as i32, | |
hemi: count_hemialt as i32, | |
} | |
} else { | |
inhouse | |
}; | |
Ok(VariantRecord { | |
vcf_variant, | |
call_infos, | |
ann_fields, | |
population_frequencies: PopulationFrequencies { | |
gnomad_exomes, | |
gnomad_genomes, | |
gnomad_mtdna, | |
helixmtdb, | |
inhouse, | |
}, | |
}) | |
} | |
} | |
} |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #522 +/- ##
======================================
+ Coverage 71% 79% +7%
======================================
Files 43 43
Lines 11830 12012 +182
======================================
+ Hits 8510 9492 +982
+ Misses 3320 2520 -800
|
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores