Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: inhouse freqs useable in seqvars query (#232) #522

Merged
merged 2 commits into from
Oct 30, 2024

Conversation

holtgrewe
Copy link
Contributor

@holtgrewe holtgrewe commented Oct 30, 2024

Summary by CodeRabbit

  • New Features

    • Introduced in-house frequency data integration for enhanced querying capabilities.
    • Added support for tracking large files using Git LFS in specific directories.
    • Restructured JSON configuration files for improved organization and clarity.
  • Bug Fixes

    • Enhanced checks for frequency filters in the variant evaluation process.
  • Documentation

    • Updated test cases to validate new in-house database integration.
  • Chores

    • Added several new files for managing RocksDB and large file storage, including metadata and versioning files.

@holtgrewe holtgrewe linked an issue Oct 30, 2024 that may be closed by this pull request
Copy link
Contributor

coderabbitai bot commented Oct 30, 2024

Caution

Review failed

The pull request is closed.

Walkthrough

The changes in this pull request involve modifications to several files, primarily focusing on enhancing the handling of large files through Git LFS, updating JSON configurations for genomic data, and refining the functionality of query processing in the seqvars module. New entries for LFS tracking have been added, along with structural changes to JSON files to improve organization and clarity. Additionally, the run function and query handling have been updated to incorporate in-house frequency data, enriching the querying capabilities of the application.

Changes

File Path Change Summary
.gitattributes Added entries for LFS tracking for tests/seqvars/query/db-dynamic/** and tests/strucvars/query/db-dynamic/**.
src/seqvars/aggregate/mod.rs Modified run function to include genome release name in RocksDB metadata.
src/seqvars/query/interpreter/frequency.rs Enhanced passes function with additional checks for inhouse frequency filter.
src/seqvars/query/mod.rs Added optional path_inhouse_db argument and updated run_query to accept inhouse parameter. Introduced inhouse module.
tests/seqvars/query/Case_1.query.json Restructured JSON format for improved organization, including new objects for genotype and frequency.
tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/paths.txt New file added with metadata, tracked by Git LFS.
tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/*.sst Multiple new binary files added, tracked by Git LFS.
tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/* New files added for RocksDB management, including CURRENT, IDENTITY, LOG, MANIFEST, and OPTIONS.
tests/seqvars/query/dragen.query.json Modified JSON structure to enhance organization, including new sections for genotype, quality, and frequency.

Poem

In the burrow deep, we make our play,
With files and data, we hop all day.
Large files tracked with care and grace,
New paths and queries in our space.
A world of genes we now explore,
With every change, we leap for more! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@holtgrewe holtgrewe enabled auto-merge (squash) October 30, 2024 13:48
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (3)
tests/seqvars/query/dragen.query.json (1)

76-77: Consider documenting empty sections.

The locus and clinvar sections are empty objects. Consider adding comments or documentation to explain their purpose and expected future usage.

src/seqvars/query/interpreter/frequency.rs (2)

48-64: LGTM! Consider enhancing the documentation.

The implementation for in-house frequency filtering in mtDNA is correct and follows the established pattern. The comment about allele frequency calculation limitation is helpful.

Consider expanding the comment to explain why differentiating between "no coverage" and "homozygous reference" affects allele frequency calculation:

-// Note that allele frequency cannot be calculated from inhouse data as of yet
-// as we cannot differentiate between "no coverage" and "hom. ref." yet.
+// Note: Allele frequency cannot be calculated from in-house data yet because we cannot
+// differentiate between "no coverage" and "homozygous reference" cases. This limitation
+// affects the denominator in frequency calculations, as we cannot accurately determine
+// the total number of alleles observed at each position.

Missing test coverage for in-house frequency filtering functionality

Looking at the code, I can confirm that while there are comprehensive test cases for gnomad_exomes, gnomad_genomes, helix_chrmt, and gnomad_mtdna frequency filtering, there are no test cases for the in-house frequency filtering functionality. The implementation in the passes function handles in-house frequency filtering for both mtDNA and nuclear DNA cases, but this functionality remains untested.

Key areas needing test coverage:

  • In-house frequency filtering for mtDNA variants (het and hom counts)
  • In-house frequency filtering for nuclear DNA variants (het, hom, and hemi counts)
  • Edge cases like disabled filters and boundary conditions
🔗 Analysis chain

Line range hint 1-1: Add test coverage for in-house frequency filtering.

While the existing frequency filters have comprehensive test coverage, the newly added in-house frequency filtering functionality lacks test cases. Consider adding test cases similar to the existing ones for both mtDNA and nuclear DNA scenarios.

Let's verify the current test coverage:

Would you like me to help generate the test cases for in-house frequency filtering?

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for test cases related to in-house frequency filtering

# Test: Search for test functions covering in-house frequency. Expect: Should find test cases.
rg -t rust "fn passes_frequency_inhouse" .

Length of output: 44


Script:

#!/bin/bash
# Let's first check for any test files related to frequency filtering
rg -t rust "mod tests" src/seqvars/query/interpreter/frequency.rs -A 10

# Also check for any test functions containing "frequency" in their name
rg -t rust "fn.*frequency.*\(" src/seqvars/query/interpreter/frequency.rs

# Let's see the actual implementation to understand what we're testing
rg -t rust "passes_frequency_inhouse" src/seqvars/query/interpreter/frequency.rs -A 10

Length of output: 414


Script:

#!/bin/bash
# Let's look at the entire frequency.rs file to understand the test module structure
# and the implementation of in-house frequency filtering
cat src/seqvars/query/interpreter/frequency.rs

Length of output: 28531

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 0ce7f13 and 8eba31c.

⛔ Files ignored due to path filters (5)
  • src/seqvars/query/snapshots/varfish_server_worker__seqvars__query__test__smoke_test@Case_1.ingested.vcf-no_inhouse.snap is excluded by !**/*.snap
  • src/seqvars/query/snapshots/varfish_server_worker__seqvars__query__test__smoke_test@Case_1.ingested.vcf-with_inhouse-2.snap is excluded by !**/*.snap
  • src/seqvars/query/snapshots/varfish_server_worker__seqvars__query__test__smoke_test@Case_1.ingested.vcf-with_inhouse.snap is excluded by !**/*.snap
  • src/seqvars/query/snapshots/varfish_server_worker__seqvars__query__test__smoke_test@dragen.ingested.vcf-no_inhouse-2.snap is excluded by !**/*.snap
  • src/seqvars/query/snapshots/varfish_server_worker__seqvars__query__test__smoke_test@dragen.ingested.vcf-no_inhouse.snap is excluded by !**/*.snap
📒 Files selected for processing (18)
  • .gitattributes (1 hunks)
  • src/seqvars/aggregate/mod.rs (1 hunks)
  • src/seqvars/query/interpreter/frequency.rs (2 hunks)
  • src/seqvars/query/mod.rs (8 hunks)
  • tests/seqvars/query/Case_1.query.json (1 hunks)
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/paths.txt (1 hunks)
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/000017.sst (1 hunks)
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/000018.sst (1 hunks)
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/000019.sst (1 hunks)
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/CURRENT (1 hunks)
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/IDENTITY (1 hunks)
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/LOG (1 hunks)
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/LOG.old.1730295216382820 (1 hunks)
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/MANIFEST-000014 (1 hunks)
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/OPTIONS-000009 (1 hunks)
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/OPTIONS-000016 (1 hunks)
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/active (1 hunks)
  • tests/seqvars/query/dragen.query.json (1 hunks)
✅ Files skipped from review due to trivial changes (12)
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/paths.txt
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/000017.sst
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/000018.sst
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/000019.sst
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/CURRENT
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/IDENTITY
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/LOG
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/LOG.old.1730295216382820
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/MANIFEST-000014
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/OPTIONS-000009
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/2024-10-30_13-33-30/rocksdb/OPTIONS-000016
  • tests/seqvars/query/db-dynamic/worker/seqvars/inhouse/grch37/active
🔇 Additional comments (13)
.gitattributes (2)

5-5: LGTM! Appropriate LFS configuration for seqvars test data.

The Git LFS configuration for tests/seqvars/query/db-dynamic/** follows the established pattern and will properly handle the binary test files.


10-10: LGTM! Verify binary files are properly tracked.

The Git LFS configuration for tests/strucvars/query/db-dynamic/** is correct. Let's verify that all binary files are being tracked properly.

✅ Verification successful

The db-dynamic directory does not exist yet, but LFS configuration is properly set

The .gitattributes file correctly includes the LFS configuration for tests/strucvars/query/db-dynamic/**, but the directory doesn't exist yet. This is expected as the configuration is proactively set up to handle binary files that will be added to this directory in the future.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that binary files in the new directories are properly tracked by Git LFS

# Test: Check if any binary files exist in the new directories
echo "Checking for binary files in new directories..."
find tests/seqvars/query/db-dynamic tests/strucvars/query/db-dynamic -type f -exec file {} \; | grep -i "binary"

# Test: Verify these files are tracked by Git LFS
echo "Verifying LFS tracking..."
git lfs ls-files | grep -E "tests/(seq|struc)vars/query/db-dynamic/"

Length of output: 985


Script:

#!/bin/bash
# Let's first check if the directories exist and what files they contain
echo "Checking directory structure..."
fd . tests/strucvars/query/db-dynamic

# Check if the .gitattributes file contains the LFS configuration
echo "Checking .gitattributes content..."
cat .gitattributes

# List all files in the repository to see the actual structure
echo "Listing repository files..."
fd . tests/strucvars

Length of output: 16181

tests/seqvars/query/dragen.query.json (3)

2-11: LGTM: Well-structured genotype configuration.

The genotype section provides a clear and complete configuration for sample genotypes with proper enum values.


12-24: Verify quality thresholds against best practices.

The quality thresholds appear reasonable, but please verify:

  • minDpHet: 10 (minimum depth for heterozygous variants)
  • minDpHom: 5 (minimum depth for homozygous variants)
  • minGq: 10 (minimum genotype quality)
  • minAb: 0.2 (minimum allele balance)
  • minAd: 3 (minimum allele depth)

These values might need adjustment based on your sequencing protocol and analysis requirements.


45-74: LGTM: Comprehensive consequence configuration.

The consequence section provides a complete and well-organized configuration covering:

  • All relevant variant types
  • Appropriate transcript type filtering
  • Extensive list of consequence types
tests/seqvars/query/Case_1.query.json (4)

22-52: LGTM! Quality thresholds are well-defined.

The quality configuration uses appropriate thresholds for WGS data:

  • Minimum depth for heterozygous (10x) and homozygous (5x) variants
  • Minimum genotype quality of 10
  • Minimum allele balance of 0.2
  • Minimum allele depth of 3

73-103: LGTM! Comprehensive consequence configuration.

The consequence configuration is well-structured with:

  • Appropriate variant types (SNV, INDEL, MNV, complex)
  • Focus on coding transcripts
  • Comprehensive list of consequences ordered by severity

2-21: LGTM! Verify if identical genotype configuration is intended.

The genotype section is well-structured with clear sample identification. All three samples (index, father, mother) have identical configuration with GENOTYPE_CHOICE_ANY.

Please verify if this is the intended test scenario:

✅ Verification successful

Identical genotype configuration is appropriate for this test case

Looking at other query configurations in the codebase:

  • dragen.query.json uses a simpler single-sample configuration
  • Case_3.query.json in strucvars has a different schema for structural variants
  • The identical GENOTYPE_CHOICE_ANY configuration for all three samples in Case_1.query.json appears to be a valid test case scenario, likely testing the basic functionality with a trio of samples.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if there are other test cases with different genotype configurations
rg -g '*.query.json' '"genotype":\s*{' -A 20

Length of output: 140


Script:

#!/bin/bash
# Check if there are other test cases with different genotype configurations
rg -g '*.query.json' '"genotype": \{' -A 20

Length of output: 3962


104-105: Verify empty locus and clinvar sections.

Both locus and clinvar sections are empty objects. Please clarify if:

  1. These sections are optional
  2. Default values will be used
  3. Configuration should be added
✅ Verification successful

Empty locus and clinvar sections are valid default configurations

Based on the schema definitions and test snapshots, both sections are optional with well-defined default values:

  • locus: Optional section for filtering by genes and genomic regions
    • Default: empty arrays for genes and genome_regions
  • clinvar: Optional section for ClinVar-based filtering
    • Default: presence_required: false, empty germline_descriptions, and allow_conflicting_interpretations: false

The empty objects in your query file represent these default configurations, which is a valid state indicating no specific filtering requirements for these sections.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if these sections are configured in other test cases
rg -g '*.query.json' -A 10 '"locus":|"clinvar":'

Length of output: 441


Script:

#!/bin/bash
# Search for any non-empty locus or clinvar configurations in JSON files
rg -g '*.json' '"locus":\s*\{[^{}]+\}|"clinvar":\s*\{[^{}]+\}'

# Search for any documentation or schema files that might describe these fields
fd -e md -e schema.json -e yaml -e yml | xargs rg -l 'locus|clinvar'

Length of output: 158


Script:

#!/bin/bash
# Check README.md and CHANGELOG.md for any documentation about these fields
cat README.md CHANGELOG.md | grep -A 5 -B 5 -i "locus\|clinvar"

# Look for any test files that might contain assertions or validations for these fields
fd -e test.js -e test.ts -e spec.js -e spec.ts | xargs rg -l 'locus|clinvar'

# Look for any schema validation files
fd schema | xargs rg -l 'locus|clinvar'

Length of output: 13305


Script:

#!/bin/bash
# Check the schema files for locus and clinvar definitions
cat src/seqvars/query/schema/query.rs | grep -A 10 -B 10 'locus\|clinvar'

# Check test snapshots for examples of non-empty configurations
cat src/seqvars/query/schema/snapshots/varfish_server_worker__seqvars__query__schema__query__tests__smoke_test_load@empty.json.snap

Length of output: 13502

src/seqvars/aggregate/mod.rs (1)

399-403: Consider adding documentation for metadata schema and migration path.

While the implementation is correct, consider:

  1. Documenting the metadata schema and expected values
  2. Adding a migration path for existing databases

Let's verify if this metadata is used by the query module:

✅ Verification successful

The metadata key "genome-release" is properly used and documented

The verification shows that:

  • The metadata is read and validated in the query module with proper error handling
  • The genome release is logged during database opening
  • The functionality is documented in README.md with CLI examples
  • The metadata is consistently used across aggregate and query operations

The original review comment's concerns about documentation and verification are addressed by the existing implementation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for usage of "genome-release" metadata key
rg -l "genome-release"

# Search for any genome release related queries
ast-grep --pattern 'db.get_cf($_, "genome-release")'

Length of output: 226


Script:

#!/bin/bash
# Let's check how the genome-release metadata is used in the query module
rg -A 5 -B 5 "genome-release" src/seqvars/query/mod.rs

# Also check README.md for any documentation about this metadata
rg -A 5 -B 5 "genome-release" README.md

Length of output: 1564

src/seqvars/query/interpreter/frequency.rs (1)

108-127: LGTM! Implementation is consistent and complete.

The in-house frequency filtering for nuclear DNA correctly implements all necessary checks (heterozygous, homozygous, and hemizygous) while maintaining consistency with other frequency filters.

src/seqvars/query/mod.rs (2)

52-54: Addition of optional path_inhouse_db argument is appropriate

The new path_inhouse_db optional argument is correctly added to the Args struct with proper annotations for command-line parsing.


306-313: Proper handling of optional in-house database during annotation

The code correctly checks for the presence of the in-house database before attempting to annotate record_seqvar. The use of map_err provides clear error context if the annotation fails.

Comment on lines +25 to 44
"frequency": {
"gnomadExomes": {
"enabled": true,
"maxHet": 1,
"maxHom": 0,
"maxAf": 0.002
},
"gnomadGenomes": {
"enabled": true,
"maxHet": 1,
"maxHom": 0,
"maxAf": 0.002
},
"gnomadMtdna": {},
"helixmtdb": {},
"inhouse": {
"enabled": true,
"maxCarriers": 20
}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider enhancing inhouse frequency configuration.

While the basic inhouse frequency configuration is present, it could benefit from additional parameters to match the granularity of gnomAD settings:

  • maxAf (maximum allele frequency)
  • maxHet (maximum heterozygous count)
  • maxHom (maximum homozygous count)

This would provide consistent filtering capabilities across all frequency sources.

   "inhouse": {
     "enabled": true,
-    "maxCarriers": 20
+    "maxCarriers": 20,
+    "maxAf": 0.002,
+    "maxHet": 1,
+    "maxHom": 0
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"frequency": {
"gnomadExomes": {
"enabled": true,
"maxHet": 1,
"maxHom": 0,
"maxAf": 0.002
},
"gnomadGenomes": {
"enabled": true,
"maxHet": 1,
"maxHom": 0,
"maxAf": 0.002
},
"gnomadMtdna": {},
"helixmtdb": {},
"inhouse": {
"enabled": true,
"maxCarriers": 20
}
},
"frequency": {
"gnomadExomes": {
"enabled": true,
"maxHet": 1,
"maxHom": 0,
"maxAf": 0.002
},
"gnomadGenomes": {
"enabled": true,
"maxHet": 1,
"maxHom": 0,
"maxAf": 0.002
},
"gnomadMtdna": {},
"helixmtdb": {},
"inhouse": {
"enabled": true,
"maxCarriers": 20,
"maxAf": 0.002,
"maxHet": 1,
"maxHom": 0
}
},

Comment on lines +53 to 72
"frequency": {
"gnomadExomes": {
"enabled": true,
"maxHet": 1,
"maxHom": 0,
"maxAf": 0.002
},
"Case_1_father-N1-DNA1-WGS1": {
"dp_het": 10,
"dp_hom": 5,
"gq": 10,
"ab": 0.2,
"ad": 3,
"ad_max": null,
"filter_active": "drop-variant"
"gnomadGenomes": {
"enabled": true,
"maxHet": 1,
"maxHom": 0,
"maxAf": 0.002
},
"Case_1_mother-N1-DNA1-WGS1": {
"dp_het": 10,
"dp_hom": 5,
"gq": 10,
"ab": 0.2,
"ad": 3,
"ad_max": null,
"filter_active": "drop-variant"
"gnomadMtdna": {},
"helixmtdb": {},
"inhouse": {
"enabled": true,
"maxCarriers": 20
}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Empty configurations for gnomadMtdna and helixmtdb need to be populated

Based on the test files in the repository, these databases should have proper configurations:

  • helixmtdb should include fields for enabled, frequency, heteroplasmic, and homoplasmic as seen in full-pb.json
  • Both databases have configurations in other test files (empty.json) with fields for enabled, heteroplasmic, homoplasmic, and frequency

The empty configurations in Case_1.query.json should be updated to match the schema used in other test files, either enabling or explicitly disabling these databases.

🔗 Analysis chain

Verify empty database configurations.

The frequency thresholds for gnomAD are well-configured for rare variant analysis. The new inhouse frequency configuration with maxCarriers: 20 is a good addition. However, there are empty configurations for:

  • gnomadMtdna
  • helixmtdb

Please verify if these databases should be configured:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if these databases are configured in other test cases
rg -g '*.query.json' -A 5 '"gnomadMtdna"|"helixmtdb"'

Length of output: 441


Script:

#!/bin/bash
# Let's search for any configuration patterns of these databases in query files
rg -g '*.query.json' '"gnomadMtdna":\s*{[^}]*}|"helixmtdb":\s*{[^}]*}'

# Also check for any documentation or schema files that might define these configurations
fd -e json -e md -e yaml | xargs rg -l 'gnomadMtdna|helixmtdb'

Length of output: 491


Script:

#!/bin/bash
# Let's check the content of these files to see how gnomadMtdna and helixmtdb are configured
for file in tests/seqvars/query/with_extra.json tests/seqvars/query/full-pb.json tests/seqvars/query/full.json tests/seqvars/query/empty.json tests/seqvars/query/dragen.query.json; do
    echo "=== $file ==="
    rg -A 5 -B 5 '"gnomadMtdna"|"helixmtdb"' "$file"
done

Length of output: 2448

Comment on lines +1958 to +1961
#[case::case_1_ingested_vcf_with_inhouse("tests/seqvars/query/Case_1.ingested.vcf", true)]
#[case::dragen_ingested_vcf_no_inhouse("tests/seqvars/query/dragen.ingested.vcf", false)]
#[case::case_1_ingested_vcf_with_inhouse("tests/seqvars/query/Case_1.ingested.vcf", true)]
#[case::dragen_ingested_vcf_no_inhouse("tests/seqvars/query/dragen.ingested.vcf", false)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Duplicate test cases detected

The test cases for case_1_ingested_vcf_with_inhouse and dragen_ingested_vcf_no_inhouse are duplicated. Removing the duplicates will streamline the tests and prevent redundancy.

Apply this diff to remove the duplicate test cases:

 #[rstest::rstest]
 #[case::case_1_ingested_vcf_with_inhouse("tests/seqvars/query/Case_1.ingested.vcf", true)]
 #[case::dragen_ingested_vcf_no_inhouse("tests/seqvars/query/dragen.ingested.vcf", false)]
-#[case::case_1_ingested_vcf_with_inhouse("tests/seqvars/query/Case_1.ingested.vcf", true)]
-#[case::dragen_ingested_vcf_no_inhouse("tests/seqvars/query/dragen.ingested.vcf", false)]
 #[tokio::test]

Committable suggestion was skipped due to low confidence.

Comment on lines +1608 to +1773
let (inhouse_db, inhouse_meta) =
open_rocksdb(&path, "counts", "meta").map_err(|e| {
anyhow::anyhow!(
"problem opening {} metadata at {}: {}",
"counts",
path.as_ref().as_os_str().to_string_lossy(),
e
)
})?;

if inhouse_meta.genome_release != genome_release {
anyhow::bail!(
"genome release mismatch: expected {} but got {}",
genome_release,
inhouse_meta.genome_release
);
}

Ok(Self {
inhouse_db,
inhouse_meta,
})
}

/// Query for in-house data frequencies.
fn query(
&self,
key: &annonars::common::keys::Var,
) -> Result<Option<Counts>, anyhow::Error> {
let cf_counts = self
.inhouse_db
.cf_handle("counts")
.expect("cannot find column family 'counts'");
let key: Vec<u8> = key.clone().into();
let res = self
.inhouse_db
.get_cf(&cf_counts, key)
.map_err(|e| anyhow::anyhow!("querying in-house RocksDB failed: {}", e))?;
Ok(res.as_ref().map(|res| Counts::from_vec(res)))
}

/// Annotate the given seqvars record.
pub fn annotate_seqvar(
&self,
record: VariantRecord,
) -> Result<VariantRecord, anyhow::Error> {
let VariantRecord {
vcf_variant,
call_infos,
ann_fields,
population_frequencies,
} = record;
let PopulationFrequencies {
gnomad_exomes,
gnomad_genomes,
gnomad_mtdna,
helixmtdb,
inhouse,
} = population_frequencies;

let inhouse = if let Some(inhouse_counts) =
self.query(&annonars::common::keys::Var::from(
&vcf_variant.chrom,
vcf_variant.pos,
&vcf_variant.ref_allele,
&vcf_variant.alt_allele,
))? {
let Counts {
count_homref,
count_hemiref,
count_het,
count_homalt,
count_hemialt,
} = inhouse_counts;
InHouseFrequencies {
an: (count_homref + count_hemiref + count_het + count_homalt + count_hemialt)
as i32,
hom: count_homalt as i32,
het: count_het as i32,
hemi: count_hemialt as i32,
}
} else {
inhouse
};

Ok(VariantRecord {
vcf_variant,
call_infos,
ann_fields,
population_frequencies: PopulationFrequencies {
gnomad_exomes,
gnomad_genomes,
gnomad_mtdna,
helixmtdb,
inhouse,
},
})
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential error handling improvement in inhouse::Dbs::query method

In the query method, when querying the in-house database, consider handling the case where the column family 'counts' might not exist. This would prevent a potential panic if the column family is missing.

Apply this diff to handle the missing column family gracefully:

-            let cf_counts = self
-                .inhouse_db
-                .cf_handle("counts")
-                .expect("cannot find column family 'counts'");
+            let cf_counts = match self.inhouse_db.cf_handle("counts") {
+                Some(cf) => cf,
+                None => return Err(anyhow::anyhow!("Column family 'counts' not found in in-house database")),
+            };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Code for accessing the in-house frequencies.
pub(crate) mod inhouse {
use std::{path::Path, sync::Arc};
use crate::seqvars::aggregate::ds::Counts;
use super::schema::data::{InHouseFrequencies, PopulationFrequencies, VariantRecord};
/// Meta information of the database.
pub struct Meta {
/// Genome release.
pub genome_release: String,
}
/// Container for the database structures.
pub struct Dbs {
/// Inhouse frequency RocksDB handle.
pub inhouse_db: Arc<rocksdb::DBWithThreadMode<rocksdb::MultiThreaded>>,
/// Meta information.
#[allow(dead_code)]
pub inhouse_meta: Meta,
}
/// Open RocksDb given path and column family name for data and metadata.
pub fn open_rocksdb<P: AsRef<std::path::Path>>(
path_rocksdb: P,
cf_data: &str,
cf_meta: &str,
) -> Result<(Arc<rocksdb::DBWithThreadMode<rocksdb::MultiThreaded>>, Meta), anyhow::Error> {
tracing::info!("Opening RocksDB database ...");
let before_open = std::time::Instant::now();
let cf_names = &[cf_meta, cf_data];
let resolved_path_rocksdb = annonars::common::readlink_f(&path_rocksdb)?;
let db = Arc::new(rocksdb::DB::open_cf_for_read_only(
&rocksdb::Options::default(),
resolved_path_rocksdb,
cf_names,
true,
)?);
tracing::info!(" reading meta information");
let meta = {
let cf_meta = db.cf_handle(cf_meta).unwrap();
let meta_genome_release = String::from_utf8(
db.get_cf(&cf_meta, "genome-release")?
.ok_or_else(|| anyhow::anyhow!("missing value meta:genome-release"))?,
)?;
Meta {
genome_release: meta_genome_release,
}
};
tracing::info!(" meta:genome-release = {}", &meta.genome_release);
tracing::info!(
"... opening RocksDB database took {:?}",
before_open.elapsed()
);
Ok((db, meta))
}
impl Dbs {
/// Initialize from path that contains the annonars databases.
pub fn with_path<P: AsRef<Path>>(
path: P,
genome_release: &str,
) -> Result<Self, anyhow::Error> {
let (inhouse_db, inhouse_meta) =
open_rocksdb(&path, "counts", "meta").map_err(|e| {
anyhow::anyhow!(
"problem opening {} metadata at {}: {}",
"counts",
path.as_ref().as_os_str().to_string_lossy(),
e
)
})?;
if inhouse_meta.genome_release != genome_release {
anyhow::bail!(
"genome release mismatch: expected {} but got {}",
genome_release,
inhouse_meta.genome_release
);
}
Ok(Self {
inhouse_db,
inhouse_meta,
})
}
/// Query for in-house data frequencies.
fn query(
&self,
key: &annonars::common::keys::Var,
) -> Result<Option<Counts>, anyhow::Error> {
let cf_counts = self
.inhouse_db
.cf_handle("counts")
.expect("cannot find column family 'counts'");
let key: Vec<u8> = key.clone().into();
let res = self
.inhouse_db
.get_cf(&cf_counts, key)
.map_err(|e| anyhow::anyhow!("querying in-house RocksDB failed: {}", e))?;
Ok(res.as_ref().map(|res| Counts::from_vec(res)))
}
/// Annotate the given seqvars record.
pub fn annotate_seqvar(
&self,
record: VariantRecord,
) -> Result<VariantRecord, anyhow::Error> {
let VariantRecord {
vcf_variant,
call_infos,
ann_fields,
population_frequencies,
} = record;
let PopulationFrequencies {
gnomad_exomes,
gnomad_genomes,
gnomad_mtdna,
helixmtdb,
inhouse,
} = population_frequencies;
let inhouse = if let Some(inhouse_counts) =
self.query(&annonars::common::keys::Var::from(
&vcf_variant.chrom,
vcf_variant.pos,
&vcf_variant.ref_allele,
&vcf_variant.alt_allele,
))? {
let Counts {
count_homref,
count_hemiref,
count_het,
count_homalt,
count_hemialt,
} = inhouse_counts;
InHouseFrequencies {
an: (count_homref + count_hemiref + count_het + count_homalt + count_hemialt)
as i32,
hom: count_homalt as i32,
het: count_het as i32,
hemi: count_hemialt as i32,
}
} else {
inhouse
};
Ok(VariantRecord {
vcf_variant,
call_infos,
ann_fields,
population_frequencies: PopulationFrequencies {
gnomad_exomes,
gnomad_genomes,
gnomad_mtdna,
helixmtdb,
inhouse,
},
})
}
}
}
/// Code for accessing the in-house frequencies.
pub(crate) mod inhouse {
use std::{path::Path, sync::Arc};
use crate::seqvars::aggregate::ds::Counts;
use super::schema::data::{InHouseFrequencies, PopulationFrequencies, VariantRecord};
/// Meta information of the database.
pub struct Meta {
/// Genome release.
pub genome_release: String,
}
/// Container for the database structures.
pub struct Dbs {
/// Inhouse frequency RocksDB handle.
pub inhouse_db: Arc<rocksdb::DBWithThreadMode<rocksdb::MultiThreaded>>,
/// Meta information.
#[allow(dead_code)]
pub inhouse_meta: Meta,
}
/// Open RocksDb given path and column family name for data and metadata.
pub fn open_rocksdb<P: AsRef<std::path::Path>>(
path_rocksdb: P,
cf_data: &str,
cf_meta: &str,
) -> Result<(Arc<rocksdb::DBWithThreadMode<rocksdb::MultiThreaded>>, Meta), anyhow::Error> {
tracing::info!("Opening RocksDB database ...");
let before_open = std::time::Instant::now();
let cf_names = &[cf_meta, cf_data];
let resolved_path_rocksdb = annonars::common::readlink_f(&path_rocksdb)?;
let db = Arc::new(rocksdb::DB::open_cf_for_read_only(
&rocksdb::Options::default(),
resolved_path_rocksdb,
cf_names,
true,
)?);
tracing::info!(" reading meta information");
let meta = {
let cf_meta = db.cf_handle(cf_meta).unwrap();
let meta_genome_release = String::from_utf8(
db.get_cf(&cf_meta, "genome-release")?
.ok_or_else(|| anyhow::anyhow!("missing value meta:genome-release"))?,
)?;
Meta {
genome_release: meta_genome_release,
}
};
tracing::info!(" meta:genome-release = {}", &meta.genome_release);
tracing::info!(
"... opening RocksDB database took {:?}",
before_open.elapsed()
);
Ok((db, meta))
}
impl Dbs {
/// Initialize from path that contains the annonars databases.
pub fn with_path<P: AsRef<Path>>(
path: P,
genome_release: &str,
) -> Result<Self, anyhow::Error> {
let (inhouse_db, inhouse_meta) =
open_rocksdb(&path, "counts", "meta").map_err(|e| {
anyhow::anyhow!(
"problem opening {} metadata at {}: {}",
"counts",
path.as_ref().as_os_str().to_string_lossy(),
e
)
})?;
if inhouse_meta.genome_release != genome_release {
anyhow::bail!(
"genome release mismatch: expected {} but got {}",
genome_release,
inhouse_meta.genome_release
);
}
Ok(Self {
inhouse_db,
inhouse_meta,
})
}
/// Query for in-house data frequencies.
fn query(
&self,
key: &annonars::common::keys::Var,
) -> Result<Option<Counts>, anyhow::Error> {
let cf_counts = match self.inhouse_db.cf_handle("counts") {
Some(cf) => cf,
None => return Err(anyhow::anyhow!("Column family 'counts' not found in in-house database")),
};
let key: Vec<u8> = key.clone().into();
let res = self
.inhouse_db
.get_cf(&cf_counts, key)
.map_err(|e| anyhow::anyhow!("querying in-house RocksDB failed: {}", e))?;
Ok(res.as_ref().map(|res| Counts::from_vec(res)))
}
/// Annotate the given seqvars record.
pub fn annotate_seqvar(
&self,
record: VariantRecord,
) -> Result<VariantRecord, anyhow::Error> {
let VariantRecord {
vcf_variant,
call_infos,
ann_fields,
population_frequencies,
} = record;
let PopulationFrequencies {
gnomad_exomes,
gnomad_genomes,
gnomad_mtdna,
helixmtdb,
inhouse,
} = population_frequencies;
let inhouse = if let Some(inhouse_counts) =
self.query(&annonars::common::keys::Var::from(
&vcf_variant.chrom,
vcf_variant.pos,
&vcf_variant.ref_allele,
&vcf_variant.alt_allele,
))? {
let Counts {
count_homref,
count_hemiref,
count_het,
count_homalt,
count_hemialt,
} = inhouse_counts;
InHouseFrequencies {
an: (count_homref + count_hemiref + count_het + count_homalt + count_hemialt)
as i32,
hom: count_homalt as i32,
het: count_het as i32,
hemi: count_hemialt as i32,
}
} else {
inhouse
};
Ok(VariantRecord {
vcf_variant,
call_infos,
ann_fields,
population_frequencies: PopulationFrequencies {
gnomad_exomes,
gnomad_genomes,
gnomad_mtdna,
helixmtdb,
inhouse,
},
})
}
}
}

@holtgrewe holtgrewe merged commit 45bec0c into main Oct 30, 2024
8 of 9 checks passed
@holtgrewe holtgrewe deleted the 232-use-in-house-frequencies-in-seqvars-query branch October 30, 2024 14:02
Copy link

codecov bot commented Oct 30, 2024

Codecov Report

Attention: Patch coverage is 76.47059% with 44 lines in your changes missing coverage. Please review.

Project coverage is 79%. Comparing base (0ce7f13) to head (1b1a37f).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/seqvars/query/interpreter/frequency.rs 39% 20 Missing ⚠️
src/seqvars/query/mod.rs 87% 19 Missing ⚠️
src/seqvars/aggregate/mod.rs 0% 5 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##            main    #522    +/-   ##
======================================
+ Coverage     71%     79%    +7%     
======================================
  Files         43      43            
  Lines      11830   12012   +182     
======================================
+ Hits        8510    9492   +982     
+ Misses      3320    2520   -800     
Files with missing lines Coverage Δ
src/seqvars/aggregate/mod.rs 39% <0%> (-1%) ⬇️
src/seqvars/query/mod.rs 59% <87%> (+49%) ⬆️
src/seqvars/query/interpreter/frequency.rs 71% <39%> (-14%) ⬇️

... and 14 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use in-house frequencies in "seqvars query"
1 participant