Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: include source information in transcript database #615

Merged
merged 24 commits into from
Nov 11, 2024

Conversation

tedil
Copy link
Contributor

@tedil tedil commented Nov 11, 2024

Release-as: 0.30.0
#601

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced new enumerations for transcript database assembly and source.
    • Added a SourceVersion message type for encapsulating version information.
    • Enhanced command line arguments for specifying genome assembly details and transcript sources.
  • Bug Fixes

    • Improved assembly consistency checks during database merging.
  • Refactor

    • Streamlined constructors by removing unnecessary parameters.
    • Centralized assembly management within the TxSeqDatabase.
  • Documentation

    • Updated comments and documentation to reflect changes in argument structures and functionality.

Copy link
Contributor

coderabbitai bot commented Nov 11, 2024

Warning

Rate limit exceeded

@tedil has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 17 minutes and 24 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 8ff02c9 and dbf7431.

Walkthrough

The pull request introduces significant changes to the handling of assembly and source information in various components of the codebase. New enumerations and message structures are added to the protocol buffer definitions, enhancing the representation of transcript databases. Additionally, several constructors and method signatures are modified to remove the explicit assembly parameter, streamlining the initialization process across multiple modules. The changes also include updates to command-line argument structures and improvements in version handling, ensuring a more cohesive and maintainable codebase.

Changes

File Path Change Summary
protos/mehari/txs.proto Added enums Assembly and Source, new message SourceVersion, and updated TxSeqDatabase structure.
src/annotate/seqvars/csq.rs Removed Assembly parameter from ConsequencePredictor::new, adjusted internal logic and tests.
src/annotate/seqvars/mod.rs Updated command line arguments and removed assembly parameter from fill_coords and setup_annotator.
src/annotate/seqvars/provider.rs Removed assembly parameter from TxIntervalTrees::new and build_indices, adjusted Provider methods.
src/db/create/mod.rs Updated Args struct to replace genome_release with assembly and added new fields; modified TranscriptLoader.
src/db/merge.rs Enhanced merge_transcript_databases to check assembly consistency using itertools.
src/db/mod.rs Introduced TranscriptDatabase trait with assembly method for TxSeqDatabase.
src/db/subset/mod.rs Renamed genome_release field to source_version in TxSeqDatabase.
src/server/run/actix_server/versions.rs Added version_cdot field to DataVersionEntry, updated version handling logic.
src/server/run/mod.rs Removed assembly parameter from MehariProvider and SeqvarConsequencePredictor constructors.
src/verify/seqvars.rs Removed assembly parameter from MehariProvider and ConsequencePredictor constructors.

Possibly related PRs

🐇 In the meadow where data flows,
New enums and messages, how it grows!
Assembly and sources, now clearer to see,
A hop and a skip, as smooth as can be!
With each little change, we refine and we play,
In the world of transcripts, we leap every day! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

holtgrewe
holtgrewe previously approved these changes Nov 11, 2024
Copy link
Contributor

@holtgrewe holtgrewe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, see comments

protos/mehari/txs.proto Outdated Show resolved Hide resolved
protos/mehari/txs.proto Outdated Show resolved Hide resolved
protos/mehari/txs.proto Outdated Show resolved Hide resolved
protos/mehari/txs.proto Outdated Show resolved Hide resolved
src/db/create/mod.rs Outdated Show resolved Hide resolved
src/db/create/mod.rs Outdated Show resolved Hide resolved
src/db/merge.rs Outdated Show resolved Hide resolved
@tedil tedil requested a review from holtgrewe November 11, 2024 11:47
@tedil tedil marked this pull request as ready for review November 11, 2024 11:47
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (13)
src/db/mod.rs (2)

13-15: Add documentation for the new trait and method.

Consider adding documentation comments to:

  • Explain the purpose of the TranscriptDatabase trait
  • Document the assembly() method's behavior and return value
  • Include usage examples if applicable

Example:

/// Represents a database containing transcript information with assembly metadata.
pub trait TranscriptDatabase {
    /// Returns the genome assembly version used in this transcript database.
    fn assembly(&self) -> Assembly;
}

17-31: Consider centralizing assembly type mapping.

The current implementation directly maps between pbs::txs::Assembly and biocommons_bioutils::assemblies::Assembly. Consider:

  1. Creating a central location for assembly type conversions
  2. Documenting supported assemblies and their relationships
  3. Adding tests for assembly conversions

This would make it easier to maintain and extend assembly support in the future.

src/db/merge.rs (2)

31-39: Consider using Result instead of assertions for better error handling.

While the assembly verification logic is correct, using assert! for business logic validation can be improved. Consider returning Result with descriptive error messages instead, which would provide better error handling for end users.

-        assert!(first.source_version.iter().map(|v| v.assembly).all_equal());
+        if !first.source_version.iter().map(|v| v.assembly).all_equal() {
+            return Err(anyhow::anyhow!("Inconsistent assembly versions in first database"));
+        }
         let assembly = first
             .source_version
             .first()
             .expect("At least one source_version entry expected")
             .assembly;
-        assert!(others
+        if !others
             .iter()
             .all(|db| db.source_version.iter().all(|v| v.assembly == assembly)));
+        {
+            return Err(anyhow::anyhow!("All databases must use the same assembly version"));
+        }

Line range hint 74-77: Consider using with_capacity for HashMap initialization.

Since we know the size of the other database's gene-to-transcript mappings, we can optimize memory allocation by pre-allocating the HashMap.

-            let mut other_gene_to_tx: HashMap<_, _> = other_tx_db
+            let mut other_gene_to_tx: HashMap<_, _> = HashMap::with_capacity(other_tx_db.gene_to_tx.len());
+            other_gene_to_tx.extend(other_tx_db
                 .gene_to_tx
                 .into_iter()
-                .map(|g2t| (g2t.gene_id.clone(), g2t))
-                .collect();
+                .map(|g2t| (g2t.gene_id.clone(), g2t)));
src/server/run/mod.rs (1)

138-143: LGTM: Clean architectural improvement.

The removal of explicit assembly configuration in favor of provider-based handling is a good architectural improvement that:

  1. Reduces parameter coupling
  2. Centralizes assembly information handling
  3. Maintains backward compatibility through the existing genome_release mapping

The changes align well with the PR's objective of improving source information handling in the transcript database.

Consider documenting this architectural change in the project's architecture documentation to help future maintainers understand the design decisions.

src/server/run/actix_server/versions.rs (1)

106-113: Consider optimizing the filter_map operation.

The current implementation uses filter_map with bool::then, which can be more efficiently written using separate filter and map operations.

-            .filter_map(|v| (v.source_name == source_name).then(|| v.source_version.clone()))
+            .filter(|&v| v.source_name == source_name)
+            .map(|v| v.source_version.clone())
🧰 Tools
🪛 GitHub Check: clippy

[warning] 109-109: usage of bool::then in filter_map
warning: usage of bool::then in filter_map
--> src/server/run/actix_server/versions.rs:109:18
|
109 | .filter_map(|v| (v.source_name == source_name).then(|| v.source_version.clone()))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: use filter then map instead: filter(|&v| (v.source_name == source_name)).map(|v| v.source_version.clone())
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#filter_map_bool_then
= note: #[warn(clippy::filter_map_bool_then)] on by default

protos/mehari/txs.proto (2)

30-38: Consider future transcript sources.

The enum covers the main transcript sources (RefSeq and Ensembl). Consider if we should document a plan for handling additional sources that might be needed in the future (e.g., GENCODE, UCSC).


40-54: Rename source_name field for consistency.

The field name source_name (line 49) is inconsistent with the enum type Source. Consider renaming it to source for better alignment with the type name and other field naming patterns.

    // Source, either RefSeq or Ensembl (or Unknown).
-    Source source_name = 4;
+    Source source = 4;
src/db/subset/mod.rs (1)

Line range hint 28-120: Consider enhancing error handling and modularity.

The subset_tx_db function handles complex data structure manipulation. Consider these improvements:

  1. Extract the transcript filtering logic into a separate function for better maintainability
  2. Add more descriptive error messages when tx_db or seq_db are missing
  3. Consider adding tests for edge cases (empty gene symbols list, non-existent genes)

Example improvement for error handling:

-    let container_tx_db = container.tx_db.as_ref().expect("no tx_db");
+    let container_tx_db = container.tx_db.as_ref()
+        .ok_or_else(|| anyhow::anyhow!("Missing transcript database in container"))?;
src/annotate/seqvars/csq.rs (1)

1240-1241: Consider consolidating test setup code.

Multiple test functions have similar provider and predictor initialization code. Consider extracting this common setup into a helper function to reduce code duplication.

Example helper function:

fn create_test_predictor(config: Option<Config>) -> Result<ConsequencePredictor, anyhow::Error> {
    let tx_path = "tests/data/annotate/db/grch37/txs.bin.zst";
    let tx_db = load_tx_db(tx_path)?;
    let provider = Arc::new(MehariProvider::new(tx_db, Default::default()));
    Ok(ConsequencePredictor::new(provider, config.unwrap_or_default()))
}

Also applies to: 1298-1299, 1552-1553

src/annotate/seqvars/mod.rs (1)

Line range hint 1944-1971: Consider enhancing error handling for annotator setup.

The function correctly handles the assembly changes and properly initializes annotators based on provided sources. However, consider adding more specific error context for each annotator initialization failure.

Consider wrapping errors with additional context:

 if let Some(rocksdb_path) = &args.sources.frequencies {
-    annotators.push(FrequencyAnnotator::from_path(rocksdb_path).map(AnnotatorEnum::Frequency)?)
+    annotators.push(
+        FrequencyAnnotator::from_path(rocksdb_path)
+            .map(AnnotatorEnum::Frequency)
+            .with_context(|| format!("Failed to initialize frequency annotator from {}", rocksdb_path))?
+    )
 }
src/annotate/seqvars/provider.rs (2)

338-345: Improve formatting of data_version for clarity

The current data_version concatenates tx_seq_db.version, formatted source_version, and config without clear separators and uses debug formatting, which might reduce readability. Consider adding separators and using consistent formatting to enhance clarity.

Apply this diff to improve the data_version formatting:

 let data_version = format!(
-    "{}{}{:?}",
+    "{}_{}_{}",
     tx_seq_db.version.as_ref().unwrap_or(&"".to_string()),
-    tx_seq_db
-        .source_version
-        .iter()
-        .map(|v| format!("{:#?}", v))
-        .join(","),
+    tx_seq_db.source_version.join(","),
+    format!("{:?}", config)
 );

This change adds underscores as separators and formats source_version without debug syntax for better readability.


604-606: Simplify assembly formatting in error messages

Using "{:?}" in format!("{:?}", self.assembly()) includes debug information, which might not be necessary for user-facing error messages. Consider using format!("{}", self.assembly()) to present a cleaner message.

Apply this diff to adjust the formatting:

 Err(Error::NoAlignmentFound(
     tx_ac.to_string(),
-    format!("{:?}", self.assembly()),
+    format!("{}", self.assembly()),
 ))

This change uses the Display trait for Assembly, producing a more user-friendly error message.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between cce8a6d and 8ff02c9.

⛔ Files ignored due to path filters (6)
  • src/db/create/snapshots/mehari__db__create__test__run_smoke_brca1_opa1.snap is excluded by !**/*.snap
  • src/db/create/snapshots/mehari__db__create__test__run_smoke_mitochondrial.snap is excluded by !**/*.snap
  • src/db/create/snapshots/mehari__db__create__test__run_smoke_selenoproteins.snap is excluded by !**/*.snap
  • src/db/subset/snapshots/mehari__db__subset__tests__subset_tx_db.snap is excluded by !**/*.snap
  • tests/data/annotate/db/grch37/txs.bin.zst is excluded by !**/*.zst
  • tests/data/annotate/db/grch38/txs.bin.zst is excluded by !**/*.zst
📒 Files selected for processing (11)
  • protos/mehari/txs.proto (2 hunks)
  • src/annotate/seqvars/csq.rs (5 hunks)
  • src/annotate/seqvars/mod.rs (7 hunks)
  • src/annotate/seqvars/provider.rs (6 hunks)
  • src/db/create/mod.rs (13 hunks)
  • src/db/merge.rs (2 hunks)
  • src/db/mod.rs (1 hunks)
  • src/db/subset/mod.rs (1 hunks)
  • src/server/run/actix_server/versions.rs (2 hunks)
  • src/server/run/mod.rs (1 hunks)
  • src/verify/seqvars.rs (0 hunks)
💤 Files with no reviewable changes (1)
  • src/verify/seqvars.rs
🧰 Additional context used
🪛 GitHub Check: clippy
src/server/run/actix_server/versions.rs

[warning] 109-109: usage of bool::then in filter_map
warning: usage of bool::then in filter_map
--> src/server/run/actix_server/versions.rs:109:18
|
109 | .filter_map(|v| (v.source_name == source_name).then(|| v.source_version.clone()))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: use filter then map instead: filter(|&v| (v.source_name == source_name)).map(|v| v.source_version.clone())
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#filter_map_bool_then
= note: #[warn(clippy::filter_map_bool_then)] on by default

🔇 Additional comments (22)
src/db/mod.rs (2)

3-5: LGTM: Imports are well-organized and necessary.

The new imports are properly scoped and support the new transcript database functionality.


17-31: ⚠️ Potential issue

Replace unwrap() and panic! with proper error handling.

The current implementation has several reliability concerns:

  1. Multiple unwrap() calls could cause runtime panics
  2. Explicit panic! for unsupported assemblies is not user-friendly
  3. Assumes source_version is non-empty
  4. Only uses the first source version when there could be multiple

Consider this safer implementation:

impl TranscriptDatabase for TxSeqDatabase {
    fn assembly(&self) -> Assembly {
        let assembly = self
            .source_version
            .first()
            .ok_or_else(|| anyhow::anyhow!("No source version found"))?
            .assembly
            .try_into()
            .map_err(|_| anyhow::anyhow!("Invalid assembly enum value"))?;

        match assembly {
            pbs::txs::Assembly::Grch37 => Ok(Assembly::Grch37p10),
            pbs::txs::Assembly::Grch38 => Ok(Assembly::Grch38),
            other => Err(anyhow::anyhow!("Unsupported assembly: {:?}", other)),
        }
    }
}

Let's check if there are other places in the codebase that might be affected by this change:

src/db/merge.rs (1)

6-6: LGTM: Good choice using itertools!

The itertools crate is a good choice for the all_equal() functionality needed in assembly verification.

src/server/run/mod.rs (2)

143-143: Document the configuration changes in SeqvarConsequencePredictor.

The removal of explicit assembly configuration in favor of Default::default() represents a significant architectural change. While the code change is sound, it would benefit from documentation updates.

Consider adding a code comment explaining:

  1. How assembly information is now handled
  2. What the default configuration includes
  3. Any implications for existing integrations
+    // Assembly information is now handled through the provider's internal configuration
+    // Default configuration includes: ...
     SeqvarConsequencePredictor::new(provider.clone(), Default::default()),
#!/bin/bash
# Description: Verify the impact of removing assembly configuration

# Check for any tests that might need updates
rg -A 5 "SeqvarConsequencePredictor::new\(" --type rust

# Look for any documentation that might need updates
fd . -e md -e rst --exec rg -l "SeqvarConsequencePredictor"

138-138: Verify the default configuration for MehariProvider.

The removal of the assembly parameter and use of Default::default() needs verification to ensure it provides appropriate defaults for all use cases.

✅ Verification successful

Default configuration for MehariProvider is safe to use

Based on the codebase analysis, the change to use Default::default() is safe and maintains backward compatibility:

  • The code shows consistent usage of Default::default() in test cases and production code
  • MehariProviderConfigBuilder::default() is used throughout the codebase for configuration
  • When specific configuration is needed, the builder pattern is used to customize settings (e.g., transcript picking)
  • The default configuration is used successfully in test scenarios without any assembly parameter
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any custom configurations that might have been using the assembly parameter

# Look for any previous custom configurations of MehariProvider
rg -A 5 "MehariProvider::new\(" --type rust

# Check for any potential configuration structs that might need updates
ast-grep --pattern 'struct $name {
  $$$
  assembly: $_,
  $$$
}'

Length of output: 689


Script:

#!/bin/bash
# Let's check for MehariProvider usage and configuration patterns

# Look for MehariProvider configurations and usages
rg -A 5 "MehariProvider" --type rust

# Check for any configuration structs related to Mehari
ast-grep --pattern 'struct MehariConfig {
  $$$
}'

# Check for Default implementation
ast-grep --pattern 'impl Default for MehariConfig {
  $$$
}'

Length of output: 10936

src/server/run/actix_server/versions.rs (5)

1-9: LGTM! Well-organized imports.

The addition of itertools::Itertools is appropriate for the string joining operations used in version handling.


97-98: LGTM! Well-documented new field.

The addition of version_cdot aligns with the PR objective of including source information in the transcript database.


105-116: LGTM! Clean version handling implementation.

The version extraction and joining logic is well-structured and handles all cases appropriately.

🧰 Tools
🪛 GitHub Check: clippy

[warning] 109-109: usage of bool::then in filter_map
warning: usage of bool::then in filter_map
--> src/server/run/actix_server/versions.rs:109:18
|
109 | .filter_map(|v| (v.source_name == source_name).then(|| v.source_version.clone()))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: use filter then map instead: filter(|&v| (v.source_name == source_name)).map(|v| v.source_version.clone())
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#filter_map_bool_then
= note: #[warn(clippy::filter_map_bool_then)] on by default


120-122: LGTM! Proper struct initialization.

The initialization correctly includes all fields in the same order as the struct definition.


105-116: Verify consistent version handling across the codebase.

Let's ensure that the version information is consistently handled across different components.

✅ Verification successful

Let me gather more specific information about the version handling implementation.


Based on the gathered information, I can now provide the final response:

Version handling implementation is consistent across the codebase

The version handling implementation in versions.rs aligns well with the rest of the codebase:

  • Version information is properly structured through dedicated types (SoftwareVersions, DataVersionEntry, VersionsInfoResponse)
  • The cdot version handling is consistently implemented across database operations and server responses
  • Version information is appropriately exposed through the /api/v1/version endpoint
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for consistent version handling patterns across the codebase

# Look for other version-related structures and their handling
echo "Checking for version-related structures:"
rg -l "version|Version" --type rust

# Check for potential places where cdot version might need to be handled
echo -e "\nChecking for cdot version handling:"
rg "cdot.*version|version.*cdot" --type rust

# Look for similar version joining patterns that might need updating
echo -e "\nChecking for version joining patterns:"
ast-grep --pattern 'join(",")'

Length of output: 2083


Script:

#!/bin/bash
# Check the implementation details of version handling

# Look for version-related structs and types
echo "Checking version-related structures:"
rg -A 5 "struct.*Version|type.*Version" --type rust

# Check for version handling in server responses
echo -e "\nChecking version handling in responses:"
rg -A 5 "version.*Response|Response.*version" --type rust

# Look for version-related database operations
echo -e "\nChecking version handling in database operations:"
ast-grep --pattern 'fn $_ ($$$) -> $_ {
  $$$
  version
  $$$
}'

Length of output: 2600

🧰 Tools
🪛 GitHub Check: clippy

[warning] 109-109: usage of bool::then in filter_map
warning: usage of bool::then in filter_map
--> src/server/run/actix_server/versions.rs:109:18
|
109 | .filter_map(|v| (v.source_name == source_name).then(|| v.source_version.clone()))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: use filter then map instead: filter(|&v| (v.source_name == source_name)).map(|v| v.source_version.clone())
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#filter_map_bool_then
= note: #[warn(clippy::filter_map_bool_then)] on by default

protos/mehari/txs.proto (2)

20-28: LGTM: Assembly enum is well-defined and documented.

The enum follows proto3 best practices with UNKNOWN = 0 and includes comprehensive values for current needs.


194-195: LGTM: Well-documented field addition.

The repeated source_version field is appropriately documented, explaining its purpose in tracking information during database merges.

src/db/subset/mod.rs (1)

119-119: ⚠️ Potential issue

Breaking change: Field renamed from genome_release to source_version.

This change aligns with the PR objective to enhance source information handling in the transcript database. However, this is a breaking change that will require updates in dependent code.

Let's verify the impact and test coverage:

src/annotate/seqvars/csq.rs (3)

104-106: LGTM: Constructor refactoring improves code organization.

The removal of the explicit assembly parameter and using provider.assembly() instead centralizes the assembly management and reduces parameter coupling.


1049-1051: LGTM: Test setup correctly updated.

The test initialization has been properly updated to reflect the constructor changes.


Line range hint 1-1800: Overall code quality is excellent.

The code demonstrates:

  • Comprehensive documentation
  • Robust error handling
  • Extensive test coverage
  • Consistent implementation of changes
src/annotate/seqvars/mod.rs (3)

Line range hint 1365-1369: LGTM: Consistent with assembly handling changes.

The method correctly uses the updated fill_coords signature and maintains proper control flow with early return for non-canonical chromosomes.


Line range hint 1731-1744: LGTM: Well-structured configuration setup.

The method effectively uses the builder pattern to configure the annotator, properly handling all configuration options from the arguments.


Line range hint 647-657: LGTM: Assembly handling refactored correctly.

The assembly is now properly retrieved from the struct's state with a clear error message. The release string and chromosome naming logic correctly handles both GRCh37 and GRCh38 assemblies.

Let's verify the assembly handling across the codebase:

✅ Verification successful

Let me gather more context about the assembly usage to ensure the refactoring is consistent.


Assembly handling is correctly implemented and consistent across the codebase

The verification shows that:

  • Assembly is properly handled as an optional field in structs
  • The expect error handling in the reviewed code matches the pattern used elsewhere
  • Assembly state is accessed consistently through struct fields
  • The assembly parameter is only used in constructor/setup functions (new, set_assembly) where it's appropriate
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify assembly handling consistency
# Test: Check for any direct assembly parameter usage that should use struct state
rg -A 3 "fn.*assembly: Assembly"

Length of output: 1996


Script:

#!/bin/bash
# Check for assembly field declarations and their usage
ast-grep --pattern 'assembly: Option<Assembly>'

# Check for assembly.expect usage to ensure consistent error handling
rg -A 2 'assembly\.expect'

# Check for any direct assembly usage without expect
rg -A 2 'self\.assembly\s+[^=.]'

Length of output: 646

src/annotate/seqvars/provider.rs (3)

44-45: Simplified TxIntervalTrees::new by removing unnecessary assembly parameter

The assembly parameter has been removed from TxIntervalTrees::new, and the assembly is now obtained directly from db. This change simplifies the function signature and reduces redundancy.


52-53: Streamlined assembly retrieval in build_indices

In build_indices, the assembly is now retrieved using db.assembly(). This aligns with the changes in TxIntervalTrees::new and ensures that the assembly information is consistently sourced from the TxSeqDatabase.


164-165: Simplified Provider::new by removing assembly parameter

The assembly parameter has been removed from Provider::new. The assembly is now accessed through tx_seq_db, which simplifies the constructor and ensures that assembly information is consistently obtained from the database.

src/db/create/mod.rs Outdated Show resolved Hide resolved
tedil and others added 7 commits November 11, 2024 12:56
@holtgrewe holtgrewe enabled auto-merge (squash) November 11, 2024 12:40
@holtgrewe holtgrewe merged commit 111ffcc into main Nov 11, 2024
9 checks passed
@holtgrewe holtgrewe deleted the 601-include-database-source-information branch November 11, 2024 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants