-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: include source information in transcript database #615
Conversation
…source-information
Warning Rate limit exceeded@tedil has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 17 minutes and 24 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughThe pull request introduces significant changes to the handling of assembly and source information in various components of the codebase. New enumerations and message structures are added to the protocol buffer definitions, enhancing the representation of transcript databases. Additionally, several constructors and method signatures are modified to remove the explicit Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, see comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (13)
src/db/mod.rs (2)
13-15
: Add documentation for the new trait and method.Consider adding documentation comments to:
- Explain the purpose of the
TranscriptDatabase
trait- Document the
assembly()
method's behavior and return value- Include usage examples if applicable
Example:
/// Represents a database containing transcript information with assembly metadata. pub trait TranscriptDatabase { /// Returns the genome assembly version used in this transcript database. fn assembly(&self) -> Assembly; }
17-31
: Consider centralizing assembly type mapping.The current implementation directly maps between
pbs::txs::Assembly
andbiocommons_bioutils::assemblies::Assembly
. Consider:
- Creating a central location for assembly type conversions
- Documenting supported assemblies and their relationships
- Adding tests for assembly conversions
This would make it easier to maintain and extend assembly support in the future.
src/db/merge.rs (2)
31-39
: Consider using Result instead of assertions for better error handling.While the assembly verification logic is correct, using
assert!
for business logic validation can be improved. Consider returningResult
with descriptive error messages instead, which would provide better error handling for end users.- assert!(first.source_version.iter().map(|v| v.assembly).all_equal()); + if !first.source_version.iter().map(|v| v.assembly).all_equal() { + return Err(anyhow::anyhow!("Inconsistent assembly versions in first database")); + } let assembly = first .source_version .first() .expect("At least one source_version entry expected") .assembly; - assert!(others + if !others .iter() .all(|db| db.source_version.iter().all(|v| v.assembly == assembly))); + { + return Err(anyhow::anyhow!("All databases must use the same assembly version")); + }
Line range hint
74-77
: Consider using with_capacity for HashMap initialization.Since we know the size of the other database's gene-to-transcript mappings, we can optimize memory allocation by pre-allocating the HashMap.
- let mut other_gene_to_tx: HashMap<_, _> = other_tx_db + let mut other_gene_to_tx: HashMap<_, _> = HashMap::with_capacity(other_tx_db.gene_to_tx.len()); + other_gene_to_tx.extend(other_tx_db .gene_to_tx .into_iter() - .map(|g2t| (g2t.gene_id.clone(), g2t)) - .collect(); + .map(|g2t| (g2t.gene_id.clone(), g2t)));src/server/run/mod.rs (1)
138-143
: LGTM: Clean architectural improvement.The removal of explicit assembly configuration in favor of provider-based handling is a good architectural improvement that:
- Reduces parameter coupling
- Centralizes assembly information handling
- Maintains backward compatibility through the existing genome_release mapping
The changes align well with the PR's objective of improving source information handling in the transcript database.
Consider documenting this architectural change in the project's architecture documentation to help future maintainers understand the design decisions.
src/server/run/actix_server/versions.rs (1)
106-113
: Consider optimizing the filter_map operation.The current implementation uses
filter_map
withbool::then
, which can be more efficiently written using separatefilter
andmap
operations.- .filter_map(|v| (v.source_name == source_name).then(|| v.source_version.clone())) + .filter(|&v| v.source_name == source_name) + .map(|v| v.source_version.clone())🧰 Tools
🪛 GitHub Check: clippy
[warning] 109-109: usage of
bool::then
infilter_map
warning: usage ofbool::then
infilter_map
--> src/server/run/actix_server/versions.rs:109:18
|
109 | .filter_map(|v| (v.source_name == source_name).then(|| v.source_version.clone()))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: usefilter
thenmap
instead:filter(|&v| (v.source_name == source_name)).map(|v| v.source_version.clone())
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#filter_map_bool_then
= note:#[warn(clippy::filter_map_bool_then)]
on by defaultprotos/mehari/txs.proto (2)
30-38
: Consider future transcript sources.The enum covers the main transcript sources (RefSeq and Ensembl). Consider if we should document a plan for handling additional sources that might be needed in the future (e.g., GENCODE, UCSC).
40-54
: Renamesource_name
field for consistency.The field name
source_name
(line 49) is inconsistent with the enum typeSource
. Consider renaming it tosource
for better alignment with the type name and other field naming patterns.// Source, either RefSeq or Ensembl (or Unknown). - Source source_name = 4; + Source source = 4;src/db/subset/mod.rs (1)
Line range hint
28-120
: Consider enhancing error handling and modularity.The
subset_tx_db
function handles complex data structure manipulation. Consider these improvements:
- Extract the transcript filtering logic into a separate function for better maintainability
- Add more descriptive error messages when
tx_db
orseq_db
are missing- Consider adding tests for edge cases (empty gene symbols list, non-existent genes)
Example improvement for error handling:
- let container_tx_db = container.tx_db.as_ref().expect("no tx_db"); + let container_tx_db = container.tx_db.as_ref() + .ok_or_else(|| anyhow::anyhow!("Missing transcript database in container"))?;src/annotate/seqvars/csq.rs (1)
1240-1241
: Consider consolidating test setup code.Multiple test functions have similar provider and predictor initialization code. Consider extracting this common setup into a helper function to reduce code duplication.
Example helper function:
fn create_test_predictor(config: Option<Config>) -> Result<ConsequencePredictor, anyhow::Error> { let tx_path = "tests/data/annotate/db/grch37/txs.bin.zst"; let tx_db = load_tx_db(tx_path)?; let provider = Arc::new(MehariProvider::new(tx_db, Default::default())); Ok(ConsequencePredictor::new(provider, config.unwrap_or_default())) }Also applies to: 1298-1299, 1552-1553
src/annotate/seqvars/mod.rs (1)
Line range hint
1944-1971
: Consider enhancing error handling for annotator setup.The function correctly handles the assembly changes and properly initializes annotators based on provided sources. However, consider adding more specific error context for each annotator initialization failure.
Consider wrapping errors with additional context:
if let Some(rocksdb_path) = &args.sources.frequencies { - annotators.push(FrequencyAnnotator::from_path(rocksdb_path).map(AnnotatorEnum::Frequency)?) + annotators.push( + FrequencyAnnotator::from_path(rocksdb_path) + .map(AnnotatorEnum::Frequency) + .with_context(|| format!("Failed to initialize frequency annotator from {}", rocksdb_path))? + ) }src/annotate/seqvars/provider.rs (2)
338-345
: Improve formatting ofdata_version
for clarityThe current
data_version
concatenatestx_seq_db.version
, formattedsource_version
, andconfig
without clear separators and uses debug formatting, which might reduce readability. Consider adding separators and using consistent formatting to enhance clarity.Apply this diff to improve the
data_version
formatting:let data_version = format!( - "{}{}{:?}", + "{}_{}_{}", tx_seq_db.version.as_ref().unwrap_or(&"".to_string()), - tx_seq_db - .source_version - .iter() - .map(|v| format!("{:#?}", v)) - .join(","), + tx_seq_db.source_version.join(","), + format!("{:?}", config) );This change adds underscores as separators and formats
source_version
without debug syntax for better readability.
604-606
: Simplify assembly formatting in error messagesUsing
"{:?}"
informat!("{:?}", self.assembly())
includes debug information, which might not be necessary for user-facing error messages. Consider usingformat!("{}", self.assembly())
to present a cleaner message.Apply this diff to adjust the formatting:
Err(Error::NoAlignmentFound( tx_ac.to_string(), - format!("{:?}", self.assembly()), + format!("{}", self.assembly()), ))This change uses the
Display
trait forAssembly
, producing a more user-friendly error message.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (6)
src/db/create/snapshots/mehari__db__create__test__run_smoke_brca1_opa1.snap
is excluded by!**/*.snap
src/db/create/snapshots/mehari__db__create__test__run_smoke_mitochondrial.snap
is excluded by!**/*.snap
src/db/create/snapshots/mehari__db__create__test__run_smoke_selenoproteins.snap
is excluded by!**/*.snap
src/db/subset/snapshots/mehari__db__subset__tests__subset_tx_db.snap
is excluded by!**/*.snap
tests/data/annotate/db/grch37/txs.bin.zst
is excluded by!**/*.zst
tests/data/annotate/db/grch38/txs.bin.zst
is excluded by!**/*.zst
📒 Files selected for processing (11)
protos/mehari/txs.proto
(2 hunks)src/annotate/seqvars/csq.rs
(5 hunks)src/annotate/seqvars/mod.rs
(7 hunks)src/annotate/seqvars/provider.rs
(6 hunks)src/db/create/mod.rs
(13 hunks)src/db/merge.rs
(2 hunks)src/db/mod.rs
(1 hunks)src/db/subset/mod.rs
(1 hunks)src/server/run/actix_server/versions.rs
(2 hunks)src/server/run/mod.rs
(1 hunks)src/verify/seqvars.rs
(0 hunks)
💤 Files with no reviewable changes (1)
- src/verify/seqvars.rs
🧰 Additional context used
🪛 GitHub Check: clippy
src/server/run/actix_server/versions.rs
[warning] 109-109: usage of bool::then
in filter_map
warning: usage of bool::then
in filter_map
--> src/server/run/actix_server/versions.rs:109:18
|
109 | .filter_map(|v| (v.source_name == source_name).then(|| v.source_version.clone()))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: use filter
then map
instead: filter(|&v| (v.source_name == source_name)).map(|v| v.source_version.clone())
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#filter_map_bool_then
= note: #[warn(clippy::filter_map_bool_then)]
on by default
🔇 Additional comments (22)
src/db/mod.rs (2)
3-5
: LGTM: Imports are well-organized and necessary.
The new imports are properly scoped and support the new transcript database functionality.
17-31
:
Replace unwrap() and panic! with proper error handling.
The current implementation has several reliability concerns:
- Multiple
unwrap()
calls could cause runtime panics - Explicit
panic!
for unsupported assemblies is not user-friendly - Assumes
source_version
is non-empty - Only uses the first source version when there could be multiple
Consider this safer implementation:
impl TranscriptDatabase for TxSeqDatabase {
fn assembly(&self) -> Assembly {
let assembly = self
.source_version
.first()
.ok_or_else(|| anyhow::anyhow!("No source version found"))?
.assembly
.try_into()
.map_err(|_| anyhow::anyhow!("Invalid assembly enum value"))?;
match assembly {
pbs::txs::Assembly::Grch37 => Ok(Assembly::Grch37p10),
pbs::txs::Assembly::Grch38 => Ok(Assembly::Grch38),
other => Err(anyhow::anyhow!("Unsupported assembly: {:?}", other)),
}
}
}
Let's check if there are other places in the codebase that might be affected by this change:
src/db/merge.rs (1)
6-6
: LGTM: Good choice using itertools!
The itertools
crate is a good choice for the all_equal()
functionality needed in assembly verification.
src/server/run/mod.rs (2)
143-143
: Document the configuration changes in SeqvarConsequencePredictor.
The removal of explicit assembly configuration in favor of Default::default()
represents a significant architectural change. While the code change is sound, it would benefit from documentation updates.
Consider adding a code comment explaining:
- How assembly information is now handled
- What the default configuration includes
- Any implications for existing integrations
+ // Assembly information is now handled through the provider's internal configuration
+ // Default configuration includes: ...
SeqvarConsequencePredictor::new(provider.clone(), Default::default()),
#!/bin/bash
# Description: Verify the impact of removing assembly configuration
# Check for any tests that might need updates
rg -A 5 "SeqvarConsequencePredictor::new\(" --type rust
# Look for any documentation that might need updates
fd . -e md -e rst --exec rg -l "SeqvarConsequencePredictor"
138-138
: Verify the default configuration for MehariProvider.
The removal of the assembly
parameter and use of Default::default()
needs verification to ensure it provides appropriate defaults for all use cases.
✅ Verification successful
Default configuration for MehariProvider is safe to use
Based on the codebase analysis, the change to use Default::default()
is safe and maintains backward compatibility:
- The code shows consistent usage of
Default::default()
in test cases and production code MehariProviderConfigBuilder::default()
is used throughout the codebase for configuration- When specific configuration is needed, the builder pattern is used to customize settings (e.g., transcript picking)
- The default configuration is used successfully in test scenarios without any assembly parameter
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any custom configurations that might have been using the assembly parameter
# Look for any previous custom configurations of MehariProvider
rg -A 5 "MehariProvider::new\(" --type rust
# Check for any potential configuration structs that might need updates
ast-grep --pattern 'struct $name {
$$$
assembly: $_,
$$$
}'
Length of output: 689
Script:
#!/bin/bash
# Let's check for MehariProvider usage and configuration patterns
# Look for MehariProvider configurations and usages
rg -A 5 "MehariProvider" --type rust
# Check for any configuration structs related to Mehari
ast-grep --pattern 'struct MehariConfig {
$$$
}'
# Check for Default implementation
ast-grep --pattern 'impl Default for MehariConfig {
$$$
}'
Length of output: 10936
src/server/run/actix_server/versions.rs (5)
1-9
: LGTM! Well-organized imports.
The addition of itertools::Itertools
is appropriate for the string joining operations used in version handling.
97-98
: LGTM! Well-documented new field.
The addition of version_cdot
aligns with the PR objective of including source information in the transcript database.
105-116
: LGTM! Clean version handling implementation.
The version extraction and joining logic is well-structured and handles all cases appropriately.
🧰 Tools
🪛 GitHub Check: clippy
[warning] 109-109: usage of bool::then
in filter_map
warning: usage of bool::then
in filter_map
--> src/server/run/actix_server/versions.rs:109:18
|
109 | .filter_map(|v| (v.source_name == source_name).then(|| v.source_version.clone()))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: use filter
then map
instead: filter(|&v| (v.source_name == source_name)).map(|v| v.source_version.clone())
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#filter_map_bool_then
= note: #[warn(clippy::filter_map_bool_then)]
on by default
120-122
: LGTM! Proper struct initialization.
The initialization correctly includes all fields in the same order as the struct definition.
105-116
: Verify consistent version handling across the codebase.
Let's ensure that the version information is consistently handled across different components.
✅ Verification successful
Let me gather more specific information about the version handling implementation.
Based on the gathered information, I can now provide the final response:
Version handling implementation is consistent across the codebase
The version handling implementation in versions.rs
aligns well with the rest of the codebase:
- Version information is properly structured through dedicated types (
SoftwareVersions
,DataVersionEntry
,VersionsInfoResponse
) - The cdot version handling is consistently implemented across database operations and server responses
- Version information is appropriately exposed through the
/api/v1/version
endpoint
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for consistent version handling patterns across the codebase
# Look for other version-related structures and their handling
echo "Checking for version-related structures:"
rg -l "version|Version" --type rust
# Check for potential places where cdot version might need to be handled
echo -e "\nChecking for cdot version handling:"
rg "cdot.*version|version.*cdot" --type rust
# Look for similar version joining patterns that might need updating
echo -e "\nChecking for version joining patterns:"
ast-grep --pattern 'join(",")'
Length of output: 2083
Script:
#!/bin/bash
# Check the implementation details of version handling
# Look for version-related structs and types
echo "Checking version-related structures:"
rg -A 5 "struct.*Version|type.*Version" --type rust
# Check for version handling in server responses
echo -e "\nChecking version handling in responses:"
rg -A 5 "version.*Response|Response.*version" --type rust
# Look for version-related database operations
echo -e "\nChecking version handling in database operations:"
ast-grep --pattern 'fn $_ ($$$) -> $_ {
$$$
version
$$$
}'
Length of output: 2600
🧰 Tools
🪛 GitHub Check: clippy
[warning] 109-109: usage of bool::then
in filter_map
warning: usage of bool::then
in filter_map
--> src/server/run/actix_server/versions.rs:109:18
|
109 | .filter_map(|v| (v.source_name == source_name).then(|| v.source_version.clone()))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: use filter
then map
instead: filter(|&v| (v.source_name == source_name)).map(|v| v.source_version.clone())
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#filter_map_bool_then
= note: #[warn(clippy::filter_map_bool_then)]
on by default
protos/mehari/txs.proto (2)
20-28
: LGTM: Assembly enum is well-defined and documented.
The enum follows proto3 best practices with UNKNOWN = 0 and includes comprehensive values for current needs.
194-195
: LGTM: Well-documented field addition.
The repeated source_version
field is appropriately documented, explaining its purpose in tracking information during database merges.
src/db/subset/mod.rs (1)
119-119
:
Breaking change: Field renamed from genome_release
to source_version
.
This change aligns with the PR objective to enhance source information handling in the transcript database. However, this is a breaking change that will require updates in dependent code.
Let's verify the impact and test coverage:
src/annotate/seqvars/csq.rs (3)
104-106
: LGTM: Constructor refactoring improves code organization.
The removal of the explicit assembly
parameter and using provider.assembly()
instead centralizes the assembly management and reduces parameter coupling.
1049-1051
: LGTM: Test setup correctly updated.
The test initialization has been properly updated to reflect the constructor changes.
Line range hint 1-1800
: Overall code quality is excellent.
The code demonstrates:
- Comprehensive documentation
- Robust error handling
- Extensive test coverage
- Consistent implementation of changes
src/annotate/seqvars/mod.rs (3)
Line range hint 1365-1369
: LGTM: Consistent with assembly handling changes.
The method correctly uses the updated fill_coords
signature and maintains proper control flow with early return for non-canonical chromosomes.
Line range hint 1731-1744
: LGTM: Well-structured configuration setup.
The method effectively uses the builder pattern to configure the annotator, properly handling all configuration options from the arguments.
Line range hint 647-657
: LGTM: Assembly handling refactored correctly.
The assembly is now properly retrieved from the struct's state with a clear error message. The release string and chromosome naming logic correctly handles both GRCh37 and GRCh38 assemblies.
Let's verify the assembly handling across the codebase:
✅ Verification successful
Let me gather more context about the assembly usage to ensure the refactoring is consistent.
Assembly handling is correctly implemented and consistent across the codebase
The verification shows that:
- Assembly is properly handled as an optional field in structs
- The
expect
error handling in the reviewed code matches the pattern used elsewhere - Assembly state is accessed consistently through struct fields
- The assembly parameter is only used in constructor/setup functions (
new
,set_assembly
) where it's appropriate
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify assembly handling consistency
# Test: Check for any direct assembly parameter usage that should use struct state
rg -A 3 "fn.*assembly: Assembly"
Length of output: 1996
Script:
#!/bin/bash
# Check for assembly field declarations and their usage
ast-grep --pattern 'assembly: Option<Assembly>'
# Check for assembly.expect usage to ensure consistent error handling
rg -A 2 'assembly\.expect'
# Check for any direct assembly usage without expect
rg -A 2 'self\.assembly\s+[^=.]'
Length of output: 646
src/annotate/seqvars/provider.rs (3)
44-45
: Simplified TxIntervalTrees::new
by removing unnecessary assembly
parameter
The assembly
parameter has been removed from TxIntervalTrees::new
, and the assembly is now obtained directly from db
. This change simplifies the function signature and reduces redundancy.
52-53
: Streamlined assembly retrieval in build_indices
In build_indices
, the assembly is now retrieved using db.assembly()
. This aligns with the changes in TxIntervalTrees::new
and ensures that the assembly information is consistently sourced from the TxSeqDatabase
.
164-165
: Simplified Provider::new
by removing assembly
parameter
The assembly
parameter has been removed from Provider::new
. The assembly is now accessed through tx_seq_db
, which simplifies the constructor and ensures that assembly information is consistently obtained from the database.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
…varfish-org/mehari into 601-include-database-source-information
Release-as: 0.30.0
#601
Summary by CodeRabbit
Release Notes
New Features
SourceVersion
message type for encapsulating version information.Bug Fixes
Refactor
TxSeqDatabase
.Documentation