-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc enhancements #40
Conversation
WalkthroughThe pull request introduces several changes across multiple files in the Changes
Assessment against linked issues
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range and nitpick comments (8)
src/snipe/cli/cli_ops.py (1)
1053-1057
: Consider enhancing file existence validation.While the code correctly handles non-existent files, it could be improved by validating file permissions and readability.
-if not os.path.isfile(sig_path): +try: + if not os.path.isfile(sig_path): + logger.error(f"Line {line_num}: Signature file does not exist: {sig_path}. Skipping.") + invalid_files.append(sig_path) + total_invalid += 1 + continue + # Verify file is readable + with open(sig_path, 'r') as f: + pass +except PermissionError: + logger.error(f"Line {line_num}: Permission denied for signature file: {sig_path}. Skipping.") + invalid_files.append(sig_path) + total_invalid += 1 + continue +except Exception as e: + logger.error(f"Line {line_num}: Error accessing signature file {sig_path}: {e}. Skipping.") + invalid_files.append(sig_path) + total_invalid += 1 + continuesrc/snipe/api/snipe_sig.py (2)
305-318
: Add validation and type hints to the bases property.The bases property implementation looks good, but could be improved with:
- Input validation to ensure non-negative bases count
- Type hints for the setter parameter
@property def bases(self) -> int: r"""Get the bases count of the signature.""" return self._bases_count -#setter the bases count @bases.setter -def bases(self, new_bases_count: int): +def bases(self, new_bases_count: int) -> None: r""" Set the bases count of the signature. + + Args: + new_bases_count (int): The new bases count value (must be non-negative). + + Raises: + ValueError: If new_bases_count is negative. """ + if not isinstance(new_bases_count, int) or new_bases_count < 0: + raise ValueError("Bases count must be a non-negative integer.") self._bases_count = new_bases_count self._flag_bases_updated = True
319-329
: Optimize regex pattern usage.The implementation looks good but could be optimized by:
- Moving the regex pattern to a class constant
- Pre-compiling the pattern for better performance
+# Add at class level +SNIPE_BASES_PATTERN = re.compile(r";snipe_bases=([0-9]+)") def _create_signame_for_export(self): _name = self._name # make sure we remove all the snipe suffixes _name = _name.replace("-snipesample", "") _name = _name.replace("-snipeamplicon", "") _name = _name.replace("-snipegenome", "") - _name = re.sub(r";snipe_bases=([0-9]+)", "", _name) + _name = self.SNIPE_BASES_PATTERN.sub("", _name) if self._flag_bases_updated: _name += f"-snipesample;snipe_bases={self._bases_count}" return _namesrc/snipe/cli/cli_qc.py (5)
110-112
: Refactorprocess_sample_task
to Use Keyword ArgumentsUnpacking a tuple of arguments can reduce readability and increase the risk of errors due to mismatched argument order. Consider refactoring
process_sample_task
to accept named keyword arguments for clarity and maintainability.Example:
def process_sample_task( sample_path: str, sample_name: str, ref_sig_data, amplicon_sig_data, ychr_sig_data, varsigs_data, export_var: bool, roi: bool, predict_extra_folds: List[int], debug: bool ) -> Tuple[str, Dict[str, Any], Optional[str]]: # Function body remains the sameAnd when preparing tasks:
tasks.append({ "sample_path": sample_path, "sample_name": sample_name, "ref_sig_data": reference_sig, "amplicon_sig_data": amplicon_sig, "ychr_sig_data": ychr_sig, "varsigs_data": vars_snipesigs, "export_var": export_var, "roi": roi, "predict_extra_folds": predict_extra_folds, "debug": debug })
121-130
: Review Logging Configuration in Worker ProcessesConfiguring logging within each worker process can lead to complex logging setups and potential issues such as duplicate log entries or log handler conflicts. Consider configuring logging in the main process and using a thread-safe logging handler to collect logs from worker processes.
350-364
: Improve Progress Reporting in Parallel ProcessingPrinting progress updates from within worker processes can lead to jumbled output due to concurrent writes to
stdout
. Consider updating the progress from the main process or using a thread-safe progress bar liketqdm
withmultiprocessing
support.
491-494
: Reevaluate Filling NaNs with Zeros in DataFrameFilling NaN values with zeros before converting to integers might lead to misrepresenting missing data as zeros. If zero is a meaningful value, this could cause incorrect data interpretation. Consider handling missing values explicitly or retaining NaNs.
499-501
: Fix Typographical Error in Variable NameThe variable
df_zero_uniqe_hashes
contains a typo ("uniqe" instead of "unique"). Correcting this will improve code readability and prevent potential confusion.Apply this diff to correct the typo:
-df_zero_uniqe_hashes = df[df["Total unique k-mers"] == 0] +df_zero_unique_hashes = df[df["Total unique k-mers"] == 0]Ensure to update all references to this variable accordingly.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
src/snipe/__init__.py
(1 hunks)src/snipe/api/multisig_reference_QC.py
(1 hunks)src/snipe/api/snipe_sig.py
(5 hunks)src/snipe/cli/cli_ops.py
(2 hunks)src/snipe/cli/cli_qc.py
(5 hunks)
✅ Files skipped from review due to trivial changes (1)
- src/snipe/init.py
🧰 Additional context used
🪛 Ruff
src/snipe/api/multisig_reference_QC.py
774-774: Local variable predicted_delta_unique_hashes
is assigned to but never used
Remove assignment to unused variable predicted_delta_unique_hashes
(F841)
🔇 Additional comments (5)
src/snipe/cli/cli_ops.py (4)
1041-1042
: LGTM: Proper error handling for empty lines.
The error handling for empty lines is well-implemented, with appropriate logging and skipping of the problematic line.
1043-1046
: LGTM: Proper validation of row format.
The code correctly validates the number of columns and logs an error if the format is invalid.
1048-1051
: LGTM: Proper validation of required fields.
The code appropriately checks for missing signature path or experiment name, with clear error logging.
1041-1057
: Consider memory optimization opportunities.
While the error handling improvements are good, consider these memory-related enhancements:
- Release memory for invalid entries immediately
- Use generators for large file processing
- Implement batch processing for large datasets
This aligns with the PR's objective of addressing memory leak issues (Issue #30).
src/snipe/cli/cli_qc.py (1)
20-21
: Consider the Implications of Suppressing Warnings
Suppressing UserWarning
from joblib
might hide important warnings that could affect the application's performance or correctness. It's advisable to review the warnings being suppressed to ensure that no critical information is missed.
Please verify that the suppressed warnings are benign and do not indicate underlying issues.
total_bases = first_sig._bases_count | ||
|
||
for sig in signatures[1:]: | ||
total_bases += sig._bases_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Handle potential overflow in bases count accumulation.
While the implementation correctly accumulates bases count, it should handle potential integer overflow for large numbers.
- total_bases = first_sig._bases_count
+ total_bases = 0 # Initialize to handle None values
for sig in signatures[1:]:
- total_bases += sig._bases_count
+ # Handle potential overflow
+ try:
+ sig_bases = sig._bases_count or 0
+ new_total = total_bases + sig_bases
+ if new_total < total_bases: # Overflow occurred
+ raise OverflowError("Bases count exceeded maximum value")
+ total_bases = new_total
+ except OverflowError as e:
+ logger.warning("Bases count overflow: %s", str(e))
+ total_bases = sys.maxsize # Use maximum possible value
if sig.ksize != ksize or sig.scale != scale:
raise ValueError("All signatures must have the same ksize and scale.")
Also applies to: 1088-1092
reference_sig = ref_sig_data | ||
amplicon_sig = amplicon_sig_data | ||
ychr_sig = ychr_sig_data | ||
vars_snipesigs = varsigs_data | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid Passing Large Objects Between Processes
Passing large objects like signature data through process arguments can lead to significant overhead due to the need to serialize (pickle) these objects. This can impact performance negatively. Consider loading these objects within each worker process or using shared memory mechanisms to reduce overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am aware, but I will leave it like that for now; it's not a significant concern.
except Exception as e: | ||
error_msg = f"Failed to process sample {sample_name}: {e}" | ||
return sample_name, {}, error_msg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Handle Specific Exceptions Instead of Catching All Exceptions
Using a broad except Exception
clause can mask unexpected errors and make debugging difficult. It's better to catch specific exceptions that you anticipate might occur.
Update the exception handling to catch specific exceptions. For example:
except SomeSpecificException as e:
# Handle specific exception
If catching all exceptions is necessary, consider logging the traceback for better debugging.
try: | ||
with warnings.catch_warnings(): | ||
warnings.simplefilter("error", OptimizeWarning) | ||
params_unique, covariance_unique = curve_fit( | ||
saturation_model, | ||
x_unique, | ||
y_unique, | ||
p0=initial_guess_unique, | ||
bounds=(0, np.inf), | ||
maxfev=10000 | ||
) | ||
except (RuntimeError, OptimizeWarning) as exc: | ||
raise RuntimeError("Saturation model fitting for Genomic Unique k-mers failed.") from exc | ||
|
||
if np.isinf(covariance_unique).any() or np.isnan(covariance_unique).any(): | ||
raise RuntimeError("Saturation model fitting for Genomic Unique k-mers failed.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor duplicated saturation model fitting code into a helper function
The code for fitting the saturation model using curve_fit
is duplicated in both the genomic unique k-mers and coverage calculations. Refactoring this code into a helper function will improve maintainability and reduce code duplication.
Here's a suggested refactor:
def fit_saturation_model(x, y, initial_guess, model_name):
try:
with warnings.catch_warnings():
warnings.simplefilter("error", OptimizeWarning)
params, covariance = curve_fit(
saturation_model,
x,
y,
p0=initial_guess,
bounds=(0, np.inf),
maxfev=10000
)
except (RuntimeError, OptimizeWarning) as exc:
raise RuntimeError(f"Saturation model fitting for {model_name} failed.") from exc
if np.isinf(covariance).any() or np.isnan(covariance).any():
raise RuntimeError(f"Saturation model fitting for {model_name} failed.")
return params
# Replace the duplicated code with calls to the helper function
# For Genomic Unique k-mers:
params_unique = fit_saturation_model(x_unique, y_unique, initial_guess_unique, "Genomic Unique k-mers")
a_unique, b_unique = params_unique
# For coverage:
params_coverage = fit_saturation_model(x_total_abundance, y_coverage, initial_guess_coverage, "coverage")
a_coverage, b_coverage = params_coverage
Also applies to: 858-873
else: | ||
roi_reference_sig = self.reference_sig | ||
self.logger.debug("Using reference genome signature as ROI reference.") | ||
roi_reference_sig = self.reference_sig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused variable roi_reference_sig
The variable roi_reference_sig
is assigned but never utilized in the subsequent code. Consider removing this line to eliminate unused code.
Apply the following diff to remove the unused variable:
- roi_reference_sig = self.reference_sig
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
roi_reference_sig = self.reference_sig |
@@ -790,152 +770,137 @@ def sort_chromosomes(chrom_dict): | |||
if predict_extra_folds and genome_stats["Genome coverage index"] > 0.01: | |||
predicted_fold_coverage = {} | |||
predicted_fold_delta_coverage = {} | |||
predicted_unique_hashes = {} | |||
predicted_delta_unique_hashes = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused variable predicted_delta_unique_hashes
The variable predicted_delta_unique_hashes
is assigned but never used. Removing this line will clean up the code.
Apply the following diff to remove the unused variable:
- predicted_delta_unique_hashes = {}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
predicted_delta_unique_hashes = {} |
🧰 Tools
🪛 Ruff
774-774: Local variable predicted_delta_unique_hashes
is assigned to but never used
Remove assignment to unused variable predicted_delta_unique_hashes
(F841)
snipe ops guided-merge
#38