Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

major improvements of mzml statistics #37

Merged
merged 24 commits into from
Nov 30, 2024
Merged

major improvements of mzml statistics #37

merged 24 commits into from
Nov 30, 2024

Conversation

ypriverol
Copy link
Member

@ypriverol ypriverol commented Nov 29, 2024

This PR, instead of loading everything in the panda's dataframe all the data; it writes the intensity values using parquet batches.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a new batch processing option for mass spectrometry data, allowing users to specify batch sizes for improved efficiency.
    • Added a new class for enhanced reading and processing of Peptide Spectrum Matches (PSMs).
  • Improvements

    • Updated version number of the quantms-utils package to 0.0.16.
    • Improved error handling and logic for extracting global q-values in PSM conversion.
    • Enhanced mass spectrometry file handling with new functions for better path resolution and data extraction.
  • Chores

    • Updated .gitignore to exclude specific test data directories.

Copy link
Contributor

coderabbitai bot commented Nov 29, 2024

Walkthrough

The pull request introduces updates to the quantms-utils project, primarily focusing on version increments across multiple files. The version number is updated from 0.0.15 to 0.0.16 in pyproject.toml, __init__.py, and recipe/meta.yaml. Significant enhancements are made to the mzml_statistics.py file, including new parameters and functions for batch processing mzML files. Other modifications include improvements to the convert_psm function and the addition of a new class in ms2rescore.py for enhanced PSM processing.

Changes

File Path Change Summary
pyproject.toml Version updated from 0.0.15 to 0.0.16.
quantmsutils/__init__.py Version updated from "0.0.15" to "0.0.16".
quantmsutils/mzml/mzml_statistics.py Updated mzml_statistics function signature to include batch_size. New function batch_write_mzml_streaming added. Replaced parse_mzml with batch_write_mzml_streaming. New function _resolve_ms_path added. Enhanced batch processing logic.
quantmsutils/psm/psm_conversion.py Modified convert_psm function to include conditional logic for global_qvalue.
quantmsutils/rescoring/ms2rescore.py Added IDXMLReaderPatch class with methods for PSM handling. Modified rescore_idxml to use new class.
recipe/meta.yaml Version updated from 0.0.15 to 0.0.16.
.gitignore New entry added to ignore test data directory: /tests/test_data/hMICAL1_coiPAnP-N2-200_3Murea-1Mthiourea-200mMtcep_14733.d/

Possibly related PRs

  • upgrade the sdrf-pipelines 0.0.29 -> 0.0.30 #29: The changes in this PR involve updating the version of the sdrf-pipelines dependency, which is related to the overall versioning updates reflected in the main PR for the quantms-utils package.

🐰 In the meadow, a change did bloom,
Version numbers rise, dispelling the gloom.
From fifteen to sixteen, we hop with delight,
New functions and features, oh what a sight!
Batch processing magic, PSMs in a row,
With every small change, our project will grow! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (6)
quantmsutils/psm/psm_conversion.py (1)

144-145: Consider documenting the q-value precedence.

The logic for extracting global_qvalue from multiple sources is sound, but would benefit from a comment explaining the precedence order of these sources.

Add a comment like:

            if hit.metaValueExists("MS:1001491"):
                global_qvalue = hit.getMetaValue("MS:1001491")
+           # Fall back to generic q-value if MS:1001491 is not available
            elif hit.metaValueExists("q-value"):
                global_qvalue = hit.getMetaValue("q-value")
quantmsutils/mzml/mzml_statistics.py (3)

65-65: Update the docstring to reflect the function's behavior

The docstring for batch_write_mzml mentions returning a pandas DataFrame, but the function actually writes data to Parquet files and returns an Optional[str]. Updating the docstring will help maintain accurate documentation.


243-245: Ensure consistent case-insensitive file extension handling

In line 243, the code compares Path(ms_path).suffix directly to ".d". This may lead to issues on case-sensitive filesystems if the file extension is uppercase (e.g., .D). To ensure consistent handling regardless of case, convert the suffix to lowercase before comparison:

-if Path(ms_path).suffix == ".d":
+if Path(ms_path).suffix.lower() == ".d":

259-264: Simplify and standardize file extension handling in '_resolve_ms_path'

The valid_extensions set includes both lowercase and mixed-case extensions. Since the file suffixes are converted to lowercase in the list comprehension, you can streamline the valid_extensions set:

-valid_extensions = {'.d', '.mzml', '.mzML'}
+valid_extensions = {'.d', '.mzml'}

This ensures that all comparisons are case-insensitive and simplifies the code.

quantmsutils/rescoring/ms2rescore.py (2)

102-107: Optimize feature extraction and error handling

Currently, you are checking for the existence of each rescoring feature using metaValueExists and returning None if any feature is missing. This may not provide detailed information about which feature is missing. Consider collecting all missing features and reporting them for better debugging.

Apply this diff to collect and report all missing features:

 rescoring_features = {}
+missing_features = []
 for key in self.rescoring_features:
     feature_exists = peptide_hit.metaValueExists(key)
     if not feature_exists:
-        return None
+        missing_features.append(key)
     else:
         rescoring_features[key] = float(peptide_hit.getMetaValue(key))
+if missing_features:
+    logging.warning(f"Missing features {missing_features} for peptide hit {peptide_hit.getSequence().toString()}")
+    return None

235-242: Adjust log level for skipped PSMs message

The message about removed PSMs without search engine features is currently logged as a warning. If skipping PSMs without features is an expected part of the workflow, consider changing the log level to INFO to reduce unnecessary warning messages.

Apply this diff to change the log level:

 if reader.skip_invalid_psm != 0:
-    logging.warning(
+    logging.info(
         f"Removed {reader.skip_invalid_psm} PSMs without search engine features!"
     )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d1b7d62 and 8d3e0ad.

📒 Files selected for processing (6)
  • pyproject.toml (1 hunks)
  • quantmsutils/__init__.py (1 hunks)
  • quantmsutils/mzml/mzml_statistics.py (2 hunks)
  • quantmsutils/psm/psm_conversion.py (1 hunks)
  • quantmsutils/rescoring/ms2rescore.py (2 hunks)
  • recipe/meta.yaml (1 hunks)
✅ Files skipped from review due to trivial changes (3)
  • pyproject.toml
  • quantmsutils/init.py
  • recipe/meta.yaml
🔇 Additional comments (1)
quantmsutils/psm/psm_conversion.py (1)

144-145: Verify q-value data type consistency.

Let's ensure the q-value meta value maintains consistent data type with other global_qvalue sources.

✅ Verification successful

Let me gather more information about the data type handling in the codebase.


Let me check one more aspect to understand the type handling in the OpenMS interface.


Based on the gathered information, I can now provide the final response:

Q-value data type is consistently handled

The codebase shows consistent handling of q-values across different sources:

  • Q-values from getScore() are numeric (float) by default in pyOpenMS
  • Q-values from getMetaValue() for both "q-value" and "MS:1001491" are stored in global_qvalue which is initialized as np.nan (numeric)
  • All score-related values in the codebase are handled as floating-point numbers
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the data types of q-values across different idXML files

# Search for q-value related code to understand type handling
rg -A 3 "q-value|MS:1001491" --type py

# Look for any type conversion or validation
ast-grep --pattern 'getMetaValue("q-value")'
ast-grep --pattern 'getMetaValue("MS:1001491")'

Length of output: 2712


Script:

#!/bin/bash
# Look for global_qvalue assignments and type handling
rg -A 2 "global_qvalue =" --type py

# Look for any type validation or conversion of q-values
ast-grep --pattern 'float($_)'

Length of output: 2835


Script:

#!/bin/bash
# Look for OpenMS type handling and initialization
rg -A 3 "import pyopenms|from pyopenms" --type py
rg -A 3 "getScore|setScore" --type py

Length of output: 3484

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved
quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved
quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved
quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved
Comment on lines +118 to +119
accession.decode() for accession in peptide_hit.extractProteinAccessionsSet()
],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Verify the need for decoding protein accessions

In the protein_list comprehension, you are using accession.decode() on each accession obtained from peptide_hit.extractProteinAccessionsSet(). If accession is already a string, calling decode() will raise an AttributeError. Confirm whether accession is a bytes object that requires decoding. If accession is a string, you should remove the decode() call.

Apply this diff if accession is already a string:

 protein_list=[
-    accession.decode() for accession in peptide_hit.extractProteinAccessionsSet()
+    accession for accession in peptide_hit.extractProteinAccessionsSet()
 ],

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +35 to +37
self.protein_ids, self.peptide_ids = self._parse_idxml()
self.user_params_metadata = self._get_userparams_metadata(self.peptide_ids[0].getHits()[0])
self.rescoring_features = self._get_rescoring_features(self.peptide_ids[0].getHits()[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle potential IndexError when accessing peptide IDs and hits

In the __init__ method of IDXMLReaderPatch, you are accessing self.peptide_ids[0].getHits()[0] to initialize self.user_params_metadata and self.rescoring_features. If self.peptide_ids is empty or if self.peptide_ids[0].getHits() is empty, this will raise an IndexError. Consider adding a check to ensure that peptide IDs and hits are available before accessing them.

Apply this diff to handle the potential empty lists:

 def __init__(self, filename: Union[Path, str], *args, **kwargs) -> None:
     super().__init__(filename, *args, **kwargs)
     self.protein_ids, self.peptide_ids = self._parse_idxml()
+    if not self.peptide_ids or not self.peptide_ids[0].getHits():
+        raise ValueError("No peptide IDs or peptide hits found in the idXML file.")
     self.user_params_metadata = self._get_userparams_metadata(self.peptide_ids[0].getHits()[0])
     self.rescoring_features = self._get_rescoring_features(self.peptide_ids[0].getHits()[0])
     self.skip_invalid_psm = 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.protein_ids, self.peptide_ids = self._parse_idxml()
self.user_params_metadata = self._get_userparams_metadata(self.peptide_ids[0].getHits()[0])
self.rescoring_features = self._get_rescoring_features(self.peptide_ids[0].getHits()[0])
self.protein_ids, self.peptide_ids = self._parse_idxml()
if not self.peptide_ids or not self.peptide_ids[0].getHits():
raise ValueError("No peptide IDs or peptide hits found in the idXML file.")
self.user_params_metadata = self._get_userparams_metadata(self.peptide_ids[0].getHits()[0])
self.rescoring_features = self._get_rescoring_features(self.peptide_ids[0].getHits()[0])

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
quantmsutils/mzml/mzml_statistics.py (1)

278-281: Enhance error message with available candidates

When multiple or no candidates are found, it would be helpful to include the list of found candidates in the error message.

     if len(candidates) == 1:
         return candidates[0]
 
-    raise FileNotFoundError(f"No unique file found for {ms_path}")
+    if not candidates:
+        raise FileNotFoundError(f"No files found with extensions {valid_extensions} for {ms_path}")
+    raise FileNotFoundError(f"Multiple matching files found for {ms_path}: {candidates}")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 8d3e0ad and 5154daf.

📒 Files selected for processing (1)
  • quantmsutils/mzml/mzml_statistics.py (2 hunks)
🔇 Additional comments (5)
quantmsutils/mzml/mzml_statistics.py (5)

184-186: LGTM! Efficient streaming implementation

The implementation correctly uses PyOpenMS's streaming mode through MzMLFile().transform() for memory-efficient processing of mzML files.


24-24: ⚠️ Potential issue

Fix incorrect regular expression pattern

The pattern r"[scan|spectrum]=(\d+)" will not correctly match "scan=" or "spectrum=". Square brackets [] define a character class that matches any single character within them.

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

52-56: ⚠️ Potential issue

Add null check for precursors

Direct access to spectrum.getPrecursors()[0] can cause an IndexError if no precursors are available.

         if ms_level == 2:
-            precursor = spectrum.getPrecursors()[0]
-            charge_state = precursor.getCharge()
-            exp_mz = precursor.getMZ()
+            precursors = spectrum.getPrecursors()
+            if precursors:
+                precursor = precursors[0]
+                charge_state = precursor.getCharge()
+                exp_mz = precursor.getMZ()
+            else:
+                charge_state = None
+                exp_mz = None

57-63: 🛠️ Refactor suggestion

Consider streaming spectrum data instead of accumulating in memory

When id_only is True, all spectrum data is accumulated in self.psm_parts before being written at once. For large files, this could lead to memory issues.

Consider writing spectrum data in batches similar to how batch_data is handled:

             if self.id_only:
                 scan_id = self.scan_pattern.findall(spectrum.getNativeID())[0]
-                self.psm_parts.append([
-                    scan_id, ms_level,
-                    mz_array.tolist(),
-                    intensity_array.tolist()
-                ])
+                if len(self.psm_parts) >= self.batch_size:
+                    self._write_spectrum_batch()
+                self.psm_parts.append({
+                    "scan": scan_id,
+                    "ms_level": ms_level,
+                    "mz": mz_array.tolist(),
+                    "intensity": intensity_array.tolist()
+                })

Likely invalid or redundant comment.


209-246: ⚠️ Potential issue

Improve error handling for parquet writer

The parquet_writer is created outside the try block but closed in the finally block. This could raise an UnboundLocalError if writer creation fails.

-            # Set up parquet writer
-            parquet_writer = pq.ParquetWriter(
-                output_path,
-                parquet_schema,
-                compression='gzip'
-            )
-
+            parquet_writer = None
             try:
+                # Set up parquet writer
+                parquet_writer = pq.ParquetWriter(
+                    output_path,
+                    parquet_schema,
+                    compression='gzip'
+                )
+
                 # Stream data in batches
                 for chunk in pd.read_sql_query(
                     """
                     SELECT 
                         Id, 
                         CASE 
                             WHEN MsMsType IN (8, 9) THEN 2 
                             WHEN MsMsType = 0 THEN 1 
                             ELSE NULL 
                         END as MsMsType,
                         NumPeaks, 
                         MaxIntensity, 
                         SummedIntensities, 
                         Time,
                         Charge,
                         MonoisotopicMz
                     FROM frames
                     """,
                     conn,
                     chunksize=batch_size
                 ):
                     # Prepare batch
                     chunk['AcquisitionDateTime'] = acquisition_date_time

                     # Convert to pyarrow table and write
                     batch_table = pa.Table.from_pandas(chunk, schema=parquet_schema)
                     parquet_writer.write_table(batch_table)

             finally:
-                parquet_writer.close()
+                if parquet_writer:
+                    parquet_writer.close()

Likely invalid or redundant comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
quantmsutils/mzml/mzml_statistics.py (2)

139-143: Add validation for batch_size parameter

The batch_size parameter should be validated to ensure it's a positive integer.

 @click.option(
     "--batch_size",
     type=int,
     default=10000,
+    callback=lambda ctx, param, value: value if value > 0 else 
+        ctx.fail('batch_size must be positive'),
     help="Number of rows to write in each batch"
 )

218-249: Enhance error handling in batch_write_bruker_d

The function should handle potential database errors and provide more specific error messages.

             try:
                 # Stream data in batches
                 for chunk in pd.read_sql_query(
                     """
                     SELECT 
                         Id, 
                         CASE 
                             WHEN MsMsType IN (8, 9) THEN 2 
                             WHEN MsMsType = 0 THEN 1 
                             ELSE NULL 
                         END as MsMsType,
                         NumPeaks, 
                         MaxIntensity, 
                         SummedIntensities, 
                         Time,
                         Charge,
                         MonoisotopicMz
                     FROM frames
                     """,
                     conn,
                     chunksize=batch_size
                 ):
                     # Prepare batch
                     chunk['AcquisitionDateTime'] = acquisition_date_time
 
                     # Convert to pyarrow table and write
                     batch_table = pa.Table.from_pandas(chunk, schema=parquet_schema)
                     parquet_writer.write_table(batch_table)
 
+            except sqlite3.Error as e:
+                print(f"Database error: {e}")
+                raise
+            except Exception as e:
+                print(f"Error processing Bruker data: {e}")
+                raise
             finally:
                 parquet_writer.close()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5154daf and db2f3b9.

📒 Files selected for processing (1)
  • quantmsutils/mzml/mzml_statistics.py (2 hunks)
🔇 Additional comments (4)
quantmsutils/mzml/mzml_statistics.py (4)

4-11: LGTM: Required imports added for batch processing

The new imports (Optional, numpy, pyarrow) are appropriate for implementing batch processing and parquet file handling.


264-283: LGTM: Robust file path resolution implementation

The _resolve_ms_path function provides comprehensive file path resolution with proper validation of extensions and clear error messages.


24-24: ⚠️ Potential issue

Fix incorrect regular expression pattern

The pattern r"[scan|spectrum]=(\d+)" is incorrect. Square brackets [] define a character class that matches any single character within them, not the words "scan" or "spectrum".

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

52-56: ⚠️ Potential issue

Add null check for precursors

Direct access to spectrum.getPrecursors()[0] could raise an IndexError if no precursors are available.

         if ms_level == 2:
+            precursors = spectrum.getPrecursors()
+            if not precursors:
+                charge_state = None
+                exp_mz = None
+            else:
-            precursor = spectrum.getPrecursors()[0]
+                precursor = precursors[0]
                charge_state = precursor.getCharge()
                exp_mz = precursor.getMZ()

Likely invalid or redundant comment.

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (3)
quantmsutils/mzml/mzml_statistics.py (3)

15-22: Add type hints to improve code maintainability

Consider adding type hints to the class attributes for better code maintainability and IDE support.

-    def __init__(self, parquet_schema, output_path, batch_size=10000, id_only=False):
+    def __init__(self, parquet_schema: pa.Schema, output_path: str, batch_size: int = 10000, id_only: bool = False):
         self.parquet_schema = parquet_schema
         self.output_path = output_path
         self.batch_size = batch_size
         self.id_only = id_only
-        self.batch_data = []
-        self.psm_parts = []
-        self.parquet_writer = None
+        self.batch_data: list[dict] = []
+        self.psm_parts: list[list] = []
+        self.parquet_writer: Optional[pq.ParquetWriter] = None

264-265: Improve error message for unsupported file types

The error message could be more helpful by listing the supported file types.

-        raise RuntimeError(f"Unsupported file type: {ms_path}")
+        supported_types = [".mzml", ".d"]
+        raise RuntimeError(
+            f"Unsupported file type: {ms_path}. "
+            f"Supported file types are: {', '.join(supported_types)}"
+        )

208-212: Use parameterized query to prevent SQL injection

Even though the SQL query uses a fixed string, it's a good practice to use parameterized queries consistently.

-            acquisition_date_time = conn.execute(
-                "SELECT Value FROM GlobalMetadata WHERE key='AcquisitionDateTime'"
-            ).fetchone()[0]
+            acquisition_date_time = conn.execute(
+                "SELECT Value FROM GlobalMetadata WHERE key = ?",
+                ("AcquisitionDateTime",)
+            ).fetchone()[0]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between db2f3b9 and 9f9c82f.

📒 Files selected for processing (1)
  • quantmsutils/mzml/mzml_statistics.py (2 hunks)
🔇 Additional comments (2)
quantmsutils/mzml/mzml_statistics.py (2)

24-24: ⚠️ Potential issue

Fix the regular expression pattern for scan ID extraction

The regex pattern r"[scan|spectrum]=(\d+)" is incorrect as it uses square brackets which create a character class. This means it matches any single character that is 's', 'c', 'a', 'n', '|', 'p', 'e', 't', 'r', 'u', 'm', leading to incorrect scan ID extraction.

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

52-56: ⚠️ Potential issue

Add null check for precursors to prevent IndexError

Direct access to spectrum.getPrecursors()[0] can raise an IndexError if the spectrum has no precursors.

         if ms_level == 2:
-            precursor = spectrum.getPrecursors()[0]
-            charge_state = precursor.getCharge()
-            exp_mz = precursor.getMZ()
+            precursors = spectrum.getPrecursors()
+            if precursors:
+                precursor = precursors[0]
+                charge_state = precursor.getCharge()
+                exp_mz = precursor.getMZ()
+            else:
+                charge_state = None
+                exp_mz = None

Likely invalid or redundant comment.

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved
quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved
quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)
quantmsutils/mzml/mzml_statistics.py (1)

4-4: Remove unused import

The Dict type hint is imported but never used in the code.

-from typing import Optional, List, Dict
+from typing import Optional, List
🧰 Tools
🪛 Ruff (0.8.0)

4-4: typing.Dict imported but unused

Remove unused import: typing.Dict

(F401)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9f9c82f and d64503d.

📒 Files selected for processing (2)
  • .gitignore (1 hunks)
  • quantmsutils/mzml/mzml_statistics.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .gitignore
🧰 Additional context used
🪛 Ruff (0.8.0)
quantmsutils/mzml/mzml_statistics.py

4-4: typing.Dict imported but unused

Remove unused import: typing.Dict

(F401)

🔇 Additional comments (3)
quantmsutils/mzml/mzml_statistics.py (3)

299-318: Well-implemented path resolution logic

The _resolve_ms_path function is robustly implemented with:

  • Proper path existence checking
  • Case-insensitive extension handling
  • Clear error messages
  • Support for multiple file formats

198-210: Efficient streaming implementation

The implementation correctly uses pyopenms streaming mode with BatchWritingConsumer, addressing the memory efficiency concerns raised in previous reviews.


29-29: ⚠️ Potential issue

Fix incorrect regular expression pattern

The current pattern r"[scan|spectrum]=(\d+)" uses square brackets which creates a character class matching any single character in "scan|spectrum". This will not correctly match the intended patterns.

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

Likely invalid or redundant comment.

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved
quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (2)
quantmsutils/mzml/mzml_statistics.py (2)

4-4: Remove unused import

The Dict type hint is imported but never used in the code.

-from typing import Optional, List, Dict
+from typing import Optional, List
🧰 Tools
🪛 Ruff (0.8.0)

4-4: typing.Dict imported but unused

Remove unused import: typing.Dict

(F401)


289-301: Enhance error handling for database operations

Add specific error handling for database operations to provide more informative error messages.

             try:
                 # Stream data in batches
                 for chunk in pd.read_sql_query(query, conn, chunksize=batch_size):
                     chunk['AcquisitionDateTime'] = acquisition_date_time
                     for col in schema.names:
                         if col not in chunk.columns:
                             chunk[col] = None
                     batch_table = pa.Table.from_pandas(chunk, schema=schema)
                     parquet_writer.write_table(batch_table)
+            except sqlite3.Error as e:
+                print(f"Database error: {e}")
+                raise
+            except Exception as e:
+                print(f"Error processing data: {e}")
+                raise
             finally:
                 parquet_writer.close()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d64503d and ddda856.

📒 Files selected for processing (1)
  • quantmsutils/mzml/mzml_statistics.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
quantmsutils/mzml/mzml_statistics.py

4-4: typing.Dict imported but unused

Remove unused import: typing.Dict

(F401)

🔇 Additional comments (4)
quantmsutils/mzml/mzml_statistics.py (4)

68-69: 🛠️ Refactor suggestion

Optimize array handling to avoid unnecessary conversions

Converting numpy arrays to Python lists creates unnecessary copies of large arrays. Keep the data in numpy arrays throughout the process.

-                        "mz": mz_array.tolist(),
-                        "intensity": intensity_array.tolist()
+                        "mz": mz_array,
+                        "intensity": intensity_array

Likely invalid or redundant comment.


31-31: ⚠️ Potential issue

Fix incorrect regular expression pattern

The current pattern [scan|spectrum]=(\d+) uses square brackets which creates a character class matching any single character in "scan|spectrum". This will not correctly match the intended patterns.

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

Likely invalid or redundant comment.


57-61: ⚠️ Potential issue

Add null check for precursors

Accessing spectrum.getPrecursors()[0] without checking if precursors exist could raise an IndexError.

         if ms_level == 2:
+            precursors = spectrum.getPrecursors()
+            if not precursors:
+                return
-            precursor = spectrum.getPrecursors()[0]
+            precursor = precursors[0]
             charge_state = precursor.getCharge()
             exp_mz = precursor.getMZ()

Likely invalid or redundant comment.


261-266: 🛠️ Refactor suggestion

Improve resource management for parquet writer

The parquet writer should be initialized within a try block to ensure proper cleanup in case of exceptions.

-            parquet_writer = pq.ParquetWriter(
-                output_path,
-                schema=schema,
-                compression='gzip'
-            )
+            parquet_writer = None
+            try:
+                parquet_writer = pq.ParquetWriter(
+                    output_path,
+                    schema=schema,
+                    compression='gzip'
+                )

Likely invalid or redundant comment.

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved
quantmsutils/mzml/mzml_statistics.py Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
quantmsutils/mzml/mzml_statistics.py (1)

216-229: Consider implementing memory-efficient streaming for large files

While the current implementation uses streaming for mzML files, it still accumulates data in batches before writing. For extremely large files, consider implementing a more memory-efficient approach:

  1. Use a smaller default batch size for very large files
  2. Add memory monitoring
  3. Consider implementing backpressure mechanisms

Would you like me to provide an example implementation of these optimizations?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ddda856 and 5999bfb.

📒 Files selected for processing (1)
  • quantmsutils/mzml/mzml_statistics.py (1 hunks)
🔇 Additional comments (5)
quantmsutils/mzml/mzml_statistics.py (5)

31-31: ⚠️ Potential issue

Fix incorrect regular expression pattern

The pattern [scan|spectrum]=(\d+) is incorrect as it creates a character class matching any single character in scan|pectrum, not the words "scan" or "spectrum".

Apply this fix:

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

68-69: 🛠️ Refactor suggestion

Optimize array handling to avoid unnecessary copies

Converting numpy arrays to Python lists creates unnecessary copies of large arrays.

Keep the data in numpy arrays:

-                        "mz": mz_array.tolist(),
-                        "intensity": intensity_array.tolist()
+                        "mz": mz_array,
+                        "intensity": intensity_array

129-129: ⚠️ Potential issue

Fix incorrect parquet writer usage

Using self.parquet_writer instead of self.id_parquet_writer for spectrum data is incorrect.

-                self.parquet_writer.write_table(spectrum_table)
+                self.id_parquet_writer.write_table(spectrum_table)

261-266: 🛠️ Refactor suggestion

Improve resource management for parquet writer

The parquet writer should be initialized within a try block to ensure proper cleanup in case of exceptions.

-            parquet_writer = pq.ParquetWriter(
-                output_path,
-                schema=schema,
-                compression='gzip'
-            )
+            parquet_writer = None
+            try:
+                parquet_writer = pq.ParquetWriter(
+                    output_path,
+                    schema=schema,
+                    compression='gzip'
+                )
 
                 # ... existing code ...
 
             finally:
-                parquet_writer.close()
+                if parquet_writer is not None:
+                    parquet_writer.close()

57-61: ⚠️ Potential issue

Add validation for precursor existence

Accessing spectrum.getPrecursors()[0] without checking if precursors exist could raise an IndexError.

Add validation:

         if ms_level == 2:
+            precursors = spectrum.getPrecursors()
+            if not precursors:
+                return
-            precursor = spectrum.getPrecursors()[0]
+            precursor = precursors[0]
             charge_state = precursor.getCharge()
             exp_mz = precursor.getMZ()

@ypriverol ypriverol requested a review from jpfeuffer November 30, 2024 14:25
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5999bfb and 3e4613c.

📒 Files selected for processing (1)
  • quantmsutils/mzml/mzml_statistics.py (1 hunks)
🔇 Additional comments (5)
quantmsutils/mzml/mzml_statistics.py (5)

68-69: 🛠️ Refactor suggestion

Optimize array handling to avoid unnecessary conversions

Converting numpy arrays to Python lists creates unnecessary copies of potentially large arrays. Consider keeping the data in numpy arrays.

-                        "mz": mz_array.tolist(),
-                        "intensity": intensity_array.tolist()
+                        "mz": mz_array,
+                        "intensity": intensity_array

Likely invalid or redundant comment.


31-31: ⚠️ Potential issue

Fix the regular expression pattern for scan/spectrum matching

The current pattern r"[scan|spectrum]=(\d+)" uses square brackets which create a character class matching any single character in "scan|spectrum". This will not correctly match the intended patterns.

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

Likely invalid or redundant comment.


261-266: 🛠️ Refactor suggestion

Improve resource management for parquet writer

The parquet writer should be initialized within a context manager or try block to ensure proper cleanup in case of exceptions.

-            parquet_writer = pq.ParquetWriter(
-                output_path,
-                schema=schema,
-                compression='gzip'
-            )
+            parquet_writer = None
+            try:
+                parquet_writer = pq.ParquetWriter(
+                    output_path,
+                    schema=schema,
+                    compression='gzip'
+                )
+            except Exception as e:
+                print(f"Error initializing parquet writer: {e}")
+                raise

Likely invalid or redundant comment.


57-61: ⚠️ Potential issue

Add precursor check to prevent IndexError

The code directly accesses the first precursor without checking if any precursors exist, which could raise an IndexError.

         if ms_level == 2:
+            precursors = spectrum.getPrecursors()
+            if not precursors:
+                return
-            precursor = spectrum.getPrecursors()[0]
+            precursor = precursors[0]
             charge_state = precursor.getCharge()
             exp_mz = precursor.getMZ()

Likely invalid or redundant comment.


136-152: 🛠️ Refactor suggestion

Improve error handling in finalize method

The finalize method should ensure proper cleanup of resources even if an exception occurs.

     def finalize(self):
+        try:
             if self.batch_data:
                 self._write_batch()

             if self.id_only and self.psm_parts:
                 self._write_batch()

+        except Exception as e:
+            print(f"Error during finalization: {e}")
+            raise
+        finally:
             if self.parquet_writer:
                 self.parquet_writer.close()
+                self.parquet_writer = None

             if self.id_parquet_writer:
                 self.id_parquet_writer.close()
+                self.id_parquet_writer = None

Likely invalid or redundant comment.

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved
quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
quantmsutils/mzml/mzml_statistics.py (1)

293-294: Improve SQL query construction

The current approach to sanitizing column names could be more robust.

-            safe_columns = [col for col in base_columns if col.replace(" ", "").isalnum()] # Remove spaces
-            query = f"""SELECT {', '.join(safe_columns)} FROM frames """
+            # Whitelist of allowed SQL expressions
+            allowed_expressions = {
+                "CASE WHEN MsMsType IN (8, 9) THEN 2 WHEN MsMsType = 0 THEN 1 ELSE NULL END as MsMsType",
+                "Id", "NumPeaks", "MaxIntensity", "SummedIntensities", "Time", "Charge", "MonoisotopicMz"
+            }
+            safe_columns = [col for col in base_columns if col in allowed_expressions]
+            query = f"""SELECT {', '.join(safe_columns)} FROM frames"""
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between dbfc89d and fe1739e.

📒 Files selected for processing (1)
  • quantmsutils/mzml/mzml_statistics.py (1 hunks)
🔇 Additional comments (5)
quantmsutils/mzml/mzml_statistics.py (5)

76-77: 🛠️ Refactor suggestion

Optimize array handling to avoid unnecessary conversions

Converting numpy arrays to Python lists creates unnecessary copies of large arrays.

-                            "mz": mz_array.tolist(),
-                            "intensity": intensity_array.tolist(),
+                            "mz": mz_array,
+                            "intensity": intensity_array,

Likely invalid or redundant comment.


38-38: ⚠️ Potential issue

Fix incorrect regular expression pattern

The current regex pattern r"[scan|spectrum]=(\d+)" uses square brackets [] which defines a character class matching any single character in the set. This will not correctly match "scan=" or "spectrum=".

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

Likely invalid or redundant comment.


238-238: ⚠️ Potential issue

Fix missing parameter in BatchWritingConsumer constructor

The constructor call is missing the required id_parquet_schema parameter.

-        consumer = BatchWritingConsumer(parquet_schema, output_path, batch_size, id_only)
+        consumer = BatchWritingConsumer(parquet_schema, id_parquet_schema, output_path, batch_size, id_only)

Likely invalid or redundant comment.


64-68: ⚠️ Potential issue

Add check for empty precursors list

Accessing spectrum.getPrecursors()[0] without checking if the list is non-empty can raise an IndexError.

         if ms_level == 2:
+            precursors = spectrum.getPrecursors()
+            if not precursors:
+                charge_state = None
+                exp_mz = None
+            else:
-            precursor = spectrum.getPrecursors()[0]
+                precursor = precursors[0]
-            charge_state = precursor.getCharge()
-            exp_mz = precursor.getMZ()
+                charge_state = precursor.getCharge()
+                exp_mz = precursor.getMZ()

Likely invalid or redundant comment.


153-169: 🛠️ Refactor suggestion

Improve resource management in finalize method

The finalize method should ensure proper cleanup of resources even if an exception occurs.

     def finalize(self):
+        try:
             if self.batch_data:
                 self._write_batch()

             if self.id_only and self.psm_parts:
                 self._write_batch()
+        except Exception as e:
+            print(f"Error during finalization: {e}")
+            raise
+        finally:
             if self.parquet_writer:
-                self.parquet_writer.close()
+                try:
+                    self.parquet_writer.close()
+                finally:
+                    self.parquet_writer = None

             if self.id_parquet_writer:
-                self.id_parquet_writer.close()
+                try:
+                    self.id_parquet_writer.close()
+                finally:
+                    self.id_parquet_writer = None

Likely invalid or redundant comment.

Copy link
Contributor

@jpfeuffer jpfeuffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my remaining change request

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
quantmsutils/mzml/mzml_statistics.py (2)

299-302: Strengthen SQL column name validation

While the current validation removes spaces, it could be more robust by using a stricter pattern for valid SQL column names.

             safe_columns = [
-                col for col in base_columns if col.replace(" ", "").isalnum()
+                col for col in base_columns 
+                if all(part.isalnum() or part in ['_', ' '] for part in col)
+                and not col.startswith(('_', ' '))
             ]

348-349: Ensure case-insensitive file extension handling

The extension comparison could be made more robust by normalizing extensions to lowercase before comparison.

-    valid_extensions = {".d", ".mzml", ".mzML"}
-    candidates = [str(c.resolve()) for c in candidates if c.suffix.lower() in valid_extensions]
+    valid_extensions = {ext.lower() for ext in {".d", ".mzml"}}
+    candidates = [str(c.resolve()) for c in candidates if c.suffix.lower() in valid_extensions]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between fe1739e and 2ca002c.

📒 Files selected for processing (1)
  • quantmsutils/mzml/mzml_statistics.py (1 hunks)
🔇 Additional comments (4)
quantmsutils/mzml/mzml_statistics.py (4)

76-77: 🛠️ Refactor suggestion

Optimize array handling to avoid unnecessary conversions

Converting numpy arrays to Python lists creates unnecessary copies of large arrays, impacting performance.

-                            "mz": mz_array.tolist(),
-                            "intensity": intensity_array.tolist(),
+                            "mz": mz_array,
+                            "intensity": intensity_array,

Likely invalid or redundant comment.


38-38: ⚠️ Potential issue

Fix incorrect regular expression pattern

The current pattern r"[scan|spectrum]=(\d+)" is incorrect. Square brackets [] define a character class that matches any single character within the brackets, not alternation between words. This will fail to correctly extract scan IDs.

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

Likely invalid or redundant comment.


64-68: ⚠️ Potential issue

Add precursor existence check to prevent IndexError

Accessing spectrum.getPrecursors()[0] without checking if precursors exist can raise an IndexError.

         if ms_level == 2:
-            precursor = spectrum.getPrecursors()[0]
-            charge_state = precursor.getCharge()
-            exp_mz = precursor.getMZ()
+            precursors = spectrum.getPrecursors()
+            if precursors:
+                precursor = precursors[0]
+                charge_state = precursor.getCharge()
+                exp_mz = precursor.getMZ()
+            else:
+                charge_state = None
+                exp_mz = None

Likely invalid or redundant comment.


153-169: 🛠️ Refactor suggestion

Improve resource management in finalize method

The finalize method should ensure proper cleanup of resources even if an exception occurs.

     def finalize(self):
+        try:
             if self.batch_data:
                 self._write_batch()

             if self.id_only and self.psm_parts:
                 self._write_batch()
+        finally:
             if self.parquet_writer:
                 self.parquet_writer.close()
+                self.parquet_writer = None

             if self.id_parquet_writer:
                 self.id_parquet_writer.close()
+                self.id_parquet_writer = None

Likely invalid or redundant comment.

@jpfeuffer jpfeuffer self-requested a review November 30, 2024 16:43
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
quantmsutils/mzml/mzml_statistics.py (1)

307-307: Use parameterized query construction for better security

While the columns are validated, constructing the SQL query using string concatenation could still be risky.

-            query = f"""SELECT {', '.join(safe_columns)} FROM frames"""
+            query = f"""SELECT {', '.join('?' for _ in safe_columns)} FROM frames"""
+            safe_columns_tuple = tuple(safe_columns)

Then use the tuple with the parameterized query:

for chunk in pd.read_sql_query(query, conn, params=safe_columns_tuple, chunksize=batch_size):
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 2b5cc7f and 4d3351f.

📒 Files selected for processing (1)
  • quantmsutils/mzml/mzml_statistics.py (1 hunks)
🔇 Additional comments (2)
quantmsutils/mzml/mzml_statistics.py (2)

38-38: ⚠️ Potential issue

Fix incorrect regular expression pattern

The pattern r"[scan|spectrum]=(\d+)" is incorrect as square brackets [] define a character class that matches any single character within them. This will match any of the characters 's', 'c', 'a', 'n', '|', 'p', 'e', 't', 'r', 'u', 'm' followed by '='.

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

Likely invalid or redundant comment.


64-68: ⚠️ Potential issue

Add precursor existence check to prevent IndexError

Direct access to spectrum.getPrecursors()[0] can raise an IndexError if no precursors are available.

         if ms_level == 2:
-            precursor = spectrum.getPrecursors()[0]
-            charge_state = precursor.getCharge()
-            exp_mz = precursor.getMZ()
+            precursors = spectrum.getPrecursors()
+            if precursors:
+                precursor = precursors[0]
+                charge_state = precursor.getCharge()
+                exp_mz = precursor.getMZ()
+            else:
+                charge_state = None
+                exp_mz = None

Likely invalid or redundant comment.

quantmsutils/mzml/mzml_statistics.py Show resolved Hide resolved
@ypriverol ypriverol merged commit 5620ead into main Nov 30, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants