mz_parquet

REQUEST FOR COMMENT - If you have feedback or suggetions for this file format, it would be greatly appreciated (use the issue tracker!).

This repository houses a proof-of-concept for a modern, data lake-ready storage format for mass spectrometry data (mzparquet). A rust application is also provided for converting existing mzML files to mzparquet

Current standards like mzML, mzMLb, or mz5 have good characteristics for long-term data archival, but all require custom parsers to access data.

The Apache Parquet file format has emerged as the default file format for data science/big data workflows. Parquet allows storing nested data (such as lists of m/z and intensities, or key-value pairs of controlled vocabulary terms) in a column-oriented format, allowing fast queries with predicate pushdown. Parquet also supports various encoding and compression schemes natively, leading to dramatic reductions in file size.

However, the most critical part is that Parquet is widely supported in the data science ecosystem - supported out of the box by pandas, polars, and provided bindings for most programming languages. Parquet is also supported by a variety of databases and query engines (Athena, datafusion, Spark, BigQuery, etc), enabling direct SQL queries over petabyte-scale mass spectrometry data.

The format specified in this repository is a minimal set of columns needed to successfully process data - peak lists, isolation windows, and RT/ion mobility values. Each individual ion in an acquisition has it's own row in the mzparquet file (long format).

Example of querying mzparquet files

import polars as pl

df = pl.read_parquet("test.mzparquet")

# Find all MS2 ions with an m/z between (267.0, 267.1)
df.filter(
    pl.col("mz").is_between(267.0, 267.1)
  & pl.col("ms_level").eq(2)
)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
bench.py		bench.py
demo.py		demo.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mz_parquet

Example of querying mzparquet files

About

Releases 4

Packages

Languages

License

lazear/mz_parquet

Folders and files

Latest commit

History

Repository files navigation

mz_parquet

Example of querying mzparquet files

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages