SimMS

Calculate similarity between large number of mass spectra using a GPU. SimMS aims to provide very fast replacements for commonly used similarity functions in matchms. `

How SimMS works, in a nutshell

Comparing large sets of mass spectra can be done in parallel, since scores can be calculated independent of the other scores. By leveraging a large number of threads in a GPU, we created a GPU program (kernel) that calculates a 4096 x 4096 similarity matrix in a fraction of a second. By iteratively calculating similarities for batches of spectra, SimMS can quickly process datasets much larger than the GPU memory. For details, visit the preprint.

Quickstart

Hardware

Any GPU supported by numba can be used. We tested a number of GPUs:

RTX 2070, used on local machine
T4 GPU, offered for free on Colab
RTX4090 GPU, offered on vast.ai
Any A100 GPU, offered on vast.ai

The pytorch/pytorch:2.2.1-cuda12.1-cudnn8-devel docker image was used for development and testing.

Install

pip install git+https://github.com/PangeAI/simms

Use with MatchMS

from matchms import calculate_scores
from matchms.importing import load_from_mgf
from simms.utils import download
from simms.similarity import CudaCosineGreedy, \
                              CudaModifiedCosine, \
                              CudaFingerprintSimilarity

sample_file = download('pesticides.mgf')
references = list(load_from_mgf(sample_file))
queries = list(load_from_mgf(sample_file))

similarity_function = CudaCosineGreedy()

scores = calculate_scores( 
  references=references,
  queries=queries,
  similarity_function=similarity_function, 
)

scores.scores_by_query(queries[42], 'CudaCosineGreedy_score', sort=True)

Use as a CLI

pangea-simms --references library.mgf --queries queries.mgf --output_file scores.pickle \
                    --tolerance 0.01 \
                    --mz_power 1 \
                    --intensity_power 1 \
                    --batch_size 512 \
                    --n_max_peaks 512 \
                    --match_limit 1024 \
                    --array_type numpy \
                    --sparse_threshold 0.5 \
                    --method CudaCosineGreedy

Supported similarity functions

CudaModifiedCosine, equivalent to ModifiedCosine
CudaCosineGreedy, equivalent to CosineGreedy
CudaFingerprintSimilarity, equivalent to FingerprintSimilarity (jaccard, cosine, dice)
More coming soon - requests are welcome!

Installation

The easiest way to get started is to use the colab notebook that has everything ready for you.

For local installations, we recommend using micromamba, it is much faster.

Total size of install in a fresh conda environment will be around 7-8GB (heaviest packages are pytorch, and cudatoolkit).

# Install cudatoolkit
conda install nvidia::cuda-toolkit -y

# Install torch (follow the official guide https://pytorch.org/get-started/locally/#start-locally)
conda install pytorch -c pytorch -c nvidia -y

# Install numba (follow the offical guide: https://numba.pydata.org/numba-doc/latest/user/installing.html#installing-using-conda-on-x86-x86-64-power-platforms)
conda install numba -y

# Install this repository
pip install git+https://github.com/PangeAI/simms

Run in docker

The pytorch/pytorch:2.2.1-cuda12.1-cudnn8-devel has nearly everything you need. Once inside, do:

pip install git+https://github.com/PangeAI/simms

Run on vast.ai

Use this template as a starting point, once inside, simply do:

pip install git+https://github.com/PangeAI/simms

Frequently asked questions

I want to get `referenece_id`, `query_id` and `score` as 1D arrays, separately. How do I do this?

Use the "sparse" mode. It directly gives you the columns. You can set sparse_threshold to 0, at which point you will get all the scores.

from simms.similarity import CudaCosineGreedy

scores_cu = CudaCosineGreedy(
    sparse_threshold=0.75, # anything with a lower score gets discarded
).matrix(references, queries, array_type='sparse')

# Unpack sparse results as 1D arrays
ref_id, query_id, scores = scores_cu.data['sparse_score']
ref_id, query_id, matches = scores_cu.data['sparse_matches']

Citing SimMS

If you want to cite SimMS in your research, you can use the following BibTeX entry:

@article{Onoprishvili2024,
	title = {SimMS: A GPU-Accelerated Cosine Similarity implementation for Tandem Mass Spectrometry},
	author = {Onoprishvili, Tornike and Yuan, Jui-Hung and Petrov, Kamen and Ingalalli, Vijay and Khederlarian, Lila and Leuchtenmuller, Niklas and Chandra, Sona and Duarte, Aurelien and Bender, Andreas and Gloaguen, Yoann},
	journal = {bioRxiv},
	elocation-id = {2024.07.24.605006},
	URL = {https://www.biorxiv.org/content/early/2024/07/25/2024.07.24.605006},
	year = {2024},
	doi = {10.1101/2024.07.24.605006}
}

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
assets		assets
notebooks		notebooks
simms		simms
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimMS

How SimMS works, in a nutshell

Quickstart

Hardware

Install

Use with MatchMS

Use as a CLI

Supported similarity functions

Installation

Run in docker

Run on vast.ai

Frequently asked questions

I want to get `referenece_id`, `query_id` and `score` as 1D arrays, separately. How do I do this?

Citing SimMS

About

Releases 1

Packages

Contributors 2

Languages

License

PangeAI/SimMS

Folders and files

Latest commit

History

Repository files navigation

SimMS

How SimMS works, in a nutshell

Quickstart

Hardware

Install

Use with MatchMS

Use as a CLI

Supported similarity functions

Installation

Run in docker

Run on vast.ai

Frequently asked questions

I want to get referenece_id, query_id and score as 1D arrays, separately. How do I do this?

Citing SimMS

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

I want to get `referenece_id`, `query_id` and `score` as 1D arrays, separately. How do I do this?

Packages