Skip to content

Commit

Permalink
Merge pull request #243 from iomega/make_ms2deepscore_compatible
Browse files Browse the repository at this point in the history
Update to testing version 3.9 - 3.11
  • Loading branch information
niekdejonge authored Jul 2, 2024
2 parents 17acbe4 + f211a27 commit 84ecb57
Show file tree
Hide file tree
Showing 22 changed files with 167 additions and 195 deletions.
20 changes: 10 additions & 10 deletions .github/workflows/CI_build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ on:
jobs:

thorough_check:
name: Thorough code check / python-3.8 / ubuntu-latest
name: Thorough code check / python-3.9 / ubuntu-latest
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v1
uses: actions/setup-python@v5
with:
python-version: 3.8
python-version: 3.9
- name: Python info
run: |
which python
Expand All @@ -30,7 +30,7 @@ jobs:
- name: Run test with coverage
run: pytest --cov --cov-report term --cov-report xml -m "not integration"
- name: Check style against standards using prospector
run: prospector -o grouped -o pylint:pylint-report.txt
run: prospector -o grouped -o pylint:pylint-report.txt --ignore-paths notebooks
- name: Check whether import statements are used consistently
run: isort --check-only --diff .
- name: SonarCloud Scan
Expand All @@ -46,15 +46,15 @@ jobs:
fail-fast: false
matrix:
os: ['ubuntu-latest', 'macos-latest', 'windows-latest']
python-version: ['3.8', '3.9']
python-version: ['3.9', '3.10']
exclude:
# already tested in first_check job
- python-version: 3.8
- python-version: 3.9
os: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v1
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Python info
Expand All @@ -79,11 +79,11 @@ jobs:
fail-fast: false
matrix:
os: ['ubuntu-latest']
python-version: ['3.8']
python-version: ['3.9']
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v1
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Python info
Expand Down Expand Up @@ -113,7 +113,7 @@ jobs:
with:
activate-environment: ms2query
environment-file: ./environment.yml
python-version: 3.8
python-version: 3.9
- name: activate conda environment
run: |
conda activate ms2query
Expand Down
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,13 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## 1.5.0
### Changed
- MS2Query is now tested on python 3.9 and 3.10 instead of 3.8 and 3.9
- MS2Query is using MS2Deepscore 2.0. This is a breaking change, making MS2Query not work with old models anymore
- Updated model to use MS2Deepscore 2.0 and used newly available training data for all models.

## 1.4.0
### Changed
- Made compatible with MS2Deepscore 0.5.0
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ For questions regarding MS2Query please make an issue on github or contact niek.
We recommend to create an Anaconda environment with

```
conda create --name ms2query python=3.8
conda create --name ms2query python=3.9
conda activate ms2query
```
### Pip install MS2Query
Expand All @@ -54,7 +54,7 @@ pip install ms2query
```
All dependencies are automatically installed, the dependencies can be found in setup.py.
The installation is expected to take about 2 minutes.
MS2Query is tested by continous integration on MacOS, Windows and Ubuntu for python version 3.7 and 3.8.
MS2Query is tested by continous integration on MacOS, Windows and Ubuntu for python version 3.9 and 3.1

## Run MS2Query from command line

Expand Down Expand Up @@ -281,7 +281,7 @@ After running you can run MS2Query on your newly created models and library. See
We recommend to create an Anaconda environment with

```
conda create --name ms2query python=3.8
conda create --name ms2query python=3.9
conda activate ms2query
```
### Clone repository
Expand Down
23 changes: 11 additions & 12 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,18 @@ channels:
- bioconda
- defaults
dependencies:
- python=3.8.18
- matchms=0.24.1
- python=3.9.18
- matchms=0.26.4
- numpy=1.24.4
- spec2vec=0.8.0
- h5py=3.9.0
- pyarrow=12.0.1
- tensorflow=2.12.1
- scikit-learn=1.3.2
- ms2deepscore=0.5.0
- pandas=2.0.3
- matplotlib=3.7.3
- h5py=3.11.0
- pyarrow=16.1.0
- scikit-learn=1.5.0
- ms2deepscore=2.0.0
- pandas=2.2.2
- matplotlib=3.7.2
- skl2onnx=1.16.0
- onnxruntime=1.16.3
- pytest=7.4.0
- pytest-cov=4.1.0
- onnxruntime=1.17.0
- pytest=8.2.2
- pytest-cov=5.0.0
- zip
10 changes: 4 additions & 6 deletions ms2query/benchmarking/collect_test_data_results.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
from matchms.calculate_scores import calculate_scores
from matchms.similarity.CosineGreedy import CosineGreedy
from matchms.similarity.ModifiedCosine import ModifiedCosine
from ms2deepscore import MS2DeepScore
from ms2deepscore.models import SiameseModel
from ms2deepscore.models import SiameseSpectralModel, compute_embedding_array
from spec2vec.vector_operations import cosine_similarity_matrix
from tqdm import tqdm
from ms2query.create_new_library.calculate_tanimoto_scores import (
Expand Down Expand Up @@ -51,7 +50,7 @@ def generate_test_results_ms2query(ms2library: MS2Library,
return test_results_ms2query


def get_all_ms2ds_scores(ms2ds_model: SiameseModel,
def get_all_ms2ds_scores(ms2ds_model: SiameseSpectralModel,
ms2ds_embeddings,
test_spectra
) -> pd.DataFrame:
Expand All @@ -64,9 +63,8 @@ def get_all_ms2ds_scores(ms2ds_model: SiameseModel,
Spectra for which similarity scores should be calculated for all
spectra in the ms2ds embeddings file.
"""
# ms2ds_model = load_ms2ds_model(ms2ds_model_file_name)
ms2ds = MS2DeepScore(ms2ds_model, progress_bar=False)
query_embeddings = ms2ds.calculate_vectors(test_spectra)
query_embeddings = compute_embedding_array(ms2ds_model, test_spectra)

library_ms2ds_embeddings_numpy = ms2ds_embeddings.to_numpy()

ms2ds_scores = cosine_similarity_matrix(library_ms2ds_embeddings_numpy,
Expand Down
1 change: 1 addition & 0 deletions ms2query/create_new_library/calculate_tanimoto_scores.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ def get_fingerprint(smiles: str):
def calculate_tanimoto_scores_from_smiles(list_of_smiles_1: List[str],
list_of_smiles_2: List[str]) -> np.ndarray:
"""Returns a 2d ndarray containing the tanimoto scores between the smiles"""
assert len(list_of_smiles_1) > 0 and len(list_of_smiles_2) > 0
fingerprints_1 = np.array([get_fingerprint(spectrum) for spectrum in tqdm(list_of_smiles_1,
desc="Calculating fingerprints")])
fingerprints_2 = np.array([get_fingerprint(spectrum) for spectrum in tqdm(list_of_smiles_2,
Expand Down
6 changes: 2 additions & 4 deletions ms2query/create_new_library/library_files_creator.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@
import pandas as pd
from gensim.models import Word2Vec
from matchms.Spectrum import Spectrum
from ms2deepscore import MS2DeepScore
from ms2deepscore.models import load_model as load_ms2ds_model
from ms2deepscore.models.SiameseSpectralModel import compute_embedding_array
from spec2vec.vector_operations import calc_vector
from tqdm import tqdm
from ms2query.clean_and_filter_spectra import create_spectrum_documents
Expand Down Expand Up @@ -141,11 +141,9 @@ def store_ms2ds_embeddings(self):
assert not os.path.exists(self.ms2ds_embeddings_file_name), \
"Given ms2ds_embeddings_file_name already exists"
assert self.ms2ds_model is not None, "No MS2deepscore model was provided"
ms2ds = MS2DeepScore(self.ms2ds_model,
progress_bar=self.progress_bars)

# Compute spectral embeddings
embeddings = ms2ds.calculate_vectors(self.list_of_spectra)
embeddings = compute_embedding_array(self.ms2ds_model, self.list_of_spectra)
spectrum_ids = np.arange(0, len(self.list_of_spectra))
all_embeddings_df = pd.DataFrame(embeddings, index=spectrum_ids)
save_df_as_parquet_file(all_embeddings_df, self.ms2ds_embeddings_file_name)
Expand Down
41 changes: 25 additions & 16 deletions ms2query/create_new_library/train_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,15 @@
"""

import os
from ms2deepscore import SettingsMS2Deepscore
from ms2deepscore.train_new_model.train_ms2deepscore import train_ms2ds_model
from spec2vec.model_building import train_new_word2vec_model
from ms2query.clean_and_filter_spectra import (
clean_normalize_and_split_annotated_spectra, create_spectrum_documents)
from ms2query.create_new_library.library_files_creator import \
LibraryFilesCreator
from ms2query.create_new_library.train_ms2deepscore import \
train_ms2deepscore_wrapper
from ms2query.create_new_library.split_data_for_training import \
split_spectra_on_inchikeys
from ms2query.create_new_library.train_ms2query_model import (
convert_to_onnx_model, train_ms2query_model)
from ms2query.utils import load_matchms_spectrum_objects_from_file
Expand All @@ -20,17 +22,27 @@ class SettingsTrainingModels:
def __init__(self,
settings: dict = None):
default_settings = {"ms2ds_fraction_validation_spectra": 30,
"ms2ds_epochs": 150,
"spec2vec_iterations": 30,
"ms2query_fraction_for_making_pairs": 40,
"add_compound_classes": True}
"add_compound_classes": True,
"ms2ds_training_settings": SettingsMS2Deepscore(
history_plot_file_name="ms2deepscore_training_history.svg",
model_file_name="ms2deepscore_model.pt",
epochs=150,
embedding_dim=400,
base_dims=(500, 500),
min_mz=10,
max_mz=1000,
mz_bin_width=0.1,
intensity_scaling=0.5
)}
if settings:
for setting in settings:
assert setting in default_settings, \
f"Available settings are {default_settings.keys()}"
default_settings[setting] = settings[setting]
self.ms2ds_fraction_validation_spectra: float = default_settings["ms2ds_fraction_validation_spectra"]
self.ms2ds_epochs: int = default_settings["ms2ds_epochs"]
self.ms2ds_training_settings: SettingsMS2Deepscore = default_settings["ms2ds_training_settings"]
self.ms2query_fraction_for_making_pairs: int = default_settings["ms2query_fraction_for_making_pairs"]
self.spec2vec_iterations = default_settings["spec2vec_iterations"]
self.add_compound_classes: bool = default_settings["add_compound_classes"]
Expand All @@ -43,18 +55,15 @@ def train_all_models(annotated_training_spectra,
if not os.path.isdir(output_folder):
os.mkdir(output_folder)
# set file names of new generated files
ms2deepscore_model_file_name = os.path.join(output_folder, "ms2deepscore_model.hdf5")
spec2vec_model_file_name = os.path.join(output_folder, "spec2vec_model.model")
ms2query_model_file_name = os.path.join(output_folder, "ms2query_model.onnx")
ms2ds_history_figure_file_name = os.path.join(output_folder, "ms2deepscore_training_history.svg")

# Train MS2Deepscore model
train_ms2deepscore_wrapper(annotated_training_spectra,
ms2deepscore_model_file_name,
settings.ms2ds_fraction_validation_spectra,
settings.ms2ds_epochs,
ms2ds_history_figure_file_name
)
training_spectra, validation_spectra = split_spectra_on_inchikeys(annotated_training_spectra,
settings.ms2ds_fraction_validation_spectra,
)
train_ms2ds_model(training_spectra, validation_spectra, output_folder,
settings.ms2ds_training_settings)

# Train Spec2Vec model
spectrum_documents = create_spectrum_documents(annotated_training_spectra + unannotated_training_spectra)
Expand All @@ -68,7 +77,7 @@ def train_all_models(annotated_training_spectra,
# Train MS2Query model
ms2query_model = train_ms2query_model(annotated_training_spectra,
os.path.join(output_folder, "library_for_training_ms2query"),
ms2deepscore_model_file_name,
os.path.join(output_folder, "ms2deepscore_model.pt"),
spec2vec_model_file_name,
fraction_for_training=settings.ms2query_fraction_for_making_pairs)
convert_to_onnx_model(ms2query_model, ms2query_model_file_name)
Expand All @@ -77,15 +86,15 @@ def train_all_models(annotated_training_spectra,
library_files_creator = LibraryFilesCreator(annotated_training_spectra,
output_folder,
spec2vec_model_file_name,
ms2deepscore_model_file_name,
os.path.join(output_folder, "ms2deepscore_model.pt"),
add_compound_classes=settings.add_compound_classes)
library_files_creator.create_all_library_files()


def clean_and_train_models(spectrum_file: str,
ion_mode: str,
output_folder,
model_train_settings = None,
model_train_settings=None,
do_pubchem_lookup = True):
"""Trains a new MS2Deepscore, Spec2Vec and MS2Query model and creates all needed library files
Expand Down
46 changes: 0 additions & 46 deletions ms2query/create_new_library/train_ms2deepscore.py

This file was deleted.

10 changes: 5 additions & 5 deletions ms2query/ms2library.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import pandas as pd
from gensim.models import Word2Vec
from matchms.Spectrum import Spectrum
from ms2deepscore import MS2DeepScore
from ms2deepscore.models import compute_embedding_array
from ms2deepscore.models import load_model as load_ms2ds_model
from onnxruntime import InferenceSession
from spec2vec.vector_operations import calc_vector, cosine_similarity_matrix
Expand Down Expand Up @@ -82,7 +82,7 @@ def __init__(self,
self.s2v_embeddings: pd.DataFrame = load_df_from_parquet_file(s2v_embeddings_file_name)
self.ms2ds_embeddings: pd.DataFrame = load_df_from_parquet_file(ms2ds_embeddings_file_name)

assert self.ms2ds_model.base.output_shape[1] == self.ms2ds_embeddings.shape[1], \
assert self.ms2ds_model.model_settings.embedding_dim == self.ms2ds_embeddings.shape[1], \
"Dimension of pre-computed MS2DeepScore embeddings does not fit given model."

# load precursor mz's
Expand Down Expand Up @@ -250,8 +250,8 @@ def _get_all_ms2ds_scores(self, query_spectrum: Spectrum
Spectrum for which similarity scores should be calculated for all
spectra in the ms2ds embeddings file.
"""
ms2ds = MS2DeepScore(self.ms2ds_model, progress_bar=False)
query_embeddings = ms2ds.calculate_vectors([query_spectrum])
query_embeddings = compute_embedding_array(self.ms2ds_model, [query_spectrum])

library_ms2ds_embeddings_numpy = self.ms2ds_embeddings.to_numpy()
ms2ds_scores = cosine_similarity_matrix(library_ms2ds_embeddings_numpy,
query_embeddings)
Expand Down Expand Up @@ -397,7 +397,7 @@ def get_ms2query_model_prediction_single_spectrum(
def select_files_for_ms2query(file_names: List[str], files_to_select=None):
"""Selects the files needed for MS2Library based on their file extensions. """
dict_with_file_extensions = \
{"sqlite": ".sqlite", "s2v_model": ".model", "ms2ds_model": ".hdf5",
{"sqlite": ".sqlite", "s2v_model": ".model", "ms2ds_model": ".pt",
"ms2query_model": ".onnx", "s2v_embeddings": "s2v_embeddings.parquet",
"ms2ds_embeddings": "ms2ds_embeddings.parquet"}
if files_to_select is not None:
Expand Down
Loading

0 comments on commit 84ecb57

Please sign in to comment.