MS2 Autoencoder is built on Keras for Python. The purpose of MS2 Autoencoder is to create a generalized model of MS2 spectra so that any low quality spectra can be upscaled to a high quality spectra (with quality being baed on precursor intensity). The direct general application of this tool is denoising spectra.
*sklearn
- pyteomics
- h5py
- keras autoencoder tutorial
- tensorflow (tensorflow-gpu or tensorflow*)
- *tensorflow-gpu worked on version 1.14 with cudnn version 10.0
- Extract mzxml/mzml files for MS2 data
- Stitch all extracted data files (.npz) into HDF5 file (.hdf5)
- Train autoencoder, deep autoencoder, convolutional neural network,... variational autoencoder, LSTM
- Evaluate and predict test data on models
- Achieve spectra upscaling/denoising
- In MS2-Autoencoder/bin/main.py import extract_mzxml as em
- The else statement in main.py is the entire top to bottom flow of mzxml data extraction
- This step should be run on the cluster with nohup and NextFlow to gather all of the data
- The Makefile includes functions (instructions) for NextFlow to run main.py on all QExactive data on GNPS(Nov/2019)
- This step outputs several files per input mzXML/mzML. This includes ready_array.npz, which includes metadata about the spectra pair, and ready_array2.npz, which includes the actual vector'd data.
- Use SCP to transfer extracted outdirs from cluster to local (advised that .json files are rm -r from outdir)
- only ready_array2.npz or a .npz file is needed for stitching
- In MS2-Autoencoder/bin/processing.py
- Specify path to the parent directory of all outdirs, specify name of the data file (e.g. 'ready_array2.npz' if we want to merge all the actual paired spectra vectors)
- processing.py will concatenate all .npz; it will output two .hdf5 files
- Autoencoder structured dataset
- Convolution neural network 1D structured dataset
- Model architecture is outlined in ms2-autoencoder.py, ms2-conv1d.py, ms2-deepautoencoder.py
- Generators, training, evaluating, predicting, and all model architectures are in ms2_model.py
- In train_models.py import ms2_model.py
- Trained models are saved as .h5 with architeture and weights
- Models training function is built on tensorflow-gpu with gpu memory allocation and session declaration
- Model training can be done on local or cluster machine
- Jupyter/keras load validate.ipynb is the Jupyter Notebook for loading models and visualizating predictions
- Models prediction function is built on tensorflow-gpu with gpu memory allocation and session declaration
- Hopefully cosine proximity is closer to 1.0 than 0.0