NASQQ: Nextflow Automatization and Standardization for Qualitative and Quantitative 1H NMR Metabolomics

Table of Contents

About The Project
Workflow Overview
Getting Started
- Prerequisites
- Project Setup
Final Remarks

About The Project

NASQQ is a comprehensive pipeline designed to automate the preparation and analysis of 1H NMR metabolomics data. It streamlines the process from raw Bruker FIDs through spectral preprocessing and metabolite identification to data analysis and pathway enrichment. This approach accelerates the comprehension of metabolomics in analyzed subjects, eliminating the need for specialized domain knowledge.

Features

Automated Workflow: NASQQ automates the entire metabolomic analysis process, reducing manual intervention and ensuring reproducibility.
Comprehensive Analysis: The pipeline covers spectral preprocessing, metabolite identification, data analysis, and pathway enrichment, providing a holistic view of the metabolomic data.
Machine Learning Integration: NASQQ incorporates machine learning methods to bridge the gap between raw spectral information and biological insights.

Workflow overview

Load FIDs: Retrieve raw FIDs from a specified location, extract sample names, and filter pulse program.
Raw FIDs Visualisation: Plot raw FIDs figures.
Group Delay Correction: Eliminate Bruker Group Delay from the FIDs.
Solvent Suppression: Estimate and eliminate residual solvent signals from the FIDs.
Apodization: Enhance the Signal-to-Noise ratio in the spectra.
Zero Filling: Enhance the visual clarity of spectra by inserting zeros.
Fourier Transformation: Convert FIDs from the time domain to frequency domain spectra using Fourier Transformation.
Zero Order Phase Correction: Adjust spectra phase to ensure pure absorptive mode in the real part.
Internal Referencing: Align spectra with an internal reference compound.
Baseline Correction: Estimate and remove spectral baseline from the spectral profiles.
Negative Values Zeroing: Set all negative values in spectra to zero.
(Optional) Warping: Apply Semi-Parametric Time Warping technique to warp and realign spectra.
Window Selection: Choose the informative segment of spectra.
(Optional) Bucketing: Simplify density of spectra peaks.
Normalization: Normalize the spectra.
Metabolites Quantification: Identify and quantify metabolites based on normalized spectra.
Add Metadata: Merge metadata with quantified metabolites' relative abundances.
(Optional) Combine Dataset Batches: Merge batches from the dataset for streamlined analysis.
(Optional) Batch Correction: Remove batch effect from the data.
Features Processing: Load data and perform sanity checks.
Exploratory Data Analysis: Conduct Principal Component Analysis and generate exploratory analysis visualizations.
Univariate Analysis: Identify outliers, assess data normality, and conduct univariate statistical tests.
Multivariate Analysis: Utilize machine learning models to analyze metabolite data.
Pathway Analysis: Perform pathway enrichment analysis using KEGG database entries.

For detailed information on each stage of the analysis and scripts, refer to docs folder, where separate README.md files are provided.

Note: NASQQ is an extension of existing solutions, aimed at enhancing the accessibility and efficiency of metabolomic data analysis. The Workflow is designed to be system agnostic, however it was tested on MacOS (M1 chip) and Linux (Ubuntu 22.04). In order to use pipeline on Windows system please refer to WSL

(Back to Top)

Getting Started

To begin using the pipeline, it's essential to ensure that certain prerequisites are met and project is properly set up. Please review the following sections:

Prerequisites

Install Docker
Install NextFlow
(Optional) Install precommit

Project setup

Clone the project's Github repository to your local machine:
```
git clone https://github.com/ardigen/nasqq
```

Note: Grant appropriate permissions to the workflow directory e.g. chmod 777 -R <location>/nasqq

Processes executed in the NASQQ pipeline are maintained in a containerized environment. The project repository includes prebuilt Docker images for the execution of all modules, r_utils and python_utils, available here. Nextflow manages all dependencies as links to the appropriate Docker containers are included in the base.config. However, if there are issues with repository, the necessary Dockerfiles compatible with Linux and MacOS (M1 chips) systems, are provided. These Docker images can be built locally for both R and Python environments.

For Linux user execute:

cd nasqq/docker/Python
./build_docker_linux.sh
cd nasqq/docker/R
./build_docker_linux.sh

For MacOS (M1) user execute:

cd nasqq/docker/Python
./build_docker_macos.sh
cd nasqq/docker/R
./build_docker_macos.sh

After setting up project create coma separated manifest.csv file, with following structure and headers:

dataset,batch,input_path,metadata_file,selected_sample_names,target_value,referencing_range,window_selection_range
test1,test1,./testthat/data/dataset/dataset1,./testthat/data/metadata/metadata1.csv,500;501;503;504,0,None,0;10
test2,test2,./testthat/data/dataset/dataset2,./testthat/data/metadata/metadata2.csv,all,0,None,0;5
test3,None,./testthat/data/dataset/dataset3,./testthat/data/metadata/metadata3.csv,502;505;507;508;509;510,2,2.5;4.55,0;10

dataset - name of dataset.
batch - batch name (Default: None).
input_path - absolute path to NMR dataset in Bruker format.
metadata_file - absolute path to metadata file to be merged with dataset.
selected_sample_names - selection of sample names, ";" separated (Default: all).
target_value - PPM value of the signal used as the internal reference spectra (Default: 0).
referencing_range - if target_value is different from the default, the range where the referencing signal will be searched (Defaul: None).
window_selection_range - range of the informative part of the spectra, separated by ";" (Default: 0;10).

As of version 1.0.0 of the NASQQ pipeline, the only supported input files are in Bruker format. These must include at a minimum the files acqu, acqus, pulsprogram, and the pdata subfolder. An exemplary dataset folder structure is shown below.

data/dataset/
├── 10
│   ├── acqu
│   ├── acqus
│   ├── fid
│   ├── pdata
│   │   └── 1
│   │       ├── 1i
│   │       ├── 1r
│   │       ├── outd
│   │       ├── proc
│   │       ├── procs
│   │       └── title
│   ├── pulseprogram
│   ├── scon2
│   ├── specpar
│   ├── uxnmr.info
│   └── uxnmr.par
├── 11
│   ├── acqu
│   ├── acqus
│   ├── audita.txt
│   ├── fid
│   ├── pdata
│   │   └── 1
│   │       ├── 1i
│   │       ├── 1r
│   │       ├── outd
│   │       ├── proc
│   │       ├── procs
│   │       └── title
│   ├── pulseprogram
│   ├── scon2
│   ├── specpar
│   ├── uxnmr.info
│   └── uxnmr.par

If the necessary metadata_file referenced in manifest.csv does not yet exist, create it and provide the appropriate paths. The file should contain meta-information about corresponding input_path datasets. Each metadata file must be in CSV format, with three columns: patient_no, batch, and a column for state information relevant to the data analysis module, such as disease, gender, or checkpoint, with the column name not being hardcoded. As the pipeline design supports only pairwise classification of samples, having more than two groups will trigger a warning, and the pipeline will terminate during the Data Analysis stage. If there is no batch separation, the batch column should be filled with "None" values.

Note: The manifest.csv file should include dataset-specific local parameters. Metadata files may contain more records than the specific dataset, but only those records that can be mapped to folders found in input_path will be used.

The final file that needs to be created is params.yml. This document outlines the required inputs for configuring the data processing pipeline and consists of global parameters for the pipeline and its stages. These parameters are applied run-wise. Ensure that you fill in the necessary values according to the tables below (a-d).

a. Pipeline

Input	Description	Datatype
manifest	Absolute path to the manifest.csv file containing metadata information for the analysis	string
outDir	Absolute path to the directory where the output files will be stored	string
reportsDir	Absolute path to the directory where the analysis reports will be generated	string
workDir	Absolute path to the directory where the intermediate work files will be stored	string
launchDir	Absolute path to the directory from which the pipeline is launched	string
maxRetries	Number of attempts the pipeline should make to process a task before giving up	integer
errorStrategy	The strategy to handle errors during pipeline execution (terminate/ignore/retry)	string

b. Spectral processing

Input	Description	Datatype
check_pulse_samples	The pulse program specified in the manifest file for processing	string
rm_duplicated_names	Enable/disable removing duplicated sample names	boolean
lambda_bc	Baseline correction lambda parameter, controlling the smoothness of the baseline	integer
p_bc	Baseline correction parameter, controlling stickiness of the baseline	float
reverse_axis_samples	Specifies whether to reverse the axis for all samples or selected samples based on a threshold	string
run_bucketing	Enable/disable bucketing for simplifying the density of peaks before metabolite quantification	boolean
intmeth	Type of bucketing	string
mb	Number of buckets	integer
run_warping	Enable/disable warping for spectra re-alignment based on a reference spectrum	boolean
type_norm	Normalization type	string
removal_regions	Spectral regions to be removed	string
ncores	Number of threads allocated for the ASICS quantification task	integer
quantif_method	Metabolite quantification method	string

c. Data analysis

Input	Description	Datatype
run_combine_project_batches	Enable/disable merging datasets for data analysis where batch is not "None"	boolean
run_batch_correction	Enable/disable ComBat batch correction	boolean
log1p	Enable/disable log1p normalization of metabolites before data analysis	boolean
metadata_column	The column name containing state information for the data analysis module	string
zeronan_threshold	Threshold for zero or NaN values in multivariate analysis	float
test_size	Test size for splitting data in multivariate analysis	float
cross_val_fold	Cross-validation folds for Logistic regression CV model	integer
pvalue_shapiro	P-value threshold for normality (Shapiro-Wilk test)	float

d. Biological interpretation

Input	Description	Datatype
top_n	Number of metabolites to include in enrichment for pathway analysis	integer
kegg_org_id	KEGG organism ID	string

Note: The params.yml file consolidates all global parameters required for executing the pipeline using the -params-file flag. Alternatively, the pipeline can be executed without creating this file by specifying each parameter as a separate flag, such as --run_warping True or --ncores 3, in accordance with Nextflow configuration.

After completing every step open run.sh and adjust paths for execution of workflow or run manually using command:

nextflow run ../main.nf \
    -c ../nextflow.config \
    -profile standard \
    -params-file params.yml

(Back to Top)

Final remarks

Tests

In order to run the test data simply go the tests directory and run the test run:

./tests/run.sh

Memory allocation

Please remember that based on the number of datasets provided in the manifest your local machine has to have that many resources. [visit this thread: nextflow-io/nextflow#1787] The lack of resources can lead to incorrect memory allocation in the script. It is recommended to change max_cpus and max_memory params in nextflow.config file accordingly to resources avaibale on your local machine.

example:

 *** caught segfault ***
  address 0x7ff0000000000003, cause 'memory not mapped'

Be aware that NextFlow is not a resource orchestration system. If you need it, there is a need of creation of custom executor like aws or kubernetess.

Note: The default setting for the computation cannot be lower than:

cpus = 2

memory = 2.GB RAM

License

NASQQ is distributed under the MIT License. See LICENSE.md for more information.

Contact

For contact purposes, there is a dedicated email address: [email protected]

Credits and acknowledgments

The scripts and workflow was originally created as a part of Łukasz Pruss's PhD project, in collaboration between Ardigen S.A. and Wrocław University of Science and Technology (WUST). A special acknowledgment goes to Oskar Gniewek, whose expertise and critical feedback significantly contributed to the implementation of NextFlow. He also played a crucial role in managing unit and integration tests, as well as handling dependencies across various systems for pipeline execution.

Furthermore, many people were involved in the evolution of the pipeline, turning it from a concept into an end-to-end solution. These contributors include:

Special thanks for the assistance in development process, code reviews and tips are extend to:

Citations

An extensive list of references and packages used by the pipeline can be found in our publication:

NASQQ: Nextflow automatization and standarization for qualitative and quantitative 1H NMR metabolomics data preparation and analysis.

Łukasz Pruss, Oskar Gniewek, Tomasz Jetka, Wojciech Wojtowicz, Kaja Milanowska-Zabel, Piotr Młynarz.

DOI: --

If you want to utilize NASQQ for your analysis, please refer to LICENSE.md

To cite the nf-core publication use:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

(Back to Top)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NASQQ: Nextflow Automatization and Standardization for Qualitative and Quantitative 1H NMR Metabolomics

About The Project

Features

Workflow overview

Getting Started

Prerequisites

Project setup

Final remarks

Tests

Memory allocation

License

Contact

Credits and acknowledgments

Citations

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
bin		bin
conf		conf
docker		docker
docs		docs
modules		modules
subworkflows		subworkflows
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config

License

ardigen/nasqq

Folders and files

Latest commit

History

Repository files navigation

NASQQ: Nextflow Automatization and Standardization for Qualitative and Quantitative 1H NMR Metabolomics

About The Project

Features

Workflow overview

Getting Started

Prerequisites

Project setup

Final remarks

Tests

Memory allocation

License

Contact

Credits and acknowledgments

Citations

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages