From fd8b961309d71e59f791e10f99dccec8ed1391a5 Mon Sep 17 00:00:00 2001 From: jhayer Date: Wed, 17 May 2023 15:25:56 +0200 Subject: [PATCH 1/5] Statement of need paper.md (fix #16 and #17) I have rephrased some parts in the paragraph Statement of need. Is is easier to read and understand now. --- paper/paper.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/paper/paper.md b/paper/paper.md index e52597a..374e33d 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -61,6 +61,22 @@ conduct reproducible analyses. # Statement of need +High Throughput Sequencing technologies produce a significant amount of data and +researchers are producing genomics data all over the world on a daily basis. +These technologies are notably used for studying bacterial genomes in order to understand +the spread of bacterial pathogens and their resistance to antibiotics. In the bacterial genomics field, +it is possible to sequence the DNA from multiple bacterial strains at the same time. +The analysis of these sequencing data requires the use of a wide range of bioinformatics programs +to be able to identify the genes and their functions, and among those, the genes and +mutations conferring resistance to antimicrobial drugs. In order to make the results of +these analyses comparable, it is crucial to standardize, automate and parallelize all the steps. +The *baargin* workflow allows the user to perform a complete *in silico* analysis of bacterial genomes, +from the quality control of the raw data, to the detection of AMR genes and mutations, on multiple datasets +of the same bacterial species in parallel. It compiles and summarize the results from all the analysis steps, +allowing comparative studies. As a last step, *baargin* performs a pangenome analysis of all the strains provided, +producing the basis for the construction of a phylogenetic tree. The use of Nextflow and containers ensures +the reproducibility of the data analysis. + The Hight Throughput Sequencing technologies produce a significant amount of data, and the DNA from multiple bacterial strains can be sequenced at the same time on a same sequencing run. Moreover, researchers are producing genomics data all over the world on a From 019200f098ed8199c9c730f939da410ea24e1b84 Mon Sep 17 00:00:00 2001 From: jhayer Date: Fri, 26 May 2023 11:44:34 +0200 Subject: [PATCH 2/5] remove text left from previous version --- paper/paper.md | 17 ----------------- 1 file changed, 17 deletions(-) diff --git a/paper/paper.md b/paper/paper.md index 374e33d..fd6edc4 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -77,23 +77,6 @@ allowing comparative studies. As a last step, *baargin* performs a pangenome ana producing the basis for the construction of a phylogenetic tree. The use of Nextflow and containers ensures the reproducibility of the data analysis. -The Hight Throughput Sequencing technologies produce a significant amount of data, -and the DNA from multiple bacterial strains can be sequenced at the same time on a same -sequencing run. Moreover, researchers are producing genomics data all over the world on a -daily basis, notably to better understand the spread of bacterial pathogens and -their resistance to antibiotics. The analysis of this data requires the use of a -wide range of bioinformatics programs to be able to identify the genomic structure, -the genes and their functions, and among those, the genes and mutations conferring -resistance to antimicrobial drugs. In order to make the results of these analyses -comparable, it is crucial to standardize, automate and parallelize all the steps to ensure the -reproducibility of the data analysis. The workflow that we have developed allows -the user to perform a complete *in silico* analysis of a bacterial genome, from -the quality control of the raw data, to the detection of AMR genes and mutations, -on multiple datasets of bacterial strains of the same species in parallel. -It compiles and summarize the results from all the analysis steps, allowing comparative -studies, and it also performs a pangenome analysis of all the strains provided, -providing the basis for the construction of a phylogenetic tree. - # Materials and Methods From ac85347c9dd3279ab33115647fa764463ec33ed4 Mon Sep 17 00:00:00 2001 From: jhayer Date: Thu, 8 Jun 2023 16:26:18 +0200 Subject: [PATCH 3/5] paper.bib add reference (fix #29 and #30) --- paper/paper.bib | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/paper/paper.bib b/paper/paper.bib index cbaba20..e0ef6b2 100644 --- a/paper/paper.bib +++ b/paper/paper.bib @@ -29,6 +29,17 @@ @article{DITommaso:2017 year = {2017} } +@article{Petit:2020, +abstract = {Bactopia: a flexible pipeline for complete analysis of bacterial genomes.}, +author = {Petit III, R. A., & Read, T. D.}, +doi = {10.1128/mSystems.00190-20}, +issn = {2379-5077}, +journal = {Msystems}, +publisher = {American Society for Microbiology}, +title = {{Bactopia: a flexible pipeline for complete analysis of bacterial genomes}}, +year = {2020} +} + @article{Chen:2018, abstract = {Motivation Quality control and preprocessing of FASTQ files are essential to providing clean data for downstream analysis. Traditionally, a different tool is used for each operation, such as quality control, adapter trimming and quality filtering. These tools are often insufficiently fast as most are developed using high-level programming languages (e.g. Python and Java) and provide limited multi-threading support. Reading and loading data multiple times also renders preprocessing slow and I/O inefficient. Results We developed fastp as an ultra-fast FASTQ preprocessor with useful quality control and data-filtering features. It can perform quality control, adapter trimming, quality filtering, per-read quality pruning and many other operations with a single scan of the FASTQ data. This tool is developed in C++ and has multi-threading support. Based on our evaluation, fastp is 2-5 times faster than other FASTQ preprocessing tools such as Trimmomatic or Cutadapt despite performing far more operations than similar tools. Availability and implementation The open-source code and corresponding instructions are available at https://github.com/OpenGene/fastp.}, author = {Chen, Shifu and Zhou, Yanqing and Chen, Yaru and Gu, Jia}, From dd2e39fac8a4214a12fbe4432ccf4fd1d836e583 Mon Sep 17 00:00:00 2001 From: jhayer Date: Thu, 8 Jun 2023 16:34:10 +0200 Subject: [PATCH 4/5] state of the field (fix #29 and #30) Added a section on other existing workflow --- paper/paper.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/paper/paper.md b/paper/paper.md index fd6edc4..416d06d 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -76,6 +76,12 @@ of the same bacterial species in parallel. It compiles and summarize the results allowing comparative studies. As a last step, *baargin* performs a pangenome analysis of all the strains provided, producing the basis for the construction of a phylogenetic tree. The use of Nextflow and containers ensures the reproducibility of the data analysis. +Only few bacterial genomics workflows are available, like Bactopia [@Petit:2020], which is highly +flexible and complete in term of tools available. Therefore, we needed a lighter workflow, +with only a few tools and databases installed for our collaborators that have only limited computing resources and storage. +Also, *baargin* is specifically designed for detecting AMR genes and plasmid features, and include a +decontamination step of the assembly, allowing the downstream analyses to be performed especially +on the contigs belonging to the targeted species. # Materials and Methods From 7d21baca0350b5cc61ff8872708c82f61621181f Mon Sep 17 00:00:00 2001 From: jhayer Date: Thu, 8 Jun 2023 16:39:18 +0200 Subject: [PATCH 5/5] Fix typos --- paper/paper.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/paper/paper.md b/paper/paper.md index 416d06d..f9e27b9 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -2,7 +2,7 @@ title: 'Baargin: a Nextflow workflow for the automatic analysis of bacterial genomics data with a focus on Antimicrobial Resistance' tags: - - NextFlow + - Nextflow - Whole Genome Shotgun - Genomics - Long reads sequencing technology @@ -49,12 +49,12 @@ at a time. As a counterpart, these experiments produce large amount of data that needs to be analysed by various bioinformatics methods and tools for reconstructing the genomes and therefore identify their specific features and the genetic determinants of the AMR. For automating the bioinformatics analysis of multiple -strains, we have developed a NextFlow [@DITommaso:2017] workflow called *baargin* -(Bacterial Assembly and Antimicrobial Resistance Genes detection In NextFlow) +strains, we have developed a Nextflow [@DITommaso:2017] workflow called *baargin* +(Bacterial Assembly and Antimicrobial Resistance Genes detection In Nextflow) [https://github.com/jhayer/baargin](https://github.com/jhayer/baargin). It enables to conduct sequencing reads quality control, genome assembly and annotation, Multi-Locus Sequence Typing and plasmid identification, as well as antimicrobial -resistance determinants detection, and pangenome analysis. The use of NextFlow, +resistance determinants detection, and pangenome analysis. The use of Nextflow, a workflow management system, makes our workflow portable, flexible, and able to conduct reproducible analyses. @@ -108,18 +108,18 @@ Fastp [@Chen:2018]. 3. De novo assembly is run using SPAdes [@Prjibelski:2020] if only short reads were provided, and with Unicycler [@Wick:2017] for a hybrid assembly when short and long reads are provided. -4. The contigs are taxonomically assigned using Kraken2 [@Wood:2019] and the contigs classified at +4. Taxonomic assignment of the contigs is performed using Kraken2 [@Wood:2019] and the contigs classified at the taxonomic level provided by the user (with the taxid, and including the children taxa) are retrieved and therefore named as *"deconta"* for decontaminated contigs [@Lu:2022]. The dataset containing all the contigs are named as *"raw"*. From here all the steps except the annotation (8) will be performed on both sets of contigs *"raw"* and *"deconta"*. -5. A quality check of the assembly is achieved by Quast [@Gurevich:2013] and BUSCO [@Manni:2021]. +5. A quality check of the assembly is conducted using Quast [@Gurevich:2013] and BUSCO [@Manni:2021]. For BUSCO, the users have the possibility to specify the taxonomic lineage database to use for searching the housekeeping genes (at the class level of the strain to analyse for example: *enterobacterales_odb10*) 6. The contigs (*raw* and *deconta*) are then screened to identify the sequence type of the strain using MLST tool (Multi-Locus Sequence Typing) [@Seemann:2022]. -7. The contigs are then submitted to a plasmid identification with PlasmidFinder [@Carattoli:2014] +7. The contigs are subsequently submitted to plasmid identification using PlasmidFinder [@Carattoli:2014] and additionally with Platon if the user provides a database for it [@Schwengers:2020]. 8. Antimicrobial Resistance Genes (ARGs) are then searched in the contigs using both CARD RGI [@Alcock:2023] and the NCBI AMRFinderPlus [@Feldgarden:2021]. For certain species only, AMRFinderPlus @@ -151,7 +151,7 @@ We presented here an easy-to-use workflow for Bacterial Assembly and Antimicrobi Resistance Genes detection In Nextflow: *baargin*. It allows the users to analyse genomic datasets from short and long sequencing reads, of several bacterial strains from the same species in one command line. The workflow will automatically assemble the -genomes, check for contamination and specifically extract the sequences that belong +genomes, check for contamination and specifically extract the sequences that belong to the expected taxon. It will then identify their sequence type and screen the assemblies for plasmids sequences and ARGs. The fact that *baargin* is implemented in Nextflow and is based on containers makes the analyses reproducible. Its modular design makes it