Merge pull request #17 from jhayer/software-paper_quality-of-writing

Statement of need paper.md (fix #16, #18, #29 and #30)
jhayer · Jul 13, 2023 · a867de1 · a867de1
2 parents 17ef3b2 + 7d21bac
commit a867de1
Show file tree

Hide file tree

Showing 2 changed files with 40 additions and 24 deletions.
diff --git a/paper/paper.bib b/paper/paper.bib
@@ -29,6 +29,17 @@ @article{DITommaso:2017
 year = {2017}
 }
 
+@article{Petit:2020,
+abstract = {Bactopia: a flexible pipeline for complete analysis of bacterial genomes.},
+author = {Petit III, R. A., & Read, T. D.},
+doi = {10.1128/mSystems.00190-20},
+issn = {2379-5077},
+journal = {Msystems},
+publisher = {American Society for Microbiology},
+title = {{Bactopia: a flexible pipeline for complete analysis of bacterial genomes}},
+year = {2020}
+}
+
 @article{Chen:2018,
 abstract = {Motivation Quality control and preprocessing of FASTQ files are essential to providing clean data for downstream analysis. Traditionally, a different tool is used for each operation, such as quality control, adapter trimming and quality filtering. These tools are often insufficiently fast as most are developed using high-level programming languages (e.g. Python and Java) and provide limited multi-threading support. Reading and loading data multiple times also renders preprocessing slow and I/O inefficient. Results We developed fastp as an ultra-fast FASTQ preprocessor with useful quality control and data-filtering features. It can perform quality control, adapter trimming, quality filtering, per-read quality pruning and many other operations with a single scan of the FASTQ data. This tool is developed in C++ and has multi-threading support. Based on our evaluation, fastp is 2-5 times faster than other FASTQ preprocessing tools such as Trimmomatic or Cutadapt despite performing far more operations than similar tools. Availability and implementation The open-source code and corresponding instructions are available at https://github.com/OpenGene/fastp.},
 author = {Chen, Shifu and Zhou, Yanqing and Chen, Yaru and Gu, Jia},

diff --git a/paper/paper.md b/paper/paper.md
@@ -2,7 +2,7 @@
 title: 'Baargin: a Nextflow workflow for the automatic analysis of bacterial
 genomics data with a focus on Antimicrobial Resistance'
 tags:
-  - NextFlow
+  - Nextflow
   - Whole Genome Shotgun
   - Genomics
   - Long reads sequencing technology
@@ -49,34 +49,39 @@ at a time. As a counterpart, these experiments produce large amount of data that
 needs to be analysed by various bioinformatics methods and tools for reconstructing
 the genomes and therefore identify their specific features and the genetic
 determinants of the AMR. For automating the bioinformatics analysis of multiple
-strains, we have developed a NextFlow [@DITommaso:2017] workflow called *baargin*
-(Bacterial Assembly and Antimicrobial Resistance Genes detection In NextFlow)
+strains, we have developed a Nextflow [@DITommaso:2017] workflow called *baargin*
+(Bacterial Assembly and Antimicrobial Resistance Genes detection In Nextflow)
 [https://github.com/jhayer/baargin](https://github.com/jhayer/baargin).
 It enables to conduct sequencing reads quality control, genome assembly and annotation,
 Multi-Locus Sequence Typing and plasmid identification, as well as antimicrobial
-resistance determinants detection, and pangenome analysis. The use of NextFlow,
+resistance determinants detection, and pangenome analysis. The use of Nextflow,
 a workflow management system, makes our workflow portable, flexible, and able to
 conduct reproducible analyses.
 
 
 # Statement of need
 
-The Hight Throughput Sequencing technologies produce a significant amount of data,
-and the DNA from multiple bacterial strains can be sequenced at the same time on a same
-sequencing run. Moreover, researchers are producing genomics data all over the world on a
-daily basis, notably to better understand the spread of bacterial pathogens and
-their resistance to antibiotics. The analysis of this data requires the use of a
-wide range of bioinformatics programs to be able to identify the genomic structure,
-the genes and their functions, and among those, the genes and mutations conferring
-resistance to antimicrobial drugs. In order to make the results of these analyses
-comparable, it is crucial to standardize, automate and parallelize all the steps to ensure the
-reproducibility of the data analysis. The workflow that we have developed allows
-the user to perform a complete *in silico* analysis of a bacterial genome, from
-the quality control of the raw data, to the detection of AMR genes and mutations,
-on multiple datasets of bacterial strains of the same species in parallel.
-It compiles and summarize the results from all the analysis steps, allowing comparative
-studies, and it also performs a pangenome analysis of all the strains provided,
-providing the basis for the construction of a phylogenetic tree.
+High Throughput Sequencing technologies produce a significant amount of data and 
+researchers are producing genomics data all over the world on a daily basis. 
+These technologies are notably used for studying bacterial genomes in order to understand 
+the spread of bacterial pathogens and their resistance to antibiotics. In the bacterial genomics field, 
+it is possible to sequence the DNA from multiple bacterial strains at the same time. 
+The analysis of these sequencing data requires the use of a wide range of bioinformatics programs 
+to be able to identify the genes and their functions, and among those, the genes and 
+mutations conferring resistance to antimicrobial drugs. In order to make the results of 
+these analyses comparable, it is crucial to standardize, automate and parallelize all the steps. 
+The *baargin* workflow allows the user to perform a complete *in silico* analysis of bacterial genomes, 
+from the quality control of the raw data, to the detection of AMR genes and mutations, on multiple datasets 
+of the same bacterial species in parallel. It compiles and summarize the results from all the analysis steps, 
+allowing comparative studies. As a last step, *baargin* performs a pangenome analysis of all the strains provided, 
+producing the basis for the construction of a phylogenetic tree. The use of Nextflow and containers ensures 
+the reproducibility of the data analysis.
+Only few bacterial genomics workflows are available, like Bactopia [@Petit:2020], which is highly 
+flexible and complete in term of tools available. Therefore, we needed a lighter workflow, 
+with only a few tools and databases installed for our collaborators that have only limited computing resources and storage. 
+Also, *baargin* is specifically designed for detecting AMR genes and plasmid features, and include a 
+decontamination step of the assembly, allowing the downstream analyses to be performed especially 
+on the contigs belonging to the targeted species.  
 
 
 # Materials and Methods
@@ -103,18 +108,18 @@ Fastp [@Chen:2018].
 3. De novo assembly is run using SPAdes [@Prjibelski:2020] if only short reads were
 provided, and with Unicycler [@Wick:2017] for a hybrid assembly when short and
 long reads are provided.
-4. The contigs are taxonomically assigned using Kraken2 [@Wood:2019] and the contigs classified at
+4. Taxonomic assignment of the contigs is performed using Kraken2 [@Wood:2019] and the contigs classified at
 the taxonomic level provided by the user (with the taxid, and including the children taxa)
 are retrieved and therefore named as *"deconta"* for decontaminated contigs [@Lu:2022]. The dataset
 containing all the contigs are named as *"raw"*. From here all the steps except the
 annotation (8) will be performed on both sets of contigs *"raw"* and *"deconta"*.
-5. A quality check of the assembly is achieved by Quast [@Gurevich:2013] and BUSCO [@Manni:2021].
+5. A quality check of the assembly is conducted using Quast [@Gurevich:2013] and BUSCO [@Manni:2021].
 For BUSCO, the users have the possibility to specify the taxonomic lineage database
 to use for searching the housekeeping genes (at the class level of the strain to
 analyse for example: *enterobacterales_odb10*)
 6. The contigs (*raw* and *deconta*) are then screened to identify the sequence type of
 the strain using MLST tool (Multi-Locus Sequence Typing) [@Seemann:2022].
-7. The contigs are then submitted to a plasmid identification with PlasmidFinder [@Carattoli:2014]
+7. The contigs are subsequently submitted to plasmid identification using PlasmidFinder [@Carattoli:2014]
 and additionally with Platon if the user provides a database for it [@Schwengers:2020].
 8. Antimicrobial Resistance Genes (ARGs) are then searched in the contigs using both CARD RGI [@Alcock:2023]
 and the NCBI AMRFinderPlus [@Feldgarden:2021]. For certain species only, AMRFinderPlus
@@ -146,7 +151,7 @@ We presented here an easy-to-use workflow for Bacterial Assembly and Antimicrobi
 Resistance Genes detection In Nextflow: *baargin*. It allows the users to analyse
 genomic datasets from short and long sequencing reads, of several bacterial strains from
 the same species in one command line. The workflow will automatically assemble the
-genomes, check for contamination and specifically extract the sequences that belong
+genomes, check for contamination and specifically extract the sequences that belong to
 the expected taxon. It will then identify their sequence type and screen the assemblies
 for plasmids sequences and ARGs. The fact that *baargin* is implemented in Nextflow and
 is based on containers makes the analyses reproducible. Its modular design makes it