Skip to content

Commit

Permalink
Merge pull request #17 from jhayer/software-paper_quality-of-writing
Browse files Browse the repository at this point in the history
Statement of need paper.md (fix #16, #18, #29 and #30)
  • Loading branch information
Juke34 authored Jul 13, 2023
2 parents 17ef3b2 + 7d21bac commit a867de1
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 24 deletions.
11 changes: 11 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,17 @@ @article{DITommaso:2017
year = {2017}
}

@article{Petit:2020,
abstract = {Bactopia: a flexible pipeline for complete analysis of bacterial genomes.},
author = {Petit III, R. A., & Read, T. D.},
doi = {10.1128/mSystems.00190-20},
issn = {2379-5077},
journal = {Msystems},
publisher = {American Society for Microbiology},
title = {{Bactopia: a flexible pipeline for complete analysis of bacterial genomes}},
year = {2020}
}

@article{Chen:2018,
abstract = {Motivation Quality control and preprocessing of FASTQ files are essential to providing clean data for downstream analysis. Traditionally, a different tool is used for each operation, such as quality control, adapter trimming and quality filtering. These tools are often insufficiently fast as most are developed using high-level programming languages (e.g. Python and Java) and provide limited multi-threading support. Reading and loading data multiple times also renders preprocessing slow and I/O inefficient. Results We developed fastp as an ultra-fast FASTQ preprocessor with useful quality control and data-filtering features. It can perform quality control, adapter trimming, quality filtering, per-read quality pruning and many other operations with a single scan of the FASTQ data. This tool is developed in C++ and has multi-threading support. Based on our evaluation, fastp is 2-5 times faster than other FASTQ preprocessing tools such as Trimmomatic or Cutadapt despite performing far more operations than similar tools. Availability and implementation The open-source code and corresponding instructions are available at https://github.com/OpenGene/fastp.},
author = {Chen, Shifu and Zhou, Yanqing and Chen, Yaru and Gu, Jia},
Expand Down
53 changes: 29 additions & 24 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: 'Baargin: a Nextflow workflow for the automatic analysis of bacterial
genomics data with a focus on Antimicrobial Resistance'
tags:
- NextFlow
- Nextflow
- Whole Genome Shotgun
- Genomics
- Long reads sequencing technology
Expand Down Expand Up @@ -49,34 +49,39 @@ at a time. As a counterpart, these experiments produce large amount of data that
needs to be analysed by various bioinformatics methods and tools for reconstructing
the genomes and therefore identify their specific features and the genetic
determinants of the AMR. For automating the bioinformatics analysis of multiple
strains, we have developed a NextFlow [@DITommaso:2017] workflow called *baargin*
(Bacterial Assembly and Antimicrobial Resistance Genes detection In NextFlow)
strains, we have developed a Nextflow [@DITommaso:2017] workflow called *baargin*
(Bacterial Assembly and Antimicrobial Resistance Genes detection In Nextflow)
[https://github.com/jhayer/baargin](https://github.com/jhayer/baargin).
It enables to conduct sequencing reads quality control, genome assembly and annotation,
Multi-Locus Sequence Typing and plasmid identification, as well as antimicrobial
resistance determinants detection, and pangenome analysis. The use of NextFlow,
resistance determinants detection, and pangenome analysis. The use of Nextflow,
a workflow management system, makes our workflow portable, flexible, and able to
conduct reproducible analyses.


# Statement of need

The Hight Throughput Sequencing technologies produce a significant amount of data,
and the DNA from multiple bacterial strains can be sequenced at the same time on a same
sequencing run. Moreover, researchers are producing genomics data all over the world on a
daily basis, notably to better understand the spread of bacterial pathogens and
their resistance to antibiotics. The analysis of this data requires the use of a
wide range of bioinformatics programs to be able to identify the genomic structure,
the genes and their functions, and among those, the genes and mutations conferring
resistance to antimicrobial drugs. In order to make the results of these analyses
comparable, it is crucial to standardize, automate and parallelize all the steps to ensure the
reproducibility of the data analysis. The workflow that we have developed allows
the user to perform a complete *in silico* analysis of a bacterial genome, from
the quality control of the raw data, to the detection of AMR genes and mutations,
on multiple datasets of bacterial strains of the same species in parallel.
It compiles and summarize the results from all the analysis steps, allowing comparative
studies, and it also performs a pangenome analysis of all the strains provided,
providing the basis for the construction of a phylogenetic tree.
High Throughput Sequencing technologies produce a significant amount of data and
researchers are producing genomics data all over the world on a daily basis.
These technologies are notably used for studying bacterial genomes in order to understand
the spread of bacterial pathogens and their resistance to antibiotics. In the bacterial genomics field,
it is possible to sequence the DNA from multiple bacterial strains at the same time.
The analysis of these sequencing data requires the use of a wide range of bioinformatics programs
to be able to identify the genes and their functions, and among those, the genes and
mutations conferring resistance to antimicrobial drugs. In order to make the results of
these analyses comparable, it is crucial to standardize, automate and parallelize all the steps.
The *baargin* workflow allows the user to perform a complete *in silico* analysis of bacterial genomes,
from the quality control of the raw data, to the detection of AMR genes and mutations, on multiple datasets
of the same bacterial species in parallel. It compiles and summarize the results from all the analysis steps,
allowing comparative studies. As a last step, *baargin* performs a pangenome analysis of all the strains provided,
producing the basis for the construction of a phylogenetic tree. The use of Nextflow and containers ensures
the reproducibility of the data analysis.
Only few bacterial genomics workflows are available, like Bactopia [@Petit:2020], which is highly
flexible and complete in term of tools available. Therefore, we needed a lighter workflow,
with only a few tools and databases installed for our collaborators that have only limited computing resources and storage.
Also, *baargin* is specifically designed for detecting AMR genes and plasmid features, and include a
decontamination step of the assembly, allowing the downstream analyses to be performed especially
on the contigs belonging to the targeted species.


# Materials and Methods
Expand All @@ -103,18 +108,18 @@ Fastp [@Chen:2018].
3. De novo assembly is run using SPAdes [@Prjibelski:2020] if only short reads were
provided, and with Unicycler [@Wick:2017] for a hybrid assembly when short and
long reads are provided.
4. The contigs are taxonomically assigned using Kraken2 [@Wood:2019] and the contigs classified at
4. Taxonomic assignment of the contigs is performed using Kraken2 [@Wood:2019] and the contigs classified at
the taxonomic level provided by the user (with the taxid, and including the children taxa)
are retrieved and therefore named as *"deconta"* for decontaminated contigs [@Lu:2022]. The dataset
containing all the contigs are named as *"raw"*. From here all the steps except the
annotation (8) will be performed on both sets of contigs *"raw"* and *"deconta"*.
5. A quality check of the assembly is achieved by Quast [@Gurevich:2013] and BUSCO [@Manni:2021].
5. A quality check of the assembly is conducted using Quast [@Gurevich:2013] and BUSCO [@Manni:2021].
For BUSCO, the users have the possibility to specify the taxonomic lineage database
to use for searching the housekeeping genes (at the class level of the strain to
analyse for example: *enterobacterales_odb10*)
6. The contigs (*raw* and *deconta*) are then screened to identify the sequence type of
the strain using MLST tool (Multi-Locus Sequence Typing) [@Seemann:2022].
7. The contigs are then submitted to a plasmid identification with PlasmidFinder [@Carattoli:2014]
7. The contigs are subsequently submitted to plasmid identification using PlasmidFinder [@Carattoli:2014]
and additionally with Platon if the user provides a database for it [@Schwengers:2020].
8. Antimicrobial Resistance Genes (ARGs) are then searched in the contigs using both CARD RGI [@Alcock:2023]
and the NCBI AMRFinderPlus [@Feldgarden:2021]. For certain species only, AMRFinderPlus
Expand Down Expand Up @@ -146,7 +151,7 @@ We presented here an easy-to-use workflow for Bacterial Assembly and Antimicrobi
Resistance Genes detection In Nextflow: *baargin*. It allows the users to analyse
genomic datasets from short and long sequencing reads, of several bacterial strains from
the same species in one command line. The workflow will automatically assemble the
genomes, check for contamination and specifically extract the sequences that belong
genomes, check for contamination and specifically extract the sequences that belong to
the expected taxon. It will then identify their sequence type and screen the assemblies
for plasmids sequences and ARGs. The fact that *baargin* is implemented in Nextflow and
is based on containers makes the analyses reproducible. Its modular design makes it
Expand Down

0 comments on commit a867de1

Please sign in to comment.