Skip to content

A fast, easy solution for metagenomic data analysis - Forked from Sander Vermeulens repository, developed during his internship

Notifications You must be signed in to change notification settings


Repository files navigation


A fast, easy solution for metagenomic data analysis. An outline of the program is given below:


A report of FastDeme containing some benchmarks can be found here: Sander Vermeulen internship report


The program can be downloaded as an archive or with the following git command:

git clone

Next, the databases should be downloaded. This can be done by running located in the db/ directory. This will download the four databases that are needed for the program to run and unzip them. Please make sure enough space is available on the drive, since the combined size of the databases is ~165 GB. The combined download size is ~116 GB.

Other requirements include Numpy:

pip3 install numpy

After downloading, the program can be invoked with


The program has two mandatory arguments, --inp and --output. To use the basic version of the program, the following command can be ran:

./ --inp file.fastq.gz --output /path/to/output/folder/ --trimming

This will only result in the file getting trimmed.

To trim, screen the input files for host contamination, perform taxonomic identification with Kaiju and analyse the resistome with GROOT the following command can be used:

./ --pe --inp file_R1.fastq.gz file_R2.fastq.gz --output /path/to/output/folder/ --trimming --screening --kaiju --groot

The flag --pe is needed when using paired end files. --kaiju, --groot, --kraken --trimming, --kma and --screening turn on the respective modules.

16 CPU cores will be used by default. To limit or increase the amount of CPU cores used, one can use --threads. Note that trimming will not use more than 16 cores, even when more are specified.

Database information


The GROOT database consist of a mixture of the ResFinder, ARG-ANNOT and CARD databases. See the GROOT documentation for more details.


The KMA database consist of the ResFinder database.


The Kaiju database was made with assembled and annotated bacterial reference genomes from the NCBI RefSeq database.


The Kraken2 database was made with the complete bacterial reference genomes from the NCBI RefSeq database.


The Mash database was made from the complete vertebrate_mammalian and vertebrate_other databases from NCBI RefSeq. Since the bloom filters BioBloomCategorizer uses for filtering the host reads are quite large, only filters for common host species are included in the standard database to reduce download size.

Species in standard database GCF ID
Bos indicus GCF_000247795.1
Bos mutus GCF_000298355.1
Bos taurus GCF_002263795.1
Canis lupus familiaris GCF_000002285.3
Capra hircus GCF_001704415.1
Chinchilla lanigera GCF_000276665.1
Equus caballus GCF_002863925.1
Felis catus GCF_000181335.3
Gorilla gorilla gorilla GCF_000151905.2
Homo sapiens GCF_000001405.38
Mus musculus GCF_000001635.26
Ovis aries GCF_000298735.2
Ovis aries musimon GCF_000765115.1
Pan troglodytes GCF_002880755.1
Rattus norvegicus GCF_000001895.5, GCF_000002265.2
Sus scrofa GCF_000003025.6
Danio rerio GCF_000002035.6
Gallus gallus GCF_000002315.5
Meleagris gallopavo GCF_000146605.2

In case contamination is detected and the host is not in the standard database, the corresponding bloom filter will be downloaded automatically.


Output obviously depends on what modules are used for analysis. The following files are expected as output for each module:


QC report


For paired end:



For single end:



For paired end:





For single end:



These files contain the (trimmed) fastq files with removed host DNA (noMatch) and the removed host reads (GCF).


Output for Kaiju depends on which taxonomic rank is selected.


These files contain the names and abundance of the selected taxonomic ranks in the samples.


Output for Kraken2 depends on which taxonomic rank is selected.


Contains output of Bracken, sorted by taxonomic rank.


Contains output of Kraken2.


Contains all output of Bracken.



Contains information about the found antibiotic resistance genes.


Folder with .gfa files of the found antibioitc resistance genes.



Contains alignments of resistance genes against input.


Contains sequences of found resistance genes in FASTA format.


Contains information about the found antibiotic resistance genes.


usage: --inp file.fastq.gz --output /path/to/output/folder/ [OPTIONS]

  -h, --help           show this help message and exit  
  --pe                 specify paired-end data, default is single end  
  --inp INP [INP ...]  input files in fastq.gz format, if paired-end input  
                       both files with a space between them                       
  --threads THREADS    specify number of threads to be used, default is max
                       available threads up to 16 threads                       
  --kaiju              use kaiju for taxonomic identification  
  --kraken             use kraken2 for taxonomic identification  
  --groot              use groot for resistome analysis
  --kma                use kma for resistome analysis
  --tax_rank TAX_RANK  set taxonomic rank for output. choose one: phylum,
                       class, order, family, genus, species, default is all
  --prefix PREFIX      prefix for all output files, default is name of input
  --trimming           turn on trimming with fastp  
  --screening          turn on host contamination screening with mash and
  --output OUTPUT      set output directory




Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890, paper



Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. paper


BioBloom Tools

BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters. Justin Chu, Sara Sadeghi, Anthony Raymond, Shaun D. Jackman, Ka Ming Nip, Richard Mar, Hamid Mohamadi, Yaron S. Butterfield, A. Gordon Robertson, Inanç Birol. Bioinformatics 2014; 30 (23): 3402-3404. paper



Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology 2014, 15:R46. paper



Lu J, Breitwieser FP, Thielen P, Salzberg SL. 2017. Bracken: estimating species abundance in metagenomics data. PeerJ Computer Science 3:e104 paper



Menzel, P. et al. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7:11257 paper



Will P M Rowe, Martyn D Winn; Indexed variation graphs for efficient and accurate resistome profiling, Bioinformatics, Volume 34, Issue 21, 1 November 2018, Pages 3601–3608, paper



Philip T.L.C. Clausen, Frank M. Aarestrup & Ole Lund, "Rapid and precise alignment of raw reads against redundant databases with KMA", BMC Bioinformatics, 2018;19:307. paper



A fast, easy solution for metagenomic data analysis - Forked from Sander Vermeulens repository, developed during his internship






No packages published