Read contamination removal.
Make the executable src/readItAndKeep
by running:
cd src && make
From an existing environment:
conda install -c bioconda read-it-and-keep
Using a new environment (recommended):
conda create -n read-it-and-keep -c bioconda python=3 read-it-and-keep
conda activate read-it-and-keep
Get a Docker container of the latest release:
docker pull ghcr.io/globalpathogenanalysisservice/read-it-and-keep:latest
Alternatively, build a docker container by cloning this repository and running:
docker build -f Dockerfile -t <TAG> .
Releases
include a Singularity image to download, called
readItAndKeep_vX.Y.Z.img
, where X.Y.Z is the release version.
Alternatively, build a singularity container by cloning this repository and running:
sudo singularity build readItAndKeep.sif Singularity.def
If you are using readItAndKeep
for SARS-CoV-2 reads, then we recommend that
you use the reference genome MN908947.3, but with the poly-A tail removed.
This is explained in the publication
https://doi.org/10.1093/bioinformatics/btac311.
A FASTA file of MN908947.3 without the poly-A tail is available in
this repository: tests/MN908947.3.no_poly_A.fa
.
ReadItAndKeep works by keeping the reads that match the provided target genome.
To run on paired Illumina reads, in two files reads1.fq.gz
and reads2.fq.gz
, keeping
only reads that match the genome in ref_genome.fasta
:
readItAndKeep --ref_fasta ref_genome.fasta --reads1 reads1.fq.gz --reads2 reads2.fq.gz -o out
This will output out.reads_1.fastq.gz
and out.reads_2.fastq.gz
.
To run on one file of nanopore reads reads.fq.gz
:
readItAndKeep --tech ont --ref_fasta ref_genome.fasta --reads1 reads.fq.gz -o out
This will output out.reads.fastq.gz
.
If the input reads files are in FASTA format, then it will output reads in FASTA format, calling the files *.fasta.*
instead of *.fastq.*
.
It always writes the counts of input and output reads to STDOUT
in tab-delimited format, for example:
Input reads file 1 1000
Input reads file 2 1000
Kept reads 1 950
Kept reads 2 950
All logging messages are sent to STDERR
.
Required arguments:
--ref_fasta
: reference genome in FASTA format.--reads1
: at least one reads file in FASTA[.GZ] or FASTQ[.GZ] format.-o,--outprefix
: prefix of output files.
Please note there is an option --tech
, which defaults to illumina
. Use --tech ont
for nanopore reads.
Optional arguments:
--reads2
: name of second reads file, i.e. mates file for paired reads--enumerate_names
: rename the reads1
,2
,3
,... (for paired reads, will also add/1
or/2
on the end of names)--debug
: debug mode. More verbose and writes debugging files--min_map_length
: minimum length of match required to keep a read in bp (default50
)--min_map_length_pc
: minimum length of match required to keep a read, as a percent of the read length (default50.0
)-V,--version
: show version and exit
Additional arguments need to be supplied to allow Docker to access input and output files. Below is a functional example:
docker run /path/to/read-it-and-keep/tests:/tests [-v /path/to/input:/input -v /path/to/output:/output] <TAG> --ref_fasta /tests/MN908947.3.fa --reads1 /input/<SAMPLE>_1.fastq.gz --reads2 /input/<SAMPLE>_2.fastq.gz --outprefix /output/
These are under development. To run them you will need:
- Python 3
- Python package pytest (
pip install pytest
) - Python package pyfastaq (
pip install pyfastaq
) - ART read simulator
installed, so that
art_illumina
is in your$PATH
- badread for nanopore read simulation.
Run the tests after compiling the source code, ie:
cd src
make
make test
This repository includes unedited copies of the code from:
- gzstream, LGPL 2.1 licence
- minimap2, MIT licence
- CLI11 header file, licence is at start of cli11.hpp