-
Notifications
You must be signed in to change notification settings - Fork 7
sylph cookbook
sylph sketch -1 *_1.fastq.gz -2 *_2.fastq.gz -d my_sketches -t 50
-
-1/-2
: Input the first and second paired-end reads; can be gzipped. -
-t
: parallelized sketching over the samples. Each sample uses 1 thread: e.g. 5 samples would use 5 threads. -
-d
: all read sketches are output to a new folder calledmy_sketches
with.sylsp
suffixes. The folder will be created if it is not present.
sylph sketch -r reads1.fq reads2.fa -S sample1 sample2
sylph sketch reads1.fq -S sample1
# DON'T DO: sylph sketch reads.fa
-
-r
: you can sketch many single-end reads at once. - If
-r
is not specified, fastq files are assumed to be reads. If fasta reads, you must use -r. -
-S
: (Optional) renames the sketches tosample1.sylsp
andsample2.sylsp
instead ofreads1.fq.sylsp
andreads2.fa.sylsp
.
sylph sketch genomes/*.fa.gz -o database -t 50 -c 200
# EQUIVALENT
sylph sketch -g genomes/*.fa.gz -o database -t 50
- fasta files are assumed to be genomes. All genomes are combined into a new file called
database.syldb
. -
-t
: sketching can use many threads -
-c
: the compression parameter. Memory/runtime scale like1/c
; higherc
is faster but less sensitive at low coverage. - Default
c = 200
. The-c
for genomes must be than >= the-c
for reads (strict > is allowed)
sylph sketch all_genomes.fa -i -o contig_database
-
-i
considers every record in a fasta (i.e. contig) as a genome - This is single-threaded currently, so a bit slower
sylph sketch -l genome_list.txt -t 50
-
-l
is a newline delimited text file, with each genome on a separate line.
sylph profile *.sylsp *.syldb -t 50 -o results.tsv
-
*.syldb
: all syldb files (generated by sketching genomes) are aggregated together into one large database. -
*.sylsp
: each sylsp file (generated by sketching reads) is queried/profiled against the combined database. -
-t
: 50 threads used. Even single-sample vs single-database is multi-threaded. -
-o
: output results to a file. Can also redirect using>
instead of-o
.
# raw genomes
sylph profile genome1.fa genome2.fa sample.sylsp
# raw reads
sylph profile raw_reads.fq database.syldb
# raw paired-end reads (since sylph v0.6.0)
sylph profile database.syldb -1 *_1.fq.gz -2 *_2.fq.gz
- If you input fastas and fastqs into profile/query, sylph will sketch them and then use them.
- Parameters can be set in the
sylph profile
command for sketching. We recommend usingsylph sketch
instead, though, as there are more options.
sylph sketch -c 100 virus_genomes.fa -i -o viruses
sylph sketch -c 200 prokaryotic_genomes/* -o proks -t 50
sylph sketch -c 100 read.fq
sylph profile *.sylsp *.syldb -t 50 --min-number-kmers 20 -o results.tsv
IMPORTANT FOR SMALL GENOMES
-
--min-number-kmers
states the minimum number of k-mers needed for sylph to output a result. This is (approximately) the contig length divided by-c
. With default settings,--min-number-kmers 10
can work with contigs ~2500 bp. For smaller contigs, consider-c 100
.
Notes:
- Sketch viruses at
-c 100
, reads at-c 100
, but genomes at-c 200
. sylph actually runs without issues if the -c for all genomes is >= the -c for reads, which is a useful technique. - Sketching smaller genomes with smaller
-c
is preferable. - Be sure to set
--min-num-kmers
to smaller than default (50) if you care about outputting results for small genomes.
sylph can estimate the percentage of your reads that are not classified or unknown. While sylph does not classify reads directly, it can see how many genomes are detected vs how many reads you have.
sylph profile -u database.syldb sample.sylsp -o results_with_unknown.tsv
# input read identities. see below for information on the --read-seq-id parameter.
sylph profile -u --read-seq-id 99.1 database.syldb sample.sylsp -o results_with_unknown.tsv
-
-u
option multiplies theSequence_abundance
column by the percent of classified reads.
-u
needs the percent identity of your sequences. You can supply a --read-seq-id
, e.g. 99-99.5 works fine for Illumina reads. If you do not, -u
will estimate sequence identity.
This estimate works well for metagenomes that are (1) not too complex, e.g. human microbiomes and even gut and (2) > 1GB sequencing depth.
For Soil/Ocean, very complex metagenomes, and low depth this estimate does not work well, and you should set --read-seq-id
to something like 99 instead.
Tip
We have developed a new method called fairy for calculating contig coverages quickly. Consider using fairy instead of sylph for this task, especially if you want to bin contigs.
sylph sketch -i contigs.fa -o contigs -1 reads_1.fq -2 reads_2.fq
sylph profile contigs.syldb reads_1.paired.fq.sylsp -u --min-number-kmers 10 --read-len 150 --read-seq-id 99 -o results.tsv
-
-i
and-1,-2
are used for sketching contigs and reads -- all at once.
Calculating coverage accurately -- this scales the coverage to a true value, otherwise a consistent underestimate is output. Accurately scaled coverages may not be needed for some applications.
-
-u
is needed to estimate the true coverage. If you do not use-u
, you'll get the effective coverage instead, which is smaller. -
--read-len
is needed for small reads. If 2x150bp reads,--read-len
is 150. (Removed in v0.5 -- automatically detected instead now) -
--read-seq-id
set to 99 for reads with 99% accuracy. You can omit this (see above). If you are calculating reads from multiple samples, setting read-seq-id to be constant across your batch is preferable.
IMPORTANT FOR SMALL CONTIGS
-
--min-number-kmers
states the minimum number of k-mers needed for sylph to output a result. This is approximately contig length divided by-c
. With default settings,--min-number-kmers 10
can work with contigs ~2500 bp. For smaller contigs, consider-c 100
and maybe 3-4 for this value.