-
Notifications
You must be signed in to change notification settings - Fork 7
sylph cookbook
sylph sketch -1 *_1.fastq.gz -2 *_2.fastq.gz -d my_sketches -t 50
-
-1/-2
: Input the first and second paired-end reads; can be gzipped. -
-t
: parallelized sketching over the samples. Each sample uses 1 thread, but e.g. 5 samples would use 5 threads. -
-d
: all read sketches are output to a new folder calledmy_sketches
. The folder will be created if it is not present
sylph sketch -r reads1.fq reads2.fa
sylph sketch reads1.fq
# DON'T DO: sylph sketch reads.fa
-
-r
: you can sketch many single-end reads at once. - If
-r
is not specified, fastq files are assumed to be reads. If fasta reads, you must use -r.
sylph sketch genomes/*.fa.gz -o database -t 50
# EQUIVALENT
sylph sketch -g genomes/*.fa.gz -o database -t 50
- fasta files are assumed to be genomes. All genomes are combined into a new file called
database.syldb
. -
-t
: sketching can use many threads
sylph sketch all_genomes.fa -i -o contig_database
-
-i
considers every record in a fasta (i.e. contig) as a genome - This is single-threaded currently, so a bit slower
sylph sketch -l genome_list.txt -t 50
-
-l
is a newline delimited text file, with each genome on a separate line.
sylph profile *.sylsp *.syldb -t 50 -o results.tsv
-
*.syldb
: all syldb files are aggregated together into one large database. -
*.sylsp
: each sylsp file is queried/profiled against the combined database. -
-t
: 50 threads used. Even single-sample vs single-database is multi-threaded. -
-o
: output results to a file. Can also redirect using>
instead of-o
.
#raw genomes
sylph profile genome1.fa genome2.fa sample.sylsp
#raw reads
sylph profile raw_reads.fq database.syldb
- If you input fastas and fastqs into profile/query, sylph will sketch them and then use them.
sylph sketch -c 100 virus_genomes.fa -i -o viruses
sylph sketch -c 200 prokaryotic_genomes/* -o proks -t 50
sylph sketch -c 100 read.fq
sylph profile *.sylsp *.syldb -t 50 -o results.tsv --min-number-kmers 20 -o results.tsv
- Sketch viruses at
-c 100
, reads at-c 100
, but genomes at-c 200
. sylph actually runs without issues if the -c for all genomes is >= the -c for reads, which is a useful technique. - Sketching smaller genomes with smaller
-c
is preferable. - Be sure to set
--min-num-kmers
to smaller than default (50) if you care about outputting results for small genomes.
sylph can estimate the percentage of your reads that are not classified or unknown. While sylph does not classify reads directly, it can see how many genomes are detected vs how many reads you have.
sylph profile -u database.syldb sample.sylsp -o results_with_unknown.tsv
# input read identities. see below for information on the --read-seq-id parameter.
sylph profile -u --read-seq-id 99.1 database.syldb sample.sylsp -o results_with_unknown.tsv
-
-u
option multiplies theSequence_abundance
column by the percent of classified reads.
-u
needs the percent identity of your sequences. You can supply a --read-seq-id
, e.g. 99-99.5 works fine for Illumina reads. If you do not, -u
will estimate sequence identity.
This estimate works well for metagenomes that are (1) not too complex, e.g. human microbiomes and even gut and (2) > 1GB sequencing depth.
For Soil/Ocean, very complex metagenomes, and low depth this estimate does not work well, and you should set --read-seq-id
to something like 99 instead.
sylph sketch -i contigs.fa -o contigs -1 reads_1.fq -2 reads_2.fq
sylph profile contigs.syldb reads_1.paired.fq.sylsp -u --min-number-kmers 10 --read-len 150 --read-seq-id 99 -o results.tsv
-
-i
and-1,-2
are used for sketching contigs and reads -- all at once.
Calculating coverage accurately -- this scales the coverage to a true value, otherwise a consistent underestimate is output. Accurately scaled coverages may not be needed for some applications.
-
-u
is needed to estimate the true coverage. If you do not use-u
, you'll get the effective coverage instead, which is smaller. -
--read-len
is needed for small reads. If 2x150bp reads,--read-len
is 150. -
--read-seq-id
set to 99 for reads with 99% accuracy. You can omit this (see above). If you are calculating reads from multiple samples, setting read-seq-id to be constant across your batch is preferable.
IMPORTANT FOR SMALL CONTIGS
-
--min-number-kmers
states the minimum number of k-mers needed for sylph to output a result. This is approximately contig length divided by-c
. With default settings,--min-number-kmers 10
can work with contigs ~2500 bp. For smaller contigs, consider-c 100
and maybe 3-4 for this value.