-
Notifications
You must be signed in to change notification settings - Fork 7
sylph cookbook
- Read sketching options
- Database sketching options
- Profiling/querying
- Taxonomy integration with sylph-utils
"Sketching" is equivalently to "indexing". Sketching reads gives you a small index that you can efficiently reuse.
sylph sketch -1 *_1.fastq.gz -2 *_2.fastq.gz -d my_sketches -t 50
-
-1/-2
: Input the first and second paired-end reads; can be gzipped. -
-t
: parallelized sketching over the samples. Each sample uses 1 thread: e.g. 5 samples would use 5 threads. -
-d
: all read sketches are output to a new folder calledmy_sketches
with.sylsp
suffixes. The folder will be created if it is not present.
Warning
If you have identical read names but different folder names (e.g. folder1/reads.fq, folder2/reads.fq
) sylph will only output one reads.fq.sylsp
file. This will cause issues unless you use the -S option shown below.
sylph sketch -1 folder1/reads_1.fq folder2/reads_1.fq -2 folder1/reads_2.fq folder2/reads_2.fq -S sample1 sample2
ls sample1.paired.sylsp sample2.paired.sylsp
-
-S
: renames the output sketch tosample{x}.paired.sylsp
. -
--lS
: input a newline delimited list of sample names
sylph sketch -r reads1.fq reads2.fa
sylph sketch reads1.fq
# DON'T DO: sylph sketch reads.fa
-
-r
: you can sketch many single-end reads at once. - If
-r
is not specified, fastq files are assumed to be reads. If fasta reads, you must use -r.
You can create a custom database with sylph. This requires only fasta files.
sylph sketch genomes/*.fa.gz -o database -t 50 -c 200
# EQUIVALENT
sylph sketch -g genomes/*.fa.gz -o database -t 50
- fasta files are assumed to be genomes. All genomes are combined into a new file called
database.syldb
. -
-t
: sketching can use many threads -
-c
: the compression parameter. Memory/runtime scale like1/c
; higherc
is faster but less sensitive at low coverage. - Default
c = 200
. The-c
for genomes must be than >= the-c
for reads (strict > is allowed)
sylph sketch all_genomes.fa -i -o contig_database
-
-i
considers every record in a fasta (i.e. contig) as a genome - This is single-threaded currently, so a bit slower
sylph sketch -l genome_list.txt -t 50
-
-l
is a newline delimited text file, with each genome on a separate line.
Tip
Most of the time, you should use sylph profile
instead of sylph query
. See the tutorial for how profile
and query
differ.
sylph profile *.syldb *.sylsp -t 50 -o results.tsv
# sylph query *.syldb *.sylsp -t 50 -o results.tsv
-
*.syldb
: all syldb files (generated by sketching genomes) are aggregated together into one large database. -
*.sylsp
: each sylsp file (generated by sketching reads) is queried/profiled against the combined database. -
-t
: 50 threads used. Even single-sample vs single-database is multi-threaded. -
-o
: output results to a file. Can also redirect using>
instead of-o
.
# raw genomes
sylph profile genome1.fa genome2.fa sample.sylsp
# raw reads
sylph profile database.syldb raw_reads.fq
# raw paired-end reads (since sylph v0.6.0)
sylph profile database.syldb -1 *_1.fq.gz -2 *_2.fq.gz
- If you input fastas and fastqs into profile/query, sylph will sketch them and then use them.
- Parameters can be set in the
sylph profile
command for sketching. We recommend usingsylph sketch
instead, though, as there are more options.
sylph sketch -c 100 virus_genomes.fa -i -o viruses
sylph sketch -c 200 prokaryotic_genomes/* -o proks -t 50
sylph sketch -c 100 read.fq
sylph profile *.syldb *.sylsp -t 50 --min-number-kmers 20 -o results.tsv
Notes:
- Sketch viruses at
-c 100
, reads at-c 100
, but genomes at-c 200
. sylph actually runs without issues if the -c for all genomes is >= the -c for reads, which is a useful technique. - Sketching smaller genomes with smaller
-c
is preferable. - Be sure to set
--min-num-kmers
to smaller than default (50) if you care about outputting results for small genomes.
Sylph can estimate the percentage of your reads that are present at the species level using the -u
or ---estimate-unknown
option. Sylph does not classify reads directly but estimates this percentage from the lengths and coverages of detected genomes.
sylph profile -u database.syldb sample.sylsp -o results_with_unknown.tsv
# input read identities. see below for information on the --read-seq-id parameter.
sylph profile -u --read-seq-id 99.1 database.syldb sample.sylsp -o results_with_unknown.tsv
Important
-u
needs the percent identity of your sequences (i.e. 100 - error percent). You can supply a --read-seq-id
, e.g. 99-99.9 works fine for Illumina reads.
If you do not provide --read-seq-id
, -u
will estimate sequence identity.
This estimate works well for metagenomes that are (1) not too complex, e.g. host-associated metagenomes, and (2) > 1GB sequencing depth.
For Soil/Ocean, i.e. complex metagenomes, and low sequencing depth this estimate does not work well. You could set --read-seq-id
to something like 99.5 instead. For short reads, sylph v0.6 automatically sets --read-seq-id
to 99.5 if median k-mer depth is < 3.
Important
- The
-u
option multiplies theSequence_abundance
column by the percent of classified reads. - The sum of
Sequence_abundance
column is the percentage of classified reads. -
-u
changes theEff_cov
column toTrue_cov
.-u
estimates the true coverage, not the effective coverage (which is influenced by read length and error rate).
Tip
We have developed a new method called fairy for calculating contig coverages quickly. Consider using fairy instead of sylph for this task, especially if you want to bin contigs.
sylph sketch -i contigs.fa -o contigs -1 reads_1.fq -2 reads_2.fq
sylph profile contigs.syldb reads_1.paired.fq.sylsp -u --min-number-kmers 10 --read-len 150 --read-seq-id 99 -o results.tsv
-
-i
and-1,-2
are used for sketching contigs and reads -- all at once.
Calculating coverage accurately -- this scales the coverage to a true value, otherwise a consistent underestimate is output. Accurately scaled coverages may not be needed for some applications.
-
-u
is needed to estimate the true coverage. If you do not use-u
, you'll get the effective coverage instead, which is smaller. -
--read-len
is needed for small reads. If 2x150bp reads,--read-len
is 150. (Removed in v0.5 -- automatically detected instead now) -
--read-seq-id
set to 99 for reads with 99% accuracy. You can omit this (see above). If you are calculating reads from multiple samples, setting read-seq-id to be constant across your batch is preferable.
IMPORTANT FOR SMALL CONTIGS
-
--min-number-kmers
states the minimum number of k-mers needed for sylph to output a result. This is approximately contig length divided by-c
. With default settings,--min-number-kmers 10
can work with contigs ~2500 bp. For smaller contigs, consider-c 100
and maybe 3-4 for this value.
Integrating taxonomy is done through the sylph-utils repository.
git clone https://github.com/bluenote-1577/sylph-utils
More details are available here for constructing custom taxonomies and output formats.
### clone the repository if you have not:
### git clone https://github.com/bluenote-1577/sylph-utils
sylph profile gtdb-r220-c200-dbv1.syldb -1 read_name1.fq -2 read_name2.fq -o result.tsv
python sylph-utils/sylph_to_taxprof.py -s result.tsv -m sylph-utils/prokaryote/gtdb_r220_metadata.tsv.gz -o PREFIX_
### new file will be called PREFIX_read_name1.fq.sylphmpa
head PREFIX_read_name1.fq.sylphmpa
-
sylph_to_taxprof.py
takes sylph's result (-s
) and a taxonomy metadata file (-m
) and outputs.sylphmpa
files with prefix (-o
) -
gtdb-r220...syldb
is used, so we usesylph-utils/prokaryote/gtdb_r220_metadata.tsv.gz
as the metadata file. - When using paired-end reads, the "sample name" is the first read (
read_name1.fq
). For single-end reads, it is just the read name.
sylph profile gtdb-r220-c200-dbv1.syldb\
fungi-refseq-2024-07-25-c200-v0.3.syldb \
-1 A1.fq B1.fq -2 A2.fq B2.fq -o result.tsv
python sylph-utils/sylph_to_taxprof.py -s result.tsv \
-m sylph-utils/prokaryote/gtdb_r220_metadata.tsv.gz sylph-utils/eukaryote/fungi_refseq_2024-07-25_metadata.tsv.gz
# multiple .sylphmpa files are output
head A1.fq.sylphmpa
head B1.fq.sylphmpa
-
-m
can take multiplemetadata
files. This is needed becausegtdb-r220...
andfungi-refseq...
are both used (these databases are concatenated by sylph into a big database for profiling). -
sylph_to_taxprof.py
works on multiple samples at once. BothA1.fq.sylphmpa
andB1.fq.sylphmpa
are output.