Skip to content

sylph cookbook

Jim Shaw edited this page Jul 9, 2024 · 36 revisions

Sylph cookbook

Read sketching options

Sketching many paired-end reads

sylph sketch -1 *_1.fastq.gz -2 *_2.fastq.gz -d my_sketches -t 50 
  • -1/-2: Input the first and second paired-end reads; can be gzipped.
  • -t: parallelized sketching over the samples. Each sample uses 1 thread: e.g. 5 samples would use 5 threads.
  • -d: all read sketches are output to a new folder called my_sketches with .sylsp suffixes. The folder will be created if it is not present.

Sketching single-end reads

sylph sketch -r reads1.fq reads2.fa -S sample1 sample2
sylph sketch reads1.fq -S sample1

# DON'T DO: sylph sketch reads.fa
  • -r: you can sketch many single-end reads at once.
  • If -r is not specified, fastq files are assumed to be reads. If fasta reads, you must use -r.
  • -S: (Optional) renames the sketches to sample1.sylsp and sample2.sylsp instead of reads1.fq.sylsp and reads2.fa.sylsp.

Database sketching options

Creating a database of fasta files

sylph sketch genomes/*.fa.gz -o database -t 50 -c 200
# EQUIVALENT
sylph sketch -g genomes/*.fa.gz -o database -t 50
  • fasta files are assumed to be genomes. All genomes are combined into a new file called database.syldb.
  • -t: sketching can use many threads
  • -c: the compression parameter. Memory/runtime scale like 1/c; higher c is faster but less sensitive at low coverage.
  • Default c = 200. The -c for genomes must be than >= the -c for reads (strict > is allowed)

Creating a database of contigs or if genomes are all in one fasta file

sylph sketch all_genomes.fa -i -o contig_database
  • -i considers every record in a fasta (i.e. contig) as a genome
  • This is single-threaded currently, so a bit slower

Sketching a large database

sylph sketch -l genome_list.txt -t 50
  • -l is a newline delimited text file, with each genome on a separate line.

Profiling/querying

Standard profiling and querying

sylph profile *.sylsp *.syldb -t 50 -o results.tsv
  • *.syldb: all syldb files (generated by sketching genomes) are aggregated together into one large database.
  • *.sylsp: each sylsp file (generated by sketching reads) is queried/profiled against the combined database.
  • -t: 50 threads used. Even single-sample vs single-database is multi-threaded.
  • -o: output results to a file. Can also redirect using > instead of -o.

Lazy profiling and querying without sketching

# raw genomes
sylph profile genome1.fa genome2.fa sample.sylsp

# raw reads
sylph profile raw_reads.fq database.syldb

# raw paired-end reads (since sylph v0.6.0)
sylph profile database.syldb -1 *_1.fq.gz -2 *_2.fq.gz
  • If you input fastas and fastqs into profile/query, sylph will sketch them and then use them.
  • Parameters can be set in the sylph profile command for sketching. We recommend using sylph sketch instead, though, as there are more options.

Profiling small genomes such as viruses

sylph sketch -c 100 virus_genomes.fa -i -o viruses
sylph sketch -c 200 prokaryotic_genomes/* -o proks -t 50
sylph sketch -c 100 read.fq 
sylph profile *.sylsp *.syldb -t 50 --min-number-kmers 20 -o results.tsv

IMPORTANT FOR SMALL GENOMES

  • --min-number-kmers states the minimum number of k-mers needed for sylph to output a result. This is (approximately) the contig length divided by -c. With default settings, --min-number-kmers 10 can work with contigs ~2500 bp. For smaller contigs, consider -c 100.

Notes:

  • Sketch viruses at -c 100, reads at -c 100, but genomes at -c 200. sylph actually runs without issues if the -c for all genomes is >= the -c for reads, which is a useful technique.
  • Sketching smaller genomes with smaller -c is preferable.
  • Be sure to set --min-num-kmers to smaller than default (50) if you care about outputting results for small genomes.

Estimating percentage of unknown reads

sylph can estimate the percentage of your reads that are not classified or unknown. While sylph does not classify reads directly, it can see how many genomes are detected vs how many reads you have.

sylph profile -u database.syldb sample.sylsp -o results_with_unknown.tsv

# input read identities. see below for information on the --read-seq-id parameter. 
sylph profile -u --read-seq-id 99.1 database.syldb sample.sylsp -o results_with_unknown.tsv
  • -u option multiplies the Sequence_abundance column by the percent of classified reads.

Important notes for estimating unknown percentage

-u needs the percent identity of your sequences. You can supply a --read-seq-id, e.g. 99-99.5 works fine for Illumina reads. If you do not, -u will estimate sequence identity.

This estimate works well for metagenomes that are (1) not too complex, e.g. human microbiomes and even gut and (2) > 1GB sequencing depth. For Soil/Ocean, very complex metagenomes, and low depth this estimate does not work well, and you should set --read-seq-id to something like 99 instead.

Estimating coverage for contigs

Tip

We have developed a new method called fairy for calculating contig coverages quickly. Consider using fairy instead of sylph for this task, especially if you want to bin contigs.

sylph sketch -i contigs.fa -o contigs -1 reads_1.fq -2 reads_2.fq
sylph profile contigs.syldb reads_1.paired.fq.sylsp -u --min-number-kmers 10 --read-len 150 --read-seq-id 99 -o results.tsv 
  • -i and -1,-2 are used for sketching contigs and reads -- all at once.

Calculating coverage accurately -- this scales the coverage to a true value, otherwise a consistent underestimate is output. Accurately scaled coverages may not be needed for some applications.

  • -u is needed to estimate the true coverage. If you do not use -u, you'll get the effective coverage instead, which is smaller.
  • --read-len is needed for small reads. If 2x150bp reads, --read-len is 150. (Removed in v0.5 -- automatically detected instead now)
  • --read-seq-id set to 99 for reads with 99% accuracy. You can omit this (see above). If you are calculating reads from multiple samples, setting read-seq-id to be constant across your batch is preferable.

IMPORTANT FOR SMALL CONTIGS

  • --min-number-kmers states the minimum number of k-mers needed for sylph to output a result. This is approximately contig length divided by -c. With default settings, --min-number-kmers 10 can work with contigs ~2500 bp. For smaller contigs, consider -c 100 and maybe 3-4 for this value.