Skip to content

sylph cookbook

Jim Shaw edited this page Nov 15, 2023 · 36 revisions

Sylph cookbook

Read sketching options

Sketching many paired-end reads

sylph sketch -1 *_1.fastq.gz -2 *_2.fastq.gz -d my_sketches -t 50 
  • -1/-2: Input the first and second paired-end reads; can be gzipped.
  • -t: parallelized sketching over the samples. Each sample uses 1 thread, but e.g. 5 samples would use 5 threads.
  • -d: all read sketches are output to a new folder called my_sketches. The folder will be created if it is not present

Sketching single-end reads

sylph sketch -r reads1.fq reads2.fa
sylph sketch reads1.fq 

# DON'T DO: sylph sketch reads.fa
  • -r: you can sketch many single-end reads at once.
  • If -r is not specified, fastq files are assumed to be reads. If fasta reads, you must use -r.

Database sketching options

Creating a database of fasta files

sylph sketch genomes/*.fa.gz -o database -t 50
# EQUIVALENT
sylph sketch -g genomes/*.fa.gz -o database -t 50
  • fasta files are assumed to be genomes. All genomes are combined into a new file called database.syldb.
  • -t: sketching can use many threads

Creating a database of contigs or if genomes are all in one fasta file

sylph sketch all_genomes.fa -i -o contig_database
  • -i considers every record in a fasta (i.e. contig) as a genome
  • This is single-threaded currently, so a bit slower

Sketching a large database

sylph sketch -l genome_list.txt -t 50
  • -l is a newline delimited text file, with each genome on a separate line.

Profiling/querying

Standard profiling and querying

sylph profile *.sylsp *.syldb -t 50 -o results.tsv
  • *.syldb: all syldb files are aggregated together into one large database.
  • *.sylsp: each sylsp file is queried/profiled against the combined database.
  • -t: 50 threads used. Even single-sample vs single-database is multi-threaded.
  • -o: output results to a file. Can also redirect using > instead of -o.

Lazy profiling and querying without sketching

#raw genomes
sylph profile genome1.fa genome2.fa sample.sylsp

#raw reads
sylph profile raw_reads.fq database.syldb
  • If you input fastas and fastqs into profile/query, sylph will sketch them and then use them.

Profiling small genomes such as viruses

sylph sketch -c 100 virus_genomes.fa -i -o viruses
sylph sketch -c 200 prokaryotic_genomes/* -o proks -t 50
sylph sketch -c 100 read.fq 
sylph profile *.sylsp *.syldb -t 50 -o results.tsv --min-number-kmers 20 -o results.tsv
  • Sketch viruses at -c 100, reads at -c 100, but genomes at -c 200. sylph actually runs without issues if the -c for all genomes is >= the -c for reads, which is a useful technique.
  • Sketching smaller genomes with smaller -c is preferable.
  • Be sure to set --min-num-kmers to smaller than default (50) if you care about outputting results for small genomes.

Estimating percentage of unknown reads

sylph can estimate the percentage of your reads that are not classified or unknown. While sylph does not classify reads directly, it can see how many genomes are detected vs how many reads you have.

sylph profile -u database.syldb sample.sylsp -o results_with_unknown.tsv

# input read identities. see below for information on the --read-seq-id parameter. 
sylph profile -u --read-seq-id 99.1 database.syldb sample.sylsp -o results_with_unknown.tsv
  • -u option multiplies the Sequence_abundance column by the percent of classified reads.

Important notes for estimating unknown percentage

-u needs the percent identity of your sequences. You can supply a --read-seq-id, e.g. 99-99.5 works fine for Illumina reads. If you do not, -u will estimate sequence identity.

This estimate works well for metagenomes that are (1) not too complex, e.g. human microbiomes and even gut and (2) > 1GB sequencing depth. For Soil/Ocean, very complex metagenomes, and low depth this estimate does not work well, and you should set --read-seq-id to something like 99 instead.

Estimating coverage for contigs

sylph sketch -i contigs.fa -o contigs -1 reads_1.fq -2 reads_2.fq
sylph profile contigs.syldb reads_1.paired.fq.sylsp -u --min-number-kmers 10 --read-len 150 --read-seq-id 99 -o results.tsv 
  • -i and -1,-2 are used for sketching contigs and reads -- all at once.

Calculating coverage accurately -- this scales the coverage to a true value, otherwise a consistent underestimate is output. Accurately scaled coverages may not be needed for some applications.

  • -u is needed to estimate the true coverage. If you do not use -u, you'll get the effective coverage instead, which is smaller.
  • --read-len is needed for small reads. If 2x150bp reads, --read-len is 150.
  • --read-seq-id set to 99 for reads with 99% accuracy. You can omit this (see above). If you are calculating reads from multiple samples, setting read-seq-id to be constant across your batch is preferable.

IMPORTANT FOR SMALL CONTIGS

  • --min-number-kmers states the minimum number of k-mers needed for sylph to output a result. This is approximately contig length divided by -c. With default settings, --min-number-kmers 10 can work with contigs ~2500 bp. For smaller contigs, consider -c 100 and maybe 3-4 for this value.