Skip to content

sylph cookbook

Jim Shaw edited this page Nov 11, 2023 · 36 revisions

sylph cookbook

Read sketching options

sketching many paired-end reads

sylph sketch -1 *_1.fastq.gz -2 *_2.fastq.gz -d my_sketches -t 50 
  • -1/-2: Input the first and second paired-end reads; can be gzipped.
  • -t: parallelized sketching over the samples. Each sample uses 1 thread, but many 5 samples would use 5 threads.
  • -d: all read sketches are output to a new folder called my_sketches. Folder will be created if not present

sketching single-end reads

sylph sketch -r reads1.fq reads2.fa
sylph sketch reads1.fq 

# DON"T DO: sylph sketch reads.fa
  • -r: you can sketch many single-end reads at once.
  • If -r is not specified, fastq files are assumed to be reads. If fasta reads, you must use -r.

Database sketching options

Creating a database of fasta files

sylph sketch genomes/*.fa.gz -o database -t 50
# EQUIVALENT
sylph sketch -g genomes/*.fa.gz -o database -t 50
  • fasta files are assumed to be genomes. All genomes combined to new file called database.syldb.
  • -t: sketching can use many threads

Creating a database of contigs or if genomes are all in one fasta file (e.g. viruses)

sylph sketch all_genomes.fa -i -o contig_database
  • -i considers every record in a fasta (i.e. contig) as a genome
  • This is single-threaded currently, so a bit slower

sketching a large database (> 50,000 files)

sylph sketch -l genome_list.txt -t 50
  • -l is a newline delimited text file, with each genome on a separate line.

Profiling/querying

standard profiling/querying

sylph profile *.sylsp *.syldb -t 50 -o results.tsv
  • *.syldb: all syldb files are aggregated together into one large database.
  • *.sylsp: each sylsp file is queried/profiled against the combined database.
  • -t: 50 threads used. Even single-sample vs single-database is multi-threaded.
  • -o: output results to a file. Can also redirect using > instead of -o.

lazy profiling/querying without sketching

#raw genomes
sylph profile genome1.fa genome2.fa sample.sylsp

#raw reads
sylph profile raw_reads.fq database.syldb
  • If you input fastas and fastqs into profile/query, sylph will sketch them and then use them.

Estimating percentage of unknown reads

sylph can estimate the percentage of your reads that are not classified or unknown. While sylph does not classify reads directly, it can see how many genomes are detected vs how many reads you have.

sylph profile -u database.syldb sample.sylsp -o results_with_unknown.tsv

sylph profile -u --read-seq-id 99.1 database.syldb sample.sylsp -o results_with_unknown.tsv
  • -u option multiplies the Sequence_abundance column by the percent of classified reads.

IMPORTANT -u ACCURACY NOTE

-u needs the percent identity of your sequences. You can supply a --read-seq-id, e.g. 99-99.5 works fine for Illumina reads. If you do not, -u will estimate sequence identity.

This estimate works well for metagenomes that are (1) not too complex, e.g. human microbiomes and even gut and (2) > 1GB sequencing depth. For Soil/Ocean, very complex metagenomes, and low depth this estimate does not work well, and you should set --read-seq-id to something like 99 instead.

Estimating coverage for contigs

sylph sketch -i contigs.fa -o contigs -1 reads_1.fq -2 reads_2.fq
sylph profile contigs.syldb reads_1.paired.fq.sylsp -u --min-number-kmers 10 --read-len 150 --read-seq-id 99 -o results.tsv 
  • -i and -1,-2 are used for sketching contigs and reads -- all at once.

Calculating coverage accurately -- this scales the coverage to a true value, otherwise a consistent underestimate is output. Accurately scaled coverages may not be needed for some applications.

  • -u is needed to estimate the true coverage. If you do not use -u, you'll get the effective coverage instead, which is smaller.
  • --read-len is needed for small reads. If 2x150bp reads, --read-len is 150.
  • --read-seq-id set to 99 for reads with 99% accuracy. You can omit this (see above). If you are calculating reads from multiple samples, setting read-seq-id to be constant across your batch is preferable.

IMPORTANT FOR SMALL CONTIGS

  • --min-number-kmers states the minimum number of k-mers needed for sylph to output a result. This is approximately contig length divided by -c. With default settings, --min-number-kmers 10 can work with contigs ~2500 bp. For smaller contigs, consider -c 100 and maybe 3-4 for this value.