Skip to content

sylph cookbook

Jim Shaw edited this page Dec 13, 2024 · 36 revisions

Sylph cookbook

Read sketching options

"Sketching" is equivalently to "indexing". Sketching reads gives you a small index that you can efficiently reuse.

Sketching many paired-end reads

sylph sketch -1 *_1.fastq.gz -2 *_2.fastq.gz -d my_sketches -t 50 
  • -1/-2: Input the first and second paired-end reads; can be gzipped.
  • -t: parallelized sketching over the samples. Each sample uses 1 thread: e.g. 5 samples would use 5 threads.
  • -d: all read sketches are output to a new folder called my_sketches with .sylsp suffixes. The folder will be created if it is not present.

Renaming (identical) reads with -S

Warning

If you have identical read names but different folder names (e.g. folder1/reads.fq, folder2/reads.fq) sylph will only output one reads.fq.sylsp file. This will cause issues unless you use the -S option shown below.

sylph sketch -1 folder1/reads_1.fq folder2/reads_1.fq -2 folder1/reads_2.fq folder2/reads_2.fq -S sample1 sample2
ls sample1.paired.sylsp sample2.paired.sylsp
  • -S: renames the output sketch to sample{x}.paired.sylsp.
  • --lS: input a newline delimited list of sample names

Sketching single-end reads

sylph sketch -r reads1.fq reads2.fa
sylph sketch reads1.fq 

# DON'T DO: sylph sketch reads.fa
  • -r: you can sketch many single-end reads at once.
  • If -r is not specified, fastq files are assumed to be reads. If fasta reads, you must use -r.

Database sketching options

You can create a custom database with sylph. This requires only fasta files.

Creating a database of fasta files

sylph sketch genomes/*.fa.gz -o database -t 50 -c 200
# EQUIVALENT
sylph sketch -g genomes/*.fa.gz -o database -t 50
  • fasta files are assumed to be genomes. All genomes are combined into a new file called database.syldb.
  • -t: sketching can use many threads
  • -c: the compression parameter. Memory/runtime scale like 1/c; higher c is faster but less sensitive at low coverage.
  • Default c = 200. The -c for genomes must be than >= the -c for reads (strict > is allowed)

Creating a database of contigs or if genomes are all in one fasta file

sylph sketch all_genomes.fa -i -o contig_database
  • -i considers every record in a fasta (i.e. contig) as a genome
  • This is single-threaded currently, so a bit slower

Sketching a large database

sylph sketch -l genome_list.txt -t 50
  • -l is a newline delimited text file, with each genome on a separate line.

Profiling/querying

Tip

Most of the time, you should use sylph profile instead of sylph query. See the tutorial for how profile and query differ.

Standard profiling and querying

sylph profile *.syldb *.sylsp  -t 50 -o results.tsv
# sylph query *.syldb *.sylsp  -t 50 -o results.tsv
  • *.syldb: all syldb files (generated by sketching genomes) are aggregated together into one large database.
  • *.sylsp: each sylsp file (generated by sketching reads) is queried/profiled against the combined database.
  • -t: 50 threads used. Even single-sample vs single-database is multi-threaded.
  • -o: output results to a file. Can also redirect using > instead of -o.

Lazy profiling and querying without sketching

# raw genomes
sylph profile genome1.fa genome2.fa sample.sylsp

# raw reads
sylph profile database.syldb raw_reads.fq 

# raw paired-end reads (since sylph v0.6.0)
sylph profile database.syldb -1 *_1.fq.gz -2 *_2.fq.gz
  • If you input fastas and fastqs into profile/query, sylph will sketch them and then use them.
  • Parameters can be set in the sylph profile command for sketching. We recommend using sylph sketch instead, though, as there are more options.

Profiling small genomes such as viruses

sylph sketch -c 100 virus_genomes.fa -i -o viruses
sylph sketch -c 200 prokaryotic_genomes/* -o proks -t 50
sylph sketch -c 100 read.fq 
sylph profile *.syldb  *.sylsp -t 50 --min-number-kmers 20 -o results.tsv

Notes:

  • Sketch viruses at -c 100, reads at -c 100, but genomes at -c 200. sylph actually runs without issues if the -c for all genomes is >= the -c for reads, which is a useful technique.
  • Sketching smaller genomes with smaller -c is preferable.
  • Be sure to set --min-num-kmers to smaller than default (50) if you care about outputting results for small genomes.

Estimating percentage of unknown reads in database

Sylph can estimate the percentage of your reads that are present at the species level using the -u or ---estimate-unknown option. Sylph does not classify reads directly but estimates this percentage from the lengths and coverages of detected genomes.

sylph profile -u database.syldb sample.sylsp -o results_with_unknown.tsv

# input read identities. see below for information on the --read-seq-id parameter. 
sylph profile -u --read-seq-id 99.1 database.syldb sample.sylsp -o results_with_unknown.tsv

Important

Important notes for estimating unknown percentage

-u needs the percent identity of your sequences (i.e. 100 - error percent). You can supply a --read-seq-id, e.g. 99-99.9 works fine for Illumina reads. If you do not provide --read-seq-id, -u will estimate sequence identity. This estimate works well for metagenomes that are (1) not too complex, e.g. host-associated metagenomes, and (2) > 1GB sequencing depth.

For Soil/Ocean, i.e. complex metagenomes, and low sequencing depth this estimate does not work well. You could set --read-seq-id to something like 99.5 instead. For short reads, sylph v0.6 automatically sets --read-seq-id to 99.5 if median k-mer depth is < 3.

Important

  1. The -u option multiplies the Sequence_abundance column by the percent of classified reads.
  2. The sum of Sequence_abundance column is the percentage of classified reads.
  3. -u changes the Eff_cov column to True_cov. -u estimates the true coverage, not the effective coverage (which is influenced by read length and error rate).

Estimating coverage for contigs

Tip

We have developed a new method called fairy for calculating contig coverages quickly. Consider using fairy instead of sylph for this task, especially if you want to bin contigs.

sylph sketch -i contigs.fa -o contigs -1 reads_1.fq -2 reads_2.fq
sylph profile contigs.syldb reads_1.paired.fq.sylsp -u --min-number-kmers 10 --read-len 150 --read-seq-id 99 -o results.tsv 
  • -i and -1,-2 are used for sketching contigs and reads -- all at once.

Calculating coverage accurately -- this scales the coverage to a true value, otherwise a consistent underestimate is output. Accurately scaled coverages may not be needed for some applications.

  • -u is needed to estimate the true coverage. If you do not use -u, you'll get the effective coverage instead, which is smaller.
  • --read-len is needed for small reads. If 2x150bp reads, --read-len is 150. (Removed in v0.5 -- automatically detected instead now)
  • --read-seq-id set to 99 for reads with 99% accuracy. You can omit this (see above). If you are calculating reads from multiple samples, setting read-seq-id to be constant across your batch is preferable.

IMPORTANT FOR SMALL CONTIGS

  • --min-number-kmers states the minimum number of k-mers needed for sylph to output a result. This is approximately contig length divided by -c. With default settings, --min-number-kmers 10 can work with contigs ~2500 bp. For smaller contigs, consider -c 100 and maybe 3-4 for this value.

Taxonomy integration with sylph-tax

Integrating taxonomy is done through the sylph-tax repository.

conda install -c bioconda sylph-tax
sylph-tax download --download-to my_existing_folder/

More details are available here for constructing custom taxonomies and output formats.

Standard taxonomic integration (one metagenome, one database)

sylph profile  gtdb-r220-c200-dbv1.syldb -1 read_name1.fq -2 read_name2.fq -o result.tsv
sylph-tax taxprof result.tsv -t GTDB_r220 -o PREFIX_

### new file will be called PREFIX_read_name1.fq.sylphmpa
head PREFIX_read_name1.fq.sylphmpa
  • sylph-tax taxprof takes sylph's result and a taxonomy (-t) and outputs .sylphmpa files with prefix (-o)
  • GTDB_r220 database is used, so we use GTDB_r220 as the metadata file. See sylph-tax taxprof -h for available databases.
  • When using paired-end reads, the "sample name" is the first read (read_name1.fq). For single-end reads, it is just the read name.

Taxonomic integration: more than 1 metagenome, more than 1 database

sylph profile  gtdb-r220-c200-dbv1.syldb\
               fungi-refseq-2024-07-25-c200-v0.3.syldb \
              -1 A1.fq B1.fq -2 A2.fq B2.fq -o result.tsv
sylph-tax taxprof result.tsv -t GTDB_r220 FungiRefSeq-2024-07-25

# multiple .sylphmpa files are output
head A1.fq.sylphmpa
head B1.fq.sylphmpa
  • -t can take multiple taxonomies. This is needed because gtdb-r220... and fungi-refseq... are both used (these databases are concatenated by sylph into a big database for profiling).
  • sylph-tax taxprof works on multiple samples at once. Both A1.fq.sylphmpa and B1.fq.sylphmpa are output.