sylph-tax - incorporating taxonomy into sylph
Note
This repo replaces the old sylph-utils scripts. sylph-tax
is easier to download/install and use than sylph-utils
.
Sylph is an efficient and accurate metagenome profiler. However, its output does not have taxonomic information. sylph-tax
can turn sylph
's TSV output into a taxonomic profile like Kraken or MetaPhlAn. sylph-tax
does this by using custom taxonomy files to annotate sylph's output.
The following pre-built sylph databases have available taxonomic annotations. Custom taxonomies can also be incorporated.
sylph-tax identifier | Database description | Clades |
---|---|---|
GTDB_r220 | GTDB-r220 (April 2024) | Prokaryote |
GTDB_r214 | GTDB-r214 (April 2023) | Prokaryote |
OceanDNA | OceanDNA - ocean MAGs from Nishimura & Yoshizawa | Prokaryote |
SoilSMAG | Soil MAGs (SMAG) from Ma et al. | Prokaryote |
FungiRefSeq-2024-07-25 | Refseq fungi representative genomes collected on 2024-07-25 | Eukaryote |
TaraEukaryoticSMAG | TARA eukaryotic SMAGs from Delmont et al. | Eukaryote |
IMGVR_4.1 | IMG/VR 4.1 high-confidence viral OTU genomes | Virus |
conda install -c bioconda sylph-tax
git clone https://github.com/bluenote-1577/sylph-tax
cd sylph-tax
pip install .
Important
Please see this manual for more information on
- output format information
- how to create taxonomy metadata for customized genome databases
# download all taxonomy files (~50 MB)
sylph-tax download --download-to /any/folder
# incorporate GTDB-r220 and IMGVR-4.1 taxonomies into sylph's results
sylph-tax taxprof sylph_results/*.tsv -t GTDB_r220 IMGVR_4.1 -o output_prefix-
ls output_prefix-sample1.sylphmpa
ls output_prefix-sample2.sylphmpa
...
# merge multiple results
sylph-tax merge *.sylphmpa --column relative_abundance -o merged_abundance_file.tsv
sylph-tax download --download-to /my/folder/sylph_taxonomy_files/
- Downloads taxonomic annotation files (~50 MB; see here) to
--download-to
. - This folder (must exist) can be wherever you want. Its location is written to
~/.config/sylph-tax/config.json
. - If you don't have access to
$HOME
, you can specify a custom location in theSYLPH_TAXONOMY_CONFIG
environment variable. E.g.export SYLPH_TAXONOMY_CONFIG=/write_access_folder/sylph-tax-config.json
.
sylph-tax taxprof sylph_results/*.tsv -o prefix_or_folder/ -t {sylph-tax identifier}
sylph_results/*.tsv
: outputs from sylph. The databases used for sylph must be the same as the-t
option.-t/--taxonomy-metadata
: A list ofsylph-tax identifier
s in the above table (e.g.GTDB_r220
orIMGVR_4.1
). Multiple taxonomy metadata files can be input. Custom taxonomy files are also possible; see below.-o
: prepends this prefix to all of the output files. One file is output per sample insylph_output.tsv
-a/--annotate-virus-hosts
: annotates found viral genomes with host information metadata (only available forIMGVR_4.1
right now)- Output suffix is
.sylphmpa
.
Tip
In python/pandas, pd.read_csv('output.sylphmpa',sep='\t', comment='#')
works.
Merge multiple taxonomic profiles from sylph_to_taxprof.py
into a TSV table
sylph-tax merge *.sylphmpa --column {ANI, relative_abundance, sequence_abundance} -o output_table.tsv
*.sylphmpa
files are outputs fromsylph-tax taxprof
.--column
can be ANI, relative abundance, or sequence abundance (see paper for difference between abundances)-o
output file in TSV format.
clade_name sample1.fastq.gz sample2.fastq.gz
d__Archaea 0.0 1.1
d__Archaea|p__Methanobacteriota 0.0 0.0965
...