Skip to content

Introduction to fairy

Jim Shaw edited this page Apr 1, 2024 · 3 revisions

Introduction - multi-sample coverage problem

After metagenomic assembly, optimal workflows require aligning all metagenomic reads against all assemblies to obtain coverages. Then, metagenome-assembled genomes (MAGs) are generated using a binner like metabat2.

Unfortunately, all-to-all alignment of samples to assemblies is very slow.

Fairy resolves this bottleneck by using a fast k-mer alignment-free method to obtain coverage instead of aligning reads. Fairy's coverages are correlated with aligners (but still approximate). However, fairy is 10-1000x faster than BWA for all-to-all coverage calculation.

Important: fairy is designed for multi-sample usage and short reads or nanopore reads. Do not use fairy for single-sample binning.

Short-reads

Fairy seems to be comparable to BWA for multi-sample binning (maybe a +5% to -15% loss in sensitivity). Preliminary testing indicates that fairy may perform as good as (and sometimes better than) BWA on host-associated datasets and slightly worse (but usable) on environmental datasets.

Long-reads

Non-HiFi: For simplex nanopore reads, fairy seems to be comparable with minimap2.

HiFi (strain-resolved assemblies): Fairy is worse than minimap2 for strain-resolved assemblies when using >99.9% identity reads (using e.g. hifiasm or meta-mdbg).

Results

(A) Number of bins with contamination/completeness indicated for different environment and n samples. (B) (# fairy bins)/(# bwa bins) for > 50% complete and < 5% contaminated bins for several binners.