Skip to content

Advanced usage manual

Jim Shaw edited this page Oct 18, 2024 · 3 revisions

Preset parameter values

There are four preset parameters offered in devider v0.0.1.

  • old-long-reads - ~ 90% accuracy rates (old technologies)
  • nanopore-r9 (default) - ~95% accuracy rates
  • nanopore-r10 - ~ 98% accuracy rates
  • hi-fi - high-fideltiy with > 99.9% accurate rates

The four presets affect the following parameters (discussed further below):

  • k: for more accurate technologies, higher -k (k-mer length) is allowed.
  • resolution: the more accurate the technology, the lower the --resolution parameter is.
  • SNP-downsampling (no option): we downsample SNPs if too many are present. This depends on the accuracy of the preset.

Important algorithmic parameters

Automatic (default) or manually choosing -k

  • -k - the length of the SNP-encoded k-mers.

If -k is not set, devider chooses -k automatically. This is done by looking at the # of SNPs contained in each read, and taking the 33rd percentile. However, devider makes sure to pick -k such that -k does not span > 3/4s of the reference. This works pretty well in general.

For the different presets, we enforce an additional constraint: the maximum value of -k must be <= 10, 20, 35, 100 in order of the accuracy of the preset. This is to avoid very long k-mers for noisy reads.

If you believe that -k is not chosen correctly, you can set -k to a specific value. This will bypass all of the automatic selection.

Coverage/abundance filtering parameters

  • --min-cov - minimum coverage of reported haplotypes
  • --min-abund - minimum abundance of reported haplotypes

Only haplotypes with coverage > --min-cov and abundance > --min-abund are reported. devider's coverage slightly underestimates the true coverage when the reads are noisy.

Abundance is calculated as the normalized coverage times 100.

Note

The abundances across all haplotypes will may not sum to 100% if low-coverage or low-abundance haplotypes are filtered.

The resolution parameter

  • --resolution - haplotypes that differ by a fraction of SNPs less than --resolution are merged.

The resolution is set to 0.02, 0.01, 0.005, 0.001 for the four presets (in order of increasing accuracy). So for nanopore-r9, if two haplotypes differ by only 1 SNP per 100 SNPs, then these two haplotypes will be merged.

If you truely believe there are very similar haplotypes, then you can set --resolution to 0. However, systematic errors in long-read sequencing (e.g. methylation, homopolymer errors, context-specific errors) are inevitable, so you should be careful.

Strand-bias filters for SNP calls

  • --strand-bias-fdr - SNPs are filtered out if (1) they have FDR adjusted p-values (Fisher's exact test) < --strand-bias-fdr and (2) if the 2x2 table odds ratio is > 1.5 or < 1/1.5.

Strand-specific systematic errors can lead to SNPs. This is the leading cause of false SNPs for nanopore sequencing, so it is crucial that these false SNPs are filtered out.

If you have a VCF file that is already filtered, you can turn this filtering off with --strand-bias-fdr 0.

Mapping filtering parameters

  • --mapq-cutoff: only consider primary alignments with MAPQ < --mapq-cutoff.
  • --supp-mapq-cutoff: only use supplementary alignments if MAPQ < --supp-mapq-cutoff.
  • --dont-use-supp-aln: don't use supplementary alignments.
  • --min-qual: only consider bases with base quality > --min-qual.

These parameters are straightforward. Use higher MAPQ cutoffs if you want more stringent alignments.

Note that we always filter out secondary alignments; if you want to use secondary alignments, you will have to change them to supplementary alignments manually.