Skip to content

smudgeplot hetkmers

Kamil S. Jaron edited this page Feb 23, 2021 · 5 revisions

This is an algorithm that extracts kmer pairs from kmer dump. Although the algorithm does not take any parameters, it's important to realize that the output is heavily dependent on choice of L and U when the kmer dump was generated, look at wikipage chosing L and U for details.

Usage

usage: smudgeplot.py hetkmers [-h] [-o O] [--middle] [infile]

Calculate unique kmer pairs from a Jellyfish or KMC dump file.

positional arguments:
  infile      Alphabetically sorted Jellyfish or KMC dump file (stdin).

optional arguments:
  -h, --help  show this help message and exit
  -o O        The pattern used to name the output (kmerpairs).
  --middle    Get all kmer pairs that are exactly the same but in the middle nt. When this flag is used, the input dump must be alphabetically sorted/ (default: different by SNP at any position).

Memory requirements

The memory required scale linearly with the number of kmers and it is approximately 15x higher than the size of the dump file (for a 20Gb dump file you will need approx ~250Gb of RAM). Alternatively, you can estimate the RAM requirement by number of dumped kmers. It's approximately 350x higher than number of kmers in the dump file. If your file has too many kmers you can decrease computational requirement by rerunning the kmer spectra with a smaller k (i.e. kmer size) or by more strict filtering of the dumped kmers (higher L and smaller U).

We have not calculated the complexity of the algorithm yet. Usually for smaller genomes (<250Gb) it's couple of hours, the longest computation took bit more than one day.

The biggest genome we analyzed so far was a triplod genome with a haploid size 3.5Gbp. We have processed 1.5e9 genomic kmers and it have required 520GB of memory and two days of computation on eight cores.

--middle

The default algorithm extracts kmers pairs different by any nt, which mean that a heterozygous SNP might get into k kmer pairs. Here we simplify algorithm to extract only the kmer pairs that are different by the middle nucleotide. This means that we will miss all the heterozygous loci where the SNPs are closer than k / 2 nts from each other. Long story short, it seems that it moreless works, the preliminary results were very promising, but many species that were very heterozygous or just strange the middle kmer algorithm did not work, including our awesome strawberry example. Therefore I would say, everytime you have the resources to extract kmer pair with the default algorithm, do so.

This algorithm was introduced as a way how to speed up the extraction process and reduce the memory requirements to a moreless constant (it's less than 10G). And in this sense we works very well, we managed to process ~8.5Gbp genome in just a few hours. So if you feel like experimenting, you can fiddle with --middle.

One really important note, you need to have the kmer dump on input alphabetically sorted (kmc has not problem in generate the dump sorted).

Output

Two files are generated: <output_pattern>_coverages.tsv and <output_pattern>_sequences.tsv. The coverage file has the following format

136	538
236	436
202	212
221	393
...

The less covered kmer is always in the first column. This is important as the R part of the program assumes this coverage sorting when calculating normalized minor kmer coverage (simply because R is not the fastest program to sort too many kmers). The file with sequences contains corresponding kmer sequences.