Skip to content

Genome build

Haibao Tang edited this page May 22, 2024 · 10 revisions

We have included a suite of tools including genome size survey, genetic map and Hi-C heatmap concordance to check for quality of genome build.

Tip

Download the test dataset here.

Genome size survey

The raw sequencing data provides a way to estimate the size, ploidy, heterozygosity and repeat content of a genome, similar to GenomeScope. Let's say that you have a kmer count histogram (commonly generated by Jellyfish, or other kmer counter), in a file reads.histo.

1 1281576854
2 89292133
3 21588481
4 9347716
5 5569400
6 4705214

With 1st column the frequency of kmer in the sequencing data, and 2nd column the abundance of kmer with a given frequency. It is easy to infer all the genome statistics and annotate directly on the kmer histogram.

python -m jcvi.assembly.kmer histogram reads.histo "*S. species* ‘Variety 1’" 21

This takes the kmer counts and the species name that goes in the tile. Finally the size K when used to generate the kmer histogram. Behind the scenes, a negative binomial mixture model is applied to approximate the various genome statistics, including the ploidy of the genome.

reads.png

You can then simply read various genome statistics from the plot, and that the genome is a tetraploid.

Genetic map concordance

After genome assembly, we would often like to perform quality control. One of the QC is to compare to the genetic maps of the organism. Assume that you have the genetic map input matrix (MSTMap format), in file geneticmap.matrix.

geneticmap.matrix.png

With first column indicating the position in the current genome assembly, in the format of chr1.12345, and the following columns indicating the genotypes of each mapping individual.

Our genetic quality control map can then be visualized as a heatmap with one command:

python -m jcvi.assembly.geneticmap heatmap geneticmap.matrix

geneticmap.subsample.png

Entries in the heatmap corresponding to the linkage disequilibrium ($r^2$). From the heatmap, you can see a discontinuity on chr4 and chr6, suggesting a potential mis-assembly (or could be a rearrangement between the mapping parents).

Hi-C heatmap

Similarly, the genome quality can also be assessed using a Hi-C heatmap. This can be more common nowadays compared to using genetic map.

Assume that you have the Hi-C reads mapped to the genome assembly, in hic.bam.

python -m jcvi.assembly.hic bam2mat hic.bam

This will generate two files - hic.resolution_500000.npy and hic.resolution_500000.json, which can be visualized.

python -m jcvi.assembly.hic heatmap \
    hic.resolution_500000.npy \
    hic.resolution_500000.json \
    --title="*S. species* Hi-C contact map" \
    --groups=groups

hic.resolution_500000.png

Aside from the configurable title, the groups file can control if certain chromosomes should be highlighted together with specific colors. For example,

Chr01_A,Chr01_B g
Chr02_A,Chr02_B g
Chr03_A,Chr03_B g

This allows Chr01_A and Chr01_B to be plotted together with a green (g) highlight.