Skip to content

2. Overall k mer evaluation

Arang Rhie edited this page Dec 11, 2024 · 9 revisions

Overall k-mer evaluation is performed using the k-mer multiplicity of illumina WGS reads.
Here, we re-evaluate the a. thaliana trio binned assembly (Koren et al., 2018), downloadable here: COL and CVI.
The FALCON-Unzip (Chin et al., 2016) assembly (F1 Col-0–Cvi-0) is also useful to see how the k-mer based evaluation reports pseudo-haplotype assemblies. For convenience, I renamed the fasta files to col.fasta / cvi.fasta for the trio binned assembly, and pri.fasta / alt.fasta for the FALCON-Unzip assembly.

Let’s make a dir for each assembly of interest.

mkdir triocanu_clr
mv col.fasta triocanu_clr/
mv cvi.fasta triocanu_clr/

mkdir falcon_unzip
mv pri.fasta falcon_unzip/
mv alt.fasta falcon_unzip/

Download the pre-built F1.k18.meryl.tar.gz and untar with tar zxf. This 18-mer db was generated from the F1 illumina reads from the FALCON-Unzip paper.

Link F1.k18.meryl and merqury.sh. Run it.

cd triocanu_clr
ln -s ../F1.k18.meryl
nohup ./merqury.sh F1.k18.meryl col.fasta cvi.fasta triocanu_clr > triocanu.log &

It takes a bit time (at most 8 hours on a 3Gb genome, 24 cores) when running on a single cluster.
Use _submit_merqury.sh for submitting to a cluster. Everything finishes within 2~4 hours.

1. Reference free QV estimation

Assuming the k-mers found only in the assembly being bp errors, we can use the k-mer survival rate to estimate base level QV.

head */*.qv
==> triocanu_clr/triocanu_.qv <==
col	682359	123511626	35.1183	0.000307729
cvi	506306	122349407	36.3761	0.00023035
Both	1188665	245861033	35.6991	0.00026921

Each column is showing:

  1. Assembly of interest. Both is the combination of the above two.
  2. Total (present) k-mers uniquely found only in the assembly
  3. Total (present) k-mers found in the assembly
  4. QV
  5. Error rate

These results suggest the cvi assembly has slightly higher QV (36.4) than the col assembly (35.1), but still are very similar.

2. Copy-number spectrum analysis

This provides the merqury version of spectra-cn plots, first introduced by KAT. In addition to the stacked histogram, Merqury generates unstacked line and filled histograms to enhance visibility of the shape. All runs at once, no need to re-name and concatenate fasta files for diploid assemblies.

Here are the copy number analysis outputs.

Combined spectra-cn

For both col and cvi assemblies:

triocanu_clr.spectra-cn.st.png	# st: stacked
triocanu_clr.spectra-cn.fl.png	# fl: filled
triocanu_clr.spectra-cn.ln.png	# ln: line

Stacked Filled Line

Per-assembly spectra-cn

COL haplotype assembly:

triocanu_clr.col.spectra-cn.st.png
triocanu_clr.col.spectra-cn.fl.png
triocanu_clr.col.spectra-cn.ln.png

Stacked Filled Line

CVI haplotype assembly:

triocanu_clr.cvi.spectra-cn.st.png
triocanu_clr.cvi.spectra-cn.fl.png
triocanu_clr.cvi.spectra-cn.ln.png

Stacked Filled Line

spectra-asm

The spectra-asm shows the distinct k-mers shared between col.fasta and cvi.fasta.

triocanu_clr.spectra-asm.st.png
triocanu_clr.spectra-asm.fl.png
triocanu_clr.spectra-asm.ln.png

Stacked Filled Line

3. k-mer completeness (recovery rate)

The k-mer completeness can be measured by the fraction of recovered solid k-mer. Solid k-mers are distinct k-mers filtered for erroneous low copy k-mers.

k-mer completeness = found solid k-mers in an assembly / solid k-mers in a read set

This is how the report looks like:

col	all	104975080	125303808	83.7764
cvi	all	104809523	125303808	83.6443
both	all	123134729	125303808	98.2689

Columns are

  1. Assembly
  2. k-mer set used for measuring completeness - all = read set (This gets expended with hap-mers later)
  3. solid k-mers in the assembly
  4. Total solid k-mers in the read set
  5. Completeness (%)

Here, both is a higher because each col and cvi assemblies contain haplotype specific part of the genome, as we can see from the spectra-asm plots.

4. Track error bases in the assembly

asm_only.bed and asm_only.wig files are provided to trace the bp errors. This is useful for finding mis-assembly sites, which usually involves low bp quality region.

Note: As of Merqury v1.3, asm_only.wig replaces asm_only.tdf (tiled data format to visualize on IGV). asm_only.tdf can be generated using IGVTools:

igvtools count ${asm}_only.bed ${asm}_only.tdf $ref.fai