Phasing and scaffolding polyploid genomes based on Pore-C, Ultra-long, or Hi-C data
.One of the major problems with Hi-C scaffolding of polyploid genomes is a large proportion of ambiguous short-read mapping, leading to a high-level of switched or chimeric assemblies. Now, the long-read-based chromosome conformation capture technology, e.g., Pore-C, provides an effective way to overcome this problem. Here, we developed a new pipeline, namely C-Phasing
, which is specifically tailored for polyploid phasing by leveraging the advantage of Pore-C data. It also works on Hi-C data and diploid genome assembly.
The advantages of C-Phasing
:
- High speed.
- High anchor rate of genome.
- High accuracy of polyploid phasing.
## Download C-Phasing and install all dependencies
git clone https://github.com/wangyibin/CPhasing.git
cd CPhasing
conda env create -f environment.yml
conda activate cphasing
## Add these command into .bash_profile or .bashrc
export PATH=/path/to/CPhasing/bin:$PATH
export PYTHONPATH=/path/to/CPhasing:$PYTHONPATH
## Download C-Phasing and install python dependencies
git clone https://github.com/wangyibin/CPhasing.git
cd CPhasing
pip install .
## Add these into .bash_profile or .bashrc
export PATH=/path/to/CPhasing/bin:$PATH
The -n 8:4
parameter of the following commands means assembling a tetraploid (4) with 8 chromosome basic numbers. If you set -n 0:0
means partition in both rounds automatically, also support it set to -n 8:0
or -n 0:4
.
- Start from a pore-c data
cphasing pipeline -f draft.asm.fasta -pcd sample.fastq.gz -t 10 -n "8:4"
- Start from multiple pore-c data: specified multiple
-pcd
parameters.
cphasing pipeline -f draft.asm.fasta -pcd sample1.fastq.gz -pcd sample2.fastq.gz -t 10 -n "8:4"
Note: If you want to run on cluster system and submit them to multiple nodes, you can use cphasing mapper
and cphasing-rs porec-merge
to generate the merged porec.gz
file and input it by -pct
parameter.
- Start from a pore-c table (porec.gz), which is generated by
cphasing mapper
.
cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -n "8:4"
- Start from a paired-end Hi-C data
cphasing pipeline -f draft.asm.fasta -hic1 Lib_R1.fastq.gz -hic2 Lib_R2.fastq.gz -t 10 -n "8:4"
Note: If you want to run multiple samples, you can use cphasing hic mapper
and cphasing-rs pairs-merge
to generate the merged pairs.gz
file, and input it by -prs
parameter.
- Start from a 4DN pairs file,
cphasing pipeline -f draft.asm.fasta -prs sample.pairs.gz -t 10 -n "8:4"
- Skip some steps
## skip steps 1.alleles and 2.prepare steps
cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -ss 1,2
- Only run specified steps
## run 3.hyperpartition
cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -s 3
- Add the
-hcr
parameter to remove the greedy contacts (several regions contact with the whole genome) to improve the phasing quality.
cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -hcr
C-Phasing enable to use ultra-long to correct chimeric and identify the high confidence regions (HCRs) to help assembly.
-
mapping
Mapping pore-c data into contig-level assembly and output the pore-c table and 4DN pairs.
## results are `sample.porec.gz` and `sample.pairs.gz` cphasing mapper draft.asm.fasta sample.fastq.gz -t 10
For Hi-C data please use
cphasing hic mapper
.Note: If you are mapping multiple pore-c data, the multiple
pairs.gz
files should be merged by following steps:cphasing-rs pairs-merge sample1.pairs.gz sample2.pairs.gz -o sample.merge.pairs.gz
0_1. hcr (Optional)
Only retain high confidence regions (HCRs) to subsequencing analysis, which will remove greedy contacts (contact to whole genome or multiple chromosomes.)
## results are `sample.hcr.bed`, `sample.hcr.porec.gz` and `sample.hcr.pairs.gz`
cphasing -pct sample.porec.gz -cs drfat.asm.contigsizes
-
alleles (Optional for phasing mode)
Identify the inter-allelic contig pairs.
- Step1
alleles
## result is `draft.asm.allele.table` cphasing alleles -f draft.asm.fasta
Note: the memory usage of this step is depend on the genome size of input fasta, when input the length larger 10 Gb, the peak memory will reach 130 Gb, or more. If you want to reduce the memory usage, you can increase the
k
andw
, for example, when we benchmark a 20 Gb hexaploid, we set thek
andw
to 27 and 24, respectively. - Step1
-
prepare
Prepare some data for subsequence analysis
## results are `sample.counts_AAGCTT.txt`, `sample.clm.gz`, `sample.split.contacts`, `sample.contacts` cphasing prepare draft.asm.fasta sample.pairs.gz
-
hyperpartition
## result is `output.clusters.txt` ## for haploid scaffolding cphasing hyperpartition sample.porec.gz draft.asm.contigsizes output.clusters.txt ## for polyploid or diploid phasing must add prune information and use the incremental partition mode ### chromsome number aware, 8:4 indicate that this is a tetraploid with 8 chromosome in each haplotype cphasing hyperpartition sample.porec.gz draft.asm.contigsizes output.clusters.txt -pt prune.contig.table -inc -n 8:4 -t 4 ### auto generate groups cphasing hyperpartition sample.porec.gz draft.asm.contigsizes output.clusters.txt -pt prune.contig.table -inc -t 4
Note: If you are not satisfied with the number of groups, you can adjust the resolution parameters (
-r1
or-r2
) to make difference groups.-r1
controls the first cluster while-r2
controls the second cluster in phasing mode. Increasing the resolution can generate more groups while decreasing it will get fewer groups.
If you want to group some contigs, you can input a contig list with the-wl
parameter. -
scaffolding
## result is `groups.agp` cphasing scaffolding output.clusters.txt sample.counts_AAGCTT.txt sample.clm.gz -sc sample.split.contacts -f draft.asm.fasta -t 4 ## for polyploid can specified allele table to adjust the orientation of different haplotypes cphasing scaffolding output.clusters.txt sample.counts_AAGCTT.txt sample.clm -at draft.asm.allele.table -sc sample.split.contacts -f draft.asm.fasta
-
plot
Adjust the contig-level contacts matrix into chromosome-level and plot a heatmap.
## result is `sample.10000.cool` and `groups.wg.png` cphasing pairs2cool sample.pairs.gz draft.asm.contigsizes sample.10000.cool cphasing plot -a groups.agp -m sample.10000.cool -o groups.wg.png
plot
also enable only plot the heatmap from a matrix file (.cool)## directly plot the heatmap in the input resolution, cphasing plot -m sample.500k.chrom.cool -o sample.500k.chrom.png ## or first coarsen small binsize to larger binsize, e.g. 10k to 500k, specified the `--coarsen` and `-k 50` parameters. cphasing plot -m sample.10k.chrom.cool -o sample.500k.chrom.png --coarsen -k 50
Note: If the number of bins in matrix is too large (large genome with small binsize), the memory may overflow. To fix it, users can improve the binsize or specified several chromosomes with
-c
parameter.cphasing plot -m sample.100k.chrom -c Chr01g1,Chr01g2,Chr01g3,Chr01g4 -o Chr01.100k.png
- generate
.assembly
and.hic
, depend on 3d-dna
cphasing pairs2mnd sample.pairs.gz -o sample.mnd.txt
cphasing utils agp2assembly groups.agp > groups.assembly
bash ~/software/3d-dna/visualize/run-assembly-visualizer.sh sample.assembly sample.mnd.txt
Note: if chimeric corrected, please use groups.corrected.agp
and generate a new corrected.pairs.gz
by cphasing-rs pairs-break
- After curation
## convert assembly to agp
cphasing utils assembly2agp groups.review.assembly -n 8:4
## or haploid or a homologous group
cphasing utils assembly2agp groups.review.assembly -n 8
## extract contigs from agp
cphasing agp2fasta groups.review.agp draft.asm.fasta --contigs > contigs.fasta
## extract chromosome-level fasta from agp
cphasing agp2fasta groups.review.agp draft.asm.fasta > groups.review.asm.fasta
Rename and orient chromosome according a monoploid reference (or genome of closely related species).
cphasing rename -r mono.fasta -f draft.asm.fasta -a groups.review.agp -t 20
Note: To reduce the time consumed, we only align the first haplotype (g1) to the monoploid, which the orientation among different haplotypes has already been set to the same in the scaffolding
step. If not, you can set —unphased
to align all haplotypes to the monoploid to adjust the orientation.
In our two-round partition algorithm, the first round partition depends on the h-trans errors between homologous chromosomes; if you input a contig assembly with low level switch errors or input a high accuracy pore-c data, the h-trans will be not enough to cluster all contigs to correct homologous groups, resulting in unsatisfactory results. You can set the -q1 0
for hyperpartition
to increase the rate of h-trans errors. However, this parameter may raise error of out of memory
when you input huge pore-c data in porec table or hic contacts in pairs file.
The aneuploid genome, such as modern cultivated sugarcane, contains unequal homologous chromosomes. The -n
parameter can be set to zero (-n 10:0
) to automatically partition contigs into different chromosomes within a homologous group.
However, we also allow the user to input a file with two columns: the first column is the index(1-base) of the first round partition, and the second column is the chromosome number of each homologous. And then specified the -n 10:second.number.tsv
in cphasing pipeline
or cphasing hyperpartition
.
second.number.tsv
1 13
2 12
3 12
4 11
5 10
6 12
7 12
8 10
9 12
10 12