PacBio Assembly Tool Suite: Reads in ⇨ Assembly out
- Availability
- Scope
- General Overview
- What's New in PB Assembly
- Usage
- Configuration
- Example Data Set
- Tutorial
- FAQ
- Acknowledgements
- Citations
- Disclaimer
The latest pre-release, experts-only linux/mac binaries can be installed via bioconda.
conda install pb-assembly
Alternatively, if you don't have administrator access you can install the pb-assembly suite into an environment in your $HOME directory
conda create -n denovo_asm
source activate denovo_asm
conda install pb-assembly
To updates the code run:
conda update --all
Please refer to our official pbbioconda page for information on Installation, Support, License, Copyright, and Disclaimer.
Latest release notes can be found here.
GitHub issues can be filed here for problems and questions.
FALCON has been update to be compatibile with HiFi data. Please refer to this documentation for more information.
If you are looking for a GUI based long read de novo genome assembler, you are urged to learn about HGAP4
pb-assembly is the bioconda recipe encompassing all code and dependencies necessary to run:
- FALCON assembly pipeline
- FALCON-Unzip to phase the genome and perform phased-polishing with Arrow
- FALCON-Phase to extend phasing between unzipped haplotig blocks (requires HiC data)
Installed package recipes include:
- pb-falcon
- pb-dazzler
- genomicconsensus
- etc (all other dependencies)
FALCON and FALCON-Unzip are de novo genome assemblers for PacBio long reads, also known as Single-Molecule Real-Time (SMRT) sequences. FALCON is a diploid-aware assembler which follows the hierarchical genome assembly process (HGAP) and is optimized for large genome assembly though microbial genomes can also be assembled. FALCON produces a set of primary contigs (p-contigs) as the primary assembly and a set of associate contigs (a-contigs) which represent divergent allelic variants. Each a-contig is associated with a homologous genomic region on an p-contig.
FALCON-Unzip is a true diploid assembler. It takes the contigs from FALCON and phases the reads based on heterozygous SNPs identified in the initial assembly. It then produces a set of partially-phased primary contigs and fully-phased haplotigs which represent divergent haplotypes.
This method maps HiC data to the FALCON-Unzip assembly to fix phase switches between haplotigs within primary contigs. Read the preprint and manual. A stand-alone version 1 of the software is available through our co-developer, Phase Genomics.
Assembly with PacBio data uses the hierarchical genome assembly process (HGAP). The first round is pre-assembly or error correcction of the long reads. This involves
the selection of seed reads or the longest reads in the dataset (user-defined length_cutoff
). All
shorter reads are aligned to the seed reads, in order to generate consensus sequences with high accuracy.
We refer to these as pre-assembled reads and they can also be thought of as “error corrected” reads. Preassembled
reads tend to have accuracy > 99%.
In the next round of HGAP, the preads are aligned to each other and assembled into genomic contigs.
Assembly is typically followed by a round of polishing where all raw PacBio subreads are aligned to the draft contigs and genomic consensus is performed. Polishing greatly increases base quality of the assembly.
For more complex genomes assembled with FALCON, “bubbles” in the contig-assembly graph that result from structural variation between haplotypes may be resolved as associate and primary contigs. The unzip process will extend haplotype phasing beyond “bubble” regions, increasing the amount of phased contig sequence. It is important to note that while individual haplotype blocks are phased, phasing does not extend between haplotigs. Thus, in part C) of the figure below, haplotig_1 and haplotig_2 may originate from different parental haplotypes. Additional information is needed to phase the haplotype blocks with each other such as Hi-C, see FALCON-Phase as a method for extended phasing.
FALCON-Phase involves processing the FALCON-Unzip contigs into unzipped blocks (haplotigs pairs) and collapsed haplotypes and then mappin the HiC data in order to correctly separate the unzipped regions into pahses.
FALCON and FALCON-Unzip are now compatibile with HiFi data. Please refer to this documentation for details on how to run it.
See the latest release notes for the bioconda distribution here.
-
gcpp instead of GenomicConsensus for polishing (Falcon_unzip 1.2.0)
-
Use all subreads for polishing (Falcon_unzip 1.1.5) We used to use only 1 per zmw, same as assembly typically. Chemistry v3+ has longer polymerase reads, resulting in multiple passes of library inserts in many cases. Read more here. Config: [Unzip]polish_include_zmw_all_subreads is "true"
-
Use pbmm2 instead of blasr by default (Falcon_unzip 1.1.5) Config: [Unzip]polish_use_blasr = true for blasr
-
Repeat Masking Integration of Tandem repeat masking (done) and general repeat masking (done)
-
New! GFA and Placement Files -GFA-1 and GFA-2 output for assembly graphs -placement files for associate contigs (contig.gfa2)
-
Increased Accuracy of Associate Contigs -algorithm and alignment improvements (Edlib integration)
-
Performance Improvements -general workflow and resource specification improvements -easier integration of future features with Pbsmrtpipe
-
Improved Haplotig Extraction -algorithm and data structure improvements reduce haplotype switching and improve extraction -can now handle circular contigs!
-
New! Placement Files -haplotig placement (PAF format) generated in 3-unzip stage
-
Performance Improvements -use of minimap2 instead of BLASR for phasing in Unzip significantly reduces runtime -unzipping and polishing now part of single workflow -reduced memory requirements
- New integration into pb-assembly pipeline
fc_run fc_run.cfg
fc_unzip.py fc_unzip.cfg
Running with HiFi data:
fc_unzip.py --target='ccs' fc_unzip.cfg
fc_phase.py fc_phase.cfg
Both FALCON and FALCON-Unzip take a config file as their only input parameter.
Here is a sample fc_run.cfg that was designed to work with the 200kb test case found below.
Here is a sample fc_run.cfg that was used with a recent ~2.9Gb human genome assembly.
Example of FALCON configuration for HiFi data: fc_run_HiFi.cfg
Up to date configuration info can be found in the wiki.
Detailed explanations for each pipeline stage are below:
The FALCON pipeline has three main steps which occur in distinct directories:
Subdirectory | Description |
---|---|
0-rawreads |
raw read overlapping and consensus calling, also known as pre-assembly |
1-preads_ovl |
pre-assembled read overlapping or pread overlapping |
2-asm-falcon |
contig assembly |
Many of the tools that comprise the FALCON Assembly pipeline were written by Gene Myers and are extensively documented at his dazzlerblog.
Below is a breakdown of the configuration options available to FALCON:
[General]
input_fofn=input.fofn
input_type=raw
pa_DBdust_option=
pa_fasta_filter_option=streamed-median
Your list of paths to the input fasta files is specified in your input_fofn
and your input_type
can be
either raw
or preads
. If specifying preads
, the pipeline will skip the entire 0-rawreads
pre-assembly phase.
By default, dusting is turned on and is run after generating the raw read database with default options as
recommended by Gene Meyer's. If you wish to modify your
dusting parameters you can set the
flag pa_DBdust_option
.
Filtering options for your input data for pre-assembly can also be set with the pa_fasta_filter_option
flag.
The default is streamed-internal-median
which uses the median-length subread for each
ZMW (sequencing reaction well). Choosing the longest
subread can lead to an enrichment in chimeric molecules. Users will rarely need to change this option from the
default.
Recognized values are described below.
Value | Setting |
---|---|
pass | The no-op filter - passes every FASTA record to the database. |
streamed-median | Applies the median-length ZMW filter by running a single-pass over the data. The input subreads should be groupped by ZMW. |
streamed-internal-median | Applies the median-length ZMW filter only on internal subreads (ZMWs with >= 3 subreads) by running a single pass over the data. The input subreads should be groupped by ZMW. For ZMWs with < 3 subreads, the maximum-length one is selected. |
# large genomes
pa_DBsplit_option=-x500 -s200
ovlp_DBsplit_option=-x500 -s200
# small genomes (<10Mb)
pa_DBsplit_option = -x500 -s50
ovlp_DBsplit_option = -x500 -s50
For the first and second stages of FALCON, the data needs to be read in to a
dazzler DB. The -x
flag filters
reads smaller than what's specified while the -s
flag controls the size of DB blocks. The -a
option should
not be used here in conjunction with pa_fasta_filter_option=pass
as it uses all reads per ZMW which can lead
to errors is preassembly.
pa_HPCTANmask_option=
pa_REPmask_code=0,300;0,300;0,300
Repeat masking occurs in two phases, Tandem and Interspersed. Tandem repeat masking is run
with a modified version of daligner
called datander
and thus uses a similar
parameter set. Whatever settings you use
for pre-assembly daligner overlapping in the next section (pa_daligner_option
) will be used here for tandem
repeat masking. You can supply additional arguments for tandem repeat masking that will be passed to HPC.TANmask
with the pa_HPCTANmask_option
.
The second phase of masking deals with interspersed repeats and can be run in up to 3 iterations specified with the
pa_REPmask_code
option. The parameters needed for each iteration are both the group size and coverage specified
as group,coverage
pairs separated by semicolons as seen above.
For information and theory on how to set up your rounds of repeat masking, consult this blog post.
genome_size=1000000000
seed_coverage=30
length_cutoff=-1
pa_HPCdaligner_option=-v -B128 -M24
pa_daligner_option=-e0.8 -l2000 -k18 -h480 -w8 -s100
falcon_sense_option=--output-multi --min-idt 0.70 --min-cov 3 --max-n-read 400
falcon_sense_greedy=False
During pre-assembly, the PacBio subreads are aligned and error correction is performed. The longest subreads are
chosen as seed reads and all shorter reads are aligned to them and consensus sequences are generated from the
alignments. These consensus sequences are called pre-assembled reads or preads
and generally have accuracy
greater than 99% or QV20.
If you wish to auto-calculate your seed read coverage, then it's necessary to enter your genome_size
in base
pairs, the desired seed_coverage
as well as set length_cutoff=-1
to force the auto-calculation. We generally
recommend 20-40x
seed coverage. Alternatively, if you don't know your genome size, are unsure of the seed_coverage
you would like to use or if you would rather just leverage all reads above a specific length, you can use the the
length_cutoff
flag to manually set that limit. It's important to note that whatever value length_cutoff
gets set
to is a limit that carries through to the unzipping algorithm, and any reads smaller than that cutoff will not be
used for phasing. For assembly alone, there is likely no harm in setting a high length_cutoff
, unless you are
expecting a certain feature like micro chromosomes or short circular plasmids. Howevere, if you are planning to unzip,
then you will be artificially limiting your phasing dataset and it's probably in your interest to have a
lower length_cutoff
. The majority of computation occurs in preassembly so if compute time is important to you,
increasing length_cutoff
will increase efficiency but with the tradeoffs described above.
Overlap options for daligner
are set with the pa_HPCdaligner_option
and pa_daligner_option
flags. Previous
versions of FALCON had a single parameter. This is now split into two flags, one that affects requested
resources pa_HPCdaligner_option
and one that affects the overlap search pa_daligner_option
.
For pa_HPCdaligner_option
, the -v
parameter is passed to the LAsort
and LAmerge
programs while -B
and -M
parameters are passed to the daligner
sub-commands.
To understand the theory and how to configure daligner
see
this blog post
and this command reference guide.
For daligner
, in general we recommend the following:
-e
: average correlation rate (average sequence identity)
0.70
(low quality data) - 0.80
(high quality data). A higher value will help prevent haplotype collapse.
-l
: minimum length of overlap
1000
(shorter library) - 5000
(longer library)
-k
: kmer size
14
(low quality data) - 18
(high quality data)
Lower values of -k
have higher sensitivity at the tradeoff of increased diskspace, memory consumption and
slower run time and tend to work best with lower quality data. In contrast, a larger kmer value for -k
has a
higher specificity, uses less system resources and runs faster, but will only be suitable for high quality data.
You can configure basic pre-assembly consensus calling options with the falcon_sense_option
flag. The --output-multi
flag is necessary for generating proper fasta headers and should not be removed unless your specific use case requires
it. The parameters --min-idt
, --min-cov
and --max-n-read
set the minimum alignment identity, minimum
coverage necessary and max number of reads, respectively, for calling consensus to make the preads.
By default, -fo
are the parameters passed to LA4Falcon
. The option falcon_sense_greedy
changes this
parameter set to -fog
which essentially attempts to maintain relative information between reads that have
been broken due to regions of low quality.
ovlp_daligner_option=-e.96 -s1000 -h60
ovlp_HPCdaligner_option=-v -M24 -l500
The second phase of error-corrected read overlapping occurs in a similar fashion to the overlapping performed in the pre-assembly, however no repeat masking is performed and no consensus is called. Overlaps are identified and fed into the final assembly. The parameter options work the same way as described above in the pre-assembly section.
Recommendation for preads:
-e
: average correlation rate (average sequence identity)
0.93
(inbred) - 0.96
(outbred)
-l
: minimum length of overlap
1800
(poor preassembly, short/low quality library) - 6000
(long, high quality library)
-k
: kmer size
18
(low quality) - 24
(most cases)
overlap_filtering_setting=--max-diff 100 --max-cov 100 --min-cov 2
fc_ovlp_to_graph_option=
length_cutoff_pr=1000
The option overlap_filter_setting
allows setting criteria for filtering pread overlaps.
--max-diff
filters overlaps that have a coverage difference between the 5' and 3' ends larger than specified.
--max-cov
filters highly represented overlaps typically caused by contaminants or repeats and --min-cov
allows
specification of a minimum overlap coverage. Setting --min-cov
too low will allow more overlaps to be detected
at the expense of additional chimeric / mis-assemblies.
length_cutoff_pr
is the minimum length of pre-assembled preads used for the final assembly. Typically, this value
is set to allow for approximately 15 to 30-fold coverage of corrected reads in the final assembly.
Additional configuration options that don't necessarily fit into one of the previous categories are described here.
target=assembly
skip_checks=False
LA4Falcon_preload=false
FALCON can be configured to stop after any of its three stages with the target
flag set to either
overlapping
, pre-assembly
or assembly
. Each option will stop the pipeline at the end of its corresponding
stage, 0-rawreads
, 1-preads_ovl
or 2-asm-falcon
respectively. The default is full assembly
pipeline.
The flag skip_checks
disables .las
file checks with LAcheck
which has been known to cause errors on certain
systems in the past.
The option LA4Falcon_preload
passes the -P
parameter to LA4Falcon
which causes all the reads to be loaded
into memory. On slow filesystems this can speed things up significantly; but it will dramatically increase the
memory requirement for the consensus stage.
[job.defaults]
job_type=sge
pwatcher_type=blocking
JOB_QUEUE = default
MB = 32768
NPROC = 6
njobs = 32
submit = qsub -S /bin/bash -sync y -V \
-q ${JOB_QUEUE} \
-N ${JOB_NAME} \
-o "${JOB_STDOUT}" \
-e "${JOB_STDERR}" \
-pe smp ${NPROC} \
-l h_vmem=${MB}M \
"${JOB_SCRIPT}"
[job.step.da]
NPROC=4
MB=49152
njobs=240
Default job configuration options are specified in the [job.defaults]
section of your config file. The first option
you should set that controls the flow of your job is the process watcher, pwatcher_type
. There two possible values
you can set, either blocking
or fs_based
. fs_based
is the default and relies on the pipeline polling the file
system periodically to determine whether a sentinel
file has appeared that would signal the pipeline to continue.
The other option is to use a blocking
process watcher which can help with systems that have issues with filesystem
latency. In this case, the end of the job is determined by the finishing of the system call, rather than by file
system polling.
The next most important option is the job_type
. Allowed values are sge
, pbs
, torque
, slurm
, lsf
and local
.
If running on a cluster, you need to configure the submit
string to work with your job scheduler. The submit
string in the sample above is a tested and working SGE submit string.
Some example job_type
configurations and submit
string are listed for convenience in the following table.
You may need to modify some of the parameters to work with your job scheduler.
job_type | submit |
---|---|
local |
bash -C ${CMD} >| ${STDOUT_FILE} 2>| ${STDERR_FILE} |
sge |
qsub -S /bin/bash -sync y -V -q myqueue -N ${JOB_NAME} -o "${JOB_STDOUT}" -e "${JOB_STDERR}" -pe smp ${NPROC} -l h_vmem=${MB}M "${JOB_SCRIPT}" |
lsf |
bsub -K -q myqueue -J ${JOB_NAME} -o ${JOB_STDOUT} -e ${JOB_STDERR} ${JOB_SCRIPT} |
slurm |
srun --wait=0 -p myqueue -J ${JOB_NAME} -o ${JOB_STDOUT} -e ${JOB_STDERR} --mem-per-cpu=${MB}M --cpus-per-task=${NPROC} ${JOB_SCRIPT} |
For further information about how to configure for job schedulers other than SGE (PBS/LSF/Slurm/hermit) see the pypeflow wiki
Next you will find your job distribution settings. You will find settings for your default job queue JOB_QUEUE
,
memory allocated per job MB
, number of processors per job NPROC
as well as number of concurrently running
jobs njobs
.
Each stage of the assembly pipeline can be given different default parameters with different [job.step.*]
sections.
There are 6 optional stages you can configure by using different 2-3 letter codes. da
and la
refer to
pre-assembly daligner jobs and pre-assembly LAshow/LAsort jobs respectively. cns
refers
to the pread consensus calling stage. pda
and pla
refer to the pread daligner overlapping and pread
LAshow/LAsort stages respectively while asm
refers to the final assembly. If you omit a specific [job.step.*]
section, the [job.defaults]
will be applied. [job.step.da]
, [job.step.la]
, [job.step.cns]
, [job.step.pda]
,
[job.step.pla]
, and [job.step.asm]
are the available sections.
FALCON-Unzip has two main steps which occur in distinct directories:
Subdirectory | Description |
---|---|
3-unzip |
read alignment, SNP calling, read phasing, and diploid assembly of primary contigs and haplotigs |
4-polish |
phased polishing in which reads are used to polish in a haplotype-specific manner using BLASR and arrow |
[General]
max_n_open_files = 1000
[Unzip]
input_fofn=input.fofn
input_bam_fofn=input_bam.fofn
FALCON-Unzip configuration is quite simple as the majority of the options have to do
exclusively with job distribution. The first and only setting in the [General] section is for max_n_open_files
.
During the read tracking stage the pipeline can be writing to many .sam
files at the same time. This can cause
problems with certain networked filesystems, so the default is to set max_n_open_files=300
. Feel free to raise
this number if file system latency is not an issue for you.
Similar to FALCON, the parameter input_fofn
simply refers to the input file of fasta names. This setting should
be redundant with your fc_run.cfg
. Finally, if you wish to polish your unzipped genome, you will need to also
specify a list of your input bam files with input_bam_fofn
.
Here is a sample fc_unzip.cfg that will need to be tuned to your compute environment.
Here is a sample fc_unzip_HiFi.cfg for HiFi data.
Configuration of your [job.defaults]
section is identical to FALCON as described previously. The only differences
are the job specific settings specific to FALCON-Unzip. Available sections are [job.step.unzip_track_reads]
,
[job.step.unzip_blasr_aln]
, [job.step.unzip.phasing]
and [job.step.unzip.hasm]
An example fc_phase.cfg.
stay tuned for better documentation on falcon phase
To test your installation above you can download and run this small 200kb test case.
git clone https://github.com/cdunn2001/git-sym.git
git clone https://github.com/pb-cdunn/FALCON-examples.git
cd FALCON-examples
../git-sym/git-sym update run/greg200k-sv2
Once you have the data, you can test the pipeline in local or distributed mode by editing the fc_run.cfg file found in the run directory.
cd run/greg200k-sv2
fc_run fc_run.cfg
fc_unzip.py fc_unzip.cfg
If everything was installed properly the test case will exit cleanly and you should find fasta files with a
size greater than 0 in the 4-polish/cns-output
directory.
In this section we will run the full pb-assembly pipeline, FALCON
, FALCON-Unzip
, and FALCON-Phase
on a test dataset. The data is subsampled from the F1 bull from Koren et al. 2018;
full dataset is available at NCBI BioProject PRJNA432857.
We will work through the commands and results and give you ideas of how to assess
the perfomance of pb-assembly on your dataset so you can modify parameters and trouble-shoot more
effectively.
This tutorial was run using v0.0.2 of the pb-assembly bioconda metapackage.
The dataset can be download here and then unpacked. e.g.:
$ wget https://downloads.pacbcloud.com/public/dataset/assembly_test_data/F1_bull_test_data.tar.gz
$ tar -xvzf F1_bull_test_data.tar.gz
Inside the F1_bull_test_data/
directory you'll find the following files with md5sums so you can be sure
the file transfer is complete.
00e1c1ad1d33e4cd481747d7efdffcc0 F1_bull_test.subreads.fasta.gz - PacBio subreads for falcon assembly
6cf93f0d096ddf0ce3017f25c56ff7e4 F1_bull_test.HiC_R2.fastq.gz - HiC data for falcon phase
ce8f6057e07459bb091ceeca7f6ff04e F1_bull_test.HiC_R1.fastq.gz - HiC data for falcon phase
39ec303e6527dd227770cd33dbdb3609 fc_run.cfg - contig file for fc_run (falcon)
c52064770d25def32974e7a11211d9c8 fc_unzip.cfg - contig file for fc_unzip.py
51dc7c9e1e26c44d9cc5f5e491f53d2b fc_phase.cfg - config file for fc_phase.py
81033c7c4ed46fe8b1c89e9d33cc1e84 F1_bull_test.subreads.bam - PacBio subreads for unzip
Next, create two "files-of-file-names", ("fofn") for the PacBio subread data in fasta and bam format.
You need the subreads.fasta
for FALCON and FALCON-Unzip and the subreads.bam
for the polishing step
of FALCON-Unzip.
$ cat subreads.fasta.fofn
/path/to/my/data/dir/F1_bull_test.subreads.fasta
$ cat subreads.bam.fofn
/path/to/my/data/dir/F1_bull_test.subreads.bam
You can use the three configuration files, fc_run.cfg
, fc_unzip.cfg
, and fc_phase.cfg
, from the tarball
as a starting point but you will need to adjust the resource allocation for your particular compute setup.
Consult the detailed Configuration section for more information.
source activate my-pbasm-env
(my-pbasm-env) $
See the Availability section for more details about installation and set up from bioconda.
Make sure you are using screen
or tmux
or sending your job to a cluster scheduler so your job persists.
You're good to go! Let's run it!
(my-pbasm-env) $ fc_run fc_run.cfg
FALCON prints a lot to screen to help you monitor your job. I like to run my job in the background and capture
both stderr and stdout for later trouble shooting, if needed. I find that not all useful information is captured in the all.log
file,
in particular scheduler/cluster or connectivity errors tend to print to screen.
(my-pbasm-env) $ fc_run fc_run.cfg &> run0.log &
The majority of run-time is spent in preassembly; this version of FALCON relies on daligner for subread overlapping.
For example, to see how many daligner jobs there are (hint, there are 9 for the test data):
$ ls 0-rawreads/daligner-chunks/ | wc -l
9
To see how many jobs have completed, count the sentinal files in the daligner-run dirs.
$ find 0-rawreads/daligner-runs/j_*/uow-00 -name "daligner.done" | wc -l
9
Yo, initial overlaps are done!
It is also helpful to know what stage an individual job is running. On an SGE cluster, for example, I can get at list of my running processes like this:
$ qstat | grep myname
9580947 1.03953 Pf02af7214 myname r 11/16/2018 15:12:12 bigmem@mp1306-sge66 8
9580952 1.03953 P1f6ce5c60 myname r 11/16/2018 15:12:12 bigmem@mp1304-sge66 8
9580954 1.03953 P941e007f7 myname r 11/16/2018 15:12:12 bigmem@mp1304-sge66 8
9580957 1.03953 Ped31ec762 myname r 11/16/2018 15:12:12 bigmem@mp1304-sge66 8
9580959 1.03953 Pa1c326bb0 myname r 11/16/2018 15:12:12 bigmem@mp1301-sge66 8
And I can get details about these processes like this:
$ qstat -j 9580947
The above command outputs all sorts of useful information like stderr paths and submission times:
...
stdout_path_list: NONE:NONE:/path/to/my_job_dir/1-preads_ovl/daligner-runs/j_0077/run-Pf02af721455dae.bash.stdout
...
submission_time: Fri Nov 16 15:12:09 2018
...
The first step in FALCON is to build a dazzler database. From the 0-rawreads/build/raw_reads.db
you can easily extract
a histogram of read lengths for the subreads used in assembly. In addition, the total base pairs can be used to calculate
raw subread coverage if you know your genome size. You may notice that the base pairs from the dazzDB is lower than the total yield of
your SMRT cells, particulary if you are using version 3.0 chemistry of sequel. Consult the FAQ section for more details.
(my-pbasm-env) $ DBstats raw_reads.db
Statistics for all reads of length 500 bases or more
116,757 reads out of 116,757 (100.0%)
1,625,908,631 base pairs out of 1,625,908,631 (100.0%)
13,925 average read length
8,316 standard deviation
Base composition: 0.297(A) 0.199(C) 0.212(G) 0.292(T)
Distribution of Read Lengths (Bin size = 1,000)
Bin: Count % Reads % Bases Average
68,000: 1 0.0 0.0 68655
67,000: 0 0.0 0.0 68655
66,000: 0 0.0 0.0 68655
65,000: 3 0.0 0.0 66171
64,000: 0 0.0 0.0 66171
63,000: 0 0.0 0.0 66171
62,000: 1 0.0 0.0 65359
61,000: 2 0.0 0.0 64282
60,000: 0 0.0 0.0 64282
59,000: 3 0.0 0.0 62735
...
You can extract the same information from the preads:
(my-pbasm-env) $ DBstats 1-preads_ovl/build/preads.db
Statistics for all reads of length 70 bases or more
39,573 reads out of 39,573 (100.0%)
558,983,583 base pairs out of 558,983,583 (100.0%)
14,125 average read length
8,124 standard deviation
Base composition: 0.299(A) 0.200(C) 0.200(G) 0.300(T)
Distribution of Read Lengths (Bin size = 1,000)
Bin: Count % Reads % Bases Average
57,000: 2 0.0 0.0 57484
56,000: 0 0.0 0.0 57484
55,000: 1 0.0 0.0 56712
54,000: 0 0.0 0.0 56712
53,000: 1 0.0 0.0 55899
52,000: 3 0.0 0.1 54468
51,000: 4 0.0 0.1 53341
50,000: 2 0.0 0.1 52870
49,000: 6 0.0 0.2 51819
48,000: 4 0.1 0.2 51221
47,000: 9 0.1 0.3 50182
46,000: 4 0.1 0.3 49768
45,000: 9 0.1 0.4 48911
44,000: 11 0.1 0.5 48052
...
In addition to the DBstats command, consult the 0-rawreads/report/pre_assembly_stats.json
file for details about
pre-assembly performance:
$ {
"genome_length": 20000000,
"length_cutoff": 17886, # calculated min length of seed reads when "autocut" (length_cutoff = -1) is used.
"preassembled_bases": 558983583,
"preassembled_coverage": 27.949,
"preassembled_esize": 18798.024,
"preassembled_mean": 14125.378,
"preassembled_n50": 18750,
"preassembled_p95": 28394,
"preassembled_reads": 39573,
"preassembled_seed_fragmentation": 1.507, # number split preads / seed reads
"preassembled_seed_truncation": 3475.765, # ave bp lost per pread due to low cov
"preassembled_yield": 0.699, # total pread bp / seed read bp (>50% is acceptable)
"raw_bases": 1625908631,
"raw_coverage": 81.295,
"raw_esize": 18892.366,
"raw_mean": 13925.577,
"raw_n50": 17703,
"raw_p95": 30005,
"raw_reads": 116757,
"seed_bases": 800080383,
"seed_coverage": 40.004,
"seed_esize": 26357.341,
"seed_mean": 24841.045,
"seed_n50": 24687,
"seed_p95": 36872,
"seed_reads": 32208
}
A note on these statistics: in the process of created preads, seeds reads with insufficient raw read coverage (falcon_sense_option = --min_cov option) will be split or truncated. The preassembled seed fragmentation, truncation, and yield stats summarize the quality of pread assembly. A good preassembled yield should be greater than 50%. Often coverage-limited assemblies (<30X coverage) show poor preassembled yield.
When your run is complete, you can summarize your assembly stats using the get_asm_stats.py
included in the
documatation package.
$ git clone https://github.com/PacificBiosciences/pb-assembly.git
$ python pb-assembly/scripts/get_asm_stats.py 2-asm-falcon/p_ctg.fa
{
"asm_contigs": 4,
"asm_esize": 6087779,
"asm_max": 7022361,
"asm_mean": 4455235,
"asm_median": 4601161,
"asm_min": 32628,
"asm_n50": 6164792,
"asm_n90": 4601161,
"asm_n95": 4601161,
"asm_total_bp": 17820942
}
- Check the assembly
completeness
: isasm_total_bp
what you expect your genome size to be? - Check the
contiguity
: asm_n50 larger than 1Mb is best for downstream scaffolding and annotion effort. - Check the
correctness
of the contigs using third party tools: align your contigs against a similar reference (if available) using minimap or mummer and visualize with a tool like dgenies, assemblytics, or dot. - Check base level correctness (after polishing): Use BUSCO to look for single copy conserved genes.
Command-line FALCON does not automatically polish the assembly, but you need to do at least 1 round of polishing at this point if you do not proceed to FALCON-Unzip! Polishing increased base pair accuracy by mapping the PacBio raw data back to the draft assembly and then computing the consensus sequence of the aligned reads. Assembly polishing may be run using the resequencing pipeline of pbsmrtpipe (command line instruction available in the SMRT_Tools_Reference_Guide or through the SMRT Link GUI. Resequencing requires PacBio subread BAM inputs.
If your sample in not haploid or an inbred diploid, you should consider running FALCON-Unzip as well, to separately assembly the haplotypes. You run FALCON-Unzip in the same directory with a different configuration file. (Here I preserved the all.log file before launching Unzip, which will overwrite it.)
(my-pbasm-env) $ mv all.log all0.log
(my-pbasm-env) $ fc_unzip.py fc_unzip.cfg &> run1.std &
The first stage of FALCON-Unzip involves calling variants in the FALCON assembly, binning reads by haplotype, then haplotype-specific re-assembly. This occurs in the
3-unzip
directory. You can assess the performance of Unzip by running the get_asm_stats.py
script on the primary and haplotig fasta files:
python pb-assembly/scripts/get_asm_stats.py 3-unzip/all_p_ctg.fa
{
"asm_contigs": 4,
"asm_esize": 6082188,
"asm_max": 7018959,
"asm_mean": 4447277,
"asm_median": 4568005,
"asm_min": 32608,
"asm_n50": 6169539,
"asm_n90": 4568005,
"asm_n95": 4568005,
"asm_total_bp": 17789111
}
python pb-assembly/scripts/get_asm_stats.py 3-unzip/all_h_ctg.fa
{
"asm_contigs": 50,
"asm_esize": 747994,
"asm_max": 1654630,
"asm_mean": 338144,
"asm_median": 169041,
"asm_min": 4675,
"asm_n50": 617201,
"asm_n90": 223476,
"asm_n95": 95301,
"asm_total_bp": 16907217
}
The assembly stats are largely the same for 3-unzip/all_p_ctg.fa
as they were after running FALCON. The total length of the alternate haplotigs (3-unzip/all_h_ctg.fa
)
is typically shorter than the primary contigs and the assembly is more fragmented. For this test data, 16.9Mb/17.8Mb = 95% of the genome "unzipped" or is haplotype-resolved.
This sample was specifically sequences in order to separate haplotypes; samples with lower heterozygosity will have a smaller proportion of the genome unzipped.
A new feature of FALCON-Unzip is the haplotig placement
file, 3-unzip/all_h_ctg.paf
, which specifies where each alternate haplotig
aligns to the primary contigs in Pairwise mApping Format.
$ head 3-unzip/all_h_ctg.paf
000002F_001 978968 0 978968 + 000002F 4568005 3132263 4100391 968128 968128 60
000002F_002 126260 0 126260 + 000002F 4568005 562806 690698 127892 127892 60
000002F_003 1654630 0 1654630 + 000002F 4568005 692926 2340316 1647390 1647390 60
000002F_004 266093 0 266093 + 000002F 4568005 2842384 3110211 267827 267827 60
000002F_005 102771 0 102771 + 000002F 4568005 2722867 2828717 105850 105850 60
000002F_006 481464 0 481464 + 000002F 4568005 67901 546495 478594 478594 60
000002F_007 443941 0 443941 + 000002F 4568005 4123628 4568005 444377 444377 60
000002F_008 357850 0 357850 + 000002F 4568005 2352024 2700428 348404 348404 60
000003F_001 22390 0 22390 + 000003F 32608 10258 32608 22350 22350 60
000000F_001 434620 0 434620 + 000000F 7018959 2452537 2885704 433167 433167 60
You can think of the coordinates: 000002F_001:0-978968
and 000002F:3132263-4100391
as a phase block
containing the two phased haplotypes.
The second stage of FALCON-Unzip is phased-polishing, which occurs in the 4-polish
directory. This method of polishing preserves the haplotype
differences by polishing the primary contigs and alternate haplotigs with reads that are binned into the two haplotypes.
Some residual indel errors, particularly around homopolymer stretches may remain so consider concatenating your primary contigs and haplotigs
into a single reference and polishing with resequening as described above in the polishing section of this tutorial.
After polishing, your final Unzip assembly files are: 4-polish/cns-output/cns_p_ctg.fasta
and 4-polish/cns-output/cns_h_ctg.fasta
.
$ python pb-assembly/scripts/get_asm_stats.py 4-polish/cns-output/cns_p_ctg.fasta
{
"asm_contigs": 4,
"asm_esize": 6097543,
"asm_max": 7037049,
"asm_mean": 4458862,
"asm_median": 4582352,
"asm_min": 32736,
"asm_n50": 6183312,
"asm_n90": 4582352,
"asm_n95": 4582352,
"asm_total_bp": 17835449
}
$ python pb-assembly/scripts/get_asm_stats.py 4-polish/cns-output/cns_h_ctg.fasta
{
"asm_contigs": 46,
"asm_esize": 752544,
"asm_max": 1658804,
"asm_mean": 367180,
"asm_median": 229107,
"asm_min": 16669,
"asm_n50": 618770,
"asm_n90": 223906,
"asm_n95": 95389,
"asm_total_bp": 16890306
}
The stats are largely unchanged after polishing. The coordinates for the PAF have shifted and unfortunately, we do not produce a PAF for the polished assembly at this time. However, the first stage of FALCON-Phase produces a haplotig placement file for the polished assembly using sequence alignment.
The length of the phase-blocks (haplotigs) produced by FALCON-Unzip are limited by the magnitude and distribution of heterozygosity in the diploid genome, PacBio read lengths, the coverage depth. Regions of low heterozysity are resolved as collapsed haplotypes because they contain insufficient information for read phasing. As a consequence, linkage information is lost between sequential phase blocks that are separated by a collapsed haplotype region. To address the problem of phase switching between blocks on the primary contigs, PacBio and Phase Genomics implemented a novel stochastic algorithm called FALCON-Phase which integrates ultra-range genotype information in the form of Hi-C read pairs.
The preprint is available on biorXiv here.
(my-pbasm-env) $ mv all.log all1.log
(my-pbasm-env) $ fc_phase.py fc_phase.cfg &> run2.std &
A cartoon of the FALCON-Phase pipeline is above. The pipeline begins by producing a haplotig placement file
using alignment of haplotig to primary contigs with mummer4. The placement file guides the mincing
of the primary contigs to define
phase blocks of haplotig and primary contigs sequences. HiC reads are mapped to these minced contigs and based on the density of read pairs
haplotype phase switch errors are corrected.
Two options for output fasta files are available: unzip
produces primary contig and haplotigs but with the phase switch errors corrected.
pseudohap
produces two contigs similar to the primary contigs but with the unzipped regions all in phase with each other. If you want a
haploid version of your genome, choose unzip
style. If you prefer a diploid version of the genome, choose the pseudohaplotype format.
See the haplotig placement file:
$ head 5-phase/placement-output/haplotig.placement
000000F_004 902477 0 902477 + 000000F 7037049 8 897889 902477 902477 60
000000F_008 52009 0 52009 + 000000F 7037049 927907 978232 52009 52009 60
000000F_007 641978 0 641978 + 000000F 7037049 978232 1620331 641978 641978 60
000000F_006 565777 0 565777 + 000000F 7037049 1803676 2370153 565777 565777 60
000000F_017 88633 0 88633 + 000000F 7037049 2370153 2458441 88633 88633 60
000000F_001 435877 0 435877 + 000000F 7037049 2458441 2892768 435877 435877 60
000000F_020 52718 0 52718 + 000000F 7037049 2892768 2946870 52718 52718 60
000000F_014 229107 0 229107 + 000000F 7037049 2946870 3175808 229107 229107 60
000000F_019 83728 0 83728 + 000000F 7037049 3175808 3259330 83728 83728 60
000000F_015 223906 0 223906 + 000000F 7037049 3259330 3487449 223906 223906 60
Final output stats in the pseudohap
format are similar to the primary contigs from FALCON and FALCON-Unzip:
$ python pb-assembly/scripts/get_asm_stats.py 5-phase/output/phased.0.fasta
{
"asm_contigs": 4,
"asm_esize": 6094408,
"asm_max": 7038334,
"asm_mean": 4457216,
"asm_median": 4587500,
"asm_min": 32780,
"asm_n50": 6170253,
"asm_n90": 4587500,
"asm_n95": 4587500,
"asm_total_bp": 17828867
}
$ python pb-assembly/scripts/get_asm_stats.py 5-phase/output/phased.1.fasta
{
"asm_contigs": 4,
"asm_esize": 6099451,
"asm_max": 7039100,
"asm_mean": 4462161,
"asm_median": 4598741,
"asm_min": 32736,
"asm_n50": 6178068,
"asm_n90": 4598741,
"asm_n95": 4598741,
"asm_total_bp": 17848645
}
Note: the contigs in the phased.0.fasta
and phased.1.fasta
files are not necessarilly in phase with each other, although the
phase blocks within each contig are in phase.
Yes, see details here
https://github.com/PacificBiosciences/pbbioconda/issues
Please use this handy bug report template.
When planning for a project, you should consider two types of coverage: total coverage
is the total bases generated divided by the
genome size. Unique molecular coverage
is the number of unique bases divided by the genome size.
PacBio sequencing can generate multiple subreads for a single template molecule because within a reaction well,
the polymerase may make multiple passes around the circular library molecule. How many passes depends on the movie length
and the length of the insert, among other factors. For de novo genome assembly, we recommend selecting only a single subread per reaction well.
This reduces the rate of chimerism/misassembly in the resulting contigs. However, we recommend using all subreads when polishing your contigs
in order to get the highest base qualities.
In general, we recommend:
- 30-50X
unique molecular coverage
per haplotye for assembly
Coverage requirements for scale linearly by the number of unique haplotypes. For example, a highly heterozygous diploid may require double the coverage recommend above, while a homozygous tetraploid may also require double coverage, (in a case where haplotypes are identical, but homeologs are not). In the latter example, assume genome length is 1N the length of one subgenome.
We recommend
- 15-20X
HiFi coverage
for haploids or diploids
Unique molecular yield is now reported in SMRT Link (v7.0.0). To calculate it at the command line0220, make sure you have pb-assembly
and bamtools
loaded in your path and run the following commands:
# convert subreads.bam to subreads.fasta
bamtools convert -format fasta -in movie.subreads.bam -out movie.subreads.fasta
# generate fasta file with just as single, median-length subread
python -m falcon_kit.mains.fasta_filter median movie.subreads.fasta > movie.median.fasta
With the new fasta file you can run samtools faidx
to get the lengths and then sum them up with a utility like datamash
.
samtools faidx movie.median.fasta
cut -f2 movie.median.fasta.fai | datamash sum 1 > movie.umy
Then divide unique molecular yield
by your genome size
to get unique molecular coverage.
Yes. The option input_type
can be set to either raw
or preads
. In the case of the latter,
fc_run.py
will assume the fasta files in input_fofn
are all error-corrected reads and it
will ignore any pre-assembly step and go directly into the final assembly overlapping step.
Primary contigs can be thought of as the longest continuous stretches of assembled sequence, while
associate contigs can be thought of mostly as structural variants that occur over the length of the
primary contigs. Thus, each alternate primary contig configuration (associated contig) can be
"associated" with its primary based on its XXXXXXF
prefix.
Some basic information about how the associated contigs are generated can be found in this speakerdeck, and also here (pg.14-15).
Conceptually, if a genome is haploid, then all contigs should be primary contigs. However, often there will usually still be some associated contigs generated. This is likely due to:
- Sequencing errors
- Segmental duplications
For the first case, Quiver should help by filtering out low quality contigs. Since there is more sequence in the set of primary contigs for blasr to anchor reads and there is no true unique region in the erroneous associated contigs, the raw read coverage of them should be low. We can thus filter low quality associated contig consensus as there won't be much raw read data to support them.
For the second case, one could potentially partition the reads into different haplotype groups and construct an assembly graph for each haplotype and generate contigs accordingly.
If a genome is a diploid, most of the associated contigs will be locally alternative alleles. Typically, when there are big structural variations between homologous chromosomes, there will be alternative paths in the assembly graph and the alternative paths correspond to the associated contigs. In such case, the primary contigs are “fused contigs” from both haplotypes.
FALCON-Unzip is currently being developed to resolve the haplotypes so haplotigs can be generated. Two videos illustrating the concept - (Video 1 , Video 2)
A slide illustrating the method on a synthetic genome.
The file a_ctg_base.fasta
contains the sequences in the primary contigs fasta that correspond to the associated
contigs inside a_ctg.fasta
. Namely, each sequence of a_ctg_base.fasta is a contiguous sub-sequence of a primary
contig. For each sequence inside a_ctg_base.fasta
, there are one or more associated contigs in a_ctg.fasta
.
It's useful to first understand that not all genomes are alike. Haploid genomes are the ideal use case of genome assembly since there is only one haplotype phase present and assembly is trivial if you have reads long enough to span repeats. Diploid and (allo/auto)polyploid genomes become difficult as there are two or more haplotype phases present. This fact, coupled with widely varying levels of heterozygosity and structural variation lead to complications during the assembly process. To understand your FALCON output, it's useful to look at this supplemental figure from the FALCON-Unzip paper:
Consider the first line as a cartoon illustrating 3 ranges of heterozygosity (low/medium/high). In general, all genomes will have regions that fall into each of these three categories depending on organismal biology. During the first step of the FALCON assembly process, a diploid aware assembly graph is generated. At this point, in medium heterozygosity regions structural variation information is captured as bubbles or alternative pathways in the assembly graph whereas at high levels of heterozygosity the haplotype phases assemble into distinct primary assembly graphs.
The FALCON-Unzip add-on module to the FALCON pipeline is an attempt to leverage the heterozygous SNP information to phase the medium level heterozygosity regions of the genome. Low heterozygosity regions have insufficient SNP density for phasing, while high heterozygosity regions will likely have already been assembled as distinct haplotypes in the primary contigs.
FALCON-Unzip yields two fasta files. One containing primary contigs, and one containing haplotigs. The primary contigs fasta file is the main output that most people consider first and should consist of the majority of your genome. Primary contigs are considered partially-phased. What this means is that even after the unzipping process, certain regions with insufficient SNP density are unable to be phased and are thus represented as collapsed haplotypes. The presence of these regions of low heterozygosity makes it impossible to maintain phase across the entire primary contig. Therefore, primary contigs may contain phase-switches between unzipped regions. The haplotigs file will consist of regions of the genome that are able to be unzipped or phased and are considered fully phased. This means there should be no phase switching within a haplotig and each haplotig should represent only one phase. See this figure for reference:
It's also important to note that in high heterozygosity situations, we often see the primary contig fasta file approaching 1.5X+ the expected haploid genome size, due to the assembly of both phases of certain chromosomes or chromosomal regions in the primary assembly.
Also, one needs to consider that FALCON-Unzip was designed to phase the plant and fungal genomes in the 2016 Nature Methods paper above. Many people have successfully used it to help phase their genome of interest, but as always with free software on the internet, your results may vary.
The magnitude of haplotype divergence determines the structure of the resulting FALCON-Unzip assembly. Genomic regions with low heterozygosity will be assembled as a collapsed haplotype on a single primary contig. Haplotypes up to ~5% diverged will be unzipped, while highly divergent haplotypes will be assembled on different primary contigs. In the latter case, it is up to the user to identify these contigs as homologous using gene annotation or sequence alignment.
For a variety of FALCON-Unzip assemblies, here is the distribution of haplotype divergence for unzipped regions.
Each haplotig was aligned to the corresponding primary contig with nucmer,
filtered with delta-filter and divergence was estimated with show-coords
. (Data credits to John Williams, Tim Smith,
Paolo Ajmone-Marsan, David Hume, Erich Jarvis, John Henning, Dave Hendrix, Carlos Machado, and Iago Hale).
Preassembly performance is summarized in the file: 0-rawreads/report/pre_assembly_stats.json
.
$ cat 0-rawreads/report/pre_assembly_stats.json
"genome_length": 4652500,
"length_cutoff": 15000,
"preassembled_bases": 350302363,
"preassembled_coverage": 75.293,
"preassembled_mean": 10730.33,
"preassembled_n50": 16120,
"preassembled_p95": 22741,
"preassembled_reads": 32646,
"preassembled_seed_fragmentation": 1.451, # total preads / seed reads
"preassembled_seed_truncation": 4453.782, # ave bp lost per pread due to low cov
"preassembled_yield": 0.758, # total pread bp / seed read bp
"raw_bases": 964281429,
"raw_coverage": 207.261,
"raw_mean": 10626.042,
"raw_n50": 14591,
"raw_p95": 23194,
"raw_reads": 90747,
"seed_bases": 461851093,
"seed_coverage": 99.269,
"seed_mean": 20029.103,
"seed_n50": 19855,
"seed_p95": 28307,
"seed_reads": 23059
A note on these statistics: in the process of created preads, seeds read bases with insufficient
raw read coverage (specific by --min_cov
in falcon_sense_option
) will cause the read to be split or truncated.
The preassembled seed fragmentation, truncation, and yield stats summarize the quality of pread assembly.
A good preassembled yield should be greater than 50%. Insufficient coverage or low quality data are common reasons
for poor preassembled yield.
FALCON was designed for whole genome shotgun assembly rather than amplicon assembly. In whole genome shotgun assembly we suppress repetitive high copy regions to assemble less repetitive regions first. When you assemble the PCR product of a short region in a genome, FALCON sees the whole thing as a high copy repeat and filters out a significant portion of the data.
You can try to downsample your data and make the daligner
block size even smaller ( reduce -s50
in
pa_DBsplit_option
) and increase the overlap filter thresholds (--max_diff 100
--max_cov 100
in overlap_filtering_setting
) to try to make it work. However, this use case is not really within
the scope of the FALCON algorithm.
Associate contig IDs contain the name of their primary contig but the precise location of alignment must
be determined with third party tools such as NUCmer. For example, in a FALCON assembly, 000123F-010-01
is an associated contig to primary contig 000123F. In a FALCON-Unzip assembly, 000123F_001 is a haplotig
of primary contig 000123F. The alignment position can be found in a
PAF file: 3-unzip/all_h_ctg.paf
. The alignment
coordinates after polishing is not yet produced but can be generated through alignments.
Thanks to Jason Chin for the original concept and Chris Dunn/Ivan Sovic/Zev Kronenberg for their numerous improvements.
- FALCON/FALCON-Unzip Chin et al. (2016) Nature Methods 13:1050–1054
- HGAP Chin et al. (2013) Nature Methods 10:563-9
- HiFi assembly: Wenger et al. (2019) Nature Methods
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.