Progressive Cactus is included in the Cactus Software Package. It can be used to align hundreds of vertebrate genomes. A phylogenetic tree is required as input in order to define the progressive decomposition. For aligning samples from the same species, use the Minigraph-Cactus pipeline instead.
Please cite the Progressive Cactus paper when using Cactus.
Please cite the HAL paper when using HAL tools.
- Progressive Cactus Data
- Quick-start
- Interface
- Using the HAL Output
- Using Docker
- Running in the Cloud
- Running on a Cluster
- Running Step-by-step
- Running on Terra or Cromwell
- Updating Alignments
- GPU Acceleration
- Pre-Alignment Checklist
- Frequently Asked Questions
A small test example is included in cactus/examples/evolverMammals.txt
. To align these genomes, install the latest release and run from the cactus directory:
cactus ./js ./examples/evolverMammals.txt ./evolverMammals.hal
The alignment can be inspected with
halStats ./evolverMammals.hal
hal v2.2
((simHuman_chr6:0.144018,((simMouse_chr6:0.084509,simRat_chr6:0.091589)mr:0.171974,simGorilla:0.075)AncGorilla:0.1)Anc1:0.020593,(simCow_chr6:0.18908,simDog_chr6:0.16303)Anc2:0.032898)Anc0;
GenomeName, NumChildren, Length, NumSequences, NumTopSegments, NumBottomSegments
Anc0, 2, 535128, 13, 0, 17165
Anc1, 2, 561672, 7, 19963, 24668
simHuman_chr6, 0, 601863, 1, 25791, 0
AncGorilla, 2, 578003, 4, 26318, 57158
mr, 2, 607607, 5, 55240, 60749
simMouse_chr6, 0, 636262, 1, 62021, 0
simRat_chr6, 0, 647215, 1, 61761, 0
simGorilla, 0, 599081, 1, 58887, 0
Anc2, 2, 573075, 19, 20396, 61873
simCow_chr6, 0, 602619, 1, 61912, 0
simDog_chr6, 0, 593897, 1, 61666, 0
or converted to MAF with
cactus-hal2maf ./js evolverMammals.hal evolverMammals.maf.gz --refGenome simHuman_chr6 --chunkSize 1000000
Note: See the step-by-step interface to see how to run Progressive Cactus one job at a time.
Cactus takes as input a set of softmasked genome assemblies in fasta format (they can be gzipped) and a phylogenetic tree relating them. Each input genome should correspond to a leaf in the tree. It outputs a multiple genome alignment in HAL, which includes ancestral sequences, one for each internal node in the tree.
To run Cactus, the basic format is:
cactus <jobStorePath> <seqFile> <outputHal>
The jobStorePath
is where intermediate files, as well as job metadata, will be kept by Toil. It must be accessible to all worker systems.
The seqFile
is a text file containing the locations of the input sequences as well as their phylogenetic tree. The tree will be used to progressively decompose the alignment by iteratively aligning sibling genomes to estimate their parents in a bottom-up fashion. Polytomies in the tree are no longer allowed, as changes to chaining have caused them to lead to drops in coverage. Cactus uses the predicted branch lengths from the tree to determine appropriate pairwise alignment parameters, allowing closely related species to be aligned more quickly with no loss in accuracy. The file is formatted as follows:
NEWICK tree
name1 path1
name2 path2
...
nameN pathN
An optional * can be placed at the beginning of a name to specify that its assembly is of reference quality. This implies that it can be used as an outgroup for sub-alignments. If no genomes are marked in this way, all genomes are assumed to be of reference quality. The star should only be placed on the name-path lines and not inside the tree.
- The tree must be on a single line. All leaves must be labeled and these labels must be unique. Ancestors may be named, or left blank (in which case the ancestors in the final output will automatically be labeled Anc0, Anc1, etc.) Labels must not contain any spaces.
- Branch lengths that are not specified are assumed to be 1.
- Lines beginning with # are ignored.
- Sequence paths must point to either a FASTA file or a directory containing 1 or more FASTA files.
- FASTA files may be gzipped (since Cactus v2.2.0)
- Sequence paths must not contain spaces.
- Each name / path pair must be on its own line
http://
,s3://
, etc. URLs may be used.
Please ensure your genomes are soft-masked, ideally with RepeatMasker. We do some basic masking as a preprocessing step to ensure highly repetitive elements are masked when repeat libraries are incomplete, but genomes that aren't properly masked can still take tens of times longer to align that those that are masked. Hard-masking (totally replacing repeats with stretches of Ns) isn't necessary, and is strongly discouraged (you will miss a lot of alignments!).
An example seqfile can be found here.
The number of genome assemblies for any given species is increasing. To align several genomes of the same species, you are probably best to use the Pangenome pipeline, but there are cases such as for diploid assemblies where you may want to include multiple genomes from one species in a progressive alignment. An example of such a dataset is the T2T Primates from NHGRI. These assemblies are all diploid, in that for each species, two haploid genomes are provided. For species (bonobo and gorilla) for which parental sequencing data was available, the haplotypes could be labeled as maternal or paternal, while for the other species the assignment into haplotypes is based on assembly quality. The nomenclature for this data is as follows
- primary assembly: most complete version of each autosome plus chrX and chrY
- alternate assembly: other version of each autosome (no sex chromosomes)
- maternal assembly: maternal autosomes (assigned with trio data) plus chrX
- paternal assembly: paternal autosomes (assigned with trio data) plus chrY
Note that all the samples were male. In other projects it's quite possible that the primary assembly does not include a chrY. Also, the same principle applies to other types of chromosomes, such as chrW and chrZ.
In all cases, we want progressive cactus, where possible, to correctly reconstruct sex chromosomes in the ancestors. To do this, each of these chromosomes should be present in at least two genomes in each alignment. The --chromInfo
option can be used to guide the outgroup selection process to ensure that this is the case. This option specifies a two-column text file mapping input genomes to lists of the sex chromosomes they contain. For example, it may contain something like
chimp_primary chrX,chrY
hg38 chrX,chrY
gorilla_maternal chrX
gorilla_paternal chrY
Note that genomes with no chromosomes to specify (chimp_alt
in this example) can be left out of the file (or included in a single column).
Normally, cactus greedily chooses the N (default=2, override with cactus --maxOutgroups
or in the configuration XML) nearest genomes to the ancestor in question to be outgroups. When --chromFile
is specified, it will choose (greedily, by distance) as many extra outgroups are needed to cover the set of specified chromosomes in the ingroups.
The outputHal
file represents the multiple alignment, including all input and inferred ancestral sequences. It is stored in HAL format, and can be accessed with HAL tools, which are all included in Cactus either as static binaries for the binary release, or within the Docker image for the Docker release.
Please cite HAL.
The HAL format represents the alignment in a reference-free, indexed way, but isn't readable by many tools. Many applications will require direct access to alignment columns, which requires "transposing" the row-based HAL format. This is best done by converting to MAF, which can be computationally expensive for larger files. HAL tools includes a conversion tool called hal2maf
but it has several limitations:
hal2maf
is extremely slowhal2maf
produces very small, fragmented alignment blockshal2maf
can write multiple rows per block per species, representing paralogies, which are not supported by many tools. Filtering these usinghal2maf --noDupes
often leads to weird breaks in synteny.hal2maf
does not properly support insertions relative to the reference. It has an option for this,--maxRefGap
, but it is extremely slow and doesn't work properly.
These issues are all at least partially addressed by a new tool, cactus-hal2maf
, included in Cactus. It is therefore always recommended to use cactus-hal2maf
rather than running hal2maf
directly (or hal2mafMP.py
).
cactus-hal2maf
uses Toil to support distributed computation, which allows very large alignments to be quickly converted on clusters or the cloud.cactus-hal2maf
uses TAFFY-based normalization to merge together adjacent blocks, resulting in far less fragmentationcactus-hal2maf
has a--dupeMode
option to greedily filter down to a single row / species, using contiguity and identity to reference which, while not perfect, is much better than before.cactus-hal2maf
uses TAFFY to correctly add insertions back into the alignment directly from the HAL file.
Genome Selection
--refGenome
(required): A genome from the alignment (can be an ancestor) to use as the MAF reference onto which all other alignments are projected. It will be the first row of each MAF block in the output.--refSequence
: Only process the given chromosome(s) from the reference genome--noAncestors
: Do not included reconstructed ancestors in the output (exAnc0
etc)--targetGenomes
: Only include comma-separated list of genomes specified by this option in the output--start
/--length
: Only process the given subrange(s) of the reference sequence(s)--bedRanges
: Only process the given subranges of reference genome
Computational Resources
--chunkSize
(required): The size (in bp) of each chunk on the reference to process in parallel. I typically use500000
for whole-genome alignments.--batchCount
: The number of Toil jobs to distribute computation to (1
by default). For example, on a3gb
genome with1mb
chunks, there would be3000
chunks to process. If--batchCount
is10
, then these chunks would be divided into10
multithreaded jobs, each processing300
chunks. If you are running on a single machine, there is little reason to use more than1
job.--batchSize
: An alternative to--batchCount
, this will set the number of jobs by specifying the maximum number of chunks for each job. In the above example--batchSize 300
would also divide the work into10
Toil jobs.--batchCores
: Number of threads per Toil job, which determines how manyhal2maf
processes are run.--batchMemory
: Amount of memory to assign to each Toil Job (defaults to an estimate based on the input size).--batchParallelHal2maf
: Number ofhal2maf
jobs to run concurrently. (defaults to--batchCores
)--batchParallelTaf
: Number oftaffy
normalization jobs to run concurrently. (defaults ot--batchCores
)
Normalization
--filterGapCausingDupes
(recommended): Filter paralogous rows that would otherwise cause a block to be broken. In practice, it has a negligible effect on coverage, but a very large effect on the number of blocks.--maximumBlockLengthToMerge
: Merge adjacent blocks if one or both is less than this many bases long (see TAFFY documentation for default value and more explanation).--maximumGapLength
: Merge adjacent blocks if the maximum number of unaligned bases between them is less than this value (see TAFFY documentation for default value and more explanation).--fractionSharedRows
: Minimum fraction of shared rows between adjacent blocks in order to merge (see TAFFY documentation for default value and more explanation).
Duplication Filtering is specified with the --dupeMode
option. Possible values are:
- "single" : Uses greedy heuristics to pick the copy for each species that results in fewest mutations and block breaks.
- "consensus" : Uses maf_stream merge_dups consensus to make a single "consensus" row for all duplicate rows. This row won't actually reflect a real sequence in the fasta, but the individual columns will be more sensitive to the true coverage than when using "single". Recommended when only looking for maximum coverage of columns, without considering duplications.
- "ancestral" : Restricts the duplication relationships shown to only those orthologous to the reference genome according to the HAL tree. There may be multiple orthologs per genome. This relies on the dating of the duplication in the hal tree (ie in which genome it is explicitly self-aligned) and is still a work in progress. For example, in a tree with
((human,chimp),gorilla)
, if a duplication in human is collapsed (ie a single copy) in the human-chimp ancestor, then it would not show up on the human-referenced MAF using this option. But if the duplication is not collapsed in this ancestor (presumably because each copy has an ortholog in chimp and gorilla), then it will be in the MAF because the duplication event was higher in the tree. - "all" : (default) All duplications are written, including ancestral events (orthologs) and paralogs in the reference.
The output path argument of cactus-hal2maf
must end with .maf.gz
, .maf
, .taf.gz
or .taf
. In the latter two cases, the output will be written as TAF instead of MAF. Conversion between MAF and TAF with taffy view
is lossless, and TAF is much more compressed than MAF.
Files in .maf.gz
, .maf
, .taf.gz
and .taf
can be quickly queried by reference coordinate interval (analogous to bcftools view -r
) using taffy view
. As in VCF
, the files must be indexed first. Use cactus-hal2maf --index
to generate a .tai
index of the output alignment file. See the TAFFY documentation for more information on indexing. taffy
can index and query MAF and TAF files interchangeably, ie an interval from an indexed TAF file can be output to MAF and vice versa.
A table of alignment coverage and identity statistics with suffix .cov.tsv
will be generated with cactus-hal2maf --coverage
. Breakdowns by gap length and sex chromosomes vs autosomes can be added using the --coverageGapThresholds
and --coverageSexChroms
options, respectively. See the taffy coverage
documentation on the TAFFY site for more details.
Export the simulated mammals example on a local machine, filtering the output so each genome only appears in at most one row per block
cactus-hal2maf ./js evolverMammals.hal evolverMammals.maf.gz --refGenome simHuman_chr6 --chunkSize 1000000 --dupeMode single
Exporting a MAF for each reference in an 8-way ape alignment on UCSC Slurm cluster:
for i in hs1 hg38 GCA_028858775.2 GCA_028885655.2 GCA_028885625.2 GCA_028878055.2 GCA_029281585.2 GCA_029289425.2; do TOIL_SLURM_ARGS="--partition=long --time=8000" cactus-hal2maf ./js_hal2maf8 ./8-t2t-apes-2023v2.hal ./8-t2t-apes-2023v2.${i}.maf.gz --filterGapCausingDupes --refGenome $i --chunkSize 500000 --batchCores 64 --noAncestors --batchCount 16 --batchSystem slurm --logFile ./8-t2t-apes-2023v2.${i}.gz.log --batchLogsDir batch-logs-8apes --coordinationDir /data/tmp ;done
Note that the output will contain paralogies. These can be filtered post-hoc with (note this isn't run on Slurm, to do so you'd need to run sbatch
yourself)
for i in hs1 hg38 GCA_028858775.2 GCA_028885655.2 GCA_028885625.2 GCA_028878055.2 GCA_029281585.2 GCA_029289425.2; do zcat ./8-t2t-apes-2023v2.${i}.maf.gz | mafDuplicateFilter -km - | bgzip > ./8-t2t-apes-2023v2.${i}.single-copy.maf.gz
You can accomplish the same thing by passing --dupeMode single
to the original cactus-hal2maf
command. I usually prefer to do it in two steps to be able to keep around both versions of the MAF without needing to run the whole conversion process twice.
Exporting a MAF on human for the 447-way Zoonomia with extra primates alignment on the UCSC Slurm cluster:
cactus-hal2maf ./js ../447-mammalian-2022v1.hal 447-mammalian-2022v1.maf.gz --chunkSize 100000 --batchCores 96 --batchCount 10 --noAncestors --filterGapCausingDupes --batchParallelTaf 32 --batchSystem slurm --maxLocalJobs 800 --refGenome hg38 --logFile 447-cmammalian-2022v1.maf.gz.log
zcat 447-mammalian-2022v1.maf.gz | mafDuplicateFilter -k -m - | bgzip > 447-mammalian-2022v1-single-copy.maf.gz
Because the Zoonomia assemblies are so fragmented, and the alignment file is so huge, the TAFFY normalization processes that read the HAL file take lots of memory. To compensate for this, the --batchParallelTaf 32
flag was used above to throttle down the number of taffy
processes to run in parallel.
This is an earlier run of the above example on an AWS/EC2 autoscale cluster.
cactus-hal2maf aws:us-west-2:cactus-hprc-jobstore-cow s3://vg-k8s/users/hickey/zoo/447-mammalian-2022v1.hal s3://vg-k8s/users/hickey/zoo/447-mammalian-2022v1.maf.gz --refGenome Homo_sapiens --noAncestors --chunkSize 5000000 --batchCount 20 --batchCores 32 --batchParallelTaf 8 --batchSystem mesos --provisioner aws --defaultPreemptable --nodeStorage 4000 --maxNodes 20 --nodeTypes r5.8xlarge --logFile 447-mammalian-2022v1.maf.gz.log --retryCount 0 --betaInertia 0 --targetTime 1
UCSC BigMaf is an indexed version of MAF (see above) that can viewed on the Genome Browser. cactus-maf2bigmaf
can be used to convert MAF (as output by cactus-hal2maf
) into BigMaf.
It is recommended to create the maf using the --noAncestors
options with cactus-hal2maf
. The Browser does not support duplicates, so cactus-maf2bigmaf
will automatically filter them out using mafDuplicateFilter -k
.
cactus-maf2bigmaf ./js ./evolverMammals.maf.gz ./evolverMammals.bigmaf.bb --refGenome simHuman_chr6 --halFile evolverMammals.hal
This will produce the BigMaf file evolverMammals.bigmaf.bb
along with the summary file evolverMammals.bigmaf.summary.bb
which is used by the browser for zoomed out summary display.
The chromosome sizes of the reference genome must be provided via the original hal file via --halFile
.
The UCSC Chain Format is a concise way to represent pairwise alignments, and is used by the Genome Browser and some of its tools. HAL files can be converted into sets of Chain files using cactus-hal2chains
. Use the --queryGenomes
and --targetGenomes
flags to specify one or more query and/or target genomes. Chains between all pairs will be computed (so watch out for large alignments!). If either --queryGenomes
or --targetGenomes
is unspecifiefd, then all leaf genomes in the HAL file will be used.
For example
cactus-hal2chains ./js ./evolverMammals.hal chains-dir --queryGenomes simHuman_chr6
will create ./chains-dir
and populate it with a Chain alignment between simHuman and each other leaf genome in evolverMammals.hal.
By default, chains will be created using halLiftover
as in CAT. An option --useHalSynteny
is provided to use that tool instead.
In order to view your chains on the UCSC Genome Browser, you need to convert to bigChain. Use the --bigChain
flag to have cactus-hal2chains
produce bigChain.bb
and bigChain.link.bb
output files in addtion to chain.gz
.
You can use the alignment to generate gene annotatations for your assemblies, using the Comparative Annotation Toolkit.
HAL alignments can also be displayed in the UCSC genome browser via creation of assembly hubs as described here. hal2assemblyHub.py
is included in Cactus.
Conservation scores can be computed using phast either directly from the HAL (halPhyloP
) or from the MAF. The phast binaries are included in the Cactus releases.
The Cactus Docker image contains everything you need to run Cactus (python environment, all binaries, system dependencies). For example, to run the test data:
wget -q https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactus/master/examples/evolverMammals.txt -O evolverMammals.txt
docker run -v $(pwd):/data --rm -it quay.io/comparative-genomics-toolkit/cactus:v2.8.2 cactus /data/jobStore /data/evolverMammals.txt /data/evolverMammals.hal
Or you can proceed interactively by running
docker run -v $(pwd):/data --rm -it quay.io/comparative-genomics-toolkit/cactus:v2.8.2 bash
cactus /data/jobStore /data/evolverMammals.txt /data/evolverMammals.hal
Cactus supports running on AWS using Toil's autoscaling features. For more details on running in AWS, check out these instructions.
Cactus can also be run on Google Cloud Platform via Terra.
Cactus supports SLURM since version 2.6.1. To run on SLURM, add --batchSystem slurm
to your cactus command line and run it from your cluster's head node. For example,
cactus ./js ./examples/evolverMammals.txt evolverMammals.hal --batchSystem slurm
In order to run cactus, you will need a shared filesystem across the cluster nodes where the jobstore (./js
in the exmaple above) and output file will be written. It is best to use local scratch space for jobs, so you can use --workDir
or the TEMPDIR
environment to set this to a place that will be available to write to on all cluster nodes.
Also, you should use --coordinationDir
to specify a local, physical disk (non-network filesystem) for Toil to use for its locks. This can usually be the same as used for --workDir
.
You can use TOIL_SLURM_ARGS
to add flags to the slurm submission commands that Toil uses. For example if you are running Progressive Cactus and generating loads of lastz
jobs, you may want to run
export TOIL_SLURM_ARGS="--nice=5000"
to avoid making too many enemies.
You can (and probably should) use the --batchLogsDir
option in order to enable more SLURM logging. You must pass it a directory that already exists. Ex.
cactus ./js ./examples/evolverMammals.txt evolverMammals.hal --batchSystem slurm --batchLogsDir batch-logs
You'll want to clean out this directory after a successful run.
You cannot run cactus --batchSystem slurm
from inside the Cactus docker container, because the Cactus docker container doesn't contain slurm. Therefore in order to use slurm, you must be able to pip install
Cactus inside a virtualenv on the head node. You can still use --binariesMode docker
or --binariesMode
singularity to run cactus binaries from a container, but the Cactus Python module needs to be installed locally.
IMPORTANT
Slurm can strictly enforce memory usage. This means that Cactus must specify the memory limit of each job before it is run, and it uses the input file sizes to do this. Accurately predicting the memory usage is challenging though, and remains a work in progress. If Cactus predicts too little memory, then the job will be evicted by the cluster. If it predicts too much, ie more than available on any node, then the job will never run. In both these cases, the workflow risks being doomed though there is a change that --maxMemory
or --defaultMemory
can be used in conjunction with --restart
(see below) to rescue it. Some better Toil support would also be useful.
Most of memory usage in Cactus occurs in the cactus_consolidated
jobs and, as such, these are the most important to get right for slurm. The parameters used to estimate memory for cactus_consolidated
can be found in the <consolidatedMemory>
element in the configuraiton XML. Importantly, they can be overridden with the --consMemory
option.
To be extra safe, consider setting --consMemory
to the maximum amount of memory available on any node of your cluster. This way you know that all your jobs will be scheduled.
NOTE Since v2.7.2
--maxMemory
will act as a global memory override. So it is best to set --maxMemory
whenever using Slurm to the most memory available on your cluster to make sure that Cactus never asks for more memory. Likewise --defaultMemory
can be used to specifiy a minimum memory requirement for both jobs. Both of these options can be used with --restart
to try to rescue jobs that died due to invalid memory requirements (though they may not always work).
Cactus (through Toil) supports many other cluster workload managers in theory, including LSF, GridEngine, Parasol, and Torque, but unlike slurm they are untested and difficult for us to support. Add --batchSystem <batchSystem>
, e.g. --batchSystem gridEngine
. If your batch system needs additional configuration, Toil exposes some environment variables that can help.
Breaking Cactus up into smaller jobs can be practical, both for development and debugging, and managing larger workflows. Here is an example of how to break the Evolver Mammals example up into three steps: 1) Preprocessing 2) Blast 3) Multiple Aligment:
cactus-prepare examples/evolverMammals.txt --outDir steps-output --outSeqFile steps-output/evovlerMammals.txt --outHal steps-output/evolverMammals.hal --jobStore jobstore
It will print the sequence of commands to run the alignment step-by-step. Blocks of commands within each alignment run can be run in parallel
cactus-prepare
can also be used to simplify preprocessing sequences without decomposing the remaining workflow:
cactus-prepare examples/evolverMammals.txt --outDir steps-output --outSeqFile steps-output/evovlerMammals.txt --outHal steps-output/evolverMammals.hal --jobStore jobstore --preprocessOnly
cactus-prepare-toil
shares the interface of cactus-prepare
except instead of printing the command lines or WDL script, it runs them directly from Toil. An example use case of this is within UCSC's kubernetes cluster. Like many computing environments, the number of jobs that can be scheduled is limited, so running cactus
directly using Toil's kubernetes
batch system will swamp the cluster. But if the computation can be broken up into a handful of steps, and a job is only created for each step (as in the Cromwell/WDL method), then it can run through. So cactus-prepare-toil
will run as a high-level Toil workflow on the specified batch system, and it will launch jobs for each command (cactus-preprocess, cactus-blast, cactus-align
), and each one of these jobs will get scheduled on a node and run its command with the singleMachine
batch system. Here is an example invocation for kubernetes:
cactus-prepare-toil aws:us-west-2:<JOBSTORE-NAME> examples/evolverMammals.txt --binariesMode singularity --batchSystem kubernetes --outHal s3://<BUCKET-NAME>/out.hal --defaultDisk 20G --defaultMemory 12G --defaultCores 4
The --wdl
option in cactus-prepare
can be used to generate a bespoke WDL script for running the alignment from the input seqFile. Here is an example on how to run locally in Cromwell
cactus-prepare examples/evolverMammals.txt --wdl > evolver.wdl
wget https://github.com/broadinstitute/cromwell/releases/download/49/cromwell-49.jar
javac -jar ./cromwell-49.jar run evolver.wdl
To run on Terra, use the --noLocalInputs
option to make sure no local files are embedded in the script. Also, care must be taken to specify some minimum resource requirements.
cactus-prepare examples/evolverMammals.txt --wdl --noLocalInputs --alignCores 2 --defaultMemory 16G > evolver_terra.wdl
Then in Terra's workspace menu:
- Create a new workspace if necessary with the "+" button
- Click on the workspace
- Click on the "DATA" tab in the workspace menu and use the "Files" link to upload
examples/evolverMammals.txt
to Goggle Cloud - Click on the "WORKFLOWS" tab
- Click the "+" button to add a workflow
- Click the link in the bottom right to the "Broad Methods Repository"
- Click the "Create New Method... +" button
- Choose and namespace and name, then either upload or paste
evolver_terra.wdl
as created above and click "Upload" - If this WDL is valid, you can use the "Export To Workspace" button to link it to the Terra Workspace (using a blank configuration)
- You can select the option to go back to the Terra Workspace, otherwise the workflow should now appear as a card in the Terra "workflows" tab the next time you navigate there or refresh
- To run it, click the workflow then click the "INPUTS" tab, and select the
evolverMammals.txt
file in the Attribute field for Task=cactus_prepare
Variable=prep_seq_file
- Tick "Run workflow with inputs defined by file paths"
- Save and click "RUN ANALYSIS"
In the evolver example, all input sequences are specified in public URLs. If sequences are not specified as URLs in the seqfile, then they must be uploaded in similar fashion to how the evolverMammals.txt was uploaded and selected in the example above.
Here is an example of some settings that have worked on a mammalian-sized genome alignment on Terra. It's important to align the resources requested (CPU and memory) to N1 instance types as found here. Note that disk will be rounded up to the nearest multiple of 375G, with only these multiples supported: [0, 1, 2, 3, 4, 5, 6, 7, 8, 16, 24]
. In the example below this is:
- preprocess/blast:
n1-standard-32 + 8 v100 GPUs
- align:
n1-highmem-64
(can lower ton1-highmem-32
using--alignCores 32 --alignMemory 208Gi
for shorter branches/smaller genomes) - append:
n1-highmem-16
cactus-prepare --wdl mammals.txt --noLocalInputs \
--preprocessDisk 375Gi --preprocessCores 32 --preprocessMemory 120Gi \
--blastDisk 375Gi --blastCores 32 --gpu 8 --blastMemory 120Gi \
--alignDisk 375Gi --alignCores 64 --alignMemory 416Gi \
--halAppendDisk 3000Gi --defaultMemory 104Gi > mammals.wdl
If the workflow fails for whatever reason, it can be edited (to, say, increase job requirements) then resumed as follows:
- In the Workflows tab, click the scripts link beside "Source:" to go back to the Firecloud page to edit the WDL script
- Edit it and "Save a New Snapshot"
- Back in the Terra Workflows tab for the workflow, refresh the page, and select the new snapshot from the "Snapshots" menu.
- Click the "Save" button, ensure that "Use call caching" is ticked, then "Run Analysis" again to resume the workflow.
Important note on resuming Terra workflows
The above instructions to use Terra's call caching do not work reliably anymore. This is really frustrating, as you can be many days and dollars into a workflow, need to adjust the WDL for whatever reason, and Terra will ignore all intermediate files and restart from scratch when you rerun for reasons only it understands. But if the results are in your GCP bucket somewhere (which they will be as long as you are not explicitly removing them), you can still use them by editing the WDL to incorporate them. This can be done with cactus-terra-helper
. You can use the Terra interface (by clicking on pretty much any file) to find the root bucket prefix of all intermediate files of a given run. Once you have that, run
gsutil ls -r gs://<BUCKET/PREFIX> | cactus-terra-helper resume mammals.wdl > mammals-resume.wdl
The output script will remove all WDL calls for which output files were found in the bucket, and replace references to them to full paths of the intermediate files.
The same script can be used to download all the logs off Terra, which can be useful. This command (the -l
is important) will download the latest version of each log to the present directory.
gsutil ls -l -r gs://<BUCKET/PREFIX> | cactus-terra-helper scrape-logs
Cactus supports incrementally updating existing alignments to add, remove, or update genomes. The process involves minor surgery on the output HAL files. See this document for details. cactus-update-prepare can be used to simplify this process!
SegAlign, a GPU-accelerated version of lastz, can be used in the "preprocess" and "blast" phases to speed up the runtime considerably, provided the right hardware is available. Unlike lastz, the input sequences do not need to be chunked before running SegAlign, so it also reduces the number of Toil jobs substantially. The GPU-enabled Docker releases have SegAlign turned on by default and require no extra options from the user. Otherwise, it is possible to manually install it and then enable it in cactus
using the --gpu
command line option. One effective way of ensuring that only GPU-enabled parts of the workflow are run on GPU nodes is on Terra with cactus-prepare --gpu --wdl
(see above example).
By default --gpu
will give all available GPUs to each SegAlign job. This can be tuned by passing in a numeric value, ex --gpu 8
to assign 8 GPUs to each SegAlign job. In non-single-machine batch systems, it is mandatory to set an exact value with --gpu
.
GPUs must
- support CUDA
- have at least 8GB GPU memory (for mammal-sized input)
- have 1-2 CPU cores available each for spawning lastz jobs
We've tested SegAlign on Nvidia V100 and A10G GPUs. See the Terra example above for suggested node type on GCP.
Please cite SegAlign.
Since SegAlign
is only released in the GPU-enabled docker image, that's the easiest way to run it. When running on a cluster, this usually means the best way to use it is with --binariesMode docker --gpu <N>
. This way cactus is installed locally on your virtual environment and can run slurm commands like sbatch
(that aren't available in the Cactus container), but SegAlign itself will be run from inside Docker.
Important: Consider using --lastzMemory
when using GPU acceleration on a cluster. Like --consMemory
, it lets you override the amount of memory Toil requests which can help with errors if Cactus's automatic estimate is either too low (cluster evicts the job) or too high (cluster cannot schedule the job).
- Are the input sequences softmasked (ideally with RepeatMasker, but WindowMasker may be sufficient)? For mammals we expect at least 40% of the genome to be masked this way. You can use the
cactus_analyseAssembly
tool (included in cactus, and who's output is logged by Cactus) to check how masked the gnomes are. - Have you run a small test alignment to make sure Cactus is properly installed?
- Do you have at least one outgroup species?
Q: I'm running under macOS using the Docker functionality and get an error from Docker: docker: Error response from daemon: Mounts denied: [...]
A: Go to your Docker preferences. In the "File Sharing" tab, double-click the last entry ("/path/to/exported/directory") and type in /var/folders
. (Don't use the +
button, it won't work because it resolves symlinks before adding).
The reason you have to do this is that the Docker VM requires explicitly listing the directories that can be bind-mounted. The default temp directory on macOS (/var/folders/...
) is symlinked to a directory that is already listed as bind-mountable, but Docker checks the listing before resolving the symlink, returning an error.
Q: Why does cactus write some files to /tmp
when I specify another location with --workDir
.
A: This is a bug, but you can work around it by running export TMPDIR=<your desired path>
before running cactus.
Q: How exactly does Cactus use the branch lengths in the input tree?
A: The branch lengths are expected to be denoted in substitutions per site and are used in four ways.
-
to determine lastz parameters. The pairwise distance between species is measured using the branch lengths, and mapped to a set of lastz parameters using the
<divergences>
(inside<constants>
) and<divergence>
(inside<blast>
) elements in the configuration XML. Faster parameters are used for more closely-related species. If you are aligning human and chimp with the correct branch lengths, their distance will be about 0.02, and it will use the fasest parameters which will be several times faster than if, say, the default branch length of 1 was used. -
(since v2.6.0) to determine cactus chaining parameters. Similar to above the
<annealingRounds>
(inside<caf>
) alements are used to set the minimum chain length based on the divergence. A longer length is used for more closely related species, which will result in more syntenic, less fragmented alignments. Shorter lengths are used at higher divergences to boost sensitvity, allowing that longer synteny may not be possible due to structural changes. -
to estimate ancestral bases. Here the relative branch lengths are more important -- the base of a descendant that is much nearer to the ancestor will provide more information and will be wieghted higher when estimating it.
-
to calculate outgroups. Branch lengths are taken into account by the greedy heuristic used to find the nearest outgroup to the given ancestral event.
If you do not know the branch lengths, you can leave them out and Cactus will use its default (1). This will cause the alignment to be slower than necessary but the results shouldn't be affected much. Otherwise, even inexact branch lengths should be fine. For closely related species mash dist
is a very easy way to estimate a pairwise distance (mash
is now included in Cactus). Often you can use mash
and already-published trees to come up with your branch lengths.
We are currently working on incorporating a fast genome tree estimation workflow with Cactus.
Q: How much do I have to worry about uncertainty in my guide tree?
A: Changes in the tree topology will affect the alignment, but like the Progressive Cactus paper shows, it'll be fairly minimal for small, local changes. For most data, there is no single correct tree anyway, due to things like incomplete lineage sorting, hybridization and lateral gene transfer.
That said, cactus can handle multifurcations up to a point: runtime increases quadratically with the number of species in the split. So depending on your genome size, you probably won't be able to go much beyond 5. This should be more accurate in theory, but I haven't run the experiments to back it up.
Q: I'm running out of memory, or getting crashes, or very long runtimes in one of the paf_xxxx
tools (paf_tile, paf_to_bed
etc.). What can I do?
A: This is almost always due to Cactus having found too many pairwise alignments in the all-to-all lastz mapping (blast) phase. The only way to get around this is my softmasking the input genomes before running Cactus. For most species, this is best done with RepeatMasker. We do intend to work on lifting this requirement in the future by making cactus's own repeatmasking more robust. As of v2.6.0, Cactus is more tolerant of repetative sequence but the input still needs to be softmasked.
Q: The --gpu
option isn't working for me.
A: Unless you've set up SegAlign yourself, the GPU option will only work using the gpu-enabled Docker image (name ending in -gpu
). If you are running directly from the container make sure you use docker run
's --gpus
option to enable GPUs in your container. If you are using singularity
, the option is --nv
.
Q: But what if I want to use --gpu
on my cluster? When I try from inside the GPU-enabled container, none of my cluster commands (ex qsub
) are available.
A: Install the Cactus binary release as described in the instructions on the Releases page. But run Cactus with --binariesMode docker
(or singularity
). This will let Cactus run SegAlign (and all other binaries) directly from the container, while itself running from the Python virtualenv.
Q: I get an error to the effect of ERROR: No matching distribution found for toil[aws]==xxxx
when trying to install Toil.
A: This is probably happening because you are using Python 3.6. Toil and Cactus require Python >= 3.7. Use python3 --version
to check your Python version.
Q: I get an error to the effect of toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job cactus_cons is requesting 66623310306 bytes of memory, more than the maximum of 34359738368 bytes of memory that SingleMachineBatchSystem was configured with, or enforced by --maxMemory. Scale is set to 1.0.”
. What's going on?
A: As of version 2.6.0, Cactus is now trying to (conservatively) estimate the memory usage of each job, which is required for must cluster schedulers. This can be annoying if, like in the above scenario, the estimate is too conservative to even try running on your machine. So you can use the --consMemory
option to override it. Ex. use --consMemory 32Gi
to force Cactus to reserve exactly 32 Gigs for each cactus consolidated job. --maxMemory
and --defaultMemory
can also be used to clamp the memory of big jobs from above and below, respectively.