-
Notifications
You must be signed in to change notification settings - Fork 50
merge
RagTag Version: v2.1.0
Draft genome assemblies are often scaffolded multiple times using different approaches. For example, one might scaffold an assembly using different genome maps (physical, linkage, Hi-C, etc.), different methods, or different method parameters. RagTag merge
is a tool to merge and reconcile different scaffoldings of the same assembly. In this way, one can leverage the advantages of multiple techniques to synergistically improve scaffolding.
Most tools write scaffolding results in the AGP file format, which encodes adjacency and gap information in a plain text file. To run RagTag merge, one must supply the assembly in FASTA format and at least two AGP files that define a scaffolding of the assembly. Each AGP file can optionally be assigned a weight, allowing users to assign the relative influence of each AGP on the final result.
If available, users can supply Hi-C alignments to the draft assembly to resolve conflicts in the merging graph. In this scenario, the input AGP files are used to build the initial graph, but then Hi-C alignments are used to re-weight the graph before computing the scaffolding solution.
usage: ragtag.py merge <asm.fa> <scf1.agp> <scf2.agp> [...]
Scaffold merging: derive a consensus scaffolding solution by reconciling distinct scaffoldings of 'asm.fa'
positional arguments:
<asm.fasta> assembly fasta file (uncompressed or bgzipped)
<scf1.agp> <scf2.agp> [...]
scaffolding AGP files
optional arguments:
-h, --help show this help message and exit
merging options:
-f FILE CSV list of (AGP file,weight) [null]
-j <skip.txt> list of query headers to leave unplaced [null]
-l INT minimum assembly sequence length [100000]
-e FLOAT minimum edge weight. NA if using Hi-C [0.0]
--gap-func STR function for merging gap lengths {'min', 'max', or 'mean'} [min]
input/output options:
-o PATH output directory [./ragtag_output]
-w overwrite intermediate files
-u add suffix to unplaced sequence headers
Hi-C options:
-b FILE Hi-C alignments in BAM format, sorted by read name [null]
-r STR CSV list of restriction enzymes/sites or 'DNase' [GATC]
-p FLOAT portion of the sequence termini to consider for links [1.0]
--list-enzymes list all available restriction enzymes/sites
<asm.fasta>
is the input genome assembly that we wish to scaffold. This is followed by at least two AGP files (<scf1.agp> <scf2.agp> [...]
), each representing an individual scaffolding of <asm.fasta>
. AGP files provided as positional arguments are assigned a weight of 1. Every sequence in <asm.fasta>
must be represented as a component in each AGP file. Please use agpcheck to ensure that AGP files are properly formatted.
Instead of providing AGP files as positional arguments, one can provide a list of (at least two) AGP files with -f
. Use the second column of the provided CSV file to indicate the weight of each AGP file. Use -j
to specify a list of assembly sequences to leave unplaced. Assembly sequences shorter than -l
will also be left unplaced.
The edges in the merging graph represent scaffolding adjacencies. If an AGP file supports a particular adjacency, its weight is added to the edge weight. Any edges with a weight lower than -e
will be removed from the graph. For example, if merging 3 AGP files, each with a weight of 1, you may consider only retaining scaffolding joins that are supported by all three (-e 3
) or two out of three (-e 2
) AGP files. This does not apply when using -b
.
Scaffold gaps can differ between input AGP files. For example, a Hi-C derived AGP file might place 100 bp gaps between sequences while a reference-guided AGP file might infer gap sizes based on a reference genome. Use --gap-func
to specify how gap sizes should be computed from the supporting AGP files.
By default, RagTag places all of the scaffolding output and intermediate files in a directory named ragtag_output
, but this can be changed with -o
. Use -w
to overwrite any preexisting intermediate output files.
Use the -u
option to add the "_RagTag" suffix to each sequence in the scaffold output, even unplaced query sequences that have not changed. This ensures AGP compatibility with some external programs/databases. If one wants unplaced query sequences to retain their original header, do not use -u
.
Use -b
to provide Hi-C alignments to the assembly sequences. Alignments must be in BAM format and sorted by read name (not by coordinate).-r
provides a list of restriction enzymes/cut-sites used for the Hi-C experiment. Use --list-enzymes
to find a list of acceptable restriction enzymes/sites and for more details.
All output is in ragtag_output
, or whichever directory -o
specifies. All RagTag 'merge' files begin with the ragtag.merge.
prefix.
If using -b
, Hi-C link quantification will be stored in ragtag.merge.links
. ragtag.merge.agp
provides the merged scaffold results and ragtag.merge.fasta
is the associated FASTA file. Any RagTag standard error messages will be stored in ragtag.merge.err
.
Are these docs confusing or incomplete? Please open an issue and let me know.