Skip to content
Michael Alonge edited this page Nov 1, 2021 · 15 revisions

RagTag Version: v2.1.0

descriptive diagram

Draft genome assemblies are often scaffolded multiple times using different approaches. For example, one might scaffold an assembly using different genome maps (physical, linkage, Hi-C, etc.), different methods, or different method parameters. RagTag merge is a tool to merge and reconcile different scaffoldings of the same assembly. In this way, one can leverage the advantages of multiple techniques to synergistically improve scaffolding.

Most tools write scaffolding results in the AGP file format, which encodes adjacency and gap information in a plain text file. To run RagTag merge, one must supply the assembly in FASTA format and at least two AGP files that define a scaffolding of the assembly. Each AGP file can optionally be assigned a weight, allowing users to assign the relative influence of each AGP on the final result.

If available, users can supply Hi-C alignments to the draft assembly to resolve conflicts in the merging graph. In this scenario, the input AGP files are used to build the initial graph, but then Hi-C alignments are used to re-weight the graph before computing the scaffolding solution.

Usage

usage: ragtag.py merge <asm.fa> <scf1.agp> <scf2.agp> [...]

Scaffold merging: derive a consensus scaffolding solution by reconciling distinct scaffoldings of 'asm.fa'

positional arguments:
  <asm.fasta>           assembly fasta file (uncompressed or bgzipped)
  <scf1.agp> <scf2.agp> [...]
                        scaffolding AGP files

optional arguments:
  -h, --help            show this help message and exit

merging options:
  -f FILE               CSV list of (AGP file,weight) [null]
  -j <skip.txt>         list of query headers to leave unplaced [null]
  -l INT                minimum assembly sequence length [100000]
  -e FLOAT              minimum edge weight. NA if using Hi-C [0.0]
  --gap-func STR        function for merging gap lengths {'min', 'max', or 'mean'} [min]

input/output options:
  -o PATH               output directory [./ragtag_output]
  -w                    overwrite intermediate files
  -u                    add suffix to unplaced sequence headers

Hi-C options:
  -b FILE               Hi-C alignments in BAM format, sorted by read name [null]
  -r STR                CSV list of restriction enzymes/sites or 'DNase' [GATC]
  -p FLOAT              portion of the sequence termini to consider for links [1.0]
  --list-enzymes        list all available restriction enzymes/sites

positional arguments

<asm.fasta> is the input genome assembly that we wish to scaffold. This is followed by at least two AGP files (<scf1.agp> <scf2.agp> [...]), each representing an individual scaffolding of <asm.fasta>. AGP files provided as positional arguments are assigned a weight of 1. Every sequence in <asm.fasta> must be represented as a component in each AGP file. Please use agpcheck to ensure that AGP files are properly formatted.

merging options

Instead of providing AGP files as positional arguments, one can provide a list of (at least two) AGP files with -f. Use the second column of the provided CSV file to indicate the weight of each AGP file. Use -j to specify a list of assembly sequences to leave unplaced. Assembly sequences shorter than -l will also be left unplaced.

The edges in the merging graph represent scaffolding adjacencies. If an AGP file supports a particular adjacency, its weight is added to the edge weight. Any edges with a weight lower than -e will be removed from the graph. For example, if merging 3 AGP files, each with a weight of 1, you may consider only retaining scaffolding joins that are supported by all three (-e 3) or two out of three (-e 2) AGP files. This does not apply when using -b.

Scaffold gaps can differ between input AGP files. For example, a Hi-C derived AGP file might place 100 bp gaps between sequences while a reference-guided AGP file might infer gap sizes based on a reference genome. Use --gap-func to specify how gap sizes should be computed from the supporting AGP files.

input/output options

By default, RagTag places all of the scaffolding output and intermediate files in a directory named ragtag_output , but this can be changed with -o. Use -w to overwrite any preexisting intermediate output files.

Use the -u option to add the "_RagTag" suffix to each sequence in the scaffold output, even unplaced query sequences that have not changed. This ensures AGP compatibility with some external programs/databases. If one wants unplaced query sequences to retain their original header, do not use -u.

Hi-C options

Use -b to provide Hi-C alignments to the assembly sequences. Alignments must be in BAM format and sorted by read name (not by coordinate).-r provides a list of restriction enzymes/cut-sites used for the Hi-C experiment. Use --list-enzymes to find a list of acceptable restriction enzymes/sites and for more details.

Output

All output is in ragtag_output, or whichever directory -o specifies. All RagTag 'merge' files begin with the ragtag.merge. prefix.

If using -b, Hi-C link quantification will be stored in ragtag.merge.links. ragtag.merge.agp provides the merged scaffold results and ragtag.merge.fasta is the associated FASTA file. Any RagTag standard error messages will be stored in ragtag.merge.err.

Clone this wiki locally