-
Notifications
You must be signed in to change notification settings - Fork 50
correct
RagTag Version: v2.1.0
RagTag offers a correction module that uses a reference genome to identify and correct potential misassemblies in a query assembly. RagTag also provides the option to verify putative misassemblies by aligning reads (from the same genotype) to the query assembly and observing read coverage near misassembly break points. In all cases, sequence is never added or subtracted. Query sequences are only broken at points of putative misassembly.
usage: ragtag.py correct <reference.fa> <query.fa>
Homology-based misassembly correction: Correct sequences in 'query.fa' by comparing them to sequences in 'reference.fa'>
positional arguments:
<reference.fa> reference fasta file (uncompressed or bgzipped)
<query.fa> query fasta file (uncompressed or bgzipped)
optional arguments:
-h, --help show this help message and exit
correction options:
-f INT minimum unique alignment length [1000]
--remove-small remove unique alignments shorter than -f
-q INT minimum mapq (NA for Nucmer alignments) [10]
-d INT maximum alignment merge distance [100000]
-b INT minimum break distance from contig ends [5000]
-e <exclude.txt> list of reference headers to ignore [null]
-j <skip.txt> list of query headers to leave uncorrected [null]
--inter only break misassemblies between reference sequences
--intra only break misassemblies within reference sequences
--gff <features.gff> don't break sequences within gff intervals [null]
input/output options:
-o PATH output directory [./ragtag_output]
-w overwrite intermediate files
-u add suffix to unaltered sequence headers
mapping options:
-t INT number of minimap2/unimap threads [1]
--aligner PATH whole genome aligner executable ('nucmer', 'unimap' or 'minimap2') [minimap2]
--mm2-params STR space delimited minimap2 whole genome alignment parameters (overrides '-t') ['-x asm5']
--unimap-params STR space delimited unimap parameters (overrides '-t') ['-x asm5']
--nucmer-params STR space delimted nucmer whole genome alignment parameters ['--maxmatch -l 100 -c 500']
validation options:
--read-aligner PATH read aligner executable (only 'minimap2' is allowed) [minimap2]
-R <reads.fasta> validation reads (uncompressed or gzipped) [null]
-F <reads.fofn> same as '-R', but a list of files [null]
-T STR read type. 'sr', 'ont' and 'corr' accepted for Illumina, nanopore and error corrected long-reads, respectively [null]
-v INT coverage validation window size [10000]
--max-cov INT break sequences at regions at or above this coverage level [AUTO]
--min-cov INT break sequences at regions at or below this coverage level [AUTO]
RagTag 'correct' breaks sequences in <query.fa>
when they discordantly map to <reference.fa>
. These files can be uncompressed or bgzipped. Use -e
to provide a single column file listing any reference.fa
headers that should be ignored (e.g. chr0/chrUn or alt contigs). Similarly, use -j
to provide a single column file listing any query.fa
headers that shall not be broken. If an alignment is not entirely unique, at least -f
bp of the alignment must be unique to be considered for scaffolding. By default, entirely unique alignments are considered regardless of their length, but this can be disabled with --remove-small
. Doing so ensures that only alignments at least -f
bp in length are considered for correction. -q
sets the minimum Minimap2/Unimap mapq score for alignments. For each query sequence, syntenic alignments within -d
bp of each other are merged into longer alignments. Breaks will not be made within -b
bp of query sequence termini.
One can also direct RagTag to only break misassemblies between (--inter
, query maps to >1 reference sequence) or within (--intra
, query maps discordantly to 1 reference sequence) reference sequences. If one has annotations associated with the query assembly, provide them with the --gff
option to ensure that the query assembly is never broken within annotation intervals. -gff
allows users to update GFF coordinates with respect to the new broken assembly using updategff.
By default, RagTag places all output and intermediate files in a directory named ragtag_output
, but this can be changed with -o
. RagTag will not overwrite intermediate files that already exist in the output directory. This is to save time producing expensive alignment files. Users can set -w
to overwrite any preexisting files.
Use the -u
option to add the "_RagTag" suffix to each sequence in the output, even uncorrected query sequences that have not changed. This ensures AGP compatibility with some external programs/databases. If one wants uncorrected query sequences to retain their original header, do not use -u
.
Use -t
to set the number of threads Minimap2 or Unimap uses for mapping (overridden by --mm2-params
and --unimap-params
). This option does not apply to Nucmer alignments. If the aligner executable is not in one's PATH, or one would like to use Nucmer or Unimap instead of Minimap2, use the --aligner
option to specify the PATH of the appropriate aligner executable. The --mm2-params
, --unimap-params
, and --nucmer-params
options allow one to specify custom alignment parameters for Minimap2, Unimap, and Nucmer, respectively.
Use these validation options to verify putative misassemblies by querying read coverage near misassembly break points. Without validation, the module will break at any point of reference discordance as defined by the "correction options". With validation, RagTag maps reads to the query assembly and verifies putative break points if they are near regions of exceptionally low or high coverage. The reads (-R
/-F
) used for validation should come from the same genotype as the query assembly to ensure that coverage abnormalities don't arise from true biological variation. RagTag correction only accepts either short reads, Oxford Nanopore long reads (ONT), or error-corrected long reads (such as PacBio CCS) (-T
).
One can adjust the sensitivity of misassembly validation to reduce false positives. -v
specifies the window around the putative misassembly break point that RagTag examines for exceptionally low or high read coverage. The larger this window size, the more likely it is to find an unrelated coverage abnormality. One can also define low/high coverage thresholds with --max-cov
and --min-cov
.
RagTag can only use minimap2 for read alignment. If you don't have the minimap2 executable in your PATH, you can specify the path with --read-aligner
.
All output is in ragtag_output
, or whichever directory -o
specifies.
ragtag.correct.fasta
The corrected query assembly in FASTA format.
ragtag.correct.agp
The AGP file defining the exact coordinates of query sequence breaks.
The "object" AGP field represents the original query sequences, while the "component" AGP field represents the broken query subsequences. If a query sequence was not broken, it will be represented as a single AGP line where the object and query share the same original sequence header. Some programs/databases don't like when the component and object are the same, so use the
-u
option to make the object header distinct from the component header (will also be reflected in the FASTA file), even though they represent the same sequence.If
-gff
was used during correction, use this AGP file to update the GFF coordinates to refer to the new broken query assembly.
Reference-guided misassembly signatures are sometimes caused by true biological structural variation if the reference and query assemblies represent distinct genotypes (or haplotypes). The read validation feature should help to avoid some of these misassembly false positives, and the validation sensitivity can be tuned with command line parameters. However, it is ultimately up to the discretion of the user to decide if misassembly correction is appropriate. One should validate all RagTag results with independent data (usually physical, optical, or genetic maps), when possible.
Are these docs confusing or incomplete? Please open an issue and let me know.