-
Notifications
You must be signed in to change notification settings - Fork 50
scaffold
RagTag Version: v2.1.0
Scaffolding is the process of ordering and orienting draft assembly (query) sequences into longer sequences. Gaps (stretches of "N" characters) are placed between adjacent query sequences to indicate the presence of unknown sequence. RagTag uses whole-genome alignments to a reference assembly to scaffold query sequences. RagTag does not alter input query sequence in any way and only orders and orients sequences, joining them with gaps.
usage: ragtag.py scaffold <reference.fa> <query.fa>
Homology-based assembly scaffolding: Order and orient sequences in 'query.fa' by comparing them to sequences in 'reference.fa'
positional arguments:
<reference.fa> reference fasta file (uncompressed or bgzipped)
<query.fa> query fasta file (uncompressed or bgzipped)
optional arguments:
-h, --help show this help message and exit
scaffolding options:
-e <exclude.txt> list of reference sequences to ignore [null]
-j <skip.txt> list of query sequences to leave unplaced [null]
-J <hard-skip.txt> list of query headers to leave unplaced and exclude from 'chr0' ('-C') [null]
-f INT minimum unique alignment length [1000]
--remove-small remove unique alignments shorter than '-f'
-q INT minimum mapq (NA for Nucmer alignments) [10]
-d INT maximum alignment merge distance [100000]
-i FLOAT minimum grouping confidence score [0.2]
-a FLOAT minimum location confidence score [0.0]
-s FLOAT minimum orientation confidence score [0.0]
-C concatenate unplaced contigs and make 'chr0'
-r infer gap sizes. if not, all gaps are 100 bp
-g INT minimum inferred gap size [100]
-m INT maximum inferred gap size [100000]
input/output options:
-o PATH output directory [./ragtag_output]
-w overwrite intermediate files
-u add suffix to unplaced sequence headers
mapping options:
-t INT number of minimap2/unimap threads [1]
--aligner PATH aligner executable ('nucmer', 'unimap' or 'minimap2') [minimap2]
--mm2-params STR space delimited minimap2 parameters (overrides '-t') ['-x asm5']
--unimap-params STR space delimited unimap parameters (overrides '-t') ['-x asm5']
--nucmer-params STR space delimted nucmer parameters ['--maxmatch -l 100 -c 500']
RagTag orders and orients sequences in <query.fa>
according to their mappings to <reference.fa>
. These files can be uncompressed or bgzipped. Use -e
to provide a single column file listing any <reference.fa>
sequences that should be ignored during scaffolding (e.g. chr0/chrUn or alt contigs). Similarly, use -j
to provide a single column file listing any <query.fa>
sequences that should automatically be left unplaced. If an alignment is not entirely unique, at least -f
bp of the alignment must be unique to be considered. By default, entirely unique alignments are considered regardless of their length, but this can be disabled with --remove-small
. Doing so ensures that only alignments at least -f
bp in length are considered. -q
sets the minimum Minimap2/Unimap mapq score for alignments. For each query sequence, syntenic alignments within -d
bp of each other are merged into longer alignments.
-i
, -a
, and -s
specify the minimum grouping, location, and orientation confidence scores, respectively. These scores are described in the original publication. Briefly, these scores, between 0 and 1, provide an indication of how ambiguous scaffolding was for each contig given the reference genome alignments. For example, a query sequence that aligns equally well to two distinct reference sequences will receive a grouping confidence score of 0.5. If every alignment for this query sequence is in the reverse strand, it will receive an orientation confidence score of 1.
By default, RagTag appends unplaced query sequences as-is to the end of the output AGP and FASTA files. Use -C
to concatenate all unplaced sequences (with gaps for padding) into a single scaffold called chr0
. For gap padding generally, RagTag places 100 bp gaps between adjacent query sequences by default. Invoke -r
to infer gap sizes from the alignments. The minimum and maximum inferred gap can be adjusted with -g
and -m
.
By default, RagTag places all of the output and intermediate files in a directory named ragtag_output
, but this can be changed with -o
. RagTag will not overwrite intermediate files that already exist in the output directory. This is to save time producing expensive alignment files. Users can set -w
to overwrite any preexisting files.
Use the -u
option to add the "_RagTag" suffix to each sequence in the scaffold output, even unplaced query sequences that have not changed. This ensures AGP compatibility with some external programs/databases. If one wants unplaced query sequences to retain their original header, do not use -u
.
Use -t
to set the number of threads Minimap2 or Unimap uses for mapping (overridden by --mm2-params
and --unimap-params
). This option does not apply to Nucmer alignments. Use the --aligner
option to specify the PATH of the appropriate aligner executable. The --mm2-params
, --unimap-params
, and --nucmer-params
options allow one to specify custom alignment parameters for Minimap2, Unimap, and Nucmer, respectively.
All output is in ragtag_output
, or whichever directory -o
specifies.
ragtag.scaffold.agp
The ordering and orientations of query sequences in AGP format.
ragtag.scaffold.fasta
The scaffolds in FASTA format, defined by the ordering and orientations of
ragtag.scaffold.agp
.
ragtag.scaffold.stats
Summary statistics for the scaffolding process. "placed_sequences" and "placed_bp" provide the number of query sequences and total query bp localized to one of the reference sequences. "unplaced_sequences and "unplaced_bp" provide the number of query sequences and total query bp that were left unplaced. "gap_sequences" and "gap_bp" provide the number of gap sequences and total gap bp.
Are these docs confusing or incomplete? Please open an issue and let me know.