man/vsearch.1

.\" import www macros (URL, TAG, MTO)
.mso www.tmac
.\" ============================================================================
.TH vsearch 1 "December 20, 2024" "version 2.29.2" "USER COMMANDS"
.\" ============================================================================
.SH NAME
vsearch \(em a versatile open-source tool for microbiome analysis,
including chimera detection, clustering, dereplication and
rereplication, extraction, FASTA/FASTQ/SFF file processing, masking,
orienting, pairwise alignment, restriction site cutting, searching,
shuffling, sorting, subsampling, and taxonomic classification of
amplicon sequences for metagenomics, genomics, and population
genetics.
.\" ============================================================================
.SH SYNOPSIS
.\" left justified, ragged right
.ad l
Chimera detection:
.RS
\fBvsearch\fR (\-\-uchime_denovo | \-\-uchime2_denovo |
\-\-uchime3_denovo) \fIfastafile\fR (\-\-chimeras | \-\-nonchimeras |
\-\-uchimealns | \-\-uchimeout) \fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-uchime_ref \fIfastafile\fR (\-\-chimeras |
\-\-nonchimeras | \-\-uchimealns | \-\-uchimeout) \fIoutputfile\fR
\-\-db \fIfastafile\fR [\fIoptions\fR]
.PP
.RE
Clustering:
.RS
\fBvsearch\fR (\-\-cluster_fast | \-\-cluster_size |
\-\-cluster_smallmem | \-\-cluster_unoise) \fIfastafile\fR (\-\-alnout
| \-\-biomout | \-\-blast6out | \-\-centroids | \-\-clusters |
\-\-mothur_shared_out | \-\-msaout | \-\-otutabout | \-\-profile |
\-\-samout | \-\-uc | \-\-userout) \fIoutputfile\fR \-\-id \fIreal\fR
[\fIoptions\fR]
.PP
.RE
Dereplication and rereplication:
.RS
\fBvsearch\fR \-\-fastx_uniques (\fIfastafile\fR | \fIfastqfile\fR)
(\-\-fastaout | \-\-fastqout | \-\-tabbedout | \-\-uc) \fIoutputfile\fR
[\fIoptions\fR]
.PP
\fBvsearch\fR (\-\-derep_fulllength | \-\-derep_id | \-\-derep_prefix)
\fIfastafile\fR (\-\-output | \-\-uc) \fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-derep_smallmem (\fIfastafile\fR | \fIfastqfile\fR)
\-\-fastaout \fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-rereplicate \fIfastafile\fR \-\-output
\fIoutputfile\fR [\fIoptions\fR]
.PP
.RE
Extraction of sequences:
.RS
\fBvsearch\fR \-\-fastx_getseq \fIfastafile\fR (\-\-fastaout |
\-\-fastqout | \-\-notmatched | \-\-notmatchedfq) \fIoutputfile\fR
\-\-label \fIlabel\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastx_getseqs \fIfastafile\fR (\-\-fastaout |
\-\-fastqout | \-\-notmatched | \-\-notmatchedfq) \fIoutputfile\fR
(\-\-label \fIlabel\fR \ \-\-labels \fIlabelfile\fR | \-\-label_word
\fIlabel\fR | \-\-label_words \fIlabelfile\fR) [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastx_getsubseq \fIfastafile\fR (\-\-fastaout |
\-\-fastqout | \-\-notmatched | \-\-notmatchedfq) \fIoutputfile\fR
\-\-label \fIlabel\fR [\-\-subseq_start \fIposition\fR]
[\-\-subseq_end \fIposition\fR] [\fIoptions\fR]
.PP
.RE
FASTA/FASTQ/SFF file processing:
.RS
\fBvsearch\fR \-\-fasta2fastq \fIfastqfile\fR \-\-fastqout
\fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastq_chars \fIfastqfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastq_convert \fIfastqfile\fR \-\-fastqout
\fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR (\-\-fastq_eestats | \-\-fastq_eestats2) \fIfastqfile\fR
\-\-output \fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastq_filter \fIfastqfile\fR [\-\-reverse
\fIfastqfile\fR] (\-\-fastaout | \-\-fastaout_discarded | \-\-fastqout
| \-\-fastqout_discarded \-\-fastaout_rev | \-\-fastaout_discarded_rev
| \-\-fastqout_rev | \-\-fastqout_discarded_rev) \fIoutputfile\fR
[\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastq_join \fIfastqfile\fR \-\-reverse
\fIfastqfile\fR (\-\-fastaout | \-\-fastqout) \fIoutputfile\fR
[\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastq_mergepairs \fIfastqfile\fR \-\-reverse
\fIfastqfile\fR (\-\-fastaout | \-\-fastqout |
\-\-fastaout_notmerged_fwd | \-\-fastaout_notmerged_rev |
\-\-fastqout_notmerged_fwd | \-\-fastqout_notmerged_rev |
\-\-eetabbedout) \fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastq_stats \fIfastqfile\fR
[\-\-log \fIlogfile\fR] [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastx_filter \fIinputfile\fR [\-\-reverse
\fIinputfile\fR] (\-\-fastaout | \-\-fastaout_discarded | \-\-fastqout
| \-\-fastqout_discarded \-\-fastaout_rev | \-\-fastaout_discarded_rev
| \-\-fastqout_rev | \-\-fastqout_discarded_rev) \fIoutputfile\fR
[\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastx_revcomp \fIinputfile\fR (\-\-fastaout |
\-\-fastqout) \fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-sff_convert \fIsff-file\fR \-\-fastqout
\fIoutputfile\fR [\fIoptions\fR]
.PP
.RE
Masking:
.RS
\fBvsearch\fR \-\-fastx_mask \fIfastxfile\fR (\-\-fastaout |
\-\-fastqout) \fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-maskfasta \fIfastafile\fR \-\-output
\fIoutputfile\fR [\fIoptions\fR]
.PP
.RE
Orienting:
.RS
\fBvsearch\fR \-\-orient \fIfastxfile\fR \-\-db \fIfastxfile\fR
(\-\-fastaout | \-\-fastqout | \-\-notmatched | \-\-tabbedout)
\fIoutputfile\fR [\fIoptions\fR]
.PP
.RE
Pairwise alignment:
.RS
\fBvsearch\fR \-\-allpairs_global \fIfastafile\fR (\-\-alnout |
\-\-blast6out | \-\-matched | \-\-notmatched | \-\-samout | \-\-uc |
\-\-userout) \fIoutputfile\fR (\-\-acceptall | \-\-id \fIreal\fR)
[\fIoptions\fR]
.PP
.RE
Restriction site cutting:
.RS
\fBvsearch\fR \-\-cut \fIfastafile\fR \-\-cut_pattern \fIpattern\fR
(\-\-fastaout | \-\-fastaout_rev | \-\-fastaout_discarded |
\-\-fastaout_discarded_rev) \fIoutputfile\fR [\fIoptions\fR]
.PP
.RE
Searching:
.RS
\fBvsearch\fR \-\-search_exact \fIfastafile\fR \-\-db \fIfastafile\fR
(\-\-alnout | \-\-biomout | \-\-blast6out | \-\-mothur_shared_out |
\-\-otutabout | \-\-samout | \-\-uc | \-\-userout | \-\-lcaout)
\fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-usearch_global \fIfastafile\fR \-\-db
\fIfastafile\fR (\-\-alnout | \-\-biomout | \-\-blast6out |
\-\-mothur_shared_out | \-\-otutabout | \-\-samout | \-\-uc |
\-\-userout | \-\-lcaout) \fIoutputfile\fR \-\-id \fIreal\fR
[\fIoptions\fR]
.PP
.RE
Shuffling and sorting:
.RS
\fBvsearch\fR (\-\-shuffle | \-\-sortbylength | \-\-sortbysize)
\fIfastafile\fR \-\-output \fIoutputfile\fR [\fIoptions\fR]
.PP
.RE
Subsampling:
.RS
\fBvsearch\fR \-\-fastx_subsample \fIfastafile\fR (\-\-fastaout |
\-\-fastqout) \fIoutputfile\fR (\-\-sample_pct \fIreal\fR |
\-\-sample_size \fIpositive integer\fR) [\fIoptions\fR]
.PP
.RE
Taxonomic classification:
.RS
\fBvsearch\fR \-\-sintax \fIfastafile\fR \-\-db \fIfastafile\fR
\-\-tabbedout \fIoutputfile\fR [\-\-sintax_cutoff \fIreal\fR]
[\fIoptions\fR]
.PP
.RE
UDB database handling:
.RS
\fBvsearch\fR \-\-makeudb_usearch \fIfastafile\fR \-\-output
\fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-udb2fasta \fIudbfile\fR \-\-output \fIoutputfile\fR
[\fIoptions\fR]
.PP
\fBvsearch\fR (\-\-udbinfo | \-\-udbstats) \fIudbfile\fR
[\fIoptions\fR]
.PP
.RE
.\" left and right justified (default)
.ad b
.\" ============================================================================
.SH DESCRIPTION
Environmental or clinical molecular diversity studies generate large
volumes of amplicons (e.g.; SSU-rRNA sequences) that need to be
checked for chimeras, dereplicated, masked, sorted, searched,
clustered or compared to reference sequences. The aim of \fBvsearch\fR
is to offer a all-in-one open source tool to perform these tasks,
using optimized algorithm implementations and harvesting the full
potential of modern computers, thus providing fast and accurate data
processing.
.PP
Comparing nucleotide sequences is at the core of \fBvsearch\fR. To
speed up comparisons, \fBvsearch\fR implements an extremely fast
Needleman-Wunsch algorithm, making use of the Streaming SIMD
Extensions (SSE2) of post-2003 x86-64 CPUs.  If SSE2 instructions are
not available, \fBvsearch\fR exits with an error message. On Power8
CPUs it will use AltiVec/VSX/VMX instructions, and on ARMv8 CPUs it
will use Neon instructions. On other systems it can use the SIMD
Everywhere (simde) library, if available. Memory usage increases
rapidly with sequence length: for example comparing two sequences of
length 1 kb requires 8 MB of memory per thread, and comparing two 10
kb sequences requires 800 MB of memory per thread. For comparisons
involving sequences with a length product greater than 25 million (for
example two sequences of length 5 kb), \fBvsearch\fR uses a slower
alignment method described by Hirschberg (1975) and Myers and Miller
(1988), with much smaller memory requirements.
.\" ----------------------------------------------------------------------------
.SS Input
\fBvsearch\fR accept as input fasta or fastq files containing one or
several nucleotidic entries. In fasta files, each entry is made of a
header and a sequence. The header is defined as the string comprised
between the initial '>' symbol and the first space, tab or the end of
the line, unless the \-\-notrunclabels option is in effect, in which
case the entire line is included. The header should contain printable
ascii characters (33-126). The program will terminate with a fatal
error if there are unprintable ascii characters. A warning will be
issued if non-ascii characters (128-255) are encountered.
.PP
If the header matches the pattern '>[;]size=\fIinteger\fR;label', the
pattern '>label;size=\fIinteger\fR;label', or the
pattern '>label;size=\fIinteger\fR[;]', \fBvsearch\fR will interpret
\fIinteger\fR as the number of occurrences (or abundance) of the
sequence in the study. That abundance information is used or created
during chimera detection, clustering, dereplication, sorting and
searching.
.PP
The sequence is defined as a string of IUPAC symbols
(ACGTURYSWKMDBHVN), starting after the end of the identifier line and
ending before the next identifier line, or the file end. \fBvsearch\fR
silently ignores ascii characters 9 to 13, and exits with an error
message if ascii characters 0 to 8, 14 to 31, '.' or '-' are
present. All other ascii or non-ascii characters are stripped and
complained about in a warning message.
.PP
In fastq files, each entry is made of sequence header starting with a
symbol '@', a nucleotidic sequence (same rules as for fasta
sequences), a quality header starting with a symbol '+' and a string
of ASCII characters (offset 33 or 64), each one encoding the quality
value of the corresponding position in the nucleotidic sequence.
.PP
\fBvsearch\fR operations are case insensitive, except when soft
masking is activated. Masking is automatically applied during chimera
detection, clustering, masking, pairwise alignment and searching. Soft
masking is specified with the options '\-\-dbmask soft' (for searching
and chimera detection with a reference) or '\-\-qmask soft' (for
searching, \fIde novo\fR chimera detection, clustering and
masking). When using soft masking, lower case letters indicate masked
symbols, while upper case letters indicate regular symbols. Masked
symbols are never included in the unique index words used for sequence
comparisons, otherwise they are treated as normal symbols.
.PP
When comparing sequences during chimera detection, dereplication,
searching and clustering, T and U are considered identical, regardless
of their case. When aligning sequences, identical symbols will receive
a positive match score (default +2). If two symbols are not identical,
their alignment result in a negative mismatch score (default
-4). Aligning a pair of symbols where at least one of them is an
ambiguous symbol (BDHKMNRSVWY) will always result in a score of
zero. Alignment of two identical ambiguous symbols (for example, R vs
R) also receives a score of zero. When computing the amount of
similarity by counting matches and mismatches after alignment,
ambiguous nucleotide symbols will count as matching to other symbols
if they have at least one of the nucleotides (ACGTU) they may
represent in common. For example: W will match A and T, but also any
of MRVHDN. When showing alignments (for example with the \-\-alnout
option) matches involving ambiguous symbols will be shown with a plus
character (+) between them while exact matches between non-ambiguous
symbols will be shown with a vertical bar character (|).
.PP
\fBvsearch\fR can read data from standard files and write to standard
files, but it can also read from pipes and write to pipes! For
example, multiple fasta files can be piped into \fBvsearch\fR for
dereplication. To do so, file names can be replaced with:
.RS
.IP - 2
the symbol '-', representing '/dev/stdin' for input files
or '/dev/stdout' for output files (with an exception for '\-\-db \-',
see * below),
.IP -
a named pipe created with the command mkfifo,
.IP -
a process substitution '<(command)' as input or '>(command)' as
output.
.IP *
\-\-db \- is not accepted, to prevent potential concurrent reads from
stdin. A workaround for advanced users is to call '\-\-db /dev/stdin'
directly.
.RE
.PP
\fBvsearch\fR can automatically read compressed gzip or bzip2 files if
the appropriate libraries are present during the
compilation. \fBvsearch\fR can also read pipes streaming compressed
gzip or bzip2 data if the options \-\-gzip_decompress or
\-\-bzip2_decompress are selected. When reading from a pipe, the
progress indicator is not updated.
.\" ----------------------------------------------------------------------------
.SS Options
\fBvsearch\fR recognizes a large number of command-line commands and
options. For easier navigation, options are grouped below by theme
(chimera detection, clustering, dereplication and rereplication,
FASTA/FASTQ file processing, masking, pairwise alignment, searching,
shuffling, sorting, and subsampling). We start with the general
options that apply to all themes. Options start with a double dash
(\-\-). A single dash (\-) may also be used, except on NetBSD
systems. Option names may be shortened as long as they are not
ambiguous (e.g. \-\-derep_f).
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG help-and-version-commands
Help and version commands:
.PP
.RS
.TAG help
.TAG h
.TP 9
.B \-\-help \-\-h
Display help text with brief information about all commands and
options.
.TAG version
.TAG v
.TP
.B \-\-version \-\-v
Output version information and a citation for the VSEARCH
publication. Show the status of the support for gzip- and
bzip2-compressed input files.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG general-options
General options:
.RS
.TAG bzip2_decompress
.TP 9
.B \-\-bzip2_decompress
When reading from a pipe streaming bzip2-compressed data, decompress
the data. This option is not needed when reading from a standard
bzip2-compressed file.
.TAG fasta_width
.TP
.BI \-\-fasta_width\~ "positive integer"
Fasta files produced by \fBvsearch\fR are wrapped (sequences are
written on lines of \fIinteger\fR nucleotides, 80 by default). Set
the value to zero to eliminate the wrapping.
.TAG gzip_decompress
.TP
.B \-\-gzip_decompress
When reading from a pipe streaming gzip-compressed data, decompress
the data. This option is not needed when reading from a standard
gzip-compressed file.
.TAG label_suffix
.TP
.BI \-\-label_suffix\~ string
When writing FASTA or FASTQ files, add the suffix \fIstring\fR to
sequence headers.
.TAG log
.TP
.BI \-\-log \0filename
Write messages to the specified log file. Information written includes
program version, amount of memory available, number of cores and
command line options, and if need be, informational messages, warnings
and fatal errors. The start and finish times are also recorded as well
as the elapsed time and the maximum amount of memory consumed. The
different \fBvsearch\fR commands can also write additional
information to the log file.
.TAG maxseqlength
.TP
.BI \-\-maxseqlength\~ "positive integer"
All \fBvsearch\fR operations discard sequences longer than
\fIinteger\fR (50,000 nucleotides by default).
.TAG minseqlength
.TP
.BI \-\-minseqlength\~ "positive integer"
All \fBvsearch\fR operations discard sequences shorter than
\fIinteger\fR: 1 nucleotide by default for sorting or shuffling, 32
nucleotides for clustering and dereplication as well as the commands
\-\-makeudb_usearch, \-\-sintax, and \-\-usearch_global.
.\" note: minseqlength can be set to zero (keep empty entries)
.TAG no_progress
.TP
.B \-\-no_progress
Do not show the gradually increasing progress indicator.
.TAG notrunclabels
.TP
.B \-\-notrunclabels
Do not truncate sequence labels at first space or tab, but use the full
header in output files. Turned off by default for all commands except
the sintax command.
.TAG quiet
.TP
.B \-\-quiet
Suppress all messages to stdout and stderr except for warnings and
fatal error messages.
.TAG sample
.TP
.BI \-\-sample\~ string
When writing FASTA or FASTQ files, add the the given sample identifier
\fIstring\fR to sequence headers. For instance, if the given string is
ABC, the text ";sample=ABC" will be added to the header. Note that
\fIstring\fR will be truncated at the first ';' or blank
character. Other characters (alphabetical, numerical and punctuations)
are accepted.
.TAG threads
.TP
.BI \-\-threads\~ "positive integer"
Number of computation threads to use (1 to 1024). The number of threads
should be less than or equal to the number of available CPU cores. The
default is to use all available resources and to launch one thread per
core. The following commands are multi-threaded:
allpairs_global, cluster_fast, cluster_size, cluster_smallmem,
cluster_unoise, fastq_mergepairs, fastx_mask, maskfasta, search_exact,
sintax, uchime_ref, and usearch_global. Only one thread is used for
the other commands.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG chimera-detection-options
Chimera detection options:
.PP
.RS
Chimera detection is based on a scoring function controlled by five
options (\-\-dn, \-\-mindiffs, \-\-mindiv, \-\-minh,
\-\-xn). Sequences are first sorted by decreasing abundance, if
available, and compared on their \fIplus\fR strand only (case
insensitive).
.PP
Input sequences are masked as specified with the \-\-qmask and
\-\-hardmask options. Masking of the database for reference based
chimera detection is specified with the \-\-dbmask option.
.PP
In \fIde novo\fR mode, input fasta file must present abundance
annotations (i.e. a pattern [;]size=\fIinteger\fR[;] in the fasta
header). Input order matters for chimera detection, so we recommend to
sort sequences by decreasing abundance (default of
\-\-derep_fulllength command). If your sequence set needs to be
sorted, please see the \-\-sortbysize command in the sorting section.
.PP
.TAG abskew
.TP 9
.BI \-\-abskew \0real
When using \-\-uchime_denovo, the abundance skew is used to
distinguish in a three-way alignment which sequence is the chimera and
which are the parents. The assumption is that chimeras appear later in
the PCR amplification process and are therefore less abundant than
their parents. For \-\-uchime3_denovo the default value is 16.0. For
the other commands, the default value is 2.0, which means that the
parents should be at least 2 times more abundant than their
chimera. Any positive value equal or greater than 1.0 can be used.
.TAG alignwidth
.TP
.BI \-\-alignwidth\~ "positive integer"
When using \-\-uchimealns, set the width of the three-way alignments
(80 nucleotides by default). Set to zero to eliminate wrapping.
.TAG borderline
.TP
.BI \-\-borderline \0filename
Output borderline chimeric sequences to \fIfilename\fR, in fasta
format. Borderline chimeric sequences are sequences that have a high
enough score but which are not sufficiently different from their
closest parent.
.TAG chimeras
.TP
.BI \-\-chimeras \0filename
Output chimeric sequences to \fIfilename\fR, in fasta format. Output
order may vary when using multiple threads.
.TAG db
.TP
.BI \-\-db \0filename
When using \-\-uchime_ref, detect chimeras using the reference
sequences contained in \fIfilename\fR. Reference sequences are assumed
to be chimera-free. Chimeras cannot be detected if their parents, or
sufficiently close relatives, are not present in the database. The
file name must refer to a FASTA file or to a UDB file. If a UDB file
is used, it should be created using the \-\-makeudb_usearch command
with the \-\-dbmask dust option.
.TAG dn
.TP
.BI \-\-dn\~ "strictly positive real number"
pseudo-count prior on the number of no votes, corresponding to the
parameter \fIn\fR in the chimera scoring function (default value is
1.4). Increasing \-\-dn reduces the likelihood of tagging a sequence
as a chimera (less false positives, but also more false negatives).
.TAG fasta_score
.TP
.B \-\-fasta_score
Add the chimera score to the headers in the fasta output files for
chimeras, non-chimeras and borderline sequences, using the
format ';uchime_denovo=\fIfloat\fR;'.
.TAG lengthout
.TP
.B \-\-lengthout
Write sequence length information to the output files in FASTA format
by adding a ";length=\fIinteger\fR" attribute in the header.
.TAG mindiffs
.TP
.BI \-\-mindiffs\~ "positive integer"
Minimum number of differences per segment (default value is 3). The
parameter is ignored with \-\-uchime2_denovo and \-\-uchime3_denovo.
.TAG mindiv
.TP
.BI \-\-mindiv \0real
Minimum divergence from closest parent (default value is 0.8). The
parameter is ignored with \-\-uchime2_denovo and \-\-uchime3_denovo.
.TAG minh
.TP
.BI \-\-minh \0real
Minimum score (\fIh\fR). Increasing this value tends to reduce the
number of false positives and to decrease sensitivity. Default value
is 0.28, and values ranging from 0.0 to 1.0 included are accepted. The
parameter is ignored with \-\-uchime2_denovo and \-\-uchime3_denovo.
.TAG nonchimeras
.TP
.BI \-\-nonchimeras \0filename
Output non-chimeric sequences to \fIfilename\fR, in fasta
format. Output order may vary when using multiple threads.
.TAG relabel
.TP
.BI \-\-relabel \0string
Relabel sequences using the prefix \fIstring\fR and a ticker (1, 2, 3,
etc.) to construct the new headers. Use \-\-sizeout to conserve the
abundance annotations.
.TAG relabel_keep
.TP
.B \-\-relabel_keep
When relabelling, keep the old identifier in the header after a space.
.TAG relabel_md5
.TP
.B \-\-relabel_md5
Relabel sequences using the MD5 message digest algorithm applied to
each sequence. Former sequence headers are discarded. The sequence is
converted to upper case and each 'U' is replaced by a 'T' before
computation of the digest. The MD5 digest is a cryptographic hash
function designed to minimize the probability that two different
inputs give the same output, even for very similar, but non-identical
inputs. Still, there is a very small, but non-zero, probability that
two different inputs give the same digest (i.e. a collision). MD5
generates a 128-bit (16-byte) digest that is represented by 16
hexadecimal numbers (using 32 symbols among 0123456789abcdef). Use
\-\-sizeout to conserve the abundance annotations.
.\" The probablity of collision for two sequences is 1/2^128
.TAG relabel_self
.TP
.B \-\-relabel_self
Relabel sequences using each sequence itself as a label.
.TAG relabel_sha1
.TP
.B \-\-relabel_sha1
Relabel sequences using the SHA1 message digest algorithm applied to
each sequence. It is similar to the \-\-relabel_md5 option but uses
the SHA1 algorithm instead of the MD5 algorithm. SHA1 generates a
160-bit (20-byte) digest that is represented by 20 hexadecimal numbers
(40 symbols). The probability of a collision (two non-identical
sequences resulting in the same digest) is smaller for the SHA1
algorithm than it is for the MD5 algorithm.
.\" The probablity of collision for two sequences is 1/2^160
.TAG self
.TP
.B \-\-self
When using \-\-uchime_ref, ignore a reference sequence when its label
matches the label of the query sequence (useful to estimate
false-positive rate in reference sequences).
.\" I am not sure the statement above is true.
.TAG selfid
.TP
.B \-\-selfid
When using \-\-uchime_ref, ignore a reference sequence when its
nucleotide sequence is strictly identical to the nucleotidic sequence
of the query.
.TP
.B \-\-sizein
In \fIde novo\fR mode, abundance annotations
(pattern '[>;]size=\fIinteger\fR[;]') present in sequence headers are
taken into account by default (\-\-sizein is always implied). This
option is ignored by \-\-uchime_ref.
.TP
.TAG sizeout
.B \-\-sizeout
When relabelling, add abundance annotations to fasta headers (using
the format ';size=\fIinteger\fR;').
.TAG uchime_denovo
.TP
.BI \-\-uchime_denovo \0filename
Detect chimeras present in the fasta-formatted \fIfilename\fR, without
external references (i.e. \fIde novo\fR). Automatically sort the
sequences in \fIfilename\fR by decreasing abundance beforehand (see
the sorting section for details). Multithreading is not supported.
.TAG uchime2_denovo
.TP
.BI \-\-uchime2_denovo \0filename
Detect chimeras present in the fasta-formatted \fIfilename\fR, using
the UCHIME2 algorithm. This algorithm is designed for denoised
amplicons (see \-\-cluster_unoise). Automatically sort the sequences
in \fIfilename\fR by decreasing abundance beforehand (see the sorting
section for details).  Multithreading is not supported.
.TAG uchime3_denovo
.TP
.BI \-\-uchime3_denovo \0filename
Detect chimeras present in the fasta-formatted \fIfilename\fR, using
the UCHIME2 algorithm. The only difference from \-\-uchime2_denovo is
that the default minimum abundance skew (\-\-abskew) is set to 16.0
rather than 2.0.
.TAG uchime_ref
.TP
.BI \-\-uchime_ref \0filename
Detect chimeras present in the fasta-formatted \fIfilename\fR by
comparing them with reference sequences (option
\-\-db). Multithreading is supported.
.TAG uchimealns
.TP
.BI \-\-uchimealns \0filename
Write the three-way global alignments (parentA, parentB, chimera) to
\fIfilename\fR using a human-readable format. Use \-\-alignwidth to
modify alignment length. Output order may vary when using multiple
threads. All sequences are converted to upper case before
alignment. Lower case letters indicate disagreement in the alignment.
.TAG uchimeout
.TP
.BI \-\-uchimeout \0filename
Write chimera detection results to \fIfilename\fR using a 18-field,
tab\-separated uchime\-like format. Use \-\-uchimeout5 to use a format
compatible with usearch v5 and earlier versions. Rows output order may
vary when using multiple threads.
.RS
.RS
.nr step 1 1
.IP \n[step]. 4
score: higher score means a more likely chimeric alignment.
.IP \n+[step].
Q: query sequence label.
.IP \n+[step].
A: parent A sequence label.
.IP \n+[step].
B: parent B sequence label.
.IP \n+[step].
T: top parent sequence label (i.e. parent most similar to the
query). That field is removed when using \-\-uchimeout5.
.IP \n+[step].
idQM: percentage of similarity of query (Q) and model (M)
constructed as a part of parent A and a part of parent B.
.IP \n+[step].
idQA: percentage of similarity of query (Q) and parent A.
.IP \n+[step].
idQB: percentage of similarity of query (Q) and parent B.
.IP \n+[step].
idAB: percentage of similarity of parent A and parent B.
.IP \n+[step].
idQT: percentage of similarity of query (Q) and top parent (T).
.IP \n+[step].
LY: yes votes in the left part of the model.
.IP \n+[step].
LN: no votes in the left part of the model.
.IP \n+[step].
LA: abstain votes in the left part of the model.
.IP \n+[step].
RY: yes votes in the right part of the model.
.IP \n+[step].
RN: no votes in the right part of the model.
.IP \n+[step].
RA: abstain votes in the right part of the model.
.IP \n+[step].
div: divergence, defined as (idQM - idQT).
.IP \n+[step].
YN: query is chimeric (Y), or not (N), or is a borderline case (?).
.RE
.RE
.TAG uchimeout5
.TP
.B \-\-uchimeout5
When using \-\-uchimeout, write chimera detection results using a
17\-field, tab\-separated uchime\-like format (drop the 5th field of
\-\-uchimeout), compatible with usearch version 5 and earlier
versions.
.TAG xlength
.TP
.B \-\-xlength
Strip header attribute ";length=\fIinteger\fR" from input
sequences. This attribute is added to output sequences by the
\-\-lengthout option.
.TAG xn
.TP
.BI \-\-xn\~ "strictly positive real number"
weight of no votes, corresponding to the parameter \fIbeta\fR in the
scoring function (default value is 8.0). Increasing \-\-xn reduces the
likelihood of tagging a sequence as a chimera (less false positives,
but also more false negatives).
.TAG xsize
.TP
.B \-\-xsize
Strip abundance information from the headers when writing the output
file.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG clustering-options
Clustering options:
.RS
.PP
\fBvsearch\fR implements a single-pass, greedy centroid-based
clustering algorithm, similar to the algorithms implemented in
usearch, DNAclust and sumaclust for example. Important parameters are
the global clustering threshold (\-\-id) and the pairwise identity
definition (\-\-iddef).
.PP
Input sequences are masked as specified with the \-\-qmask and
\-\-hardmask options.
.TAG biomout
.TP 9
.BI \-\-biomout \0filename
Generate an OTU table in the biom version 1.0 JSON file format as
specified at
.URL https://biom-format.org/documentation/format_versions/biom-1.0.html "(link)"
<https://biom-format.org/documentation/format_versions/biom-1.0.html>.
The format describes how to store a sparse matrix containing the
abundances of the OTUs in the different samples. This format is much
more efficient than the classic and mothur OTU table formats available
with the \-\-otutabout and \-\-mothur_shared_out options,
respectively, and is recommended at least for large tables. The OTUs
are represented by the cluster centroids. Taxonomy information will be
included for the OTUs if available. Sample identifiers will be
extracted from the headers of all sequences in the input file. If the
header contains ';sample=abc123;' or ';barcodelabel=abc123;' or a
similar string somewhere, then the given sample identifier
(here 'abc123') will be used. The semicolon is not mandatory at the
beginning or end of the header. The sample identifier may contain any
printable character except semicolons. If no such sample label is
found, the identifier in the initial part of the header will be used,
but only letters, digits and underscores are allowed. OTU identifiers
will be extracted from the headers of the cluster centroid
sequences. If the header contains ';otu=def789;' or a similar string
somewhere, then the given OTU identifier (here 'def789') will be
used. The semicolon is not mandatory at the beginning or end of the
header. The OTU identifier may contain any printable character except
semicolons. If no such OTU label is found, the identifier in the
initial part of the header will be used, and all characters except
semicolons are allowed. Alternatively, OTU identifiers can be
generated using the relabelling options (\-\-relabel,
\-\-relabel_self, \-\-relabel_sha1, or \-\-relabel_md5). Taxonomy
information, if present, will also be extracted from the headers of
the centroid sequences. If the header contains ';tax=Homo_sapiens;' or
a similar string somewhere, then the given taxonomy information
(here 'Homo_sapiens') will be used. The semicolon is not mandatory at
the beginning or end of the header. The taxonomy information may
contain any printable character except semicolons. If an OTU table in
the biom version 2.1 HDF5 file format is required, the biom utility
may be used as described at
.URL https://biom-format.org/documentation/biom_conversion.html "(link)"
<https://biom-format.org/documentation/biom_conversion.html>.
.TAG centroids
.TP
.BI \-\-centroids \0filename
Output cluster centroid sequences to \fIfilename\fR, in fasta
format. The centroid is the sequence that seeded the cluster (i.e. the
first sequence of the cluster).
.TAG clusterout_id
.TP
.BI \-\-clusterout_id
Add cluster identifier information to the output files
when using the \-\-centroids, \-\-consout and \-\-profile options.
.TAG clusterout_sort
.TP
.BI \-\-clusterout_sort
Sort some output files by decreasing abundance instead of input
order. It applies to the \-\-consout, \-\-msaout, \-\-profile,
\-\-centroids, and \-\-uc options. For \-\-uc, the sorting applies
only to the centroid information part (the C lines).
.TAG cluster_fast
.TP
.BI \-\-cluster_fast \0filename
Clusterize the fasta sequences in \fIfilename\fR, automatically sort
by decreasing sequence length beforehand.
.TAG cluster_size
.TP
.BI \-\-cluster_size \0filename
Clusterize the fasta sequences in \fIfilename\fR, automatically sort
by decreasing sequence abundance beforehand.
.TAG cluster_smallmem
.TP
.BI \-\-cluster_smallmem \0filename
Clusterize the fasta sequences in \fIfilename\fR without automatically
modifying their order beforehand. Sequence are expected to be sorted
by decreasing sequence length, unless \-\-usersort is used.
.TAG cluster_unoise
.TP
.BI \-\-cluster_unoise \0filename
Perform denoising of the fasta sequences in \fIfilename\fR according
to the UNOISE version 3 algorithm by Robert Edgar, but without the
\fIde novo\fR chimera removal step, which may be performed afterwards
with \-\-uchime3_denovo. The options \-\-minsize (default 8) and
\-\-unoise_alpha (default 2.0) may be specified. In the this
algorithm, clustering of sequences depend on both the sequence
distance and the abundance ratio. The abundance ratio (skew) is the
abundance of a new sequence divided by the abundance of the centroid
sequence. This skew must not be larger than beta if the sequences
should be clustered together. Beta is calculated as 2 raised to the
power of minus 1 minus alpha times the sequence distance. The sequence
distance used is the number of mismatches in the alignment, ignoring
gaps. This means that the abundance must be exponentially lower as the
distance increases from the centroid for a new sequence to be included
in the cluster. Nearer sequences with higher abundances will form
their own new clusters.
.TAG clusters
.TP
.BI \-\-clusters \0string
Output each cluster to a separate fasta file using the prefix
\fIstring\fR and a ticker (0, 1, 2, etc.) to construct the path and
filenames.
.TAG consout
.TP
.BI \-\-consout \0filename
Output cluster consensus sequences to \fIfilename\fR. For each
cluster, a center-star multiple sequence alignment is computed with
the centroid as the center, using a fast algorithm (not accurate when
using low pairwise identity thresholds). A consensus sequence is
constructed by taking the majority symbol (nucleotide or gap) from
each column of the alignment. Columns containing a majority of gaps
are skipped, except for terminal gaps. If the \-\-sizein option is
specified, sequence abundances will be taken into account.
.TAG cons_truncate
.TP
.B \-\-cons_truncate
This command is ignored. A warning is issued.
.\" .TP
.\" .B \-\-cons_truncate
.\" when using the \-\-consout option to build consensus sequences,
.\" do not ignore terminal gaps. That option skips terminal columns
.\" if they contain a majority of gaps, yielding shorter consensus
.\" sequences than when using \-\-consout alone.
.TAG id
.TP
.BI \-\-id \0real
Do not add the target to the cluster if the pairwise identity with the
centroid is lower than \fIreal\fR (value ranging from 0.0 to 1.0
included). The pairwise identity is defined as the number of (matching
columns) / (alignment length - terminal gaps). That definition can be
modified by \-\-iddef.
.TAG iddef
.TP
.BI \-\-iddef\~ "0|1|2|3|4"
Change the pairwise identity definition used in \-\-id. Values
accepted are:
.RS
.RS
.nr step 0 1
.IP \n[step]. 4
CD-HIT definition: (matching columns) / (shortest sequence length).
.IP \n+[step].
edit distance: (matching columns) / (alignment length).
.IP \n+[step].
edit distance excluding terminal gaps (same as \-\-id).
.IP \n+[step].
Marine Biological Lab definition counting each gap opening (internal
or terminal) as a single mismatch, whether or not the gap was
extended: 1.0 - [(mismatches + gap openings)/(longest sequence
length)]
.IP \n+[step].
BLAST definition, equivalent to \-\-iddef 1 in a context of global
pairwise alignment.
.RE
.RE
.TAG lengthout
.TP
.B \-\-lengthout
Write sequence length information to the output files in FASTA format
by adding a ";length=\fIinteger\fR" attribute in the header.
.TAG minsize
.TP
.BI \-\-minsize\~ "positive integer"
Specify the minimum abundance of sequences for denoising using
\-\-cluster_unoise. The default is 8.
.TAG msaout
.TP
.BI \-\-msaout \0filename
Output a multiple sequence alignment and a consensus sequence for each
cluster to \fIfilename\fR, in fasta format. Be warned that vsearch
computes center star multiple sequence alignments using a fast method
whose accuracy can decrease significantly when using low pairwise
identity thresholds. The consensus sequence is constructed by taking
the majority symbol (nucleotide or gap) from each column of the
alignment. Columns containing a majority of gaps are skipped, except
for terminal gaps. If the \-\-sizein option is specified, sequence
abundances will be taken into account when computing the consensus.
.TAG mothur_shared_out
.TP
.BI \-\-mothur_shared_out \0filename
Output an OTU table in the mothur 'shared' tab-separated plain text
format as described at
.URL https://www.mothur.org/wiki/Shared_file (link)
<https://www.mothur.org/wiki/Shared_file>. The
format describes how a matrix containing the abundances of the OTUs in
the different samples is stored. The first line will start with the
strings 'label', 'group' and 'numOtus' and is followed by a list of
all OTU identifiers. The following lines, one for each sample, starts
with the string 'vsearch' followed by the sample identifier, the total
number of OTUs, and a list of abundances for each OTU in that sample,
in the order given on the first line. The OTU and sample identifiers
are extracted from the FASTA headers of the sequences. The OTUs are
represented by the cluster centroids. See the \-\-biomout option for
further details.
.TAG otutabout
.TP
.BI \-\-otutabout \0filename
Output an OTU table in the classic tab-separated plain text format as
a matrix containing the abundances of the OTUs in the different
samples. The first line will start with the string '#OTU ID' and is
followed by a tab-separated list of all sample identifiers. The
following lines, one for each OTU, starts with the OTU identifier and
is followed by a tab-separated list of abundances for that OTU in each
sample, in the order given on the first line. The OTU and sample
identifiers are extracted from the FASTA headers of the sequences (see
the \-\-sample option). The OTUs are represented by the cluster
centroids. An extra column is added to the right of the table if
taxonomy information is available for at least one of the OTUs. This
column will be labelled 'taxonomy' and each row will then contain the
taxonomy information extracted for that OTU. See the \-\-biomout
option for further details.
.TAG profile
.TP
.BI \-\-profile \0filename
Output a sequence profile to a text file with the frequency of each
nucleotide in each position in the multiple alignment for each
cluster. There is a FASTA-like header line for each cluster, followed
by the profile information in a tab-separated format. The eight
columns are: position (0-based), consensus nucleotide, number of As,
number of Cs, number of Gs, number of Ts or Us, number of gap symbols,
and finally the total number of ambiguous nucleotide symbols (B, D, H,
K, M, N, R, S, Y, V or W). All numbers are integers. If the \-\-sizein
option is specified, sequence abundances will be taken into account.
.TAG qmask
.TP
.BI \-\-qmask\~ "none|dust|soft"
Mask regions in sequences using the
\fIdust\fR or the \fIsoft\fR methods, or do not mask
(\fInone\fR). Warning, when using \fIsoft\fR masking, clustering
becomes case sensitive. The default is to mask using \fIdust\fR.
.TAG qsegout
.TP
.BI \-\-qsegout \0filename
Write the aligned part of each query sequence to \fIfilename\fR in
FASTA format.
.TAG relabel
.TP
.BI \-\-relabel \0string
Relabel sequence identifiers in the output files produced by
\-\-consout, \-\-profile and \-\-centroids options. Please see the
description of the same option under Chimera detection for details.
.TAG relabel_keep
.TP
.B \-\-relabel_keep
When relabelling, keep the old identifier in the header after a space.
.TAG relabel_md5
.TP
.B \-\-relabel_md5
Relabel sequence identifiers in the output files produced by
\-\-consout, \-\-profile and \-\-centroids options. Please see the
description of the same option under Chimera detection for details.
.TAG relabel_self
.TP
.B \-\-relabel_self
Relabel sequence identifiers in the output files produced by
\-\-consout, \-\-profile and \-\-centroids options. Please see the
description of the same option under Chimera detection for details.
.TAG relabel_sha1
.TP
.B \-\-relabel_sha1
Relabel sequence identifiers in the output files produced by
\-\-consout, \-\-profile and \-\-centroids options. Please see the
description of the same option under Chimera detection for details.
.TAG sizein
.TP
.B \-\-sizein
Take into account the abundance annotations present in the input fasta
file (search for the pattern '[>;]size=\fIinteger\fR[;]' in sequence
headers).
.TAG sizeorder
.TP
.B \-\-sizeorder
When an amplicon is close to 2 or more centroids, both within the
distance specified with the \-\-id option, resolve the ambiguity by
clustering it with the centroid having the highest abundance, not
necessarily the closest one. The option only has effect when the value
specified with \-\-maxaccepts is higher than one. The \-\-sizeorder
option turns on what is sometimes referred to as abundance-based
greedy clustering (AGC), in contrast to the default distance-based
greedy clustering (DGC).
.TAG sizeout
.TP
.B \-\-sizeout
Add abundance annotations to the output fasta files (add the
pattern ';size=\fIinteger\fR;' to sequence headers). If \-\-sizein is
specified, abundance annotations are reported to output files, and
each cluster centroid receives a new abundance value corresponding to
the total abundance of the amplicons included in the cluster
(\-\-centroids option). If \-\-sizein is not specified, input
abundances are set to 1 for amplicons, and to the number of amplicons
per cluster for centroids.
.TAG strand
.TP
.BI \-\-strand\~ "plus|both"
When comparing sequences with the cluster seed, check the \fIplus\fR
strand only (default) or check \fIboth\fR strands.
.TAG tsegout
.TP
.BI \-\-tsegout \0filename
Write the aligned part of each target sequence to \fIfilename\fR in
FASTA format.
.TAG uc
.TP
.BI \-\-uc \0filename
Output clustering results in \fIfilename\fR using a tab-separated
uclust-like format with 10 columns and 3 different type of entries (S,
H or C). Each fasta sequence in the input file can be either a cluster
centroid (S) or a hit (H) assigned to a cluster. Cluster records (C)
summarize information (size, centroid label) for each cluster. In the
context of clustering, the option \-\-uc_allhits has no effect on the
\-\-uc output. Column content varies with the type of entry (S, H or
C):
.RS
.RS
.nr step 1 1
.IP \n[step]. 4
Record type: S, H, or C.
.IP \n+[step].
Cluster number (zero-based).
.IP \n+[step].
Centroid length (S), query length (H), or cluster size (C).
.IP \n+[step].
Percentage of similarity with the centroid sequence (H), or set to '*'
(S, C).
.IP \n+[step].
Match orientation + or - (H), or set to '*' (S, C).
.IP \n+[step].
Not used, always set to '*' (S, C) or to zero (H).
.IP \n+[step].
Not used, always set to '*' (S, C) or to zero (H).
.IP \n+[step].
set to '*' (S, C) or, for H, compact representation of the pairwise
alignment using the CIGAR format (Compact Idiosyncratic Gapped
Alignment Report): M (match/mismatch), D (deletion) and I
(insertion). The equal sign '=' indicates that the query is identical
to the centroid sequence.
.IP \n+[step].
Label of the query sequence (H), or of the centroid sequence (S, C).
.IP \n+[step].
Label of the centroid sequence (H), or set to '*' (S, C).
.RE
.RE
.TAG unoise_alpha
.TP
.BI \-\-unoise_alpha\~ real
Specify the alpha parameter to the \-\-cluster_unoise command. The
default is 2.0.
.TAG usersort
.TP
.B \-\-usersort
When using \-\-cluster_smallmem, allow any sequence input order, not
just a decreasing length ordering.
.TAG xlength
.TP
.B \-\-xlength
Strip header attribute ";length=\fIinteger\fR" from input
sequences. This attribute is added to output sequences by the
\-\-lengthout option.
.TAG xsize
.TP
.B \-\-xsize
Strip abundance information from the headers when writing the output
file.
.TP
.B ...
Most searching options as well as score filtering, gap penalties and
masking also apply to clustering (see the Searching section for
definitions): \-\-alnout, \-\-blast6out, \-\-fastapairs, \-\-matched,
\-\-notmatched, \-\-maxaccepts, \-\-maxrejects, \-\-samout, \-\-userout,
\-\-userfields
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG dereplication-and-rereplication-options
Dereplication and rereplication options:
.PP
.RS
VSEARCH can dereplicate sequences with the commands
\-\-derep_fulllength, \-\-derep_id, \-\-derep_smallmem,
\-\-derep_prefix and \-\-fastx_uniques. The \-\-derep_fulllength
command is depreciated and is replaced by the new \-\-fastx_uniques
command that can also handle FASTQ files in addition to FASTA
files. The \-\-derep_fulllength, \-\-derep_smallmem, and
\-\-fastx_uniques commands requires strictly identical sequences of
the same length, but ignores upper/lower case and treats T and U as
identical symbols. The \-\-derep_id command requires both identical
sequences and identical headers/labels. The \-\-derep_prefix command
will group sequences with a common prefix and does not require them to
be equally long. The \-\-derep_smallmem uses a much smaller amount of
memory when dereplicating than the other files, and may be a bit
slower and cannot read the input from a pipe. It takes both FASTA and
FASTQ files as input but only writes FASTA output to the file
specified with the \-\-fastaout option. The \-\-fastx_uniques command
can write FASTQ output (specified with \-\-fastqout) or FASTA output
(specified with \-\-fastaout) as well as a special tab-separated
column text format (with \-\-tabbedout). The other commands can write
FASTA output to the file specified with the \-\-output option. All
dereplication commands, except \-\-derep_smallmem, can write output to
a special UCLUST-like file specified with the \-\-uc option. The
\-\-rereplicate command can duplicate sequences in the input file
according to the abundance of each input sequence. Other valid options
are \-\-fastq_ascii, \-\-fastq_asciiout, \-\-fastq_qmax,
\-\-fastq_qmaxout, \-\-fastq_qmin, \-\-fastq_qminout,
\-\-fastq_qout_max, \-\-lengthout, \-\-maxuniquesize,
\-\-minuniquesize, \-\-relabel, \-\-relabel_keep, \-\-relabel_md5,
\-\-relabel_self, \-\-relabel_sha1, \-\-sizein, \-\-sizeout,
\-\-strand, \-\-topn, \-\-xlength, and \-\-xsize.
.PP
.TAG derep_fulllength
.TP 9
.BI \-\-derep_fulllength \0filename
Merge strictly identical sequences contained in
\fIfilename\fR. Identical sequences are defined as having the same
length and the same string of nucleotides (case insensitive, T and U
are considered the same). See the options \-\-sizein and \-\-sizeout
to take into account and compute abundance values. This command does
not support multithreading.
.TAG derep_id
.TP
.BI \-\-derep_id \0filename
Merge strictly identical sequences contained in \fIfilename\fR, as
with the \-\-derep_fulllength command, but the sequence labels
(identifiers) on the header line need to be identical too.
.TAG derep_smallmem
.TP
.BI \-\-derep_smallmem \0filename
Merge strictly identical sequences contained in \fIfilename\fR, as
with the \-\-derep_fulllength command, but using much less memory. The
output is written to a FASTA file specified with the \-\-fastaout
option. The output is written in the order that the sequences first
appear in the input, and not in descending abundance order as with the
other dereplication commands. It can read, but not write FASTQ
files. This command cannot read from a pipe, it must be a proper file,
as it is read twice. Dereplication is performed with a 128 bit hash
function and it is not verified that grouped sequences are identical,
however the probability that two different sequences are grouped in a
dataset of one billion unique sequences is approximately 1e-21. Memory
footprint is appr. 24 bytes times the number of unique
sequence. Multithreading and the options \-\-topn, \-\-uc, or
\-\-tabbedout are not supported.
.TAG derep_prefix
.TP
.BI \-\-derep_prefix \0filename
Merge sequences with identical prefixes contained in \fIfilename\fR.
A short sequence identical to an initial segment (prefix) of another
sequence is considered a replicate of the longer sequence. If a
sequence is identical to the prefix of two or more longer sequences,
it is clustered with the shortest of them. If they are equally long,
it is clustered with the most abundant. Remaining ties are solved
using sequence headers and sequence input order. Sequence comparisons
are case insensitive, and T and U are considered identical. This
command does not support multithreading.
.TAG fastaout
.TP
.BI \-\-fastaout \0filename
Write the dereplicated sequences to \fIfilename\fR, in fasta format
and sorted by decreasing abundance. Identical sequences receive the
header of the first sequence of their group. If \-\-sizeout is used,
the number of occurrences (i.e. abundance) of each sequence is
indicated at the end of their fasta header using the
pattern ';size=\fIinteger\fR;'. This option is only valid for
\-\-fastx_uniques and \-\-derep_smallmem.
.TAG fastqout
.TP
.BI \-\-fastqout \0filename
Write the dereplicated sequences to \fIfilename\fR, in fastq format
and sorted by decreasing abundance. Identical sequences receive the
header of the first sequence of their group. If \-\-sizeout is used,
the number of occurrences (i.e. abundance) of each sequence is
indicated at the end of their fastq header using the
pattern ';size=\fIinteger\fR;'. This option is only valid for
\-\-fastx_uniques.
.TAG fastq_ascii
.TP
.BI \-\-fastq_ascii\~ "positive integer"
Define the ASCII character number used as the basis for the FASTQ
quality score. The default is 33, which is used by the Sanger /
Illumina 1.8+ FASTQ format (phred+33). The value 64 is used by the
Solexa, Illumina 1.3+ and Illumina 1.5+ formats (phred+64). Only 33
and 64 are valid arguments.
.TAG fastq_asciiout
.TP
.BI \-\-fastq_asciiout\~ "positive integer"
When using \-\-fastq_convert, \-\-sff_convert or \-\-fasta2fastq,
define the ASCII character number used as the basis for the FASTQ
quality score when writing FASTQ output files. The default is 33. Only
33 and 64 are valid arguments.
.TAG fastq_qmax
.TP
.BI \-\-fastq_qmax\~ "positive integer"
Specify the maximum quality score accepted when reading FASTQ
files. The default is 41, which is usual for recent Sanger/Illumina
1.8+ files.
.TAG fastq_qmaxout
.TP
.BI \-\-fastq_qmaxout\~ "positive integer"
Specify the maximum quality score used when writing
FASTQ files. The default
is 41, which is usual for recent Sanger/Illumina 1.8+ files. Older
formats may use a maximum quality score of 40.
.TAG fastq_qmin
.TP
.BI \-\-fastq_qmin\~ "positive integer"
Specify the minimum quality score accepted for FASTQ files. The
default is 0, which is usual for recent Sanger/Illumina 1.8+
files. Older formats may use scores between -5 and 2.
.TAG fastq_qminout
.TP
.BI \-\-fastq_qminout\~ "positive integer"
Specify the minimum quality score used when writing FASTQ files. The
default is 0, which is usual for Sanger/Illumina 1.8+ files. Older
versions of the format may use scores between -5 and 2.
.TAG fastq_qout_max
.TP
.BI \-\-fastq_qout_max
For \-\-fastx_uniques, indicate that the new quality scores computed
when dereplicating FASTQ files should be equal to the maximum (best)
of the input quality scores for each position (corresponding to the
lowest error probability). The default is to output a quality score
corresponding to the average of the error probabilities for each
position.
.TAG fastx_uniques
.TP
.BI \-\-fastx_uniques \0filename
Merge strictly identical sequences contained in FASTA or FASTQ file
\fIfilename\fR. Identical sequences are defined as having the same
length and the same string of nucleotides (case insensitive, T and U
are considered the same). See the options \-\-sizein and \-\-sizeout
to take into account and compute abundance values. This command does
not support multithreading. By default, the quality scores in FASTQ
output files will correspond to the average error probability of the
nucleotides in the each position. If the \-\-fastq_qout_max option is
given, the quality score will be the highest (best) quality score
observed in each position.
.TAG lengthout
.TP
.B \-\-lengthout
Write sequence length information to the output files in FASTA and
FASTQ format by adding a ";length=\fIinteger\fR" attribute in the
header.
.TAG maxuniquesize
.TP
.BI \-\-maxuniquesize\~ "positive integer"
Discard sequences with a post-dereplication abundance value greater
than \fIinteger\fR.
.TAG minuniquesize
.TP
.BI \-\-minuniquesize\~ "positive integer"
Discard sequences with a post-dereplication abundance value smaller
than \fIinteger\fR.
.TAG output
.TP
.BI \-\-output \0filename
Write the dereplicated sequences to \fIfilename\fR, in fasta format
and sorted by decreasing abundance. Identical sequences receive the
header of the first sequence of their group. If \-\-sizeout is used,
the number of occurrences (i.e. abundance) of each sequence is
indicated at the end of their fasta header using the
pattern ';size=\fIinteger\fR;'. This option is not allowed for
\-\-fastx_uniques or \-\-derep_smallmem.
.TP
.TAG relabel
.BI \-\-relabel \0string
Please see the description of the same option under Chimera detection
for details.
.TP
.TAG relabel_keep
.B \-\-relabel_keep
When relabelling, keep the old identifier in the header after a space.
.TP
.TAG relabel_md5
.B \-\-relabel_md5
Please see the description of the same option under Chimera detection
for details.
.TP
.TAG relabel_self
.B \-\-relabel_self
Please see the description of the same option under Chimera detection
for details.
.TP
.TAG relabel_sha1
.B \-\-relabel_sha1
Please see the description of the same option under Chimera detection
for details.
.TP
.TAG rereplicate
.BI \-\-rereplicate \0filename
Duplicate each sequence the number of times indicated by the abundance
of each sequence in the specified file (option \-\-sizein is always
implied). The sequence labels are identical for the same sequence,
unless \-\-relabel, \-\-relabel_self, \-\-relabel_sha1 or
\-\-relabel_md5 is used to create unique labels. Output is written to
the file specified with the \-\-output option, in FASTA format. The
output file does not contain abundance information unless \-\-sizeout
is specified, in which case an abundance of 1 is used.
.TAG sizein
.TP
.B \-\-sizein
Take into account the abundance annotations present in the input fasta
file (search for the pattern '[>;]size=\fIinteger\fR[;]' in sequence
headers). That option is active by default when rereplicating.
.TAG sizeout
.TP
.B \-\-sizeout
Add abundance annotations to the output fasta file (add the
pattern ';size=\fIinteger\fR;' to sequence headers). If \-\-sizein is
specified, each unique sequence receives a new abundance value
corresponding to its total abundance (sum of the abundances of its
occurrences). If \-\-sizein is not specified, input abundances are set
to 1, and each unique sequence receives a new abundance value
corresponding to its number of occurrences in the input file.
.TAG strand
.TP
.BI \-\-strand\~ "plus|both"
When searching for strictly identical sequences, check the \fIplus\fR
strand only (default) or check \fIboth\fR strands.
.TAG tabbedout
.TP
.BI \-\-tabbedout \0filename
Output clustering info to the specified tab-separated text file with 6
columns and a row for each input sequence. Column 1 contains the
original label/header of the sequence. Column 2 contains the label of
the output sequence which is equal to the label/header of the first
sequence in each cluster, but potentially relabelled. Column 3
contains the cluster number, starting from 0. Column 4 contains the
sequence number within each cluster, starting at 0. Column 5 contains
the number of sequences in the cluster. Column 6 contains the original
label/header of the first sequence in the cluster before any potential
relabelling. This option is only valid for the \-\-fastx_uniques
command.
.TAG topn
.TP
.BI \-\-topn\~ "positive integer"
Output only the top \fIinteger\fR sequences (i.e. the most abundant).
.TAG uc
.TP
.BI \-\-uc \0filename
Output full-length or prefix-dereplication results in \fIfilename\fR
using a tab-separated uclust-like format with 10 columns and 3
different type of entries (S, H or C). Each fasta sequence in the
input file can be either a cluster centroid (S) or a hit (H) assigned
to a cluster. Cluster records (C) summarize information (size,
centroid label) for each cluster. In the context of dereplication, the
option \-\-uc_allhits has no effect on the \-\-uc output. Column
content varies with the type of entry (S, H or C):
.RS
.RS
.nr step 1 1
.IP \n[step]. 4
Record type: S, H, or C.
.IP \n+[step].
Cluster number (zero-based).
.IP \n+[step].
Sequence length (S, H), or cluster size (C).
.IP \n+[step].
Percentage of similarity with the centroid sequence (H), or set to '*'
(S, C).
.IP \n+[step].
Match orientation + or - (H), or set to '*' (S, C).
.IP \n+[step].
Not used, always set to '*' (S, C) or 0 (H).
.IP \n+[step].
Not used, always set to '*' (S, C) or 0 (H).
.IP \n+[step].
Not used, always set to '*'.
.IP \n+[step].
Label of the query sequence (H), or of the centroid sequence (S, C).
.IP \n+[step].
Label of the centroid sequence (H), or set to '*' (S, C).
.RE
.RE
.RE
.PP
.RS
.TAG xlength
.TP
.B \-\-xlength
Strip header attribute ";length=\fIinteger\fR" from input
sequences. This attribute is added to output sequences by the
\-\-lengthout option.
.TAG xsize
.TP
.B \-\-xsize
Strip abundance information from the headers when writing the output file.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG extraction-options
Extraction options:
.RS
.PP
Sequences with headers matching certain criteria can be extracted from
FASTA and FASTQ files using the \-\-fastx_getseq, \-\-fastx_getseqs
and \-\-fastx_getsubseq commands.
.PP
The \-\-fastx_getseq command requires the header to match a label
specified with the \-\-label option.  If the \-\-label_substr_match
option is given, the label may be a substring located anywhere in the
header, otherwise the entire header must match the label. These
matches are not case-sensitive. The headers in the input file are
truncated at the first space or tab character unless the
\-\-notrunclabels option is given.  The matching sequences will be
written to the files specified with the \-\-fastaout and \-\-fastqout
options, in FASTA and FASTQ format, respectively. Sequences that do
not match are written to the files specified with the \-\-notmatched
and \-\-notmatchedfq options, respectively.
.PP
The \-\-fastx_getsubseq command is similar to the \-\-fastx_getseq
command, but will extract a subsequence of the matching sequences. The
start position is specified with the \-\-subseq_start option and the
end position is specified with the \-\-subseq_end option. The
positions are 1-based, meaning that the first symbol of the sequence
is at position 1. If the start or end position option is not
specified, the default is to start at the first position and end at
the last position in the sequence.
.PP
The \-\-fastx_getseqs command is similar to the \-\-fastx_getseq
command but allows more flexibility in specifying the label(s) to be
matched. A single label may be specified using the \-\-label option as
described above. Alternatively, a file containing a list of labels to
be matched may be specified with the \-\-labels option. The file must
be a plain text file with one label on each line. The \-\-label_word
and \-\-label_words options may be used to specify either a single
word or a file containing a list of words, respectively, to be
matched. Words are defined as character sequences delimited either by
a character that is not alpha-numeric (A-Z, a-z, or 0-9) or by the
beginning or end of the header. Word matching is case-sensitive. The
\-\-label_field option will limit the matching of words to a certain
field in the header.
.PP
.TAG fastaout
.TP 9
.BI \-\-fastaout \0filename
Write the extracted sequences in FASTA format to the file with the
given name.
.TAG fastqout
.TP
.BI \-\-fastqout \0filename
Write the extracted sequences in FASTQ format to the file with the
given name. This option is illegal if the input is in FASTA format.
.TAG fastx_getseq
.TP
.BI \-\-fastx_getseq \0filename
Extract sequences from the given FASTA or FASTQ file. Specify a label
to match using the \-\-label option. Output files are specified with
the \-\-fastaout, \-\-fastqout, \-\-notmatched and \-\-notmatchedfq
options.
.TAG fastx_getseqs
.TP
.BI \-\-fastx_getseqs \0filename
Extract sequences from the given FASTA or FASTQ file. Specify the
label or labels to match using one of the following options: \-\-label,
\-\-labels, \-\-label_word, or \-\-label_words. Output
files are specified with the \-\-fastaout, \-\-fastqout,
\-\-notmatched and \-\-notmatchedfq options.
.TAG fastx_getsubseq
.TP
.BI \-\-fastx_getsubseq \0filename
Extract a certain part of some of the sequences in the given FASTA or
FASTQ file. Specify labels to match using the \-\-label
option. Specify the subsequence range to be extracted with the
\-\-subseq_start and \-\-subseq_end options. Output files are
specified with the \-\-fastaout, \-\-fastqout, \-\-notmatched and
\-\-notmatchedfq options.
.TAG label
.TP
.BI \-\-label \0string
Specify the label to match in the sequence header. Unless the
\-\-label_substr_match option is given, the label must match the
entire header. The comparison is not case-sensitive.
.TAG label_field
.TP
.BI \-\-label_field \0string
Specify a field name to be used when matching using the \-\-label_word
or \-\-label_words option. The field name is a string like "abc" that
must precede the word to be matched with an equals sign (=) in
between. The field must be delimited by semicolons or the beginning or
end of the header. The following header will match the label 123 in
the field abc: "seq1;abc=123".
.TAG label_substr_match
.TP
.BI \-\-label_substr_match
The labels specified with the \-\-label or the \-\-labels option may
match anywhere in the header if this option is given. Otherwise a
label needs to match the entire header.
.TAG label_word
.TP
.BI \-\-label_word \0string
Specify a word to match in the sequence header. Words are defined as
strings delimited by either the start or end of the header or by any
symbol that is not a letter (A-Z, a-z) or digit (0-9). The comparison is
case-sensitive.
.TAG label_words
.TP
.BI \-\-label_words \0filename
Specify a file containing words to be matched against the sequence
headers. The plain text file must contain one word on each line.
Words are defined as strings delimited by either the start or end of
the header or by any symbol that is not a letter (A-Z, a-z) or digit
(0-9). The comparison is case-sensitive.
.TAG labels
.TP
.BI \-\-labels \0filename
Specify a file containing labels to be matched against the sequence
headers. The plain text file must contain one label on each
line. Unless the \-\-label_substr_match option is given, a label must
match the entire header. The comparison is not case-sensitive.
.TAG notmatched
.TP
.BI \-\-notmatched \0filename
Write the sequences that were not extracted to the file with the given
name, in FASTA format.
.TAG notmatchedfq
.TP
.BI \-\-notmatchedfq \0filename
Write the sequences that were not extracted to the file with the given
name, in FASTQ format. This option is illegal if the input is in FASTA
format.
.TAG subseq_end
.TP
.BI \-\-subseq_end\~ "positive integer"
Specify the end position in the sequences when extracting
subsequences using the \-\-fastx_getsubseq command. Positions are
1-based, so the sequences start at position 1. The default is to end
at the end of the sequence if this option is not specified.
.TAG subseq_start
.TP
.BI \-\-subseq_start\~ "positive integer"
Specify the starting position in the sequences when extracting
subsequences using the \-\-fastx_getsubseq command. Positions are
1-based, so the sequences start at position 1. The default is to start
at the beginning of the sequence (position 1), if this option is not
specified.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG fasta-fastq-file-processing-options
FASTA/FASTQ/SFF file processing options:
.RS
.PP
Analyse, trim, filter, convert, merge, join or reverse complement
sequences in FASTA, FASTQ or SFF files. The \-\-fastq_chars command
can be used to analyse FASTQ files to identify the quality encoding
and the range of quality score values used. To convert between
different FASTQ file variants, use the \-\-fastq_convert
command. Statistical analysis of the quality and length of the
sequences in a FASTQ file may be performed with the \-\-fastq_stats,
\-\-fastq_eestats, and \-\-fastq_eestats2 commands.  Sequences may be
trimmed, filtered and converted by the \-\-fastq_filter or
\-\-fastx_filter commands.  The \-\-sff_convert command can be used to
convert SFF files to FASTQ, while the \-\-fasta2fastq command will
convert a FASTA file to a FASTQ file with fake quality scores.
Paired-end reads can be merged using the \-\-fastq_mergepairs command
or joined with the \-\-fastq_join command.  The \-\-fastx_revcomp
command will reverse-complements sequences.
.PP
.TAG eeout
.TP 9
.B \-\-eeout
When using \-\-fastq_filter, \-\-fastx_filter or \-\-fastq_mergepairs,
include the number of expected errors (ee) in the sequence header of
FASTQ and FASTA output files. This option is a synonym of the
\-\-fastq_eeout option. Use the \-\-xee option to remove this
information from headers.
.TAG eetabbedout
.TP
.BI \-\-eetabbedout \0filename
When specified with the \-\-fastq_mergepairs command, write statistics
with expected errors of each merged read to the given file. The file
is a tab separated file with four columns: The number of expected
errors in the forward read, the number of expected errors in the
reverse read, the number of observed errors in the forward read, and
the number of observed errors in the reverse read. The observed number
of errors are the number of differences in the overlap region of the
merged sequence relative to each of the reads in the pair.
.TAG fasta2fastq
.TP
.BI \-\-fasta2fastq \0filename
Add a fake nucleotide quality score to the sequences in the given
FASTA file and write them to the FASTQ file specified with the
\-\-fastqout option. The quality score may be adjusted using the
\-\-fastq_qmaxout option (default 41). The \-\-fastq_asciiout option
may be used to adjust the FASTQ output quality ASCII base character
(default 33).
.TAG fastaout
.TP
.BI \-\-fastaout \0filename
When using \-\-fastq_filter, \-\-fastq_mergepairs or \-\-fastx_filter,
write to the given FASTA-formatted file the sequences passing the
filter, or the merged sequences.
.TAG fastaout_rev
.TP
.BI \-\-fastaout_rev \0filename
When using \-\-fastq_filter, or \-\-fastx_filter,
write to the given FASTA-formatted file the reverse reads passing the
filter.
.TAG fastaout_notmerged_fwd
.TP
.BI \-\-fastaout_notmerged_fwd \0filename
When using \-\-fastq_mergepairs, write forward reads not merged to the
specified FASTA file.
.TAG fastaout_notmerged_rev
.TP
.BI \-\-fastaout_notmerged_rev \0filename
When using \-\-fastq_mergepairs, write reverse reads not merged to the
specified FASTA file.
.TAG fastaout_discarded
.TP
.BI \-\-fastaout_discarded \0filename
Write sequences that do not pass the filter of the \-\-fastq_filter or
\-\-fastx_filter command to the given FASTA-formatted file.
.TAG fastaout_discarded_rev
.TP
.BI \-\-fastaout_discarded_rev \0filename
Write reverse reads that do not pass the filter of the
\-\-fastq_filter or \-\-fastx_filter command to the given
FASTA-formatted file.
.TAG fastq_allowmergestagger
.TP
.B \-\-fastq_allowmergestagger
When using \-\-fastq_mergepairs, allow merging of staggered read
pairs. Staggered pairs are pairs where the 3' end of the reverse read
has an overhang to the left of the 5' end of the forward read. This
situation can occur when a very short fragment is sequenced. The 3'
overhang of the reverse read is not included in the merged
sequence. The opposite option is the \-\-fastq_nostagger option. The
default is to discard staggered pairs.
.TAG fastq_ascii
.TP
.BI \-\-fastq_ascii\~ "positive integer"
Define the ASCII character number used as the basis for the FASTQ
quality score. The default is 33, which is used by the Sanger /
Illumina 1.8+ FASTQ format (phred+33). The value 64 is used by the
Solexa, Illumina 1.3+ and Illumina 1.5+ formats (phred+64). Only 33
and 64 are valid arguments.
.TAG fastq_asciiout
.TP
.BI \-\-fastq_asciiout\~ "positive integer"
When using \-\-fastq_convert, \-\-sff_convert or \-\-fasta2fastq,
define the ASCII character number used as the basis for the FASTQ
quality score when writing FASTQ output files. The default is 33. Only
33 and 64 are valid arguments.
.TAG fastq_chars
.TP
.BI \-\-fastq_chars \0filename
Summarize the composition of sequence and quality strings contained in
the input FASTQ file. For each sequence symbol, \-\-fastq_chars gives
the number of occurrences of the symbol, its relative frequency and
the length of the longest run of that symbol. For each character
present in the quality strings, \-\-fastq_chars gives the ASCII value
of the character, its relative frequency, and the number of times a
\fIk\fR-mer of that character appears at the end of quality
strings. The length of the \fIk\fR-mer can be set using \-\-fastq_tail
(4 by default). The command \-\-fastq_chars tries to automatically
detect the quality encoding (Solexa, Illumina 1.3+, Illumina 1.5+ or
Illumina 1.8+/Sanger) by analyzing the range of observed quality score
values. In case of success, \-\-fastq_chars suggests values for the
\-\-fastq_ascii (33 or 64), \-\-fastq_qmin and \-\-fastq_qmax options
to be used with the other commands that require a FASTQ input file.
.TAG fastq_convert
.TP
.BI \-\-fastq_convert \0filename
Convert between the different variants of the FASTQ file format. The
quality encoding of the input file must be specified with the
\-\-fastq_ascii option (either 33 or 64, the default is 33), and the
output quality encoding must be specified with the \-\-fastq_asciiout
option (default 33). The minimum and maximum output quality scores may
be limited using the \-\-fastq_qminout and \-\-fastq_qmaxout
options. The output file is specified with the \-\-fastqout option.
.TAG fastq_eeout
.TP
.B \-\-fastq_eeout
When using \-\-fastq_filter, \-\-fastx_filter or \-\-fastq_mergepairs,
include the number of expected errors (ee) in the sequence header of
FASTQ and FASTA files. This option is a synonym of the \-\-eeout
option. Use the \-\-xee option to remove this information from
headers.
.TAG fastq_eestats
.TP
.BI \-\-fastq_eestats \0filename
Analyze a FASTQ file and report statistics on the distributions of
quality scores, error probabilities and expected accumulated
errors. The report, a table of 21 tab-separated columns, is written to
the file specified with the \-\-output option. The first column
corresponds to the position in the reads (Pos). The second and third
columns correspond to the number of reads (Reads) and percentage of
reads (PctRecs) that include this position. The remaining columns
include information about the distribution of quality scores in this
position (Q), error probabilities in this position (Pe), and finally
the expected number of accumulated errors from the beginning of the
reads and until the current position (EE). For each of the Q, Pe and
EE distributions, the following statistics are included: minimum value
(Min), lower quartile (Low), median (Med), mean (Mean), upper quartile
(Hi), and maximum value (Max). The quality encoding and the range of
quality values may be specified with \-\-fastq_ascii \-\-fastq_qmin
and \-\-fastq_qmax.
.TAG fastq_eestats2
.TP
.BI \-\-fastq_eestats2 \0filename
Analyze the specified FASTQ file and report statistics on the number
of sequences that would be retained at a combination of selected
cutoffs for length truncation and maximum expected errors, that could
potentially be used as arguments to the \-\-fastq_trunclen and
\-\-fastq_maxee options to the \-\-fastq_filter command.  The result,
a table of two or more columns, is written to the file specified with
the \-\-output option. There is a line for each length truncation
cutoff. The first column on each line contains the selected truncation
length, while the following columns contain the number of sequences
and, in parenthesis, the percentage of sequences that would be
retained at the selected EE levels.  The truncation length cutoffs may
be specified with the \-\-length_cutoffs option and requires a list of
three comma-separated integers indicating the shortest cutoff, the
longest cutoff, and the increment between cutoffs. The longest cutoff
may be specified with a star (*) which indicates that the limit is
equal to the longest sequence in the input file. The default setting
is "50,*,50" meaning that truncation lengths of 50, 100, 150 and so on
up to the longest sequence length should be used.  The maximum
expected error (EE) cutoffs may be specified with the \-\-ee_cutoffs
option which requires a comma-separated list of floating point numbers
as its argument. The default setting is "0.5,1.0,2.0" that indicates
that expected error levels of 0.5, 1.0 and 2.0 should be used.
.TAG fastq_filter
.TP
.BI \-\-fastq_filter \0filename
Trim and/or filter sequences in the given FASTQ file. Similar to
the \-\-fastx_filter command, but works only on FASTQ files. See
\-\-fastx_filter for details.
.TAG fastq_join
.TP
.BI \-\-fastq_join\0 filename
Join paired-end sequence reads into one sequence and add a gap between
them using a padding sequence. The sequences are not merged as with
the fastq_mergepairs command, but simply joined with a gap. The
forward reads are specified as the argument to this option and the
reverse reads are specified with the \-\-reverse option. The resulting
sequences consist of the forward read, the padding sequence and the
reverse complement of the reverse read. The padding sequence is
specified with the \-\-join_padgap option and the padding quality is
specified with the \-\-join_padgapq option. The default padding
sequence string is NNNNNNNN and the default padding quality string is
IIIIIIII, corresponding to a base quality score of 40 (a very high
quality score with error probability 0.0001). The joined sequences are
output to the file(s) specified with the \-\-fastaout or \-\-fastqout
options.
.TAG fastq_maxdiffs
.TP
.BI \-\-fastq_maxdiffs\~ "positive integer"
When using \-\-fastq_mergepairs, specify the maximum number of
non-matching nucleotides allowed in the overlap region. That option
has a strong influence on the merging success rate. The default
value is 10.
.TAG fastq_maxdiffpct
.TP
.BI \-\-fastq_maxdiffpct\~ real
When using \-\-fastq_mergepairs, specify the maximum percentage of
non-matching nucleotides allowed in the overlap region. The default
value is 100.0%. There are other more sophisticated rules in the
merging algorithm that will discard read pairs with a high fraction of
mismatches.
.TAG fastq_maxee
.TP
.BI \-\-fastq_maxee\~ real
When using \-\-fastq_filter, \-\-fastq_mergepairs or \-\-fastx_filter,
discard sequences with an expected error greater than the specified
number (value ranging from 0.0 to infinity). For a given sequence, the
expected error is the sum of error probabilities for all the positions
in the sequence. Since error probabilities can be small but not null,
the expected error is always greater than zero, and at most equal to
the length of the sequence when all positions in the sequence have an
error probability of 1.0.

Using the expected error as the \fIlambda\fR parameter in the Poisson
distribution, it is possible to compute the probability of observing
\fIk\fR errors. For instance, a read with an expected error of 1.0
has:
.RS
.IP - 2
36.8% chance of having zero error,
.IP -
36.8% chance of having one error,
.IP -
18.4% chance of having two errors,
.IP -
6.1% chance of having three errors,
.IP -
1.5% chance of having four errors,
.IP -
0.3% chance of having five errors,
.IP -
etc.
.RE
.PP
.TAG fastq_maxee_rate
.TP
.BI \-\-fastq_maxee_rate\~ real
When using \-\-fastq_filter or \-\-fastx_filter, discard sequences
with an average expected error greater than the specified number
(value ranging from 0.0 to 1.0 included). For a given sequence, the
average expected error is the sum of error probabilities for all the
positions in the sequence, divided by the length of the sequence.
.TAG fastq_maxlen
.TP
.BI \-\-fastq_maxlen\~ "positive integer"
When using \-\-fastq_filter, \-\-fastq_mergepairs or \-\-fastx_filter,
discard sequences with more than the specified number of bases.
.TAG fastq_maxmergelen
.TP
.BI \-\-fastq_maxmergelen\~ "positive integer"
When using \-\-fastq_mergepairs, specify the maximum length of the
merged sequence (default is 1,000,000).
.TAG fastq_maxns
.TP
.BI \-\-fastq_maxns\~ "positive integer"
When using \-\-fastq_filter, \-\-fastq_mergepairs or \-\-fastx_filter,
discard sequences with more than the specified number of N's.
.TAG fastq_mergepairs
.TP
.BI \-\-fastq_mergepairs\0 filename
Merge paired-end sequence reads into one sequence. The forward reads
are specified as the argument to this option and the reverse reads are
specified with the \-\-reverse option. Reads with the same
index/position in the forward and reverse files are considered to form
a pair, even if their labels are different. Thus, forward and reverse
reads \fBmust\fR appear in the same order and total number in both
files. A warning is emitted if the forward and reverse files contain
different numbers of reads. The merged sequences are written to the
file(s) specified with the \-\-fastaout or \-\-fastqout options. The
non-merged reads can be output to the files specified with the
\-\-fastaout_notmerged_fwd, \-\-fastaout_notmerged_rev,
\-\-fastqout_notmerged_fwd and \-\-fastqout_notmerged_rev
options. Statistics may be output to the file specified with the
\-\-eetabbedout option. Sequences are truncated as specified with the
\-\-fastq_truncqual option to remove low-quality bases in the 3'
end. Sequences shorter than specified with \-\-fastq_minlen (after
truncation) are discarded (1 by default). Sequences with too many
ambiguous bases (N's), as specified with the \-\-fastq_maxns are also
discarded (no limit by default). Staggered reads are not merged unless
the \-\-fastq_allowmergestagger option is specified. The minimum
length of the overlap region between the reads may be specified with
the \-\-fastq_minovlen option (at least 5, default 10). The overlap
region may not include more mismatches than specified with the
\-\-fastq_maxdiffs option (10 by default) or a higher percentage of
mismatches than specified with the \-\-fastq_maxdiffpct option (100.0%
by default), otherwise the read pair is discarded. Additional rules
will avoid merging of reads that cannot be aligned reliably and
unambiguously. The minimum and maximum length of the merged sequence
may be specified with the \-\-fastq_minmergelen and
\-\-fastq_maxmergelen options, respectively. The quality value limits
for output files may be specified with the \-\-fastq_qminout and
\-\-fastq_qmaxout options, but they apply only to the merged region.
Other relevant options are: \-\-fastq_ascii, \-\-fastq_maxee,
\-\-fastq_nostagger, \-\-fastq_qmax, \-\-fastq_qmin, and
\-\-label_suffix.
.TAG fastq_minlen
.TP
.BI \-\-fastq_minlen\~ "positive integer"
When using \-\-fastq_filter, \-\-fastq_mergepairs or \-\-fastx_filter,
discard input sequences with less than the specified number of bases
(default 1).
.TAG fastq_minmergelen
.TP
.BI \-\-fastq_minmergelen\~ "positive integer"
When using \-\-fastq_mergepairs, specify the minimum length of the
merged sequence. The default is 1.
.TAG fastq_minovlen
.TP
.BI \-\-fastq_minovlen\~ "positive integer"
When using \-\-fastq_mergepairs, specify the minimum overlap between
the merged reads. The default is 10. Must be at least 5.
.TAG fastq_nostagger
.TP
.B \-\-fastq_nostagger
When using \-\-fastq_mergepairs, forbid the merging of staggered read
pairs. This is the default behaviour of \-\-fastq_mergepairs. To
change that behaviour, see the \-\-fastq_allowmergestagger option.
.TAG fastq_qmax
.TP
.BI \-\-fastq_qmax\~ "positive integer"
Specify the maximum quality score accepted when reading FASTQ
files. The default is 41, which is usual for recent Sanger/Illumina
1.8+ files.
.TAG fastq_qmaxout
.TP
.BI \-\-fastq_qmaxout\~ "positive integer"
When using \-\-fastq_mergepairs, \-\-fastq_convert, \-\-sff_convert or
\-\-fasta2fastq, specify the maximum quality score used when writing
FASTQ files. For the \-\-fasta2fastq command, the value specified here
is the fake quality score used for the FASTQ output file. The default
is 41, which is usual for recent Sanger/Illumina 1.8+ files. Older
formats may use a maximum quality score of 40. The limit only applies
to the merged region when using \-\-fastq_mergepairs.
.TAG fastq_qmin
.TP
.BI \-\-fastq_qmin\~ "positive integer"
Specify the minimum quality score accepted for FASTQ files. The
default is 0, which is usual for recent Sanger/Illumina 1.8+
files. Older formats may use scores between -5 and 2.
.TAG fastq_qminout
.TP
.BI \-\-fastq_qminout\~ "positive integer"
When using \-\-fastq_mergepairs, \-\-fastq_convert or \-\-sff_convert,
specify the minimum quality score used when writing FASTQ files. The
default is 0, which is usual for Sanger/Illumina 1.8+ files. Older
versions of the format may use scores between -5 and 2. The limit
applies only to the merged region when using \-\-fastq_mergepairs.
.TAG fastq_stats
.TP
.BI \-\-fastq_stats \0filename
Analyze a FASTQ file and report the number of reads it contains. The
quality encoding and the range of quality values may be specified with
\-\-fastq_ascii \-\-fastq_qmin and \-\-fastq_qmax. That command
requires the \-\-log option and outputs the following detailed
statistics on read length, quality score, length vs. quality
distributions, and length / quality filtering:
.RS
.TP
Read length distribution:
.RS
.nr step 1 1
.IP \n[step]. 4
L: read length.
.IP \n+[step].
N: number of reads.
.IP \n+[step].
Pct: fraction of reads with this length.
.IP \n+[step]:
AccPct: fraction of reads with this length or longer.
.RE
.TP
Quality score distribution:
.RS
.nr step 1 1
.IP \n[step]. 4
ASCII: character encoding the quality score.
.IP \n+[step].
Q: Phred quality score.
.IP \n+[step].
Pe: probability of error associated with the quality score.
.IP \n+[step].
N: number of bases with this quality score.
.IP \n+[step].
Pct: fraction of bases with this quality score.
.IP \n+[step]:
AccPct: fraction of bases with this quality score or higher.
.RE
.TP
Length vs. quality distribution:
.RS
.nr step 1 1
.IP \n[step]. 4
L: position in reads (starting from position 2).
.IP \n+[step].
PctRecs: fraction of reads with at least this length.
.IP \n+[step].
AvgQ: average quality score over all reads up to this position.
.IP \n+[step].
P(AvgQ): error probability corresponding to AvgQ.
.IP \n+[step].
AvgP: average error probability.
.IP \n+[step]:
AvgEE: average expected error over all reads up to this position.
.IP \n+[step]:
Rate: growth rate of AvgEE between this position and position - 1.
.IP \n+[step]:
RatePct: Rate (as explained above) expressed as a percentage.
.RE
.TP
Effect of expected error and length filtering:
.RS
The first column indicates read lengths (\fIL\fR). The next four
columns indicate the number of reads that would be retained by the
\-\-fastq_filter command if the reads were truncated at length \fIL\fR
(option \-\-fastq_trunclen \fIL\fR) and filtered to have a maximum
expected error of 1.0, 0.5, 0.25 or 0.1 (with the option
\-\-fastq_maxee \fIfloat\fR). The last four columns indicate the
fraction of reads that would be retained by the \-\-fastq_filter
command using the same length and maximum expected error parameters.
.RE
.TP
Effect of minimum quality and length filtering:
.RS
The first column indicates read lengths (\fILen\fR). The next four
columns indicate the fraction of reads that would be retained by the
\-\-fastq_filter command if the reads were truncated at length
\fILen\fR (option \-\-fastq_trunclen \fILen\fR) or at the first
position with a quality \fIQ\fR below 5, 10, 15 or 20 (option
\-\-fastq_truncqual \fIQ\fR).
.RE
.RE
.TAG fastq_stripleft
.TP
.BI \-\-fastq_stripleft\~ "positive integer"
When using \-\-fastq_filter or \-\-fastx_filter, strip the specified
number of bases from the left end of the reads. If the length of the
resulting read is null, then the read is discarded.
.TAG fastq_stripright
.TP
.BI \-\-fastq_stripright\~ "positive integer"
When using \-\-fastq_filter or \-\-fastx_filter, strip the specified
number of bases from the right end of the reads. If the length of the
resulting read is null, then the read is discarded.
.TAG fastq_tail
.TP
.BI \-\-fastq_tail\~ "positive integer"
When using \-\-fastq_chars, count the number of times a series of
characters of length \fIk\fR appears at the end of quality strings. By
default, \fIk\fR = 4.
.TAG fastq_truncee
.TP
.BI \-\-fastq_truncee\~ real
When using \-\-fastq_filter or \-\-fastx_filter, truncate sequences so
that their total expected error is not higher than the specified
value.
.TAG fastq_trunclen
.TP
.BI \-\-fastq_trunclen\~ "positive integer"
When using \-\-fastq_filter or \-\-fastx_filter, truncate sequences to
the specified length. Shorter sequences are discarded.
.TAG fastq_trunclen_keep
.TP
.BI \-\-fastq_trunclen_keep\~ "positive integer"
When using \-\-fastq_filter or \-\-fastx_filter, truncate sequences to
the specified length. Shorter sequences are not discarded.
.TAG fastq_truncqual
.TP
.BI \-\-fastq_truncqual\~ "positive integer"
When using \-\-fastq_filter, \-\-fastq_mergepairs or \-\-fastx_filter,
truncate sequences starting from the first base with the specified
base quality score value or lower.
.TAG fastqout
.TP
.BI \-\-fastqout \0filename
When using \-\-fastq_filter, \-\-fastq_mergepairs, \-\-fastx_filter or
\-\-fasta2fastq, write to the given FASTQ-formatted file the sequences
passing the filter, or the merged or converted sequences.
.TAG fastqout_rev
.TP
.BI \-\-fastqout_rev \0filename
When using \-\-fastq_filter or \-\-fastx_filter,
write to the given FASTQ-formatted file the reverse reads passing the
filter.
.TAG fastqout_discarded
.TP
.BI \-\-fastqout_discarded \0filename
When using \-\-fastq_filter or \-\-fastx_filter, write sequences that
do not pass the filter to the given FASTQ-formatted file.
.TAG fastqout_discarded_rev
.TP
.BI \-\-fastqout_discarded_rev \0filename
When using \-\-fastq_filter or \-\-fastx_filter, write reverse reads that
do not pass the filter to the given FASTQ-formatted file.
.TAG fastqout_notmerged_fwd
.TP
.BI \-\-fastqout_notmerged_fwd \0filename
When using \-\-fastq_mergepairs, write forward reads not merged to the
specified FASTQ file.
.TAG fastqout_notmerged_rev
.TP
.BI \-\-fastqout_notmerged_rev \0filename
When using \-\-fastq_mergepairs, write reverse reads not merged to the
specified FASTQ file.
.TAG fastx_filter
.TP
.BI \-\-fastx_filter \0filename
Trim and/or filter the sequences in the given FASTA or FASTQ file and
output the remaining sequences to the FASTQ file specified with the
\-\-fastqout option and/or to the FASTA file specified with the
\-\-fastaout option. Discarded sequences are written to the files
specified with the \-\-fastaout_discarded and \-\-fastqout_discarded
options. The input format (FASTA or FASTQ) is automatically
detected. If the input consists of paired sequences, an input file
with reverse reads may be specified with the \-\-reverse option, and
corresponding output will be written to the files specified with the
\-\-fastqout_rev, \-\-fastaout_rev, \-\-fastqout_discarded_rev, and
\-\-fastaout_discarded_rev options. Output can not be written to FASTQ
files if the input is in FASTA format. The sequences are first trimmed
and then filtered based on the remaining bases. Sequences may be
trimmed using the options \-\-fastq_stripleft, \-\-fastq_stripright,
\-\-fastq_truncee, \-\-fastq_trunclen, \-\-fastq_trunclen_keep and
\-\-fastq_truncqual. The sequences may be filtered using the options
\-\-fastq_maxee, \-\-fastq_maxee_rate, \-\-fastq_maxlen,
\-\-fastq_maxns, \-\-fastq_minlen (default 1), \-\-fastq_trunclen,
\-\-maxsize, and \-\-minsize. Sequences not satisfying the
requirements are discarded. For pairs of sequences, both sequences in
a pair must satisfy the requirements, otherwise both are discarded. If
no shortening or filtering options are given, all sequences are
written to the output files, possibly after conversion from FASTQ to
FASTA format. The \-\-relabel option may be used to relabel the output
sequences. The \-\-eeout option may be used to output the expected
number of errors in each sequence. After all sequences have been
processed, the number of kept and discarded sequences will be shown,
as well as how many of the kept sequences were trimmed. When the input
is in FASTA format, the following options are not accepted because
quality scores are not available: \-\-eeout, \-\-fastq_ascii,
\-\-fastq_eeout, \-\-fastq_maxee, \-\-fastq_maxee_rate, \-\-fastq_out,
\-\-fastq_qmax, \-\-fastq_qmin, \-\-fastq_truncee,
\-\-fastq_truncqual, \-\-fastqout_discarded,
\-\-fastqout_discarded_rev, \-\-fastqout_rev.
.TAG fastx_revcomp
.TP
.BI \-\-fastx_revcomp \0filename
Reverse-complement the sequences in the given FASTA or FASTQ file to a
file specified with the \-\-fastaout and/or \-\-fastqout options. If
the input file is in FASTA format, the output can not be written back
to a FASTQ file due to missing base quality scores.
.TAG join_padgap
.TP
.BI \-\-join_padgap\~ string
When running \-\-fastq_join, use the \fIstring\fR as a sequence
padding string. The default is NNNNNNNN (8 N's).
.TAG join_padgapq
.TP
.BI \-\-join_padgapq\~ string
When running \-\-fastq_join, use the \fIstring\fR as a quality padding
string. The default is a string of I's equal in length to the sequence
padding string. The letter I corresponds to a base quality score of 40
indicating a very high quality base with error probability of 0.0001.
.TAG lengthout
.TP
.B \-\-lengthout
Write sequence length information to the output files in FASTA or
FASTQ format by adding a ";length=\fIinteger\fR" attribute in the
header.
.TAG maxsize
.TP
.BI \-\-maxsize\~ "positive integer"
When using \-\-fastq_filter or \-\-fastx_filter, discard sequences
with an abundance higher than the specified value.
.TAG minsize
.TP
.BI \-\-minsize\~ "positive integer"
When using \-\-fastq_filter or \-\-fastx_filter, discard sequences
with an abundance lower than the specified value.
.TAG output
.TP
.BI \-\-output \0filename
When using \-\-fastq_eestats or \-\-fastq_eestats2, write tabulated
results to \fIfilename\fR. See \-\-fastq_eestats's and
\-\-fastq_eestats2's documentation for a complete description of the
table.
.TAG relabel_keep
.TP
.B \-\-relabel_keep
When using \-\-relabel, keep the old identifier in the header after a
space.
.TAG relabel
.TP
.BI \-\-relabel \0string
Please see the description of the same option under Chimera detection
for details.
.TAG relabel_md5
.TP
.BI \-\-relabel_md5
Please see the description of the same option under Chimera detection
for details.
.TAG relabel_self
.TP
.BI \-\-relabel_self
Please see the description of the same option under Chimera detection
for details.
.TAG relabel_sha1
.TP
.BI \-\-relabel_sha1
Please see the description of the same option under Chimera detection
for details.
.TAG reverse
.TP
.BI \-\-reverse \0filename
When using \-\-fastq_filter, \-\-fastx_filter, \-\-fastq_mergepairs or
\-\-fastq_join, specify the FASTQ file containing containing the
reverse reads.
.TAG sff_convert
.TP
.BI \-\-sff_convert \0filename
Convert the given SFF file to FASTQ. The FASTQ output file is
specified with the \-\-fastqout option. The sequence may be clipped as
specified in the SFF file if the option \-\-sff_clip is specified,
otherwise no clipping occurs. Bases that would have been clipped are
converted to lower case, while the rest is in upper case. The output
quality encoding may be specified with the \-\-fastq_asciiout option
(default 33). The minimum and maximum output quality scores may be
limited using the \-\-fastq_qminout and \-\-fastq_qmaxout options.
.TAG sff_clip
.TP
.BI \-\-sff_clip
Specifies that the sequences converted by the \-\-sff_convert command
should be clipped in both ends as indicated in the SFF file. By
default no clipping is performed.
.TAG xlength
.TP
.B \-\-xlength
Strip header attribute ";length=\fIinteger\fR" from input
sequences. This attribute is added to output sequences by the
\-\-lengthout option.
.TAG xsize
.TP
.B \-\-xsize
Strip abundance information from the headers when writing the output
file.
.TAG xee
.TP
.B \-\-xee
Strip information about expected errors (ee) from the output file
headers. This information is added by the \-\-fastq_eeout and
\-\-eeout options.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG masking-options
Masking options:
.RS
.PP
An input sequence can be composed of lower- or uppercase letters. When
soft masking is specified, lower case letters are treated as symbols
that should be masked. Otherwise the case of the input sequences is
ignored.
.PP
Masking is performed by the commands for chimera detection
(uchime_denovo, uchime_ref), clustering (cluster_fast,
cluster_smallmem, cluster_size), masking (maskfasta, fastx_mask),
pairwise alignment (allpairs_global) and searching (search_exact,
usearch_global).
.PP
Masking is usually specified with the \-\-qmask option, while the
\-\-dbmask option is used for the database sequences specified with
the \-\-db option with the \-\-usearch_global, \-\-search_exact and
\-\-uchime_ref commands.
.PP
The argument to the \-\-qmask and \-\-dbmask option may be none, soft
or dust. If the argument is none, the no masking is performed. If the
argument is soft the lower case symbols are masked. Finally, if the
argument is dust, the sequence is masked using the DUST algorithm by
Tatusov and Lipman to mask low-complexity regions.
.PP
If the \-\-hardmask option is specified, all masked regions are
converted to N's, otherwise masked regions are indicated by lower case
letters.
.PP
If any sequence is masked, the masked version of the sequence (with
lower case letters or N's) is used in all output files. Otherwise the
sequence is unmodified. The exception is the sequences in the output
file specified with the \-\-uchimealns option, where the input
sequences are converted to upper case first and lower case letters
indicate disagreement between the aligned sequences.
.PP
The \-\-qmask option (or \-\-dbmask for database sequences) may be
combined with the \-\-hardmask option. The results of using the none,
dust or soft argument to \-\-qmask or \-\-dbmask are presented below,
assuming each input sequence contains both lower and uppercase
symbols.
.PP
Results if the \-\-hardmask option is off (default):
.RS
.TP 9
.B none:
no masking, all symbols used, no change
.TP
.B dust:
masked symbols lowercased, rest uppercased
.TP
.B soft:
lowercase symbols masked, no case changes
.RE
.PP
Results if the \-\-hardmask option is on:
.RS
.TP 9
.B none:
no masking, all symbols used, no change
.TP
.B dust:
masked symbols changed to Ns, rest unchanged
.TP
.B soft:
lowercase symbols masked and changed to Ns
.RE
.PP
When a sequence region is masked, words in the region are not included
in the indices used in the heuristic search algorithm. In all other
aspects, the region is treated as other regions.
.PP
Regions in sequences that are hardmasked (with N's) have a zero
alignment score and do not contribute to an alignment.
.RE
.PP
.RS
.TAG fastaout
.TP 9
.BI \-\-fastaout \0filename
Write the masked sequences to \fIfilename\fR, in fasta format. Applies
only to the \-\-fastx_mask command.
.TAG fastqout
.TP
.BI \-\-fastqout \0filename
Write the masked sequences to \fIfilename\fR, in fastq format. Applies
only to the \-\-fastx_mask command.
.TAG fastx_mask
.TP
.BI \-\-fastx_mask \0filename
Mask regions in sequences contained
in the specified fasta or fastq file. The default is to mask using
DUST (use \-\-qmask to modify that behaviour). The output files
are specified with the \-\-fastaout and \-\-fastqout options. The
minimum and maximum percentage of unmasked residues may be specified
with the \-\-min_unmasked_pct and \-\-max_unmasked_pct options,
respectively.
.TAG hardmask
.TP
.B \-\-hardmask
Symbols in masked regions are replaced by N's. The default is to
replace the masked regions by lower case letters.
.TAG maskfasta
.TP
.BI \-\-maskfasta \0filename
Mask regions in sequences contained in the fasta file
\fIfilename\fR. The default is to mask using \fIdust\fR (use \-\-qmask
to modify that behaviour). The output file is specified with the
\-\-output option. This command is depreciated, please use
\-\-fastx_mask instead.
.TAG max_unmasked_pct
.TP
.BI \-\-max_unmasked_pct \0real
Discard sequences with more than the specified maximum percentage of
unmasked residues. Works only with \-\-fastx_mask.
.TAG min_unmasked_pct
.TP
.BI \-\-min_unmasked_pct \0real
Discard sequences with less than the specified minimum percentage of
unmasked residues. Works only with \-\-fastx_mask.
.TAG output
.TP
.BI \-\-output \0filename
Write the masked sequences to \fIfilename\fR, in fasta format. Applies
only to the \-\-mask_fasta command.
.TAG qmask
.TP
.BI \-\-qmask\~ "none|dust|soft"
If the argument is dust, mask regions in sequences using the
\fIDUST\fR algorithm that detects simple repeats and low-complexity
regions. This is the default. If the argument is soft, mask the lower
case letters in the input sequence. If the argument is none, do not
mask.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG orienting-options
Orienting options:
.RS
.PP
The \-\-orient command can be used to orient the sequences in a given
file in either the forward or the reverse complementary direction
based on a reference database specified with the \-\-db option. The
two strands of each input sequence are compared to the reference
database using nucleotide words. If one of the strands shares many
more words with at least one sequence in the database than the other,
that strand is chosen. The correctly oriented sequences may be written
to a FASTA file specified with the \-\-fastaout, and to a FASTQ file
specified with the \-\-fastqout option (as long as the input was also
in FASTQ format). If the result is uncertain, because the number of
matching words is too similar, the original sequence is written to the
file specified with the \-\-notmatched option. The results may also be
written to a tab-delimited text file specified with the \-\-tabbedout
option. This file will contain the query label, the direction (+, - or
?), the number of matching words on the forward strand, and the number
of matching words on the reverse complementary strand. By default, a
word length of 12 is used for this command. The word length may be
adjusted using the \-\-wordlength option. There has to be at least 4
times as many matches on one strand than the other for a strand to be
selected. In addition to the common options, the following options may
also be specified for this command: \-\-dbmask, \-\-qmask,
\-\-relabel, \-\-relabel_keep, \-\-relabel_md5, \-\-relabel_self,
\-\-relabel_sha1, \-\-sizein, and \-\-sizeout.
.PP
.TAG db
.TP 9
.BI \-\-db \0filename
Read the reference database from the given file. It may be in FASTA,
FASTQ or UDB format. If an UDB file is used it should have been
created with a wordlength of 12.
.TAG fastaout
.TP
.BI \-\-fastaout \0filename
Write the correctly oriented sequences to \fIfilename\fR, in fasta format.
.TAG fastqout
.TP
.BI \-\-fastqout \0filename
Write the correctly oriented sequences to \fIfilename\fR, in fastq format.
.TAG notmatched
.TP
.BI \-\-notmatched \0filename
Write the sequences with undetermined direction to \fIfilename\fR, in
the original format.
.TAG orient
.TP
.BI \-\-orient \0filename
Orient the sequences in the given file.
.TAG tabbedout
.TP
.BI \-\-tabbedout \0filename
Write the resuls to a tab-delimited text file with the specified
\fIfilename\fR. This file will contain the query label, the direction
(+, - or ?), the number of matching words on the forward strand, and
the number of matching words on the reverse complementary strand.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG pairwise-alignment-options
Pairwise alignment options:
.RS
.PP
The results of the n * (n-1) / 2 pairwise alignments are written to
the result files specified with \-\-alnout, \-\-blast6out,
\-\-fastapairs \-\-matched, \-\-notmatched, \-\-qsegout, \-\-samout,
\-\-tsegout, \-\-uc or \-\-userout (see Searching section
below). Specify either the \-\-acceptall option to output all pairwise
alignments, or specify an identity level with \-\-id to discard weak
alignments. Most other accept/reject options (see Searching options
below) may also be used. Sequences are aligned on their \fIplus\fR
strand only. Masking is performed as usual and specified with
\-\-qmask and \-\-hardmask.
.TAG acceptall
.TP 9
.B \-\-acceptall
Write the results of all alignments to output files. This option
overrides all other accept/reject options (including \-\-id).
.TAG allpairs_global
.TP
.BI \-\-allpairs_global \0filename
Perform optimal global pairwise alignments of the fasta sequences
contained in \fIfilename\fR. Each sequence is compared to all sequencs
that come after it in the file, resulting in a total of n * (n-1) / 2
pairwise alignments, where n is the total number of sequences. This
command is multi-threaded.
.TAG id
.TP
.BI \-\-id \0real
Reject the sequence match if the pairwise identity is lower than
\fIreal\fR (value ranging from 0.0 to 1.0 included).
.TAG threads
.TP
.BI \-\-threads\~ "positive integer"
Number of computation threads to use (1 to 1024). The number of
threads should be lesser or equal to the number of available CPU
cores. The default is to use all available resources and to launch one
thread per logical core.
.TAG uc
.TP
.BI \-\-uc \0filename
Output pairwise alignment results in \fIfilename\fR using a
tab-separated uclust-like format with 10 columns. Each sequence is
compared to all other sequences, and all hits (\-\-acceptall) or only
some hits (\-\-id \fIfloat\fR) are reported, with one pairwise
comparison per line:
.RS
.RS
.nr step 1 1
.IP \n[step]. 4
Record type, always set to 'H'.
.IP \n+[step].
Ordinal number of the target sequence (based on input order, starting
from zero).
.IP \n+[step].
Sequence length.
.IP \n+[step].
Percentage of similarity with the target sequence.
.IP \n+[step].
Match orientation, always set to '+'.
.IP \n+[step].
Not used, always set to zero.
.IP \n+[step].
Not used, always set to zero.
.IP \n+[step].
Compact representation of the pairwise alignment using the CIGAR
format (Compact Idiosyncratic Gapped Alignment Report): M
(match/mismatch), D (deletion) and I (insertion). The equal sign '='
indicates that the query is identical to the centroid sequence.
.IP \n+[step].
Label of the query sequence.
.IP \n+[step].
Label of the target sequence.
.RE
.RE
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG restriction-site-cutting-options
Restriction site cutting options:
.RS
.PP
The input sequences in the file specified with the \-\-cut command are
cut into fragments at all restriction sites matching the pattern given
with the \-\-cut_pattern option. The fragments on the forward strand
are written to the file specified with the \-\-fastaout file and the
fragments on the reverse strand are written to the file specified with
the \-\-fastaout_rev option. Input sequences that do not match are
written to the file specified with the option \-\-fastaout_discarded,
and their reverse complement are also written to the file specified
with the \-\-fastaout_discarded_rev option. The relabel options
(\-\-relabel, \-\-relabel_self, \-\-relabel_keep, \-\-relabel_md5, and
\-\-relabel_sha1) may be used to relabel the output sequences).
.TAG cut
.TP 9
.BI \-\-cut \0filename
Specify the input file with sequences in FASTA format.
.TAG cut_pattern
.TP
.BI \-\-cut_pattern \0string
Specify the restriction site cutting pattern and positions. The
pattern is a string of lower- or uppercase letters specifying the
nucleotides that must match, and may include ambiguous nucleotide
symbols. The special characters "^" (circumflex) and "_" (underscore)
are used to indicate the cutting position on the forward and reverse
strand, respectively. For example, the pattern "G^AATT_C" is the
pattern for the EcoRI restriction site. For such palindromic patterns
(identical to its reverse complement) the command will output all
possible fragments on both strands. For non-palindromic sites, it may
be necessary to run the command also on the reverse complemented input
sequences. Exactly one cutting site on each strand must be indicated.
.TAG fastaout
.TP
.BI \-\-fastaout \0filename
Specify the output file for the resulting fragments on the forward
strand.
.TAG fastaout_rev
.TP
.BI \-\-fastaout_rev \0filename
Specify the output file for the resulting fragments on the reverse
strand.
.TAG fastaout_discarded
.TP
.BI \-\-fastaout_discarded \0filename
Specify the output file for the non-matching sequences.
.TAG fastaout_discarded_rev
.TP
.BI \-\-fastaout_discarded_rev \0filename
Specify the output file for the non-matching sequences, reverse
complemented.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG searching-options
Searching options:
.RS
.TAG alnout
.TP 9
.BI \-\-alnout \0filename
Write pairwise global alignments to \fIfilename\fR using a
human-readable format. Use \-\-rowlen to modify alignment
length. Output order may vary when using multiple threads.
.TAG biomout
.TP
.BI \-\-biomout \0filename
Write search results to an OTU table in the biom version 1.0 file
format. The query file contains the samples, while the database file
contains the OTUs. Sample and OTU identifiers are extracted from the
header of these sequences. See the \-\-biomout option in the
Clustering section for further details.
.TAG blast6out
.TP
.BI \-\-blast6out \0filename
Write search results to \fIfilename\fR using a blast-like
tab-separated format of twelve fields (listed below), with one line
per query-target matching (or lack of matching if \-\-output_no_hits
is used). Warning, vsearch uses global pairwise alignments, not
blast's seed-and-extend algorithm. Therefore, some common blast output
values (alignment start and end, evalue, bit score) are reported
differently. Output order may vary when using multiple threads. A
similar output can be obtain with \-\-userout \fIfilename\fR and
\-\-userfields
query+target+id+alnlen+mism+opens+qlo+qhi+tlo+thi+evalue+bits.  A
complete list and description is available in the section 'Userfields'
of this manual.
.RS
.RS
.nr step 1 1
.IP \n[step]. 4
\fIquery\fR: query label.
.IP \n+[step].
\fItarget\fR: target (database sequence) label. The field is set
to '*' if there is no alignment.
.IP \n+[step].
\fIid\fR: percentage of identity (real value ranging from 0.0 to
100.0). The percentage identity is defined as 100 * (matching columns)
/ (alignment length - terminal gaps). See fields id0 to id4 for other
definitions.
.IP \n+[step].
\fIalnlen\fR: length of the query-target alignment (number of
columns). The field is set to 0 if there is no alignment.
.IP \n+[step].
\fImism\fR: number of mismatches in the alignment (zero or positive
integer value).
.IP \n+[step].
\fIopens\fR: number of columns containing a gap opening (zero or
positive integer value, excluding terminal gaps).
.IP \n+[step].
\fIqlo\fR: first nucleotide of the query aligned with the
target. Always equal to 1 if there is an alignment, 0 otherwise (see
\fIqilo\fR to ignore initial gaps).
.IP \n+[step].
\fIqhi\fR: last nucleotide of the query aligned with the
target. Always equal to the length of the pairwise alignment, 0
otherwise (see \fIqihi\fR to ignore terminal gaps).
.IP \n+[step].
\fItlo\fR: first nucleotide of the target aligned with the
query. Always equal to 1 if there is an alignment, 0 otherwise (see
\fItilo\fR to ignore initial gaps).
.IP \n+[step].
\fIthi\fR: last nucleotide of the target aligned with the
query. Always equal to the length of the pairwise alignment, 0
otherwise (see \fItihi\fR to ignore terminal gaps).
.IP \n+[step].
\fIevalue\fR: expectancy-value (not computed for nucleotide
alignments). Always set to -1.
.IP \n+[step].
\fIbits\fR: bit score (not computed for nucleotide
alignments). Always set to 0.
.RE
.RE
.TAG db
.TP
.BI \-\-db \0filename
Compare query sequences (specified with \-\-usearch_global) to the
target sequences contained in \fIfilename\fR in FASTA or FASTQ format,
using global pairwise alignment. Alternatively, the name of a
preformatted UDB database created using the makeudb_usearch command
(see below) may be specified.
.TAG dbmask
.TP
.BI \-\-dbmask\~ "none|dust|soft"
Mask regions in the target database sequences using the dust method or
the soft method, or do not mask (none). Warning, when using soft
masking search commands become case sensitive. The default is to mask
using dust.
.TAG dbmatched
.TP
.BI \-\-dbmatched \0filename
Write database target sequences matching at least one query sequence
to \fIfilename\fR, in fasta format. If the option \-\-sizeout is used,
the number of queries that matched each target sequence is indicated
using the pattern ";size=\fIinteger\fR;".
.TAG dbnotmatched
.TP
.BI \-\-dbnotmatched \0filename
Write database target sequences not matching query sequences to
\fIfilename\fR, in fasta format.
.TAG fastapairs
.TP
.BI \-\-fastapairs \0filename
Write pairwise alignments of query and target sequences to
\fIfilename\fR, in fasta format.
.TAG fulldp
.TP
.B \-\-fulldp
Dummy option for compatibility with usearch. To maximize search
sensitivity, \fBvsearch\fR uses a 8-way 16-bit SIMD vectorized full
dynamic programming algorithm (Needleman-Wunsch), whether or not
\-\-fulldp is specified.
.TAG gapext
.TP
.BI \-\-gapext \0string
Set penalties for a gap extension. See \-\-gapopen for a complete
description of the penalty declaration system. The default is to
initialize the six gap extending penalties using a penalty of 2 for
extending internal gaps and a penalty of 1 for extending terminal
gaps, in both query and target sequences (i.e. 2I/1E).
.TAG gapopen
.TP
.BI \-\-gapopen \0string
Set penalties for a gap opening. A gap opening can occur in six
different contexts: in the query (Q) or in the target (T) sequence, at
the left (L) or right (R) extremity of the sequence, or inside the
sequence (I). Sequence symbols (Q and T) can be combined with location
symbols (L, I, and R), and numerical values to declare penalties for
all possible contexts: aQL/bQI/cQR/dTL/eTI/fTR, where abcdef are zero
or positive integers, and '/' is used as a separator.
.br
To simplify declarations, the location symbols (L, I, and R) can be
combined, the symbol (E) can be used to treat both extremities (L and
R) equally, and the symbols Q and T can be omitted to treat query and
target sequences equally. For instance, the default is to declare a
penalty of 20 for opening internal gaps and a penalty of 2 for opening
terminal gaps (left or right), in both query and target sequences
(i.e. 20I/2E). If only a numerical value is given, without any
sequence or location symbol, then the penalty applies to all gap
openings. To forbid gap-opening, an infinite penalty value can be
declared with the symbol '*'. To use \fBvsearch\fR as a semi-global
aligner, a null-penalty can be applied to the left (L) or right (R)
gaps.
.br
\fBvsearch\fR always initializes the six gap opening
penalties using the default parameters (20I/2E). The user is then free
to declare only the values he/she wants to modify. The \fIstring\fR is
scanned from left to right, accepted symbols are (0123456789/LIREQT*),
and later values override previous values.
.br
Please note that \fBvsearch\fR, in contrast to usearch, only allows
integer gap penalties. Because the lowest gap penalties are 0.5 by
default in usearch, all default scores and gap penalties in
\fBvsearch\fR have been doubled to maintain equivalent penalties and
to produce identical alignments.
.TAG hardmask
.TP
.B \-\-hardmask
Mask sequence regions by replacing them with Ns instead of setting
them to lower case as is the default. For more information, please see
the Masking section.
.TAG id
.TP
.BI \-\-id \0real
Reject the sequence match if the pairwise identity is lower than
\fIreal\fR (value ranging from 0.0 to 1.0 included). The search
process sorts target sequences by decreasing number of \fIk\fR-mers
they have in common with the query sequence, using that information as
a proxy for sequence similarity. That efficient pre-filtering also
prevents pairwise alignments with very short, or with weakly matching
targets, as there needs to be by default at least 12 shared
\fIk\fR-mers to start the pairwise alignment, and at least one out of
every 16 \fIk\fR-mers from the query needs to match the target (see
options \-\-wordlength and \-\-minwordmatches to change that
behaviour). Consequently, using values lower than \-\-id 0.5 is not
likely to capture more weakly matching targets. The pairwise identity
is by default defined as the number of (matching columns) / (alignment
length - terminal gaps). That definition can be modified by \-\-iddef.
.TAG iddef
.TP
.BI \-\-iddef\~ "0|1|2|3|4"
Change the pairwise identity definition used in \-\-id. Values
accepted are:
.RS
.RS
.nr step 0 1
.IP \n[step]. 4
CD-HIT definition: (matching columns) / (shortest sequence length).
.IP \n+[step].
edit distance: (matching columns) / (alignment length).
.IP \n+[step].
edit distance excluding terminal gaps (default definition for \-\-id).
.IP \n+[step].
Marine Biological Lab definition counting each gap opening (internal
or terminal) as a single mismatch, whether or not the gap was
extended: 1.0 - [(mismatches + gap openings)/(longest sequence
length)]
.IP \n+[step].
BLAST definition, equivalent to \-\-iddef 1 for global pairwise
alignments.
.RE
.PP
The option \-\-userfields accepts the fields id0 to id4, in addition
to the field id, to report the pairwise identity values corresponding
to the different definitions.
.RE
.TAG idprefix
.TP
.BI \-\-idprefix\~ "positive integer"
Reject the sequence match if the first \fIinteger\fR nucleotides of
the target do not match the query.
.TAG idsuffix
.TP
.BI \-\-idsuffix\~ "positive integer"
Reject the sequence match if the last \fIinteger\fR nucleotides of the
target do not match the query.
.TAG lca_cutoff
.TP
.BI \-\-lca_cutoff \0real
Adjust the fraction of matching hits required for the last common
ancestor (LCA) output with the \-\-lcaout option during searches. The
default value is 1.0 which requires all hits to match at each
taxonomic rank for that rank to be included. If a lower cutoff value
is used, e.g. 0.95, a small fraction of non-matching hits are allowed
while that rank will still be reported. The argument to this option
must be larger than 0.5, but not larger than 1.0.
.TAG lcaout
.TP
.BI \-\-lcaout \0filename
Output last common ancestor (LCA) information about the hits of each
query to a text file in a tab-separated format. The first column
contains the query id, while the second column contains the taxonomic
information. The headers of the sequences in the database must contain
taxonomic information in the same format as used with the \-\-sintax
command, e.g. "tax=k:Archaea,p:Euryarchaeota,c:Halobacteria". Only the
initial parts of the taxonomy that are common to a large fraction of
the hits of each query will be output. It is necessary to set the
\-\-maxaccepts option to a value different from 1 for this
information to be useful. The \-\-top_hits_only option may also be
useful. The fraction of matching hits required may be adjusted by the
\-\-lca_cutoff option (default 1.0).
.TAG leftjust
.TP
.B \-\-leftjust
Reject the sequence match if the pairwise alignment begins with gaps.
.TAG lengthout
.TP
.B \-\-lengthout
Write sequence length information to the output files in FASTA format
by adding a ";length=\fIinteger\fR" attribute in the header.
.TAG match
.TP
.BI \-\-match\~ "integer"
Score assigned to a match (i.e. identical nucleotides) in the pairwise
alignment. The default value is 2.
.TAG matched
.TP
.BI \-\-matched \0filename
Write query sequences matching database target sequences to
\fIfilename\fR, in fasta format.
.TAG maxaccepts
.TP
.BI \-\-maxaccepts\~ "positive integer"
Maximum number of matching target sequences to accept before stopping
the search for a given query. The default value is 1. This option
works in pair with \-\-maxrejects. The search process sorts target
sequences by decreasing number of \fIk\fR-mers they have in common
with the query sequence, using that information as a proxy for
sequence similarity. After pairwise alignments, if the first target
sequence passes the acceptation criteria, it is accepted as best hit
and the search process stops for that query. If \-\-maxaccepts is set
to a higher value, more matching targets are accepted. If
\-\-maxaccepts and \-\-maxrejects are both set to 0, the complete
database is searched. See \-\-maxhits option for a control on the
number of hits reported per query when search is done on both strands.
.TAG maxdiffs
.TP
.BI \-\-maxdiffs\~ "positive integer"
Reject the sequence match if the alignment contains at least
\fIinteger\fR substitutions, insertions or deletions.
.TAG maxgaps
.TP
.BI \-\-maxgaps\~ "positive integer"
Reject the sequence match if the alignment contains at least
\fIinteger\fR insertions or deletions.
.TAG maxhits
.TP
.BI \-\-maxhits\~ "non-negative integer"
Maximum number of hits to show once the search is terminated for a
given query (hits are sorted by decreasing identity). When searching
only on the plus strand (default situation, see \-\-strand), the
number of matching targets (\-\-maxaccepts) and the number of hits
(\-\-maxhits) are the same. However, when searching on both strands,
there could be two hits per target (one per strand): \-\-maxhits then
controls the overall number of reported hits per query. Unlimited by
default or if the argument is zero. This option applies to \-\-alnout,
\-\-blast6out, \-\-fastapairs, \-\-samout, \-\-uc, or \-\-userout
output files.
.TAG maxid
.TP
.BI \-\-maxid \0real
Reject the sequence match if the percentage of identity between the
two sequences is greater than \fIreal\fR.
.TAG maxqsize
.TP
.BI \-\-maxqsize\~ "positive integer"
Reject query sequences with an abundance greater than \fIinteger\fR.
.TAG maxqt
.TP
.BI \-\-maxqt \0real
Reject if the query/target sequence length ratio is greater than
\fIreal\fR.
.TAG maxrejects
.TP
.BI \-\-maxrejects\~ "positive integer"
Maximum number of non-matching target sequences to consider before
stopping the search for a given query. The default value is 32. This
option works in pair with \-\-maxaccepts. The search process sorts
target sequences by decreasing number of \fIk\fR-mers they have in
common with the query sequence, using that information as a proxy for
sequence similarity. After pairwise alignments, if none of the first
32 examined target sequences pass the acceptation criteria, the search
process stops for that query (no hit). If \-\-maxrejects is set to a
higher value, more target sequences are considered. If \-\-maxaccepts
and \-\-maxrejects are both set to 0, the complete database is
searched.
.TAG maxsizeratio
.TP
.BI \-\-maxsizeratio \0real
Reject if the query/target abundance ratio is greater than
\fIreal\fR.
.TAG maxsl
.TP
.BI \-\-maxsl \0real
Reject if the shorter/longer sequence length ratio is
greater than \fIreal\fR.
.TAG maxsubs
.TP
.BI \-\-maxsubs\~ "positive integer"
Reject the sequence match if the pairwise alignment contains more than
\fIinteger\fR substitutions.
.TAG mid
.TP
.BI \-\-mid \0real
Reject the sequence match if the percentage of identity is lower than
\fIreal\fR (ignoring all gaps, internal and terminal).
.TAG mincols
.TP
.BI \-\-mincols\~ "positive integer"
Reject the sequence match if the alignment length is shorter than
\fIinteger\fR.
.TAG minqt
.TP
.BI \-\-minqt \0real
Reject if the query/target sequence length ratio is lower than
\fIreal\fR.
.TAG minsizeratio
.TP
.BI \-\-minsizeratio \0real
Reject if the query/target abundance ratio is lower than \fIreal\fR.
.TAG minsl
.TP
.BI \-\-minsl \0real
Reject if the shorter/longer sequence length ratio is lower than
\fIreal\fR.
.TAG mintsize
.TP
.BI \-\-mintsize\~ "positive integer"
Reject target sequences with an abundance lower than \fIinteger\fR.
.TAG minwordmatches
.TP
.BI \-\-minwordmatches\~ "non-negative integer"
Minimum number of \fIk\fR-mers or word matches required for a sequence
to be considered further. Default value is 12 for the default word
length 8. For word lengths 3-15, the default minimum word matches are
18, 17, 16, 15, 14, 12, 11, 10, 9, 8, 7, 5 and 3, respectively. If the
query sequence has fewer unique words than the number specified, all
words in the query must match. If the argument is 0, no word matches
are required.
.TAG mismatch
.TP
.BI \-\-mismatch\~ "integer"
Score assigned to a mismatch (i.e. different nucleotides) in the
pairwise alignment. The default value is -4.
.TAG mothur_shared_out
.TP
.BI \-\-mothur_shared_out \0filename
Write search results to an OTU table in the mothur 'shared'
tab-separated plain text file format. The query file contains the
samples, while the database file contains the OTUs. Sample and OTU
identifiers are extracted from the header of these sequences. See the
\-\-otutabout option in the Clustering section for further details.
.TAG notmatched
.TP
.BI \-\-notmatched \0filename
Write query sequences not matching database target sequences to
\fIfilename\fR, in fasta format.
.TAG otutabout
.TP
.BI \-\-otutabout \0filename
Write search results to an OTU table in the classic tab-separated
plain text format. The query file contains the samples, while the
database file contains the OTUs. Sample and OTU identifiers are
extracted from the header of these sequences (\-\-sample option). See
the \-\-mothur_shared_out option in the Clustering section for further
details.
.TAG output_no_hits
.TP
.B \-\-output_no_hits
Write both matching and non-matching queries to \-\-alnout,
\-\-blast6out, \-\-samout or \-\-userout output files. Non-matching
queries are labelled 'No hits' in \-\-alnout files.
.TAG pattern
.TP
.B \-\-pattern \fIstring\fR
This option is ignored. It is provided for compatibility with usearch.
.TAG qmask
.TP
.BI \-\-qmask\~ "none|dust|soft"
Mask regions in the query sequences
using the dust or the soft algorithms, or do not mask
(none). Warning, when using soft masking search commands
become case sensitive. The default is to mask using \fIdust\fR.
.TAG qsegout
.TP
.BI \-\-qsegout \0filename
Write the aligned part of each query sequence to \fIfilename\fR in
FASTA format.
.TAG query_cov
.TP
.BI \-\-query_cov \0real
Reject if the fraction of the query aligned to the target sequence is
lower than \fIreal\fR (value ranging from 0.0 to 1.0 included). The
query coverage is computed as (matches + mismatches) / query sequence
length. Internal or terminal gaps are not taken into account.
.TAG rightjust
.TP
.B \-\-rightjust
Reject the sequence match if the pairwise alignment ends with gaps.
.TAG rowlen
.TP
.BI \-\-rowlen\~ "positive integer"
Width of alignment lines in \-\-alnout output. The default value is
64. Set to 0 to eliminate wrapping.
.TAG samheader
.TP
.B \-\-samheader
Include header lines to the SAM file when \-\-samout is specified. The
header includes lines starting with @HD, @SQ and @PG, but no @RG lines
(see
.URL https://github.com/samtools/hts-specs (link)
<https://github.com/samtools/hts-specs>). By default no header line is
written.
.TAG samout
.TP
.BI \-\-samout \0filename
Write alignment results to \fIfilename\fR using the SAM format (a
tab-separated text file). When using the \-\-samheader option, the SAM
file starts with header lines. Each non-header line is a SAM record,
which represents either a query-target alignment or the absence of
match for a query (output order may vary when using multiple
threads). Each record contains 11 mandatory fields and optional fields
(see
.URL https://github.com/samtools/hts-specs (link)
<https://github.com/samtools/hts-specs> for a complete description of
the format):
.RS
.RS
.nr step 1 1
.IP \n[step]. 4
query sequence label.
.IP \n+[step].
combination of bitwise flags. Possible values are: 0 (top hit), 4 (no
hit), 16 (reverse-complemented hit), 256 (secondary hit, i.e. all hits
except the top hit).
.IP \n+[step].
target sequence label.
.IP \n+[step].
first position of a target aligned with the query (always 1 for global
pairwise alignments, 0 if there is no match).
.IP \n+[step].
mapping quality (ignored, always set to '*').
.IP \n+[step].
CIGAR string (set to '*' if there is no match).
.IP \n+[step].
name of the target sequence matching with the next read of the query
(for mate reads only, ignored and always set to '*').
.IP \n+[step].
position of the primary alignment of the next read of the query (for
mate reads only, ignored and always set to 0).
.IP \n+[step].
target sequence length (for multi-segment targets, ignored and always
set to 0).
.IP \n+[step].
query sequence (complete, not only the segment aligned to the target
as usearch does).
.IP \n+[step].
quality string (ignored, always set to '*').
.RE
.TP
Optional fields for query-target matches (number and order of fields may vary):
.RS
.nr step 12 1
.IP \n[step]. 4
AS:i:? alignment score (i.e. percentage of identity).
.IP \n+[step].
XN:i:? next best alignment score (always set to 0).
.IP \n+[step].
XM:i:? number of mismatches.
.IP \n+[step].
XO:i:? number of gap openings (excluding terminal gaps).
.IP \n+[step].
XG:i:? number of gap extensions (excluding terminal gaps).
.IP \n+[step].
NM:i:? edit distance to the target (sum of XM and XG).
.IP \n+[step].
MD:Z:? string for mismatching positions.
.IP \n+[step].
YT:Z:UU string representing the alignment type.
.RE
.RE
.TAG search_exact
.TP
.BI \-\-search_exact \0filename
Search for exact full-length matches to the query sequences contained
in \fIfilename\fR in the database of target sequences (\-\-db). Only
100% exact matches are reported and this command is much faster than
\-\-usearch_global. The \-\-id, \-\-maxaccepts and \-\-maxrejects
options are ignored, but the rest of the searching options may be
specified.
.TAG self
.TP
.B \-\-self
Reject the sequence match if the query and target labels are
identical.
.TAG selfid
.TP
.B \-\-selfid
Reject the sequence match if the query and target sequences are
strictly identical.
.TAG sizeout
.TP
.B \-\-sizeout
Add abundance annotations to the output of the option \-\-dbmatched
(using the pattern ';size=\fIinteger\fR;'), to report the number of
queries that matched each target.
.TAG strand
.TP
.BI \-\-strand\~ "plus|both"
When searching for similar sequences, check the \fIplus\fR strand only
(default) or check \fIboth\fR strands.
.TAG target_cov
.TP
.BI \-\-target_cov \0real
Reject the sequence match if the fraction of the target sequence
aligned to the query sequence is lower than \fIreal\fR. The target
coverage is computed as (matches + mismatches) / target sequence
length.  Internal or terminal gaps are not taken into account.
.TAG top_hits_only
.TP
.B \-\-top_hits_only
Only the top hits with an equally high percentage of identity between
the query and database sequence sets are written to the output
specified with the options \-\-lcaout, \-\-alnout, \-\-samout,
\-\-userout, \-\-blast6out, \-\-uc, \-\-fastapairs, \-\-matched or
\-\-notmatched (but not \-\-dbmatched and \-\-dbnotmatched). For each
query, the top hit is the one presenting the highest percentage of
identity (see the \-\-iddef option to change the way identity is
measured). For a given query, if several top hits present exactly the
same percentage of identity, the number of matching targets reported
is controlled by the \-\-maxaccepts value (1 by default), and the
number of hits is controlled by the \-\-maxhits option.
.TAG tsegout
.TP
.BI \-\-tsegout \0filename
Write the aligned part of each target sequence to \fIfilename\fR in
FASTA format.
.TAG uc
.TP
.BI \-\-uc \0filename
Output searching results in \fIfilename\fR using a tab-separated
uclust-like format with 10 columns. When using the \-\-search_exact
command, the table layout is the same than with the
\-\-allpairs_global. When using the \-\-usearch_global command, the
table present two different type of entries: hit (H) or no hit
(N). Each query sequence is compared to all other sequences, and the
best hit (\-\-maxaccepts 1) or several hits (\-\-maxaccepts > 1) are
reported (H). Output order may vary when using multiple
threads. Column content varies with the type of entry (H or N):
.RS
.RS
.nr step 1 1
.IP \n[step]. 4
Record type: H, or N ('hit' or 'no hit').
.IP \n+[step].
Ordinal number of the target sequence (based on input order, starting
from zero). Set to '*' for N.
.IP \n+[step].
Sequence length. Set to '*' for N.
.IP \n+[step].
Percentage of similarity with the target sequence. Set to '*' for N.
.IP \n+[step].
Match orientation + or -. . Set to '.' for N.
.IP \n+[step].
Not used, always set to zero for H, or '*' for N.
.IP \n+[step].
Not used, always set to zero for H, or '*' for N.
.IP \n+[step].
Compact representation of the pairwise alignment using the CIGAR
format (Compact Idiosyncratic Gapped Alignment Report): M
(match/mismatch), D (deletion) and I (insertion). The equal sign '='
indicates that the query is identical to the centroid sequence. Set
to '*' for N.
.IP \n+[step].
Label of the query sequence.
.IP \n+[step].
Label of the target centroid sequence. Set to '*' for N.
.RE
.RE
.TAG uc_allhits
.TP
.B \-\-uc_allhits
When using the \-\-uc option, show all hits, not just the top hit for
each query.
.TAG usearch_global
.TP
.BI \-\-usearch_global \0filename
Compare target sequences (\-\-db) to the query sequences contained in
\fIfilename\fR in FASTA or FASTQ format, using global pairwise
alignment.
.TAG userfields
.TP
.BI \-\-userfields \0string
When using \-\-userout, select and order the fields written to the
output file. Fields are separated by '+' (e.g. query+target+id). See
the 'Userfields' section for a complete list of fields.
.TAG userout
.TP
.BI \-\-userout \0filename
Write user-defined tab-separated output to \fIfilename\fR. Select the
fields with the option \-\-userfields. Output order may vary when
using multiple threads. If \-\-userfields is empty or not present,
\fIfilename\fR is empty.
.TAG weak_id
.TP
.BI \-\-weak_id \0real
Show hits with percentage of identity of at least \fIreal\fR, without
terminating the search. A normal search stops as soon as enough hits
are found (as defined by \-\-maxaccepts, \-\-maxrejects, and
\-\-id). As \-\-weak_id reports weak hits that are not deduced from
\-\-maxaccepts (but count towards \-\-maxrejects), high \-\-id values
can be used, hence preserving both speed and sensitivity. Logically,
\fIreal\fR must be smaller than the value indicated by \-\-id.
.TAG wordlength
.TP
.BI \-\-wordlength\~ "positive integer"
Length of words (i.e. \fIk\fR-mers) for database indexing. The range
of possible values goes from 3 to 15, but values near 8 or 9 are
generally recommended. Longer words may reduce the sensitivity/recall
for weak similarities, but can increase precision. On the other hand,
shorter words may increase sensitivity or recall, but may reduce
precision. Computation time generally increases with shorter words and
decreases with longer words, but it increases again for very long
words. Memory requirements for a part of the index increase with a
factor of 4 each time word length increases by one nucleotide, and
this generally becomes significant for long words (12 or more). The
default value is 8.
.TAG xlength
.TP
.B \-\-xlength
Strip header attribute ";length=\fIinteger\fR" from input
sequences. This attribute is added to output sequences by the
\-\-lengthout option.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG shuffling-options
Shuffling options:
.RS
Fasta entries in the input file are outputted in a pseudo-random
order.
.TAG lengthout
.TP
.B \-\-lengthout
Write sequence length information to the output files in FASTA format
by adding a ";length=\fIinteger\fR" attribute in the header.
.TAG output
.TP 9
.BI \-\-output \0filename
Write the shuffled sequences to \fIfilename\fR, in fasta format.
.TAG randseed
.TP
.BI \-\-randseed\~ "positive integer"
When shuffling sequence order, use \fIinteger\fR as seed. A given seed
always produces the same output order (useful for replicability). Set
to 0 to use a pseudo-random seed (default behaviour).
.TAG relabel
.TP
.BI \-\-relabel \0string
Relabel sequences using the prefix \fIstring\fR and a ticker (1, 2, 3,
etc.) to construct the new headers. Use \-\-sizeout to conserve the
abundance annotations.
.TAG relabel_keep
.TP
.B \-\-relabel_keep
When relabelling, keep the old identifier in the header after a space.
.TAG relabel_md5
.TP
.B \-\-relabel_md5
Relabel sequences using the MD5 message digest algorithm applied to
each sequence. Former sequence headers are discarded. The sequence is
converted to upper case and U is replaced by T before the digest is
computed. The MD5 digest is a cryptographic hash function designed to
minimize the probability that two different inputs gives the same
output, even for very similar, but non-identical inputs. Still, there
is always a very small, but non-zero probability that two different
inputs give the same result. The MD5 digest generates a 128-bit
(16-byte) digest that is represented by 16 hexadecimal numbers (using
32 symbols among 0123456789abcdef). Use \-\-sizeout to conserve the
abundance annotations.
.TAG relabel_self
.TP
.B \-\-relabel_self
Relabel sequences using the sequence itself as the label.
.TAG relabel_sha1
.TP
.B \-\-relabel_sha1
Relabel sequences using the SHA1 message digest algorithm applied to
each sequence. It is similar to the \-\-relabel_md5 option but uses
the SHA1 algorithm instead of the MD5 algorithm. The SHA1 digest
generates a 160-bit (20-byte) result that is represented by 20
hexadecimal numbers (40 symbols). The probability of a collision (two
non-identical sequences having the same digest) is smaller for the
SHA1 algorithm than it is for the MD5 algorithm. Use \-\-sizeout to
conserve the abundance annotations.
.TAG sizeout
.TP
.B \-\-sizeout
When using \-\-relabel, \-\-relabel_self, \-\-relabel_md5 or
\-\-relabel_sha1, preserve and report abundance annotations to the
output fasta file (using the pattern ';size=\fIinteger\fR;').
.TAG shuffle
.TP
.BI \-\-shuffle \0filename
Pseudo-randomly shuffle the order of sequences contained in
\fIfilename\fR.
.TAG topn
.TP
.BI \-\-topn\~ "positive integer"
Output only the first \fIinteger\fR sequences after pseudo-random
reordering.
.TAG xlength
.TP
.B \-\-xlength
Strip header attribute ";length=\fIinteger\fR" from input
sequences. This attribute is added to output sequences by the
\-\-lengthout option.
.TAG xsize
.TP
.B \-\-xsize
Strip abundance information from the headers when writing the output
file.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG sorting-options
Sorting options:
.RS
Fasta entries are sorted by decreasing abundance (\-\-sortbysize) or
sequence length (\-\-sortbylength). To obtain a stable sorting order,
ties are sorted by decreasing abundance (if present) and label
increasing alpha-numerical order (\-\-sortbylength), or just by label
increasing alpha-numerical order (\-\-sortbysize). Label sorting
assumes that all sequences have unique labels. The same applies to the
automatic sorting performed during chimera checking
(\-\-uchime_denovo), dereplication (\-\-derep_fulllength), and
clustering (\-\-cluster_fast and \-\-cluster_size).
.PP
.TAG lengthout
.TP
.B \-\-lengthout
Write sequence length information to the output files in FASTA format
by adding a ";length=\fIinteger\fR" attribute in the header.
.TAG maxsize
.TP 9
.BI \-\-maxsize\~ "positive integer"
When using \-\-sortbysize, discard sequences with an abundance value
greater than \fIinteger\fR.
.TAG minsize
.TP
.BI \-\-minsize\~ "positive integer"
When using \-\-sortbysize, discard sequences with an abundance value
smaller than \fIinteger\fR.
.TAG output
.TP
.BI \-\-output \0filename
Write the sorted sequences to \fIfilename\fR, in fasta format.
.TAG relabel
.TP
.BI \-\-relabel \0string
Please see the description of the same option under Chimera detection
for details.
.TAG relabel_keep
.TP
.B \-\-relabel_keep
When relabelling, keep the old identifier in the header after a space.
.TAG relabel_md5
.TP
.BI \-\-relabel_md5
Please see the description of the same option under Chimera detection
for details.
.TAG relabel_self
.TP
.BI \-\-relabel_self
Please see the description of the same option under Chimera detection
for details.
.TAG relabel_sha1
.TP
.BI \-\-relabel_sha1
Please see the description of the same option under Chimera detection
for details.
.TAG sizeout
.TP
.B \-\-sizeout
When using \-\-relabel, report abundance annotations to the output
fasta file (using the pattern ';size=\fIinteger\fR;').
.TAG sortbylength
.TP
.BI \-\-sortbylength \0filename
Sort by decreasing length the sequences contained in
\fIfilename\fR. See the general options \-\-minseqlength and
\-\-maxseqlength to eliminate short and long sequences.
.TAG sortbysize
.TP
.BI \-\-sortbysize \0filename
Sort by decreasing abundance the sequences contained in \fIfilename\fR
(missing abundance values are assumed to be ';size=1'). See the
options \-\-minsize and \-\-maxsize to eliminate rare and dominant
sequences.
.TAG topn
.TP
.BI \-\-topn\~ "positive integer"
Output only the top \fIinteger\fR sequences (i.e. the longest or the
most abundant).
.TAG xlength
.TP
.B \-\-xlength
Strip header attribute ";length=\fIinteger\fR" from input
sequences. This attribute is added to output sequences by the
\-\-lengthout option.
.TAG xsize
.TP
.B \-\-xsize
Strip abundance information from the headers when writing the output
file.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG subsampling-options
Subsampling options:
.RS
Subsampling randomly extracts a certain number or a certain percentage
of the sequences in the input file. If the \-\-sizein option is in
effect, the abundances of the input sequences is taken into account
and the sampling is performed as if the input sequences were
rereplicated, subsampled and dereplicated before being written to the
output file. The extraction is performed as a random sampling with a
uniform distribution among the input sequences and is performed
without replacement. The input file is specified with the
\-\-fastx_subsample option, the output files are specified with the
\-\-fastaout and \-\-fastqout options and the amount of sequences to
be sampled is specified with the \-\-sample_pct or \-\-sample_size
options. The sequences not sampled may be written to files specified
with the options \-\-fasta_discarded and \-\-fastq_discarded. The
\-\-fastq_ascii, \-\-fastq_qmin and \-\-fastq_qmax options are also
available.
.PP
.TAG fastaout
.TP 9
.BI \-\-fastaout \0filename
Write the sampled sequences to \fIfilename\fR, in fasta format.
.TAG fastaout_discarded
.TP
.BI \-\-fastaout_discarded \0filename
Write the sequences not sampled to \fIfilename\fR, in fasta format.
.TAG fastq_ascii
.TP
.BI \-\-fastq_ascii\~ "positive integer"
Define the ASCII character number used as the basis for the FASTQ
quality score. The default is 33, which is used by the Sanger /
Illumina 1.8+ FASTQ format (phred+33). The value 64 is used by the
Solexa, Illumina 1.3+ and Illumina 1.5+ formats (phred+64). Only 33
and 64 are valid arguments.
.TAG fastq_qmax
.TP
.BI \-\-fastq_qmax\~ "positive integer"
Specify the maximum quality score accepted when reading FASTQ
files. The default is 41, which is usual for recent Sanger/Illumina
1.8+ files.
.TAG fastq_qmin
.TP
.BI \-\-fastq_qmin\~ "positive integer"
Specify the minimum quality score accepted for FASTQ files. The
default is 0, which is usual for recent Sanger/Illumina 1.8+
files. Older formats may use scores between -5 and 2.
.TAG fastqout
.TP
.BI \-\-fastqout \0filename
Write the sampled sequences to \fIfilename\fR, in fastq
format. Requires input in fastq format.
.TAG fastqout_discarded
.TP
.BI \-\-fastqout_discarded \0filename
Write the sequences not sampled to \fIfilename\fR, in fastq
format. Requires input in fastq format.
.TAG fastx_subsample
.TP
.BI \-\-fastx_subsample \0filename
Perform subsampling from the sequences in the specified input file
that is in FASTA or FASTQ format.
.TAG lengthout
.TP
.B \-\-lengthout
Write sequence length information to the output files in FASTA format
by adding a ";length=\fIinteger\fR" attribute in the header.
.TAG randseed
.TP
.BI \-\-randseed\~ "positive integer"
Use \fIinteger\fR as a seed for the pseudo-random generator. A given
seed always produces the same output, which is useful for
replicability. Set to 0 to use a pseudo-random seed (default
behaviour).
.TAG relabel
.TP
.BI \-\-relabel \0string
Relabel sequences using the prefix \fIstring\fR and a ticker (1, 2, 3,
etc.) to construct the new headers. Use \-\-sizeout to conserve the
abundance annotations.
.TAG relabel_keep
.TP
.B \-\-relabel_keep
When relabelling, keep the old identifier in the header after a space.
.TAG relabel_md5
.TP
.B \-\-relabel_md5
Relabel sequences using the MD5 message digest algorithm applied to
each sequence. Former sequence headers are discarded. The sequence is
converted to upper case and U is replaced by T before the digest is
computed. The MD5 digest is a cryptographic hash function designed to
minimize the probability that two different inputs give the same
output, even for very similar, but non-identical inputs. Still, there
is always a very small, but non-zero probability that two different
inputs give the same result. The MD5 digest generates a 128-bit
(16-byte) digest that is represented by 16 hexadecimal numbers (using
32 symbols among 0123456789abcdef). Use \-\-sizeout to conserve the
abundance annotations.
.TAG relabel_self
.TP
.B \-\-relabel_self
Relabel sequences using the sequence itself as the label.
.TAG relabel_sha1
.TP
.B \-\-relabel_sha1
Relabel sequences using the SHA1 message digest algorithm applied to
each sequence. It is similar to the \-\-relabel_md5 option but uses
the SHA1 algorithm instead of the MD5 algorithm. The SHA1 digest
generates a 160-bit (20-byte) result that is represented by 20
hexadecimal numbers (40 symbols). The probability of a collision (two
non-identical sequences having the same digest) is smaller for the
SHA1 algorithm than it is for the MD5 algorithm. Use \-\-sizeout to
conserve the abundance annotations.
.TAG sample_pct
.TP
.BI \-\-sample_pct\~ "real"
Subsample the given percentage of the input sequences. Accepted values
range from 0.0 to 100.0.
.TAG sample_size
.TP
.BI \-\-sample_size\~ "positive integer"
Extract the given number of sequences.
.TAG sizein
.TP
.B \-\-sizein
Take the abundance information of the input file into account,
otherwise the abundance of each sequence is considered to be 1.
.TAG sizeout
.TP
.B \-\-sizeout
Write abundance information to the output file.
.TAG xlength
.TP
.B \-\-xlength
Strip header attribute ";length=\fIinteger\fR" from input
sequences. This attribute is added to output sequences by the
\-\-lengthout option.
.TAG xsize
.TP
.B \-\-xsize
Strip abundance information from the headers when writing the output
file.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG taxonomic-classification-options
Taxonomic classification options:
.RS
The vsearch command \-\-sintax will classify the input sequences
according to the Sintax algorithm as described by Robert Edgar (2016)
in SINTAX: a simple non-Bayesian taxonomy classifier for 16S and ITS
sequences, BioRxiv, 074161. Preprint. doi: 10.1101/074161
.URL https://doi.org/10.1101/074161 (link)
.PP
The name of the fasta file containing the input sequences to be
classified is given as an argument to the \-\-sintax command. The
reference sequence database is specified with the \-\-db option. The
results are written in a tab delimited text file whose name is
specified with the \-\-tabbedout option. The \-\-sintax_cutoff option
may be used to set a minimum level of bootstrap support for the
taxonomic ranks to be reported. The \-\-randseed option may be
included to specify a seed for initialisation of the random number
generator used by the algorithm. Please note that when using multiple
threads, the \-\-randseed option may not work as intended, because
sequences may be processed in a random order by different threads. To
ensure the same results each time, use a single thread \-\-threads 1)
in combination with a fixed random seed specified with \-\-randseed.
.PP
Multithreading is supported. Databases in UDB files are supported.
The strand option may be specified.
.PP
The reference database must contain taxonomic information in the
header of each sequence in the form of a string starting with ";tax="
and followed by a comma-separated list of up to nine taxonomic
identifiers. Each taxonomic identifier must start with an indication
of the rank by one of the letters d (for domain) k (kingdom), p
(phylum), c (class), o (order), f (family), g (genus), s (species), or
t (strain). The letter is followed by a colon (:) and the name of that
rank. Commas and semicolons are not allowed in the name of the rank.
Non-ascii characters should be avoided in the names.
.PP
Example:

>X80725_S000004313;\:tax=d:Bacteria,\:p:Proteobacteria,\:c:Gammaproteobacteria,\:o:Enterobacteriales,\:f:Enterobacteriaceae,\:g:Escherichia/Shigella,\:s:Escherichia_coli,\:t:str._K-12_substr._MG1655

.PP
The option \-\-notrunclabels is turned on by default for this command,
allowing spaces in the taxonomic identifiers.
.PP
If two sequences in the reference database has equally many kmer
matches with the query, the shortest sequence will be chosen by
default. If they are equally long, the sequence appearing first in the
database will be chosen. If the recommended option \-\-sintax_random
is specified, sequences with an equal number of kmer matches will
instead be chosen by a random draw.
.PP
.TAG db
.TP 9
.BI \-\-db \0filename
Read the reference sequences from \fIfilename\fR, in FASTA, FASTQ or
UDB format. These sequences need to be annotated with taxonomy.
.TAG randseed
.TP
.BI \-\-randseed\~ "positive integer"
Use \fIinteger\fR as seed for the random number generator used in the
Sintax algorithm. A given seed always produces the same output order
(useful for replicability). Set to 0 to use a pseudo-random seed
(default behaviour). Does not work correctly with multiple threads;
please use \-\-threads 1 to ensure correct behaviour.
.TAG sintax
.TP
.BI \-\-sintax \0filename
Read the input sequences from \fIfilename\fR, in FASTA or FASTQ format.
.TAG sintax_cutoff
.TP
.BI \-\-sintax_cutoff\~ "real"
Specify a minimum level of bootstrap support for the taxonomic ranks
that will be included in column 4 of the output file. For instance
0.9, corresponding to 90%.
.TAG sintax_random
.TP
.B \-\-sintax_random
Break ties between sequences with equally many kmer matches by a
random draw. This option is recommended and may be made the default in
the future.
.TAG tabbedout
.TP
.BI \-\-tabbedout \0filename
Write the results to \fIfilename\fR, in a tab-separated text
format. Column 1 contains the query label. Column 2 contains the
predicted taxonomy in the same format as for the reference data, with
bootstrap support indicated in parentheses after each rank. Column 3
contains the strand. If the \-\-sintax_cutoff option is used, the
predicted taxonomy will be repeated in column 4 while omitting the
bootstrap values and including only the ranks with support at or above
the threshold.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG udb-options
UDB options:
.RS
Databases to be used with the \-\-usearch_global command may be
prepared from FASTA files and stored to a binary UDB formatted file in
order to speed up searching. This may be worthwhile when searching a
large database repeatedly. The sequences are indexed and stored in a
way that can be quickly loaded into memory. The commands and options
below can be used to create and inspect UDB files. An UDB file may be
specified with the \-\-db option instead of a FASTA formatted file
with the \-\-usearch_global command.
.PP
.TAG dbmask
.TP 9
.BI \-\-dbmask\~ "none|dust|soft"
Specify the sequence masking method used with the \-\-makeudb_usearch
command, either none, dust or soft. No masking is performed when none
is specified. When dust is specified, the DUST algorithm will be used
for masking low complexity regions (short repeats and skewed
composition). Lower case letters in the input file will be masked when
soft is specified (soft masking).
.TAG hardmask
.TP
.B \-\-hardmask
Mask sequences by replacing letters with N for the \-\-makeudb_usearch
command. The default is to use lower case letters (soft masking).
.TAG makeudb_usearch
.TP
.BI \-\-makeudb_usearch \0filename
Create an UDB database file from the FASTA-formatted sequences in the
file with the given \fIfilename\fR. The UDB database is written to the
file specified with the \-\-output option.
.TAG output
.TP
.BI \-\-output \0filename
Specify the \fIfilename\fR of a FASTA or UDB output file for the
\-\-makeudb_usearch or the \-\-udb2fasta command, respectively.
.TAG udb2fasta
.TP
.BI \-\-udb2fasta \0filename
Read the UDB database in the file with the given \fIfilename\fR and
output the sequences in FASTA format in the file specified by the
\-\-output option.
.TAG udbinfo
.TP
.BI \-\-udbinfo \0filename
Show information about the UDB database in the file with the given
\fIfilename\fR.
.TAG udbstats
.TP
.BI \-\-udbstats \0filename
Report statistics about the indexed words in the UDB database in the
file with the given \fIfilename\fR.
.TAG wordlength
.TP
.BI \-\-wordlength\~ "positive integer"
Specify the length of the words to be used when creating the UDB
database index using the \-\-makeudb_usearch command. Valid numbers
range from 3 to 15. The default is 8.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG userfields
Userfields (fields accepted by the \-\-userfields option):
.RS
.TP 9
.B aln
Print a string of M (match/mismatch, i.e. not a gap), D (delete,
i.e. a gap in the query) and I (insert, i.e. a gap in the target)
representing the pairwise alignment. Empty field if there is no
alignment.
.TP
.B alnlen
Print the length of the query-target alignment (number of
columns). The field is set to 0 if there is no alignment.
.TP
.B bits
Bit score (not computed for nucleotide alignments). Always set to 0.
.TP
.B caln
Compact representation of the pairwise alignment using the CIGAR
format (Compact Idiosyncratic Gapped Alignment Report): M
(match/mismatch), D (deletion) and I (insertion). Empty field if there
is no alignment.
.TP
.B evalue
E-value (not computed for nucleotide alignments). Always set to -1.
.TP
.B exts
Number of columns containing a gap extension (zero or positive integer
value).
.TP
.B gaps
Number of columns containing a gap (zero or positive integer value,
excluding terminal gaps).
.TP
.B id
The percentage of identity, according to the identity definition
specified by the \-\-iddef option.  Equal to id0, id1, id2, id3 or id4
below. By default the same as id2.
.TP
.B id0
CD-HIT definition of the percentage of identity (real value ranging
from 0.0 to 100.0) using the length of the shortest sequence in the
pairwise alignment as denominator: 100 * (matching columns) /
(shortest sequence length).
.TP
.B id1
The percentage of identity (real value ranging from 0.0 to 100.0) is
defined as the edit distance: 100 * (matching columns) / (alignment
length).
.TP
.B id2
The percentage of identity (real value ranging from 0.0 to 100.0) is
defined as the edit distance, excluding terminal gaps.
.TP
.B id3
Marine Biological Lab definition of the percentage of identity (real
value ranging from 0.0 to 100.0), counting each gap opening (internal
or terminal) as a single mismatch, whether or not the gap was
extended, and using the length of the longest sequence in the pairwise
alignment as denominator: 100 * (1.0 - [(mismatches + gaps) / (longest
sequence length)]).
.TP
.B id4
BLAST definition of the percentage of identity (real value ranging
from 0.0 to 100.0), equivalent to \-\-iddef 1 in a context of global
pairwise alignment. The field id4 is always equal to the field id1.
.TP
.B ids
Number of matches in the alignment (zero or positive integer value).
.TP
.B mism
Number of mismatches in the alignment (zero or positive integer
value).
.TP
.B opens
Number of columns containing a gap opening (zero or positive integer
value, excluding terminal gaps).
.TP
.B pairs
Number of columns containing only nucleotides. That value corresponds
to the length of the alignment minus the gap-containing columns (zero
or positive integer value).
.TP
.B pctgaps
Number of columns containing gaps expressed as a percentage of the
alignment length (real value ranging from 0.0 to 100.0).
.TP
.B pctpv
Percentage of positive columns. When working with nucleotide
sequences, this is equivalent to the percentage of matches (real value
ranging from 0.0 to 100.0).
.TP
.B pv
Number of positive columns. When working with nucleotide sequences,
this is equivalent to the number of matches (zero or positive integer
value).
.TP
.B qcov
Fraction of the query sequence that is aligned with the target
sequence (real value ranging from 0.0 to 100.0). The query coverage is
computed as 100.0 * (matches + mismatches) / query sequence length.
Internal or terminal gaps are not taken into account. The field is set
to 0.0 if there is no alignment.
.TP
.B qframe
Query frame (-3 to +3). That field only concerns coding sequences and
is not computed by \fBvsearch\fR. Always set to +0.
.TP
.B qhi
Last nucleotide of the query aligned with the target. Always equal to
the length of the pairwise alignment, 0 otherwise (see \fIqihi\fR to
ignore terminal gaps).
.TP
.B qihi
Last nucleotide of the query aligned with the target (ignoring
terminal gaps). Nucleotide numbering starts from 1. The field is set
to 0 if there is no alignment.
.TP
.B qilo
First nucleotide of the query aligned with the target (ignoring
initial gaps). Nucleotide numbering starts from 1. The field is set to
0 if there is no alignment.
.TP
.B ql
Query sequence length (positive integer value). The field is set to 0
if there is no alignment.
.TP
.B qlo
First nucleotide of the query aligned with the target. Always equal to
1 if there is an alignment, 0 otherwise (see \fIqilo\fR to ignore
initial gaps).
.TP
.B qrow
Print the sequence of the query segment as seen in the pairwise
alignment (i.e. with gap insertions if need be). Empty field if there
is no alignment.
.TP
.B qs
Query segment length. Always equal to query sequence length.
.\" The meaning of that field is not clear to us.
.TP
.B qstrand
Query strand orientation (+ or - for nucleotide sequences). Empty
field if there is no alignment.
.TP
.B query
Query label.
.TP
.B raw
Raw alignment score (negative, null or positive integer value). The
score is the sum of match rewards minus mismatch penalties, gap
openings and gap extensions. The field is set to 0 if there is no
alignment.
.TP
.B target
Target label. The field is set to '*' if there is no alignment.
.TP
.B tcov
Fraction of the target sequence that is aligned with the query
sequence (real value ranging from 0.0 to 100.0). The target coverage
is computed as 100.0 * (matches + mismatches) / target sequence
length.  Internal or terminal gaps are not taken into account.  The
field is set to 0.0 if there is no alignment.
.TP
.B tframe
Target frame (-3 to +3). That field only concerns coding sequences and
is not computed by \fBvsearch\fR. Always set to +0.
.TP
.B thi
Last nucleotide of the target aligned with the query. Always equal to
the length of the pairwise alignment, 0 otherwise (see \fItihi\fR to
ignore terminal gaps).
.TP
.B tihi
Last nucleotide of the target aligned with the query (ignoring
terminal gaps). Nucleotide numbering starts from 1. The field is set
to 0 if there is no alignment.
.TP
.B tilo
First nucleotide of the target aligned with the query (ignoring
initial gaps). Nucleotide numbering starts from 1. The field is set to
0 if there is no alignment.
.TP
.B tl
Target sequence length (positive integer value). The field is set to 0
if there is no alignment.
.TP
.B tlo
First nucleotide of the target aligned with the query. Always equal to
1 if there is an alignment, 0 otherwise (see \fItilo\fR to ignore
initial gaps).
.TP
.B trow
Print the sequence of the target segment as seen in the pairwise
alignment (i.e. with gap insertions if need be). Empty field if there
is no alignment.
.TP
.B ts
Target segment length. Always equal to target sequence length. The
field is set to 0 if there is no alignment.
.TP
.B tstrand
Target strand orientation (+ or - for nucleotide sequences). Always
set to '+', so reverse strand matches have tstrand '+' and
qstrand '\-'. Empty field if there is no alignment.
.RE
.PP
.\" ============================================================================
.SH DELIBERATE CHANGES
If you are a usearch user, our objective is to make you feel at
home. That's why \fBvsearch\fR was designed to behave like usearch, to
some extent. Like any complex software, usearch is not free from
quirks and inconsistencies. We decided not to reproduce some of them,
and for complete transparency, to document here the deliberate changes
we made.
.PP
During a search with usearch, when using the options \-\-blast6out and
\-\-output_no_hits, for queries with no match the number of fields
reported is 13, where it should be 12. This is corrected in
\fBvsearch\fR.
.PP
The field raw of the \-\-userfields option is not informative in
usearch. This is corrected in \fBvsearch\fR.
.PP
The fields qlo, qhi, tlo, thi now have counterparts (qilo, qihi, tilo,
tihi) reporting alignment coordinates ignoring terminal gaps.
.PP
In usearch, when using the option \-\-output_no_hits, queries that
receive no match are reported in \-\-blast6out file, but not in the
alignment output file. This is corrected in \fBvsearch\fR.
.PP
\fBvsearch\fR introduces a new \-\-cluster_size command that sorts
sequences by decreasing abundance before clustering.
.PP
\fBvsearch\fR reintroduces \-\-iddef alternative pairwise identity
definitions that were removed from usearch.
.PP
\fBvsearch\fR extends the \-\-topn option to sorting commands.
.PP
\fBvsearch\fR extends the \-\-sizein option to dereplication
(\-\-derep_fulllength) and clustering (\-\-cluster_fast).
.PP
\fBvsearch\fR treats T and U as identical nucleotides during
dereplication.
.PP
\fBvsearch\fR sorting is stabilized by using sequence abundances or
sequences labels as secondary or tertiary keys.
.PP
\fBvsearch\fR by default uses the DUST algorithm for masking
low-complexity regions. Masking behaviour is also slightly changed to
be more consistent.
.PP
.\" ============================================================================
.SH NOVELTIES
\fBvsearch\fR introduces new commands and new options not present in
usearch 7. They are described in the 'Options' section of this
manual. Here is a short list:
.RS
.IP - 2
uchime2_denovo, uchime3_denovo, alignwidth, borderline, fasta_score
(chimera checking)
.IP -
cluster_size, cluster_unoise, clusterout_id, clusterout_sort, profile
(clustering)
.IP -
fasta_width, gzip_decompress, bzip2_decompress (general option)
.IP -
iddef (clustering, pairwise alignment, searching)
.IP -
maxuniquesize (dereplication)
.IP -
relabel_md5, relabel_self and relabel_sha1 (chimera detection,
dereplication, FASTQ processing, shuffling, sorting)
.IP -
shuffle (shuffling)
.IP -
fastq_eestats, fastq_eestats2, fastq_maxlen, fastq_truncee (FASTQ
processing)
.IP -
fastaout_discarded, fastqout_discarded (subsampling)
.IP -
rereplicate (dereplication/rereplication)
.RE
.PP
.\" ============================================================================
.SH EXAMPLES
.PP
Align all sequences in a database with each other and output all
pairwise alignments:
.PP
.RS
\fBvsearch\fR \-\-allpairs_global \fIdatabase.fas\fR \-\-alnout
\fIresults.aln\fR \-\-acceptall
.RE
.PP
Check for the presence of chimeras (\fIde novo\fR); parents should be
at least 1.5 times more abundant than chimeras. Output non-chimeric
sequences in fasta format (no wrapping):
.PP
.RS
\fBvsearch\fR \-\-uchime_denovo \fIqueries.fas\fR \-\-abskew 1.5
\-\-nonchimeras \fIresults.fas\fR \-\-fasta_width 0
.RE
.PP
Cluster with a 97% similarity threshold, collect cluster centroids,
and write cluster descriptions using a uclust-like format:
.PP
.RS
\fBvsearch\fR \-\-cluster_fast \fIqueries.fas\fR \-\-id 0.97
\-\-centroids \fIcentroids.fas\fR \-\-uc \fIclusters.uc\fR
.RE
.PP
Dereplicate the sequences contained in \fIqueries.fas\fR, take into
account the abundance information already present, write unwrapped
fasta sequences to \fIqueries_unique.fas\fR with the new abundance
information, discard all sequences with an abundance of 1:
.PP
.RS
\fBvsearch\fR \-\-derep_fulllength \fIqueries.fas\fR \-\-sizein
\-\-fasta_width 0 \-\-sizeout \-\-output \fIqueries_unique.fas\fR
\-\-minuniquesize 2
.RE
.PP
Mask simple repeats and low complexity regions in the input fasta file
with the DUST algorithm (masked regions are lowercased), and write the
results to the output file:
.PP
.RS
\fBvsearch\fR \-\-maskfasta \fIqueries.fas\fR \-\-qmask dust
\-\-output \fIqueries_masked.fas\fR
.RE
.PP
Search queries in a reference database, with a 80%-similarity
threshold, take terminal gaps into account when calculating pairwise
similarities, output pairwise alignments:
.PP
.RS
\fBvsearch\fR \-\-usearch_global \fIqueries.fas\fR \-\-db
\fIreferences.fas\fR \-\-id 0.8 \-\-iddef 1 \-\-alnout
\fIresults.aln\fR
.RE
.PP
Search a sequence dataset against itself (ignore self hits), get all
matches with at least 60% similarity, and collect results in a
blast-like tab-separated format. Accept an unlimited number of hits
(\-\-maxaccepts 0), and compare each query to all other sequences,
including unlikely candidates (\-\-maxrejects 0):
.PP
.RS
\fBvsearch\fR \-\-usearch_global \fIqueries.fas\fR \-\-db
\fIqueries.fas\fR \-\-self \-\-id 0.6 \-\-blast6out
\fIresults.blast6\fR \-\-maxaccepts 0 \-\-maxrejects 0
.RE
.PP
Shuffle the input fasta file (change the order of sequences) in a
repeatable fashion (fixed seed), and write unwrapped fasta sequences
to the output file:
.PP
.RS
\fBvsearch\fR \-\-shuffle \fIqueries.fas\fR \-\-output
\fIqueries_shuffled.fas\fR \-\-randseed 13 \-\-fasta_width 0
.RE
.PP
Sort by decreasing abundance the sequences contained in
\fIqueries.fas\fR (using the 'size=\fIinteger\fR' information),
relabel the sequences while preserving the abundance information (with
\-\-sizeout), keep only sequences with an abundance equal to or
greater than 2:
.PP
.RS
\fBvsearch\fR \-\-sortbysize \fIqueries.fas\fR \-\-output
\fIqueries_sorted.fas\fR \-\-relabel sampleA_ \-\-sizeout \-\-minsize
2
.RE
.PP
.\"
.\" ============================================================================
.SH AUTHORS
Implementation and documentation by Torbjørn Rognes, Frédéric Mahé and Tomás Flouri.
.PP
.\" ============================================================================
.SH CITATION
.PP
Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016)
VSEARCH: a versatile open source tool for metagenomics.
\fIPeerJ\fR 4:e2584 doi: 10.7717/peerj.2584
.URL https://doi.org/10.7717/peerj.2584 (link)
.PP
.\" ============================================================================
.SH REPORTING BUGS
Submit suggestions and bug-reports at
.URL https://github.com/torognes/vsearch/issues (link)
<https://github.com/torognes/vsearch/issues>, send a pull request on
.URL https://github.com/torognes/vsearch (link)
<https://github.com/torognes/vsearch>, or compose a friendly or
curmudgeont e-mail to Torbjørn Rognes
.MTO torognes@ifi.uio.no (link)
<torognes@ifi.uio.no>.
.PP
.\" ============================================================================
.SH AVAILABILITY
Source code and binaries are available at
<https://github.com/torognes/vsearch>.
.PP
.\" ============================================================================
.SH COPYRIGHT
Copyright (C) 2014-2024, Torbjørn Rognes, Frédéric Mahé and Tomás
Flouri
.PP
All rights reserved.
.PP
Contact: Torbjørn Rognes <torognes@ifi.uio.no>,
Department of Informatics, University of Oslo,
PO Box 1080 Blindern, NO-0316 Oslo, Norway
.PP
This software is dual-licensed and available under a choice
of one of two licenses, either under the terms of the GNU
General Public License version 3 or the BSD 2-Clause License.
.PP
\fBGNU General Public License version 3\fR
.PP
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
.PP
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
.PP
You should have received a copy of the GNU General Public License
along with this program.  If not, see
.URL https://www.gnu.org/licenses/ (link)
<https://www.gnu.org/licenses/>.
.PP
.PP
\fBThe BSD 2-Clause License\fR
.PP
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
.PP
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
.PP
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
.PP
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
.PP
We would like to thank the authors of the following projects for
making their source code available:
.RS
.IP - 2
\fBvsearch\fR includes code from Google's CityHash project by Geoff
Pike and Jyrki Alakuijala, providing some excellent hash functions
available under a MIT license.
.IP -
\fBvsearch\fR includes code derived from Tatusov and Lipman's DUST
program that is in the public domain.
.IP -
\fBvsearch\fR includes public domain code written by Alexander Peslyak
for the MD5 message digest algorithm.
.IP -
\fBvsearch\fR includes public domain code written by Steve Reid and
others for the SHA1 message digest algorithm.
.IP -
\fBvsearch\fR binaries may include code from the zlib library,
copyright Jean-Loup Gailly and Mark Adler.
.IP -
\fBvsearch\fR binaries may include code from the bzip2 library,
copyright Julian R. Seward.
.RE
.PP
.\" ============================================================================
.SH SEE ALSO
\fBswipe\fR, an extremely fast pairwise local (Smith-Waterman)
database search tool by Torbjørn Rognes, available at
.URL https://github.com/torognes/swipe "(link)"
<https://github.com/torognes/swipe>.
.PP
\fBswarm\fR, a fast and accurate amplicon clustering method by
Frédéric Mahé and Torbjørn Rognes, available at
.URL https://github.com/torognes/swarm "(link)"
<https://github.com/torognes/swarm>.
.PP
.\" ============================================================================
.SH VERSION HISTORY
New features and important modifications of \fBvsearch\fR (short lived
or minor bug releases may not be mentioned):
.TP
.BR v1.0.0\~ "released November 28th, 2014"
First public release.
.TP
.BR v1.0.1\~ "released December 1st, 2014"
Bug fixes (sortbysize, semicolon after size annotation in headers) and
minor changes (labels as secondary sort key for most sorts, treat T
and U as identical for dereplication, only output size in
\-\-dbmatched file if \-\-sizeout specified).
.TP
.BR v1.0.2\~ "released December 6th, 2014"
Bug fixes (ssse3/sse4.1 requirement, memory leak).
.TP
.BR v1.0.3\~ "released December 6th, 2014"
Bug fix (now writes help to stdout instead of stderr).
.TP
.BR v1.0.4\~ "released December 8th, 2014"
Added \-\-allpairs_global option. Reduce memory requirements slightly
and eliminate memory leaks.
.TP
.BR v1.0.5\~ "released December 9th, 2014"
Fixes a minor bug with \-\-allpairs_global and \-\-acceptall options.
.TP
.BR v1.0.6\~ "released December 14th, 2014"
Fixes a memory allocation bug in chimera detection (\-\-uchime_ref
option).
.TP
.BR v1.0.7\~ "released December 19th, 2014"
Fixes a bug in the output from chimera detection with the
\-\-uchimeout option.
.TP
.BR v1.0.8\~ "released January 22nd, 2015"
Introduces several changes and bug fixes:
.RS
.IP - 2
a new linear memory aligner for alignment of sequences longer than
5,000 nucleotides,
.IP -
a new \-\-cluster_size command that sorts sequences by decreasing
abundance before clustering,
.IP -
meaning of userfields qlo, qhi, tlo, thi changed for compatibility
with usearch,
.IP -
new userfields qilo, qihi, tilo, tihi give alignment coordinates
ignoring terminal gaps,
.IP -
in \-\-uc output files, a perfect alignment is indicated with a '='
sign,
.IP -
the option \-\-cluster_fast now sorts sequences by decreasing length,
then by decreasing abundance and finally by sequence identifier,
.IP -
default \-\-maxseqlength value set to 50,000 nucleotides,
.IP -
fix for bug in alignment in rare cases,
.IP -
fix for lack of detection of under- or overflow in SIMD aligner.
.RE
.TP
.BR v1.0.9\~ "released January 22nd, 2015"
Fixes a bug in the function sorting sequences by decreasing abundance
(\-\-sortbysize).
.TP
.BR v1.0.10\~ "released January 23rd, 2015"
Fixes a bug where the \-\-sizein option was ignored and always treated
as on, affecting clustering and dereplication commands.
.TP
.BR v1.0.11\~ "released February 5th, 2015"
Introduces the possibility to output results in SAM format (for
clustering, pairwise alignment and searching).
.TP
.BR v1.0.12\~ "released February 6th, 2015"
Temporarily fixes a problem with long headers in FASTA files.
.TP
.BR v1.0.13\~ "released February 17th, 2015"
Fix a memory allocation problem when computing multiple sequence
alignments with the \-\-msaout and \-\-consout options, as well as a
memory leak.  Also increased line buffer for reading FASTA files to
4MB.
.TP
.BR v1.0.14\~ "released February 17th, 2015"
Fix a bug where the multiple alignment and consensus sequence computed
after clustering ignored the strand of the sequences. Also decreased
size of line buffer for reading FASTA files to 1MB again due to
excessive stack memory usage.
.TP
.BR v1.0.15\~ "released February 18th, 2015"
Fix bug in calculation of identity metric between sequences when using
the MBL definition (\-\-iddef 3).
.TP
.BR v1.0.16\~ "released February 19th, 2015"
Integrated patches from Debian for increased compatibility with
various architectures.
.TP
.BR v1.1.0\~ "released February 20th, 2015"
Added the \-\-quiet option to suppress all output to stdout and stderr
except for warnings and fatal errors. Added the \-\-log option to
write messages to a log file.
.TP
.BR v1.1.1\~ "released February 20th, 2015"
Added info about \-\-log and \-\-quiet options to help text.
.TP
.BR v1.1.2\~ "released March 18th, 2015"
Fix bug with large datasets. Fix format of help info.
.TP
.BR v1.1.3\~ "released March 18th, 2015"
Fix more bugs with large datasets.
.TP
.BR v1.2.0-1.2.19\~ "released July 6th to September 8th, 2015"
Several new commands and options added. Bugs fixed. Documentation
updated.
.TP
.BR v1.3.0\~ "released September 9th, 2015"
Changed to autotools build system.
.TP
.BR v1.3.1\~ "released September 14th, 2015"
Several new commands and options. Bug fixes.
.TP
.BR v1.3.2\~ "released September 15th, 2015"
Fixed memory leaks. Added '-h' shortcut for help. Removed extra 'v' in
version number.
.TP
.BR v1.3.3\~ "released September 15th, 2015"
Fixed bug in hexadecimal digits of MD5 and SHA1 digests. Added
\-\-samheader option.
.TP
.BR v1.3.4\~ "released September 16th, 2015"
Fixed compilation problems with zlib and bzip2lib.
.TP
.BR v1.3.5\~ "released September 17th, 2015"
Minor configuration/makefile changes to compile to native CPU and
simplify makefile.
.TP
.BR v1.4.0\~ "released September 25th, 2015"
Added \-\-sizeorder option.
.TP
.BR v1.4.1\~ "released September 29th, 2015"
Inserted public domain MD5 and SHA1 code to eliminate dependency on
crypto and openssl libraries and their licensing issues.
.TP
.BR v1.4.2\~ "released October 2nd, 2015"
Dynamic loading of libraries for reading gzip and bzip2 compressed
files if available. Circumvention of missing gzoffset function in
zlib 1.2.3 and earlier.
.TP
.BR v1.4.3\~ "released October 3rd, 2015"
Fix a bug with determining amount of memory on some versions of Apple
OS X.
.TP
.BR v1.4.4\~ "released October 3rd, 2015"
Remove debug message.
.TP
.BR v1.4.5\~ "released October 6th, 2015"
Fix memory allocation bug when reading long FASTA sequences.
.TP
.BR v1.4.6\~ "released October 6th, 2015"
Fix subtle bug in SIMD alignment code that reduced accuracy.
.TP
.BR v1.4.7\~ "released October 7th, 2015"
Fixes a problem with searching for or clustering sequences with
repeats. In this new version, vsearch looks at all words occurring at
least once in the sequences in the initial step. Previously only words
occurring exactly once were considered. In addition, vsearch now
requires at least 10 words to be shared by the sequences, previously
only 6 were required. If the query contains less than 10 words, all
words must be present for a match. This change seems to lead to
slightly reduced recall, but somewhat increased precision, ending up
with slightly improved overall accuracy.
.TP
.BR v1.5.0\~ "released October 7th, 2015"
This version introduces the new option \-\-minwordmatches that allows
the user to specify the minimum number of matching unique words before
a sequence is considered further. New default values for different
word lengths are also set. The minimum word length is increased to 7.
.TP
.BR v1.6.0\~ "released October 9th, 2015"
This version adds the relabeling options (\-\-relabel, \-\-relabel_md5
and \-\-relabel_sha1) to the shuffle command. It also adds the
\-\-xsize option to the clustering, dereplication, shuffling and
sorting commands.
.TP
.BR v1.6.1\~ "released October 14th, 2015"
Fix bugs and update manual and help text regarding relabelling. Add
all relabelling options to the subsampling command. Add the \-\-xsize
option to chimera detection, dereplication and fastq filtering
commands. Refactoring of code.
.TP
.BR v1.7.0\~ "released October 14th, 2015"
Add \-\-relabel_keep option.
.TP
.BR v1.8.0\~ "released October 19th, 2015"
Added \-\-search_exact, \-\-fastx_mask and \-\-fastq_convert commands.
Changed most commands to read FASTQ input files as well as FASTA
files.  Modified \-\-fastx_revcomp and \-\-fastx_subsample to write
FASTQ files.
.TP
.BR v1.8.1\~ "released November 2nd, 2015"
Fixes for compatibility with QIIME and older OS X versions.
.TP
.BR v1.9.0\~ "released November 12th, 2015"
Added the \-\-fastq_mergepairs command and associated options. This
command has not been tested well yet. Included additional files to
avoid dependency of autoconf for compilation. Fixed an error where
identifiers in fasta headers where not truncated at tabs, just spaces.
Fixed a bug in detection of the file format (FASTA/FASTQ) of a gzip
compressed input file.
.TP
.BR v1.9.1\~ "released November 13th, 2015"
Fixed memory leak and a bug in score computation in
\-\-fastq_mergepairs, and improved speed.
.TP
.BR v1.9.2\~ "released November 17th, 2015"
Fixed a bug in the computation of some values with \-\-fastq_stats.
.TP
.BR v1.9.3\~ "released November 19th, 2015"
Workaround for missing x86intrin.h with old compilers.
.TP
.BR v1.9.4\~ "released December 3rd, 2015"
Fixed incrementation of counter when relabeling dereplicated
sequences.
.TP
.BR v1.9.5\~ "released December 3rd, 2015"
Fixed bug resulting in inferior chimera detection performance.
.TP
.BR v1.9.6\~ "released January 8th, 2016"
Fixed bug in aligned sequences produced with \-\-fastapairs and
\-\-userout (qrow, trow) options.
.TP
.BR v1.9.7\~ "released January 12th, 2016"
Masking behaviour is changed somewhat to keep the letter case of the
input sequences unchanged when no masking is performed. Masking is now
performed also during chimera detection. Documentation updated.
.TP
.BR v1.9.8\~ "released January 22nd, 2016"
Fixed bug causing segfault when chimera detection is performed on
extremely short sequences.
.TP
.BR v1.9.9\~ "released January 22nd, 2016"
Adjusted default minimum number of word matches during searches for
improved performance.
.TP
.BR v1.9.10\~ "released January 25th, 2016"
Fixed bug related to masking and lower case database sequences.
.TP
.BR v1.10.0\~ "released February 11th, 2016"
Parallelized and improved merging of paired-end reads and adjusted
some defaults. Removed progress indicator when stderr is not a
terminal. Added \-\-fasta_score option to report chimera scores in
FASTA files. Added \-\-rereplicate and \-\-fastq_eestats
commands. Fixed typos. Added relabelling to files produced with
\-\-consout and \-\-profile options.
.TP
.BR v1.10.1\~ "released February 23rd, 2016"
Fixed a bug affecting the \-\-fastq_mergepairs command causing FASTQ
headers to be truncated at first space (despite the bug fix release
1.9.0 of November 12th, 2015). Full headers are now included in the
output (no matter if \-\-notrunclabels is in effect or not).
.TP
.BR v1.10.2\~ "released March 18th, 2016"
Fixed a bug causing a segmentation fault when running
\-\-usearch_global with an empty query sequence. Also fixed a bug
causing imperfect alignments to be reported with an alignment string
of '=' in uc output files. Fixed typos in man file. Fixed fasta/fastq
processing code regarding presence or absence of compression library
header files.
.TP
.BR v1.11.1\~ "released April 13th, 2016"
Added strand information in UC file for \-\-derep_fulllength and
\-\-derep_prefix. Added expected errors (ee) to header of FASTA files
specified with \-\-fastaout and \-\-fastaout_discarded when \-\-eeout
or \-\-fastq_eeout option is in effect for fastq_filter and
fastq_mergepairs. The options \-\-eeout and \-\-fastq_eeout are now
equivalent.
.TP
.BR v1.11.2\~ "released June 21st, 2016"
Two bugs were fixed. The first issue was related to the \-\-query_cov
option that used a different coverage definition than the qcov
userfield. The coverage is now defined as the fraction of the whole
query sequence length that is aligned with matching or mismatching
residues in the target. All gaps are ignored. The other issue was
related to the consensus sequences produced during clustering when
only N's were present in some positions. Previously these would be
converted to A's in the consensus. The behaviour is changed so that
N's are produced in the consensus, and it should now be more
compatible with usearch.
.TP
.BR v2.0.0\~ "released June 24th, 2016"
This major new version supports reading from pipes. Two new options
are added: \-\-gzip_decompress and \-\-bzip2_decompress. One of these
options must be specified if reading compressed input from a pipe, but
are not required when reading from ordinary files. The vsearch header
that was previously written to stdout is now written to stderr. This
enables piping of results for further processing. The file name '\-'
now represent standard input (/dev/stdin) or standard output
(/dev/stdout) when reading or writing files, respectively. Code for
reading FASTA and FASTQ files has been refactored.
.TP
.BR v2.0.1\~ "released June 30th, 2016"
Avoid segmentation fault when masking very long sequences.
.TP
.BR v2.0.2\~ "released July 5th, 2016"
Avoid warnings when compiling with GCC 6.
.TP
.BR v2.0.3\~ "released August 2nd, 2016"
Fixed bad compiler options resulting in Illegal instruction errors
when running precompiled binaries.
.TP
.BR v2.0.4\~ "released September 1st, 2016"
Improved error message for bad FASTQ quality values. Improved manual.
.TP
.BR v2.0.5\~ "released September 9th, 2016"
Add options \-\-fastaout_discarded and \-\-fastqout_discarded to
output discarded sequences from subsampling to separate files. Updated
manual.
.TP
.BR v2.1.0\~ "released September 16th, 2016"
New command: \-\-fastx_filter. New options: \-\-fastq_maxlen,
\-\-fastq_truncee. Allow \-\-minwordmatches down to 3.
.TP
.BR v2.1.1\~ "released September 23rd, 2016"
Fixed bugs in output to UC-files. Improved help text and manual.
.TP
.BR v2.1.2\~ "released September 28th, 2016"
Fixed incorrect abundance output from fastx_filter and fastq_filter
when relabelling.
.TP
.BR v2.2.0\~ "released October 7th, 2016"
Added OTU table generation options \-\-biomout, \-\-mothur_shared_out
and \-\-otutabout to the clustering and searching commands.
.TP
.BR v2.3.0\~ "released October 10th, 2016"
Allowed zero-length sequences in FASTA and FASTQ files. Added
\-\-fastq_trunclen_keep option. Fixed bug with output of OTU tables to
pipes.
.TP
.BR v2.3.1\~ "released November 16th, 2016"
Fixed bug where \-\-minwordmatches 0 was interpreted as the default
minimum word matches for the given word length instead of zero. When
used in combination with \-\-maxaccepts 0 and \-\-maxrejects 0 it will
allow complete bypass of kmer-based heuristics.
.TP
.BR v2.3.2\~ "released November 18th, 2016"
Fixed bug where vsearch reported the ordinal number of the target
sequence instead of the cluster number in column 2 on H-lines in the
uc output file after clustering. For search and alignment commands
both usearch and vsearch reports the target sequence number here.
.TP
.BR v2.3.3\~ "released December 5th, 2016"
A minor speed improvement.
.TP
.BR v2.3.4\~ "released December 9th, 2016"
Fixed bug in output of sequence profiles and updated documentation.
.TP
.BR v2.4.0\~ "released February 8th, 2017"
Added support for Linux on Power8 systems (ppc64le) and Windows on
x86_64. Improved detection of pipes when reading FASTA and FASTQ
files. Corrected option for specifying output from fastq_eestats
command in help text.
.TP
.BR v2.4.1\~ "released March 1st, 2017"
Fixed an overflow bug in fastq_stats and fastq_eestats affecting
analysis of very large FASTQ files. Fixed maximum memory usage
reporting on Windows.
.TP
.BR v2.4.2\~ "released March 10th, 2017"
Default value for fastq_minovlen increased to 16 in accordance with
help text and for compatibility with usearch. Minor changes for
improved accuracy of paired-end read merging.
.TP
.BR v2.4.3\~ "released April 6th, 2017"
Fixed bug with progress bar for shuffling. Fixed missing N-lines in UC
files with usearch_global, search_exact and allpairs_global when the
output_no_hits option was not specified.
.TP
.BR v2.4.4\~ "released August 28th, 2017"
Fixed a few minor bugs, improved error messages and updated
documentation.
.TP
.BR v2.5.0\~ "released October 5th, 2017"
Support for UDB database files. New commands: fastq_stripright,
fastq_eestats2, makeudb_usearch, udb2fasta, udbinfo, and udbstats. New
general option: no_progress. New options minsize and maxsize to
fastx_filter. Minor bug fixes, error message improvements and
documentation updates.
.TP
.BR v2.5.1\~ "released October 25th, 2017"
Fixed bug with bad default value of 1 instead of 32 for minseqlength
when using the makeudb_usearch command.
.TP
.BR v2.5.2\~ "released October 30th, 2017"
Fixed bug with where '-' as an argument to the fastq_eestats2 option
was treated literally instead of equivalent to stdin.
.TP
.BR v2.6.0\~ "released November 10th, 2017"
Rewritten paired-end reads merger with improved accuracy. Decreased
default value for fastq_minovlen option from 16 to 10. The default
value for the fastq_maxdiffs option is increased from 5 to 10. There
are now other more important restrictions that will avoid merging
reads that cannot be reliably aligned.
.TP
.BR v2.6.1\~ "released December 8th, 2017"
Improved parallelisation of paired end reads merging.
.TP
.BR v2.6.2\~ "released December 18th, 2017"
Fixed option xsize that was partially inactive for commands
uchime_denovo, uchime_ref, and fastx_filter.
.TP
.BR v2.7.0\~ "released February 13th, 2018"
Added commands cluster_unoise, uchime2_denovo and uchime3_denovo
contributed by Davide Albanese based on Robert Edgar's
papers. Refactored fasta and fastq print functions as well as code for
extraction of abundance and other attributes from the headers.
.TP
.BR v2.7.1\~ "released February 16th, 2018"
Fix several bugs on Windows related to large files, use of "-" as a
file name to mean stdin or stdout, alignment errors, missed kmers and
corrupted UDB files. Added documentation of UDB-related commands.
.TP
.BR v2.7.2\~ "released April 20th, 2018"
Added the sintax command for taxonomic classification. Fixed a bug with
incorrect FASTA headers of consensus sequences after clustering.
.TP
.BR v2.8.0\~ "released April 24th, 2018"
Added the fastq_maxdiffpct option to the fastq_mergepairs command.
.TP
.BR v2.8.1\~ "released June 22nd, 2018"
Fixes for compilation warnings with GCC 8.
.TP
.BR v2.8.2\~ "released August 21st, 2018"
Fix for wrong placement of semicolons in header lines in some cases
when using the sizeout or xsize options. Reduced memory requirements
for full-length dereplication in cases with many duplicate sequences.
Improved wording of fastq_mergepairs report. Updated manual regarding
use of sizein and sizeout with dereplication. Changed a compiler
option.
.TP
.BR v2.8.3\~ "released August 31st, 2018"
Fix for segmentation fault for \-\-derep_fulllength with \-\-uc.
.TP
.BR v2.8.4\~ "released September 3rd, 2018"
Further reduce memory requirements for dereplication when not using
the uc option. Fix output during subsampling when quiet or log options
are in effect.
.TP
.BR v2.8.5\~ "released September 26th, 2018"
Fixed a bug in fastq_eestats2 that caused the values for large lengths
to be much too high when the input sequences had varying lengths.
.TP
.BR v2.8.6\~ "released October 9th, 2018"
Fixed a bug introduced in version 2.8.2 that caused derep_fulllength
to include the full FASTA header in its output instead of stopping at
the first space (unless the notrunclabels option is in effect).
.TP
.BR v2.9.0\~ "released October 10th, 2018"
Added the fastq_join command.
.TP
.BR v2.9.1\~ "released October 29th, 2018"
Changed compiler options that select the target cpu and tuning to
allow the software to run on any 64-bit x86 system, while tuning for
more modern variants. Avoid illegal instruction error on some
architectures. Update documentation of rereplicate command.
.TP
.BR v2.10.0\~ "released December 6th, 2018"
Added the sff_convert command to convert SFF files to FASTQ. Added
some additional option argument checks. Fixed segmentation fault bug
after some fatal errors when a log file was specified.
.TP
.BR v2.10.1\~ "released December 7th, 2018"
Improved sff_convert command. It will now read several variants of the
SFF format. It is also able to read from a pipe. Warnings are given if
there are minor problems. Errors messages have been improved. Minor
speed and memory usage improvements.
.TP
.BR v2.10.2\~ "released December 10th, 2018"
Fixed bug in sintax with reversed order of domain and kingdom.
.TP
.BR v2.10.3\~ "released December 19th, 2018"
Ported to Linux on ARMv8 (aarch64). Fixed compilation warning with gcc
version 8.1.0 and 8.2.0.
.TP
.BR v2.10.4\~ "released January 4th, 2019"
Fixed serious bug in x86_64 SIMD alignment code introduced in version
2.10.3. Added link to BioConda in README. Fixed bug in fastq_stats
with sequence length 1. Fixed use of equals symbol in UC files for
identical sequences with cluster_fast.
.TP
.BR v2.11.0\~ "released February 13th, 2019"
Added ability to trim and filter paired-end reads using the reverse
option with the fastx_filter and fastq_filter commands. Added \-\-xee
option to remove ee attributes from FASTA headers. Minor invisible
improvement to the progress indicator.
.TP
.BR v2.11.1\~ "released February 28th, 2019"
Minor change to the handling of the weak_id and id options when using
cluster_unoise.
.TP
.BR v2.12.0\~ "released March 19th, 2019"
Take sequence abundance into account when computing consensus
sequences or profiles after clustering. Warn when rereplicating
sequences without abundance info. Guess offset 33 in more cases with
fastq_chars. Stricter checking of option arguments and option
combinations.
.TP
.BR v2.13.0\~ "released April 11th, 2019"
Added the \-\-fastx_getseq, \-\-fastx_getseqs and \-\-fastx_getsubseq
commands to extract sequences from a FASTA or FASTQ file based on
their labels. Improved handling of ambiguous nucleotide
symbols. Corrected behaviour of \-\-uchime_ref command with and
options \-\-self and \-\-selfid. Strict detection of illegal options
for each command.
.TP
.BR v2.13.1\~ "released April 26th, 2019"
Minor changes to the allowed options for each command. All commands
now allow the log, quiet and threads options. If more than 1 thread is
specified for commands that are not multi-threaded, a warning will be
issued. Minor changes to the manual.
.TP
.BR v2.13.2\~ "released April 30th, 2019"
Fixed bug related to improper handling of newlines on Windows.
Allowed option strand plus to uchime_ref for compatibility.
.TP
.BR v2.13.3\~ "released April 30th, 2019"
Fixed bug in FASTQ parsing introduced in version 2.13.2.
.TP
.BR v2.13.4\~ "released May 10th, 2019"
Added information about support for gzip- and bzip2-compressed input
files to the output of the version command. Adapted source code for
compilation on FreeBSD and NetBSD systems.
.TP
.BR v2.13.5\~ "released July 2nd, 2019"
Added cut command to fragment sequences at restriction sites. Silenced
output from the fastq_stats command if quiet option was given. Updated manual.
.TP
.BR v2.13.6\~ "released July 2nd, 2019"
Added info about cut command to output of help command.
.TP
.BR v2.13.7\~ "released September 2nd, 2019"
Fixed bug in consensus sequence introduced in version 2.13.0.
.TP
.BR v2.14.0\~ "released September 11th, 2019"
Added relabel_self option. Made fasta_width, sizein, sizeout and
relabelling options valid for certain commands.
.TP
.BR v2.14.1\~ "released September 18th, 2019"
Fixed bug with sequences written to file specified with fastaout_rev
for commands fastx_filter and fastq_filter.
.TP
.BR v2.14.2\~ "released January 28th, 2020"
Fixed some issues with the cut, fastx_revcomp, fastq_convert,
fastq_mergepairs, and makeudb_usearch commands. Updated manual.
.TP
.BR v2.15.0\~ "released June 19th, 2020"
Update manual and documentation. Turn on notrunclabels option for
sintax command by default. Change maxhits 0 to mean unlimited hits,
like the default. Allow non-ascii characters in headers, with a
warning. Sort centroids and uc too when clusterout_sort specified. Add
cluster id to centroids output when clusterout_id specified. Improve
error messages when parsing FASTQ files. Add missing fastq_qminout
option and fix label_suffix option for fastq_mergepairs. Add derep_id
command that dereplicates based on both label and sequence. Remove
compilation warnings.
.TP
.BR v2.15.1\~ "released October 28th, 2020"
Fix for dereplication when including reverse complement sequences and
headers. Make some extra checks when loading compression libraries and
add more diagnostic output about them to the output of the version
command. Report an error when fastx_filter is used with FASTA input
and options that require FASTQ input. Update manual.
.TP
.BR v2.15.2\~ "released January 26th, 2021"
No real functional changes, but some code and compilation
changes. Compiles successfully on macOS running on Apple Silicon
(ARMv8).  Binaries available. Code updated for C++11. Minor
adaptations for Windows compatibility, including the use of the C++
standard library for regular expressions. Minor changes for
compatibility with Power8. Switch to C++ header files.
.TP
.BR v2.16.0\~ "released March 22nd, 2021"
This version adds the orient command. It also handles empty input
files properly. Documentation has been updated.
.TP
.BR v2.17.0\~ "released March 29nd, 2021"
The fastq_mergepairs command has been changed. It now allows merging
of sequences with overlaps as short as 5 bp if the \-\-fastq_minovlen
option has been adjusted down from the default 10. In addition, much
fewer pairs of reads should now be rejected with the reason 'multiple
potential alignments' as the algorithm for detecting those have been
changed.
.TP
.BR v2.17.1\~ "released June 14th, 2021"
Modernized code. Minor changes to help info.
.TP
.BR v2.18.0\~ "released August 27th, 2021"
Added the fasta2fastq command. Fixed search bug on ppc64le. Fixed bug
with removal of size and ee info in uc files. Fixed compilation errors
in some cases. Made some general code improvements. Updated manual.
.TP
.BR v2.19.0\~ "released December 21st, 2021"
Added the lcaout and lca_cutoff options to enable the output of last
common ancestor (LCA) information about hits when searching. The
randseed option was added as a valid option to the sintax
command. Code improvements.
.TP
.BR v2.20.0\~ "released January 10th, 2022"
Added the fastx_uniques command and the fastq_qout_max option for
dereplication of FASTQ files. Some code cleaning.
.TP
.BR v2.20.1\~ "released January 11th, 2022"
Fixes a bug in fastq_mergepair that caused an occational hang at the
end when using multiple threads.
.TP
.BR v2.21.0\~ "released January 12th, 2022"
This version adds the sample, qsegout and tsegout options. It enables
the use of UDB databases with uchime_ref.
.TP
.BR v2.21.1\~ "released January 18th, 2022"
Fix a problem with dereplication of empty input files. Update Altivec
code on ppc64le for improved compiler compatibility (vector->__vector).
.TP
.BR v2.21.2\~ "released September 12th, 2022"
Fix problems with the lcaout option when using maxaccepts above 1 and
either lca_cutoff below 1 or with top_hits_only enabled. Update
documentation. Update code to avoid compiler warnings.
.TP
.BR v2.22.0\~ "released September 19th, 2022"
Add the derep_smallmem command for dereplication using little memory.
.TP
.BR v2.22.1\~ "released September 19th, 2022"
Fix compiler warning.
.TP
.BR v2.23.0\~ "released July 7th, 2023"
Update documentation. Add citation file. Modernize and improve
code. Fix several minor bugs. Fix compilation with GCC 13. Print stats
after fastq_mergepairs to log file instead of stderr. Handle sizein
option correctly with dbmatched option for usearch_global. Allow
maxseqlength option for makeudb_usearch. Fix memory allocation problem
with chimera detection. Add lengthout and xlength options. Increase
precision for eeout option. Add warning about sintax algorithm, random
seed and multiple threads. Refactor chimera detection code. Add
undocumented experimental long_chimeras_denovo command. Fix segfault
with clustering. Add more references.
.TP
.BR v2.24.0\~ "released October 26th, 2023"
Update documentation. Improve code. Allow up to 20 parents for the
undocumented and experimental chimeras_denovo command. Fix compilation
warnings for sha1.c. Compile for release (not debug) by default.
.TP
.BR v2.25.0\~ "released November 10th, 2023"
Allow a given percentage of mismatches between chimeras and parents
for the experimental chimeras_denovo command.
.TP
.BR v2.26.0\~ "released November 24th, 2023"
Enable the maxseqlength and minseqlength options for the chimera
detection commands. When the usearch_global or search_exact commands
are used, OTU tables will include samples and OTUs with no matches.
.TP
.BR v2.26.1\~ "released November 25th, 2023"
No real changes, but the previous version was released without proper
updates to the source code.
.TP
.BR v2.27.0\~ "released January 19th, 2024"
The usearch_global and search_exact commands now support FASTQ files
as well as FASTA files as input. This version of vsearch includes
clarifications and updates to the manual. Some code has been
refactored. Generic Dockerfiles for major Linux distributions have
been included. Some warnings from compilers and other tools have been
eliminated. The release for Windows will also include DLL's for the
two compression libraries.
.TP
.BR v2.27.1\~ "released April 6th, 2024"
This version fixes the weak_id option and makes searches report weak
hits in some cases. It also updates the names of the compression
libraries to libz.so.1 and libbz2.so.1 on Linux to make them work on
common Linux distributions without installing additional packages.
README.md has been updated with information about compression
libraries on Windows.
.TP
.BR v2.28.0\~ "released April 26th, 2024"
The sintax command has been improved in several ways in this version
of vsearch. Please note that several details of this algorithm is not
clearly described in the preprint, and the implementation in vsearch
differs from that in usearch. The former vsearch version did not
always choose the most common taxonomic entity over the 100 bootstraps
among the database sequences with the highest amount of word
similarity to the query. Instead, if several sequences had an equal
similarity with the query, the sequence encountered in the earliest
bootstrap was chosen. The confidence level was calculated based on
this sequence compared to the selected sequences from the other 99
bootstraps. This could lead to a suboptimal choice with a low
confidence. In the new version, the most common of the sequences with
the highest amount of word similarity across the 100 bootstraps will
be selected, and ties will be broken randomly. Another problem with
the old implementation was that if several sequences had the same
amount of word similarity, the shortest one in the reference database
would be chosen, and if they were equally long, the earliest in the
database file would be chosen. A new option called sintax_random has
now been introduced. This option will randomly select one of the
sequences with the highest number of shared words with the query,
without considering their length or position. This avoids a bias
towards shorter reference sequences. This option is strongly
recommended and will probably soon be the default. Furthermore, a
ninth taxonomic rank, strain (letter t), is now recognized. The speed
of the sintax command has also been significantly improved at least in
some cases. Run vsearch with the randseed option and 1 thread to
ensure reproducibility of the random choices in the algorithm.
.TP
.BR v2.28.1\~ "released April 26th, 2024"
Fix a segmentation fault that could occur with the blast6out and
output_no_hits options.
.TP
.BR v2.29.0\~ "released September 26th, 2024"
This version fixes seven bugs (see changelog below), adds
initial support for RISC-V architectures, and improves code quality
and code testing (1,210 new tests):
.RS
.IP - 2
add: experimental support for RISCV64 and other 64-bit little-endian
architectures, thanks to Michael R. Crusoe and his fellow Debian
developers (issue #566),
.IP -
add: official support for clang-19 and gcc 14,
.IP -
add: beta support for clang-20,
.IP -
remove: unused \-\-output option for command \-\-fastq_stats (issue #572),
.IP -
fix: bug in \-\-sintax when selecting the best lineage (only low
confidence values below 0.5 were affected) (issue #573),
.IP -
fix: out-of-bounds error in \-\-fastq_stats when processing empty
reads (issue #571),
.IP -
fix: bug in \-\-cut, patterns with multiple cutting sites were not
detected (commit 4c4f9fa70f14b28d50185dbf322cf5727087e86a),
.IP -
fix: memory error (segmentation fault) when using \-\-derep_id and
\-\-strand (issue #565),
.IP -
fix: \-\-fastq_join now obeys to \-\-quiet and \-\-log options
(commit 87f968b09f17c17ebf8db00aebe86e89b13a3948),
.IP -
fix: \-\-fastq_join quality padding is now also set to Q40 when
quality offset is 64 (commit be0bf9b48d782286c4ce38f0bf1a4c82bd230250),
.IP -
fix: (partial) \-\-fastq_join's handling of abundance annotations
(commit f2bbcb421dc2f4dfa6603b9f31ec3e4598c1b591),
.IP -
improve: additional safeguards to validate input values and to make
sure that they are within acceptable limits. Changes concern options
\-\-abskew (commit a530dd8990f8a05cb25fc0b6a5da5a14d28fbedd) and
\-\-fastq_maxdiffs (commit 4b254db7f120bfd49e86185ef3cd9070c236f940),
.IP -
improve: code quality (1.3k+ commits, 6k+ clang-tidy warnings eliminated),
.IP -
improve: documentation and help messages (issue #568),
.IP -
improve: complete refactoring and modernization of a subset of
commands (\-\-sortbylength, \-\-sortbysize, \-\-shuffle,
\-\-rereplicate, \-\-cut, \-\-fastq_join, \-\-fasta2fastq,
\-\-fastq_chars),
.IP -
improve: code-coverage of our test-suite for the above-mentioned commands (1,210 new tests, 4,753 in total)
.RE
.LP
.TP
.BR v2.29.1\~ "released October 24th, 2024"
Fix a segmentation fault that could occur during alignment in version
2.29.0, for example with \-\-uchime_ref. Some improvements to code and
documentation.
.TP
.BR v2.29.2\~ "released December 20th, 2024"
Fix a segmentation fault during clustering when the set of clusters is empty.
Initial documentation in markdown format available on GitHub Pages.
.\" ============================================================================
.\" TODO:
.\"
.\" NOTES
.\" visualize and output to pdf
.\" man -l vsearch.1
.\" man -t ./doc/vsearch.1 | ps2pdf - > ./doc/vsearch_manual.pdf