-
Notifications
You must be signed in to change notification settings - Fork 125
/
Copy pathvsearch.1
5007 lines (5004 loc) · 199 KB
/
vsearch.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
.\" import www macros (URL, TAG, MTO)
.mso www.tmac
.\" ============================================================================
.TH vsearch 1 "December 20, 2024" "version 2.29.2" "USER COMMANDS"
.\" ============================================================================
.SH NAME
vsearch \(em a versatile open-source tool for microbiome analysis,
including chimera detection, clustering, dereplication and
rereplication, extraction, FASTA/FASTQ/SFF file processing, masking,
orienting, pairwise alignment, restriction site cutting, searching,
shuffling, sorting, subsampling, and taxonomic classification of
amplicon sequences for metagenomics, genomics, and population
genetics.
.\" ============================================================================
.SH SYNOPSIS
.\" left justified, ragged right
.ad l
Chimera detection:
.RS
\fBvsearch\fR (\-\-uchime_denovo | \-\-uchime2_denovo |
\-\-uchime3_denovo) \fIfastafile\fR (\-\-chimeras | \-\-nonchimeras |
\-\-uchimealns | \-\-uchimeout) \fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-uchime_ref \fIfastafile\fR (\-\-chimeras |
\-\-nonchimeras | \-\-uchimealns | \-\-uchimeout) \fIoutputfile\fR
\-\-db \fIfastafile\fR [\fIoptions\fR]
.PP
.RE
Clustering:
.RS
\fBvsearch\fR (\-\-cluster_fast | \-\-cluster_size |
\-\-cluster_smallmem | \-\-cluster_unoise) \fIfastafile\fR (\-\-alnout
| \-\-biomout | \-\-blast6out | \-\-centroids | \-\-clusters |
\-\-mothur_shared_out | \-\-msaout | \-\-otutabout | \-\-profile |
\-\-samout | \-\-uc | \-\-userout) \fIoutputfile\fR \-\-id \fIreal\fR
[\fIoptions\fR]
.PP
.RE
Dereplication and rereplication:
.RS
\fBvsearch\fR \-\-fastx_uniques (\fIfastafile\fR | \fIfastqfile\fR)
(\-\-fastaout | \-\-fastqout | \-\-tabbedout | \-\-uc) \fIoutputfile\fR
[\fIoptions\fR]
.PP
\fBvsearch\fR (\-\-derep_fulllength | \-\-derep_id | \-\-derep_prefix)
\fIfastafile\fR (\-\-output | \-\-uc) \fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-derep_smallmem (\fIfastafile\fR | \fIfastqfile\fR)
\-\-fastaout \fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-rereplicate \fIfastafile\fR \-\-output
\fIoutputfile\fR [\fIoptions\fR]
.PP
.RE
Extraction of sequences:
.RS
\fBvsearch\fR \-\-fastx_getseq \fIfastafile\fR (\-\-fastaout |
\-\-fastqout | \-\-notmatched | \-\-notmatchedfq) \fIoutputfile\fR
\-\-label \fIlabel\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastx_getseqs \fIfastafile\fR (\-\-fastaout |
\-\-fastqout | \-\-notmatched | \-\-notmatchedfq) \fIoutputfile\fR
(\-\-label \fIlabel\fR \ \-\-labels \fIlabelfile\fR | \-\-label_word
\fIlabel\fR | \-\-label_words \fIlabelfile\fR) [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastx_getsubseq \fIfastafile\fR (\-\-fastaout |
\-\-fastqout | \-\-notmatched | \-\-notmatchedfq) \fIoutputfile\fR
\-\-label \fIlabel\fR [\-\-subseq_start \fIposition\fR]
[\-\-subseq_end \fIposition\fR] [\fIoptions\fR]
.PP
.RE
FASTA/FASTQ/SFF file processing:
.RS
\fBvsearch\fR \-\-fasta2fastq \fIfastqfile\fR \-\-fastqout
\fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastq_chars \fIfastqfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastq_convert \fIfastqfile\fR \-\-fastqout
\fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR (\-\-fastq_eestats | \-\-fastq_eestats2) \fIfastqfile\fR
\-\-output \fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastq_filter \fIfastqfile\fR [\-\-reverse
\fIfastqfile\fR] (\-\-fastaout | \-\-fastaout_discarded | \-\-fastqout
| \-\-fastqout_discarded \-\-fastaout_rev | \-\-fastaout_discarded_rev
| \-\-fastqout_rev | \-\-fastqout_discarded_rev) \fIoutputfile\fR
[\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastq_join \fIfastqfile\fR \-\-reverse
\fIfastqfile\fR (\-\-fastaout | \-\-fastqout) \fIoutputfile\fR
[\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastq_mergepairs \fIfastqfile\fR \-\-reverse
\fIfastqfile\fR (\-\-fastaout | \-\-fastqout |
\-\-fastaout_notmerged_fwd | \-\-fastaout_notmerged_rev |
\-\-fastqout_notmerged_fwd | \-\-fastqout_notmerged_rev |
\-\-eetabbedout) \fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastq_stats \fIfastqfile\fR
[\-\-log \fIlogfile\fR] [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastx_filter \fIinputfile\fR [\-\-reverse
\fIinputfile\fR] (\-\-fastaout | \-\-fastaout_discarded | \-\-fastqout
| \-\-fastqout_discarded \-\-fastaout_rev | \-\-fastaout_discarded_rev
| \-\-fastqout_rev | \-\-fastqout_discarded_rev) \fIoutputfile\fR
[\fIoptions\fR]
.PP
\fBvsearch\fR \-\-fastx_revcomp \fIinputfile\fR (\-\-fastaout |
\-\-fastqout) \fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-sff_convert \fIsff-file\fR \-\-fastqout
\fIoutputfile\fR [\fIoptions\fR]
.PP
.RE
Masking:
.RS
\fBvsearch\fR \-\-fastx_mask \fIfastxfile\fR (\-\-fastaout |
\-\-fastqout) \fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-maskfasta \fIfastafile\fR \-\-output
\fIoutputfile\fR [\fIoptions\fR]
.PP
.RE
Orienting:
.RS
\fBvsearch\fR \-\-orient \fIfastxfile\fR \-\-db \fIfastxfile\fR
(\-\-fastaout | \-\-fastqout | \-\-notmatched | \-\-tabbedout)
\fIoutputfile\fR [\fIoptions\fR]
.PP
.RE
Pairwise alignment:
.RS
\fBvsearch\fR \-\-allpairs_global \fIfastafile\fR (\-\-alnout |
\-\-blast6out | \-\-matched | \-\-notmatched | \-\-samout | \-\-uc |
\-\-userout) \fIoutputfile\fR (\-\-acceptall | \-\-id \fIreal\fR)
[\fIoptions\fR]
.PP
.RE
Restriction site cutting:
.RS
\fBvsearch\fR \-\-cut \fIfastafile\fR \-\-cut_pattern \fIpattern\fR
(\-\-fastaout | \-\-fastaout_rev | \-\-fastaout_discarded |
\-\-fastaout_discarded_rev) \fIoutputfile\fR [\fIoptions\fR]
.PP
.RE
Searching:
.RS
\fBvsearch\fR \-\-search_exact \fIfastafile\fR \-\-db \fIfastafile\fR
(\-\-alnout | \-\-biomout | \-\-blast6out | \-\-mothur_shared_out |
\-\-otutabout | \-\-samout | \-\-uc | \-\-userout | \-\-lcaout)
\fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-usearch_global \fIfastafile\fR \-\-db
\fIfastafile\fR (\-\-alnout | \-\-biomout | \-\-blast6out |
\-\-mothur_shared_out | \-\-otutabout | \-\-samout | \-\-uc |
\-\-userout | \-\-lcaout) \fIoutputfile\fR \-\-id \fIreal\fR
[\fIoptions\fR]
.PP
.RE
Shuffling and sorting:
.RS
\fBvsearch\fR (\-\-shuffle | \-\-sortbylength | \-\-sortbysize)
\fIfastafile\fR \-\-output \fIoutputfile\fR [\fIoptions\fR]
.PP
.RE
Subsampling:
.RS
\fBvsearch\fR \-\-fastx_subsample \fIfastafile\fR (\-\-fastaout |
\-\-fastqout) \fIoutputfile\fR (\-\-sample_pct \fIreal\fR |
\-\-sample_size \fIpositive integer\fR) [\fIoptions\fR]
.PP
.RE
Taxonomic classification:
.RS
\fBvsearch\fR \-\-sintax \fIfastafile\fR \-\-db \fIfastafile\fR
\-\-tabbedout \fIoutputfile\fR [\-\-sintax_cutoff \fIreal\fR]
[\fIoptions\fR]
.PP
.RE
UDB database handling:
.RS
\fBvsearch\fR \-\-makeudb_usearch \fIfastafile\fR \-\-output
\fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-udb2fasta \fIudbfile\fR \-\-output \fIoutputfile\fR
[\fIoptions\fR]
.PP
\fBvsearch\fR (\-\-udbinfo | \-\-udbstats) \fIudbfile\fR
[\fIoptions\fR]
.PP
.RE
.\" left and right justified (default)
.ad b
.\" ============================================================================
.SH DESCRIPTION
Environmental or clinical molecular diversity studies generate large
volumes of amplicons (e.g.; SSU-rRNA sequences) that need to be
checked for chimeras, dereplicated, masked, sorted, searched,
clustered or compared to reference sequences. The aim of \fBvsearch\fR
is to offer a all-in-one open source tool to perform these tasks,
using optimized algorithm implementations and harvesting the full
potential of modern computers, thus providing fast and accurate data
processing.
.PP
Comparing nucleotide sequences is at the core of \fBvsearch\fR. To
speed up comparisons, \fBvsearch\fR implements an extremely fast
Needleman-Wunsch algorithm, making use of the Streaming SIMD
Extensions (SSE2) of post-2003 x86-64 CPUs. If SSE2 instructions are
not available, \fBvsearch\fR exits with an error message. On Power8
CPUs it will use AltiVec/VSX/VMX instructions, and on ARMv8 CPUs it
will use Neon instructions. On other systems it can use the SIMD
Everywhere (simde) library, if available. Memory usage increases
rapidly with sequence length: for example comparing two sequences of
length 1 kb requires 8 MB of memory per thread, and comparing two 10
kb sequences requires 800 MB of memory per thread. For comparisons
involving sequences with a length product greater than 25 million (for
example two sequences of length 5 kb), \fBvsearch\fR uses a slower
alignment method described by Hirschberg (1975) and Myers and Miller
(1988), with much smaller memory requirements.
.\" ----------------------------------------------------------------------------
.SS Input
\fBvsearch\fR accept as input fasta or fastq files containing one or
several nucleotidic entries. In fasta files, each entry is made of a
header and a sequence. The header is defined as the string comprised
between the initial '>' symbol and the first space, tab or the end of
the line, unless the \-\-notrunclabels option is in effect, in which
case the entire line is included. The header should contain printable
ascii characters (33-126). The program will terminate with a fatal
error if there are unprintable ascii characters. A warning will be
issued if non-ascii characters (128-255) are encountered.
.PP
If the header matches the pattern '>[;]size=\fIinteger\fR;label', the
pattern '>label;size=\fIinteger\fR;label', or the
pattern '>label;size=\fIinteger\fR[;]', \fBvsearch\fR will interpret
\fIinteger\fR as the number of occurrences (or abundance) of the
sequence in the study. That abundance information is used or created
during chimera detection, clustering, dereplication, sorting and
searching.
.PP
The sequence is defined as a string of IUPAC symbols
(ACGTURYSWKMDBHVN), starting after the end of the identifier line and
ending before the next identifier line, or the file end. \fBvsearch\fR
silently ignores ascii characters 9 to 13, and exits with an error
message if ascii characters 0 to 8, 14 to 31, '.' or '-' are
present. All other ascii or non-ascii characters are stripped and
complained about in a warning message.
.PP
In fastq files, each entry is made of sequence header starting with a
symbol '@', a nucleotidic sequence (same rules as for fasta
sequences), a quality header starting with a symbol '+' and a string
of ASCII characters (offset 33 or 64), each one encoding the quality
value of the corresponding position in the nucleotidic sequence.
.PP
\fBvsearch\fR operations are case insensitive, except when soft
masking is activated. Masking is automatically applied during chimera
detection, clustering, masking, pairwise alignment and searching. Soft
masking is specified with the options '\-\-dbmask soft' (for searching
and chimera detection with a reference) or '\-\-qmask soft' (for
searching, \fIde novo\fR chimera detection, clustering and
masking). When using soft masking, lower case letters indicate masked
symbols, while upper case letters indicate regular symbols. Masked
symbols are never included in the unique index words used for sequence
comparisons, otherwise they are treated as normal symbols.
.PP
When comparing sequences during chimera detection, dereplication,
searching and clustering, T and U are considered identical, regardless
of their case. When aligning sequences, identical symbols will receive
a positive match score (default +2). If two symbols are not identical,
their alignment result in a negative mismatch score (default
-4). Aligning a pair of symbols where at least one of them is an
ambiguous symbol (BDHKMNRSVWY) will always result in a score of
zero. Alignment of two identical ambiguous symbols (for example, R vs
R) also receives a score of zero. When computing the amount of
similarity by counting matches and mismatches after alignment,
ambiguous nucleotide symbols will count as matching to other symbols
if they have at least one of the nucleotides (ACGTU) they may
represent in common. For example: W will match A and T, but also any
of MRVHDN. When showing alignments (for example with the \-\-alnout
option) matches involving ambiguous symbols will be shown with a plus
character (+) between them while exact matches between non-ambiguous
symbols will be shown with a vertical bar character (|).
.PP
\fBvsearch\fR can read data from standard files and write to standard
files, but it can also read from pipes and write to pipes! For
example, multiple fasta files can be piped into \fBvsearch\fR for
dereplication. To do so, file names can be replaced with:
.RS
.IP - 2
the symbol '-', representing '/dev/stdin' for input files
or '/dev/stdout' for output files (with an exception for '\-\-db \-',
see * below),
.IP -
a named pipe created with the command mkfifo,
.IP -
a process substitution '<(command)' as input or '>(command)' as
output.
.IP *
\-\-db \- is not accepted, to prevent potential concurrent reads from
stdin. A workaround for advanced users is to call '\-\-db /dev/stdin'
directly.
.RE
.PP
\fBvsearch\fR can automatically read compressed gzip or bzip2 files if
the appropriate libraries are present during the
compilation. \fBvsearch\fR can also read pipes streaming compressed
gzip or bzip2 data if the options \-\-gzip_decompress or
\-\-bzip2_decompress are selected. When reading from a pipe, the
progress indicator is not updated.
.\" ----------------------------------------------------------------------------
.SS Options
\fBvsearch\fR recognizes a large number of command-line commands and
options. For easier navigation, options are grouped below by theme
(chimera detection, clustering, dereplication and rereplication,
FASTA/FASTQ file processing, masking, pairwise alignment, searching,
shuffling, sorting, and subsampling). We start with the general
options that apply to all themes. Options start with a double dash
(\-\-). A single dash (\-) may also be used, except on NetBSD
systems. Option names may be shortened as long as they are not
ambiguous (e.g. \-\-derep_f).
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG help-and-version-commands
Help and version commands:
.PP
.RS
.TAG help
.TAG h
.TP 9
.B \-\-help \-\-h
Display help text with brief information about all commands and
options.
.TAG version
.TAG v
.TP
.B \-\-version \-\-v
Output version information and a citation for the VSEARCH
publication. Show the status of the support for gzip- and
bzip2-compressed input files.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG general-options
General options:
.RS
.TAG bzip2_decompress
.TP 9
.B \-\-bzip2_decompress
When reading from a pipe streaming bzip2-compressed data, decompress
the data. This option is not needed when reading from a standard
bzip2-compressed file.
.TAG fasta_width
.TP
.BI \-\-fasta_width\~ "positive integer"
Fasta files produced by \fBvsearch\fR are wrapped (sequences are
written on lines of \fIinteger\fR nucleotides, 80 by default). Set
the value to zero to eliminate the wrapping.
.TAG gzip_decompress
.TP
.B \-\-gzip_decompress
When reading from a pipe streaming gzip-compressed data, decompress
the data. This option is not needed when reading from a standard
gzip-compressed file.
.TAG label_suffix
.TP
.BI \-\-label_suffix\~ string
When writing FASTA or FASTQ files, add the suffix \fIstring\fR to
sequence headers.
.TAG log
.TP
.BI \-\-log \0filename
Write messages to the specified log file. Information written includes
program version, amount of memory available, number of cores and
command line options, and if need be, informational messages, warnings
and fatal errors. The start and finish times are also recorded as well
as the elapsed time and the maximum amount of memory consumed. The
different \fBvsearch\fR commands can also write additional
information to the log file.
.TAG maxseqlength
.TP
.BI \-\-maxseqlength\~ "positive integer"
All \fBvsearch\fR operations discard sequences longer than
\fIinteger\fR (50,000 nucleotides by default).
.TAG minseqlength
.TP
.BI \-\-minseqlength\~ "positive integer"
All \fBvsearch\fR operations discard sequences shorter than
\fIinteger\fR: 1 nucleotide by default for sorting or shuffling, 32
nucleotides for clustering and dereplication as well as the commands
\-\-makeudb_usearch, \-\-sintax, and \-\-usearch_global.
.\" note: minseqlength can be set to zero (keep empty entries)
.TAG no_progress
.TP
.B \-\-no_progress
Do not show the gradually increasing progress indicator.
.TAG notrunclabels
.TP
.B \-\-notrunclabels
Do not truncate sequence labels at first space or tab, but use the full
header in output files. Turned off by default for all commands except
the sintax command.
.TAG quiet
.TP
.B \-\-quiet
Suppress all messages to stdout and stderr except for warnings and
fatal error messages.
.TAG sample
.TP
.BI \-\-sample\~ string
When writing FASTA or FASTQ files, add the the given sample identifier
\fIstring\fR to sequence headers. For instance, if the given string is
ABC, the text ";sample=ABC" will be added to the header. Note that
\fIstring\fR will be truncated at the first ';' or blank
character. Other characters (alphabetical, numerical and punctuations)
are accepted.
.TAG threads
.TP
.BI \-\-threads\~ "positive integer"
Number of computation threads to use (1 to 1024). The number of threads
should be less than or equal to the number of available CPU cores. The
default is to use all available resources and to launch one thread per
core. The following commands are multi-threaded:
allpairs_global, cluster_fast, cluster_size, cluster_smallmem,
cluster_unoise, fastq_mergepairs, fastx_mask, maskfasta, search_exact,
sintax, uchime_ref, and usearch_global. Only one thread is used for
the other commands.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG chimera-detection-options
Chimera detection options:
.PP
.RS
Chimera detection is based on a scoring function controlled by five
options (\-\-dn, \-\-mindiffs, \-\-mindiv, \-\-minh,
\-\-xn). Sequences are first sorted by decreasing abundance, if
available, and compared on their \fIplus\fR strand only (case
insensitive).
.PP
Input sequences are masked as specified with the \-\-qmask and
\-\-hardmask options. Masking of the database for reference based
chimera detection is specified with the \-\-dbmask option.
.PP
In \fIde novo\fR mode, input fasta file must present abundance
annotations (i.e. a pattern [;]size=\fIinteger\fR[;] in the fasta
header). Input order matters for chimera detection, so we recommend to
sort sequences by decreasing abundance (default of
\-\-derep_fulllength command). If your sequence set needs to be
sorted, please see the \-\-sortbysize command in the sorting section.
.PP
.TAG abskew
.TP 9
.BI \-\-abskew \0real
When using \-\-uchime_denovo, the abundance skew is used to
distinguish in a three-way alignment which sequence is the chimera and
which are the parents. The assumption is that chimeras appear later in
the PCR amplification process and are therefore less abundant than
their parents. For \-\-uchime3_denovo the default value is 16.0. For
the other commands, the default value is 2.0, which means that the
parents should be at least 2 times more abundant than their
chimera. Any positive value equal or greater than 1.0 can be used.
.TAG alignwidth
.TP
.BI \-\-alignwidth\~ "positive integer"
When using \-\-uchimealns, set the width of the three-way alignments
(80 nucleotides by default). Set to zero to eliminate wrapping.
.TAG borderline
.TP
.BI \-\-borderline \0filename
Output borderline chimeric sequences to \fIfilename\fR, in fasta
format. Borderline chimeric sequences are sequences that have a high
enough score but which are not sufficiently different from their
closest parent.
.TAG chimeras
.TP
.BI \-\-chimeras \0filename
Output chimeric sequences to \fIfilename\fR, in fasta format. Output
order may vary when using multiple threads.
.TAG db
.TP
.BI \-\-db \0filename
When using \-\-uchime_ref, detect chimeras using the reference
sequences contained in \fIfilename\fR. Reference sequences are assumed
to be chimera-free. Chimeras cannot be detected if their parents, or
sufficiently close relatives, are not present in the database. The
file name must refer to a FASTA file or to a UDB file. If a UDB file
is used, it should be created using the \-\-makeudb_usearch command
with the \-\-dbmask dust option.
.TAG dn
.TP
.BI \-\-dn\~ "strictly positive real number"
pseudo-count prior on the number of no votes, corresponding to the
parameter \fIn\fR in the chimera scoring function (default value is
1.4). Increasing \-\-dn reduces the likelihood of tagging a sequence
as a chimera (less false positives, but also more false negatives).
.TAG fasta_score
.TP
.B \-\-fasta_score
Add the chimera score to the headers in the fasta output files for
chimeras, non-chimeras and borderline sequences, using the
format ';uchime_denovo=\fIfloat\fR;'.
.TAG lengthout
.TP
.B \-\-lengthout
Write sequence length information to the output files in FASTA format
by adding a ";length=\fIinteger\fR" attribute in the header.
.TAG mindiffs
.TP
.BI \-\-mindiffs\~ "positive integer"
Minimum number of differences per segment (default value is 3). The
parameter is ignored with \-\-uchime2_denovo and \-\-uchime3_denovo.
.TAG mindiv
.TP
.BI \-\-mindiv \0real
Minimum divergence from closest parent (default value is 0.8). The
parameter is ignored with \-\-uchime2_denovo and \-\-uchime3_denovo.
.TAG minh
.TP
.BI \-\-minh \0real
Minimum score (\fIh\fR). Increasing this value tends to reduce the
number of false positives and to decrease sensitivity. Default value
is 0.28, and values ranging from 0.0 to 1.0 included are accepted. The
parameter is ignored with \-\-uchime2_denovo and \-\-uchime3_denovo.
.TAG nonchimeras
.TP
.BI \-\-nonchimeras \0filename
Output non-chimeric sequences to \fIfilename\fR, in fasta
format. Output order may vary when using multiple threads.
.TAG relabel
.TP
.BI \-\-relabel \0string
Relabel sequences using the prefix \fIstring\fR and a ticker (1, 2, 3,
etc.) to construct the new headers. Use \-\-sizeout to conserve the
abundance annotations.
.TAG relabel_keep
.TP
.B \-\-relabel_keep
When relabelling, keep the old identifier in the header after a space.
.TAG relabel_md5
.TP
.B \-\-relabel_md5
Relabel sequences using the MD5 message digest algorithm applied to
each sequence. Former sequence headers are discarded. The sequence is
converted to upper case and each 'U' is replaced by a 'T' before
computation of the digest. The MD5 digest is a cryptographic hash
function designed to minimize the probability that two different
inputs give the same output, even for very similar, but non-identical
inputs. Still, there is a very small, but non-zero, probability that
two different inputs give the same digest (i.e. a collision). MD5
generates a 128-bit (16-byte) digest that is represented by 16
hexadecimal numbers (using 32 symbols among 0123456789abcdef). Use
\-\-sizeout to conserve the abundance annotations.
.\" The probablity of collision for two sequences is 1/2^128
.TAG relabel_self
.TP
.B \-\-relabel_self
Relabel sequences using each sequence itself as a label.
.TAG relabel_sha1
.TP
.B \-\-relabel_sha1
Relabel sequences using the SHA1 message digest algorithm applied to
each sequence. It is similar to the \-\-relabel_md5 option but uses
the SHA1 algorithm instead of the MD5 algorithm. SHA1 generates a
160-bit (20-byte) digest that is represented by 20 hexadecimal numbers
(40 symbols). The probability of a collision (two non-identical
sequences resulting in the same digest) is smaller for the SHA1
algorithm than it is for the MD5 algorithm.
.\" The probablity of collision for two sequences is 1/2^160
.TAG self
.TP
.B \-\-self
When using \-\-uchime_ref, ignore a reference sequence when its label
matches the label of the query sequence (useful to estimate
false-positive rate in reference sequences).
.\" I am not sure the statement above is true.
.TAG selfid
.TP
.B \-\-selfid
When using \-\-uchime_ref, ignore a reference sequence when its
nucleotide sequence is strictly identical to the nucleotidic sequence
of the query.
.TP
.B \-\-sizein
In \fIde novo\fR mode, abundance annotations
(pattern '[>;]size=\fIinteger\fR[;]') present in sequence headers are
taken into account by default (\-\-sizein is always implied). This
option is ignored by \-\-uchime_ref.
.TP
.TAG sizeout
.B \-\-sizeout
When relabelling, add abundance annotations to fasta headers (using
the format ';size=\fIinteger\fR;').
.TAG uchime_denovo
.TP
.BI \-\-uchime_denovo \0filename
Detect chimeras present in the fasta-formatted \fIfilename\fR, without
external references (i.e. \fIde novo\fR). Automatically sort the
sequences in \fIfilename\fR by decreasing abundance beforehand (see
the sorting section for details). Multithreading is not supported.
.TAG uchime2_denovo
.TP
.BI \-\-uchime2_denovo \0filename
Detect chimeras present in the fasta-formatted \fIfilename\fR, using
the UCHIME2 algorithm. This algorithm is designed for denoised
amplicons (see \-\-cluster_unoise). Automatically sort the sequences
in \fIfilename\fR by decreasing abundance beforehand (see the sorting
section for details). Multithreading is not supported.
.TAG uchime3_denovo
.TP
.BI \-\-uchime3_denovo \0filename
Detect chimeras present in the fasta-formatted \fIfilename\fR, using
the UCHIME2 algorithm. The only difference from \-\-uchime2_denovo is
that the default minimum abundance skew (\-\-abskew) is set to 16.0
rather than 2.0.
.TAG uchime_ref
.TP
.BI \-\-uchime_ref \0filename
Detect chimeras present in the fasta-formatted \fIfilename\fR by
comparing them with reference sequences (option
\-\-db). Multithreading is supported.
.TAG uchimealns
.TP
.BI \-\-uchimealns \0filename
Write the three-way global alignments (parentA, parentB, chimera) to
\fIfilename\fR using a human-readable format. Use \-\-alignwidth to
modify alignment length. Output order may vary when using multiple
threads. All sequences are converted to upper case before
alignment. Lower case letters indicate disagreement in the alignment.
.TAG uchimeout
.TP
.BI \-\-uchimeout \0filename
Write chimera detection results to \fIfilename\fR using a 18-field,
tab\-separated uchime\-like format. Use \-\-uchimeout5 to use a format
compatible with usearch v5 and earlier versions. Rows output order may
vary when using multiple threads.
.RS
.RS
.nr step 1 1
.IP \n[step]. 4
score: higher score means a more likely chimeric alignment.
.IP \n+[step].
Q: query sequence label.
.IP \n+[step].
A: parent A sequence label.
.IP \n+[step].
B: parent B sequence label.
.IP \n+[step].
T: top parent sequence label (i.e. parent most similar to the
query). That field is removed when using \-\-uchimeout5.
.IP \n+[step].
idQM: percentage of similarity of query (Q) and model (M)
constructed as a part of parent A and a part of parent B.
.IP \n+[step].
idQA: percentage of similarity of query (Q) and parent A.
.IP \n+[step].
idQB: percentage of similarity of query (Q) and parent B.
.IP \n+[step].
idAB: percentage of similarity of parent A and parent B.
.IP \n+[step].
idQT: percentage of similarity of query (Q) and top parent (T).
.IP \n+[step].
LY: yes votes in the left part of the model.
.IP \n+[step].
LN: no votes in the left part of the model.
.IP \n+[step].
LA: abstain votes in the left part of the model.
.IP \n+[step].
RY: yes votes in the right part of the model.
.IP \n+[step].
RN: no votes in the right part of the model.
.IP \n+[step].
RA: abstain votes in the right part of the model.
.IP \n+[step].
div: divergence, defined as (idQM - idQT).
.IP \n+[step].
YN: query is chimeric (Y), or not (N), or is a borderline case (?).
.RE
.RE
.TAG uchimeout5
.TP
.B \-\-uchimeout5
When using \-\-uchimeout, write chimera detection results using a
17\-field, tab\-separated uchime\-like format (drop the 5th field of
\-\-uchimeout), compatible with usearch version 5 and earlier
versions.
.TAG xlength
.TP
.B \-\-xlength
Strip header attribute ";length=\fIinteger\fR" from input
sequences. This attribute is added to output sequences by the
\-\-lengthout option.
.TAG xn
.TP
.BI \-\-xn\~ "strictly positive real number"
weight of no votes, corresponding to the parameter \fIbeta\fR in the
scoring function (default value is 8.0). Increasing \-\-xn reduces the
likelihood of tagging a sequence as a chimera (less false positives,
but also more false negatives).
.TAG xsize
.TP
.B \-\-xsize
Strip abundance information from the headers when writing the output
file.
.RE
.PP
.\" ----------------------------------------------------------------------------
.TAG clustering-options
Clustering options:
.RS
.PP
\fBvsearch\fR implements a single-pass, greedy centroid-based
clustering algorithm, similar to the algorithms implemented in
usearch, DNAclust and sumaclust for example. Important parameters are
the global clustering threshold (\-\-id) and the pairwise identity
definition (\-\-iddef).
.PP
Input sequences are masked as specified with the \-\-qmask and
\-\-hardmask options.
.TAG biomout
.TP 9
.BI \-\-biomout \0filename
Generate an OTU table in the biom version 1.0 JSON file format as
specified at
.URL https://biom-format.org/documentation/format_versions/biom-1.0.html "(link)"
<https://biom-format.org/documentation/format_versions/biom-1.0.html>.
The format describes how to store a sparse matrix containing the
abundances of the OTUs in the different samples. This format is much
more efficient than the classic and mothur OTU table formats available
with the \-\-otutabout and \-\-mothur_shared_out options,
respectively, and is recommended at least for large tables. The OTUs
are represented by the cluster centroids. Taxonomy information will be
included for the OTUs if available. Sample identifiers will be
extracted from the headers of all sequences in the input file. If the
header contains ';sample=abc123;' or ';barcodelabel=abc123;' or a
similar string somewhere, then the given sample identifier
(here 'abc123') will be used. The semicolon is not mandatory at the
beginning or end of the header. The sample identifier may contain any
printable character except semicolons. If no such sample label is
found, the identifier in the initial part of the header will be used,
but only letters, digits and underscores are allowed. OTU identifiers
will be extracted from the headers of the cluster centroid
sequences. If the header contains ';otu=def789;' or a similar string
somewhere, then the given OTU identifier (here 'def789') will be
used. The semicolon is not mandatory at the beginning or end of the
header. The OTU identifier may contain any printable character except
semicolons. If no such OTU label is found, the identifier in the
initial part of the header will be used, and all characters except
semicolons are allowed. Alternatively, OTU identifiers can be
generated using the relabelling options (\-\-relabel,
\-\-relabel_self, \-\-relabel_sha1, or \-\-relabel_md5). Taxonomy
information, if present, will also be extracted from the headers of
the centroid sequences. If the header contains ';tax=Homo_sapiens;' or
a similar string somewhere, then the given taxonomy information
(here 'Homo_sapiens') will be used. The semicolon is not mandatory at
the beginning or end of the header. The taxonomy information may
contain any printable character except semicolons. If an OTU table in
the biom version 2.1 HDF5 file format is required, the biom utility
may be used as described at
.URL https://biom-format.org/documentation/biom_conversion.html "(link)"
<https://biom-format.org/documentation/biom_conversion.html>.
.TAG centroids
.TP
.BI \-\-centroids \0filename
Output cluster centroid sequences to \fIfilename\fR, in fasta
format. The centroid is the sequence that seeded the cluster (i.e. the
first sequence of the cluster).
.TAG clusterout_id
.TP
.BI \-\-clusterout_id
Add cluster identifier information to the output files
when using the \-\-centroids, \-\-consout and \-\-profile options.
.TAG clusterout_sort
.TP
.BI \-\-clusterout_sort
Sort some output files by decreasing abundance instead of input
order. It applies to the \-\-consout, \-\-msaout, \-\-profile,
\-\-centroids, and \-\-uc options. For \-\-uc, the sorting applies
only to the centroid information part (the C lines).
.TAG cluster_fast
.TP
.BI \-\-cluster_fast \0filename
Clusterize the fasta sequences in \fIfilename\fR, automatically sort
by decreasing sequence length beforehand.
.TAG cluster_size
.TP
.BI \-\-cluster_size \0filename
Clusterize the fasta sequences in \fIfilename\fR, automatically sort
by decreasing sequence abundance beforehand.
.TAG cluster_smallmem
.TP
.BI \-\-cluster_smallmem \0filename
Clusterize the fasta sequences in \fIfilename\fR without automatically
modifying their order beforehand. Sequence are expected to be sorted
by decreasing sequence length, unless \-\-usersort is used.
.TAG cluster_unoise
.TP
.BI \-\-cluster_unoise \0filename
Perform denoising of the fasta sequences in \fIfilename\fR according
to the UNOISE version 3 algorithm by Robert Edgar, but without the
\fIde novo\fR chimera removal step, which may be performed afterwards
with \-\-uchime3_denovo. The options \-\-minsize (default 8) and
\-\-unoise_alpha (default 2.0) may be specified. In the this
algorithm, clustering of sequences depend on both the sequence
distance and the abundance ratio. The abundance ratio (skew) is the
abundance of a new sequence divided by the abundance of the centroid
sequence. This skew must not be larger than beta if the sequences
should be clustered together. Beta is calculated as 2 raised to the
power of minus 1 minus alpha times the sequence distance. The sequence
distance used is the number of mismatches in the alignment, ignoring
gaps. This means that the abundance must be exponentially lower as the
distance increases from the centroid for a new sequence to be included
in the cluster. Nearer sequences with higher abundances will form
their own new clusters.
.TAG clusters
.TP
.BI \-\-clusters \0string
Output each cluster to a separate fasta file using the prefix
\fIstring\fR and a ticker (0, 1, 2, etc.) to construct the path and
filenames.
.TAG consout
.TP
.BI \-\-consout \0filename
Output cluster consensus sequences to \fIfilename\fR. For each
cluster, a center-star multiple sequence alignment is computed with
the centroid as the center, using a fast algorithm (not accurate when
using low pairwise identity thresholds). A consensus sequence is
constructed by taking the majority symbol (nucleotide or gap) from
each column of the alignment. Columns containing a majority of gaps
are skipped, except for terminal gaps. If the \-\-sizein option is
specified, sequence abundances will be taken into account.
.TAG cons_truncate
.TP
.B \-\-cons_truncate
This command is ignored. A warning is issued.
.\" .TP
.\" .B \-\-cons_truncate
.\" when using the \-\-consout option to build consensus sequences,
.\" do not ignore terminal gaps. That option skips terminal columns
.\" if they contain a majority of gaps, yielding shorter consensus
.\" sequences than when using \-\-consout alone.
.TAG id
.TP
.BI \-\-id \0real
Do not add the target to the cluster if the pairwise identity with the
centroid is lower than \fIreal\fR (value ranging from 0.0 to 1.0
included). The pairwise identity is defined as the number of (matching
columns) / (alignment length - terminal gaps). That definition can be
modified by \-\-iddef.
.TAG iddef
.TP
.BI \-\-iddef\~ "0|1|2|3|4"
Change the pairwise identity definition used in \-\-id. Values
accepted are:
.RS
.RS
.nr step 0 1
.IP \n[step]. 4
CD-HIT definition: (matching columns) / (shortest sequence length).
.IP \n+[step].
edit distance: (matching columns) / (alignment length).
.IP \n+[step].
edit distance excluding terminal gaps (same as \-\-id).
.IP \n+[step].
Marine Biological Lab definition counting each gap opening (internal
or terminal) as a single mismatch, whether or not the gap was
extended: 1.0 - [(mismatches + gap openings)/(longest sequence
length)]
.IP \n+[step].
BLAST definition, equivalent to \-\-iddef 1 in a context of global
pairwise alignment.
.RE
.RE
.TAG lengthout
.TP
.B \-\-lengthout
Write sequence length information to the output files in FASTA format
by adding a ";length=\fIinteger\fR" attribute in the header.
.TAG minsize
.TP
.BI \-\-minsize\~ "positive integer"
Specify the minimum abundance of sequences for denoising using
\-\-cluster_unoise. The default is 8.
.TAG msaout
.TP
.BI \-\-msaout \0filename
Output a multiple sequence alignment and a consensus sequence for each
cluster to \fIfilename\fR, in fasta format. Be warned that vsearch
computes center star multiple sequence alignments using a fast method
whose accuracy can decrease significantly when using low pairwise
identity thresholds. The consensus sequence is constructed by taking
the majority symbol (nucleotide or gap) from each column of the
alignment. Columns containing a majority of gaps are skipped, except
for terminal gaps. If the \-\-sizein option is specified, sequence
abundances will be taken into account when computing the consensus.
.TAG mothur_shared_out
.TP
.BI \-\-mothur_shared_out \0filename
Output an OTU table in the mothur 'shared' tab-separated plain text
format as described at
.URL https://www.mothur.org/wiki/Shared_file (link)
<https://www.mothur.org/wiki/Shared_file>. The
format describes how a matrix containing the abundances of the OTUs in
the different samples is stored. The first line will start with the
strings 'label', 'group' and 'numOtus' and is followed by a list of
all OTU identifiers. The following lines, one for each sample, starts
with the string 'vsearch' followed by the sample identifier, the total
number of OTUs, and a list of abundances for each OTU in that sample,
in the order given on the first line. The OTU and sample identifiers
are extracted from the FASTA headers of the sequences. The OTUs are
represented by the cluster centroids. See the \-\-biomout option for
further details.
.TAG otutabout
.TP
.BI \-\-otutabout \0filename
Output an OTU table in the classic tab-separated plain text format as
a matrix containing the abundances of the OTUs in the different
samples. The first line will start with the string '#OTU ID' and is
followed by a tab-separated list of all sample identifiers. The
following lines, one for each OTU, starts with the OTU identifier and
is followed by a tab-separated list of abundances for that OTU in each
sample, in the order given on the first line. The OTU and sample
identifiers are extracted from the FASTA headers of the sequences (see
the \-\-sample option). The OTUs are represented by the cluster
centroids. An extra column is added to the right of the table if
taxonomy information is available for at least one of the OTUs. This
column will be labelled 'taxonomy' and each row will then contain the
taxonomy information extracted for that OTU. See the \-\-biomout
option for further details.
.TAG profile
.TP
.BI \-\-profile \0filename
Output a sequence profile to a text file with the frequency of each
nucleotide in each position in the multiple alignment for each
cluster. There is a FASTA-like header line for each cluster, followed
by the profile information in a tab-separated format. The eight
columns are: position (0-based), consensus nucleotide, number of As,
number of Cs, number of Gs, number of Ts or Us, number of gap symbols,
and finally the total number of ambiguous nucleotide symbols (B, D, H,
K, M, N, R, S, Y, V or W). All numbers are integers. If the \-\-sizein
option is specified, sequence abundances will be taken into account.
.TAG qmask
.TP
.BI \-\-qmask\~ "none|dust|soft"
Mask regions in sequences using the
\fIdust\fR or the \fIsoft\fR methods, or do not mask
(\fInone\fR). Warning, when using \fIsoft\fR masking, clustering
becomes case sensitive. The default is to mask using \fIdust\fR.
.TAG qsegout
.TP
.BI \-\-qsegout \0filename
Write the aligned part of each query sequence to \fIfilename\fR in
FASTA format.
.TAG relabel
.TP
.BI \-\-relabel \0string
Relabel sequence identifiers in the output files produced by
\-\-consout, \-\-profile and \-\-centroids options. Please see the
description of the same option under Chimera detection for details.
.TAG relabel_keep
.TP
.B \-\-relabel_keep
When relabelling, keep the old identifier in the header after a space.
.TAG relabel_md5
.TP
.B \-\-relabel_md5
Relabel sequence identifiers in the output files produced by
\-\-consout, \-\-profile and \-\-centroids options. Please see the
description of the same option under Chimera detection for details.
.TAG relabel_self
.TP
.B \-\-relabel_self
Relabel sequence identifiers in the output files produced by
\-\-consout, \-\-profile and \-\-centroids options. Please see the
description of the same option under Chimera detection for details.
.TAG relabel_sha1
.TP
.B \-\-relabel_sha1
Relabel sequence identifiers in the output files produced by
\-\-consout, \-\-profile and \-\-centroids options. Please see the
description of the same option under Chimera detection for details.
.TAG sizein
.TP
.B \-\-sizein
Take into account the abundance annotations present in the input fasta
file (search for the pattern '[>;]size=\fIinteger\fR[;]' in sequence
headers).
.TAG sizeorder
.TP
.B \-\-sizeorder
When an amplicon is close to 2 or more centroids, both within the
distance specified with the \-\-id option, resolve the ambiguity by
clustering it with the centroid having the highest abundance, not
necessarily the closest one. The option only has effect when the value
specified with \-\-maxaccepts is higher than one. The \-\-sizeorder
option turns on what is sometimes referred to as abundance-based
greedy clustering (AGC), in contrast to the default distance-based
greedy clustering (DGC).
.TAG sizeout