forked from ogotoh/spaln
-
Notifications
You must be signed in to change notification settings - Fork 0
/
spaln.1
319 lines (309 loc) · 8.73 KB
/
spaln.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
.\"
.\" Copyright (c) 2007-2018 Osamu Gotoh all rights reserved.
.\"
.TH spaln 1 "2018-09-06" \" -*- nroff -*-
.SH NAME
spaln \- mapping and alignment of cDNA/protein
sequences onto genomic sequence or rapid homology search
against protein sequence database and (semi)global alignment
.SH SYNOPSIS
spaln -WGenome.bk[n|p] -K[D|P] [W_Options] Genome.mfa(.gz)
.br
spaln -WProsDb.bka -KA [W_Options] ProteinSeqenceDB.faa(.gz)
.br
spaln [-Q[0|1|2|3]] [R_options] Genome_segment [cDNA|protein]_queries
.br
spaln -Q[4|5|6|7] -dGenome [R_options] [cDNA|protein]_queries
.br
spaln -Q[4|5|6|7] -aProsDb [R_options] Genomic_segment
.br
spaln -Q[4|5|6|7] -aProsDb [R_options] protein_queries
.br
spaln -Q[4|5|6|7] [R_options] Genome.mfa [cDNA|protein]_queries
.br
spaln -Q[4|5|6|7] [R_options] ProsDb.faa protein_queries
.SH DESCRIPTION
\fBspaln\fR is a stand-alone program that maps and aligns a set
of cDNA/protein sequences onto a genomic sequence.
From version 1.4, \fBspaln\fR supports a combination of protein
sequence database and a given genomic segment. From version 2.2,
it also supports rapid similarity search of protein sequences
against a protein sequence database followd by ordinary alignment.
\fBSpaln\fR runs with relatively small internal memory, making it feasible
to search a whole mammalian genome in a single job on a
conventional computer.
.SH EXAMPLES
.TP
.B spaln -Wddignm.bkn -KD -Xk10 -Xb8192 ddignm.mfa
Make the block index table 'ddignm.bkn' of the genomic
sequence 'ddignm.mfa' in multi-fasta format with
the word size of 10 nucleotides and the block size of 8192.
.TP
.B spaln -O1 'chr1.fa 10001 40000 <' cdna.fa
Align the cDNA sequence 'cdna.fa' onto the genomic segment of 'chr1.fa'
within the range from the 10001-th to the 40000-th
nucleotide of the complementary strand in the semi-global mode.
.TP
.B spaln -Q7 -LS -t10 -dddignm ddicdna.mfa > ddi.exon
Report the gene structures corresponding to individual
cDNAs in 'ddicdna.mfa'. The local alignment mode is used and the
results are written in 'ddi.exon' in the exon-oriented format.
The computation proceeds with 10 threads in parallel.
.TP
.B spaln -Q7 -O5 -XG1M -yX -TMus -ommu.intron -dmmugnm 'ratcdna.mfa (1 20)'
Cross-species comparison between the genome 'mmugnm' and the cDNAs
in 'ratcdna.mfa' from the first to the 20-th entries.
The maximal expected gene size is reset to 1 Mbp.
The outputs go to 'mmu.intron' in the intron-oriented format. The
tetrapod-specific parameter set is used.
.TP
.B spaln -Q7 -O0 -Tarabthal -aSwiss -oyour_genes.gff3 your_genomic_segment
A set of ORFs in the genomic segment are translated and rapidly searched
against the protein database, and then the best-hit sequence is used as
the template of the spliced alignment to infer the organization of the gene
located in the relevant genomic region. The output is shown in Gff3 gene format.
The dicot-specific parameter set is used.
.TP
.B spaln -Q4 -O0 -aSwiss -M4 -t10 aa_queries > output
Find up to four SwissProt sequences most similar to each query sequence,
and report global alignment statistics. To show alignment themselves,
use -O1 option in stead of -O0.
The computation proceeds with 10 threads in parallel.
.SH OPTIONS
Conventions: #: number; $: string; default values in ()
.br
(default for DNA, with -yX option, default for protein)
.SH "FORMATING OPTIONS"
.TP
.B -K$
Format the genomic sequence for DNA ($=D) or protein ($=P) queries, or
format the protein database sequences ($=A) for rapid search of template.
.TP
.B -Xk#
Word size (11 for DNA, 5 for protein)
.TP
.B -Xb#
Block size (4096)
.TP
.B -XG#
Maximum gene size (262144)
.TP
.B -Xa#
Abundance factor (10)
.TP
.B -Xs#
Shifts between adjacent seeds (k)
.SH "RUN TIME OPTIONS"
.TP
.B -C#
NCBI transl_table number indicating the genetic code (1)
.TP
.B -H#
Minimum alignment score for report (35)
.TP
.B -LS
Smith-Waterman-type local alignment
.TP
.B -M[#]
Multiple loci (0)
.RS
#=empty: Multiple loci maximally up to 4
.br
#=0: Single locus
.br
#=1: Re-search unaligned parts
.br
#>1: Multiple loci maximally up to #
.RE
.TP
.B -O[#]
Output format (4)
.RS
#=0: GFF3 format (gene) in genome vs [cDNA|protein] mode
.br
#=0: Alignment statistics in protein vs protein mode
.br
#=1: Alignment
.br
#=2: GFF3 format (match)
.br
#=3: Bed format
.br
#=4: Exon-oriented format similar to output of megablast -D 3
.br
#=5: Intron-oriented output
.br
#=6: Concatenated exon sequence
.br
#=7: Translated amino-acid sequence
.br
#=8: Mapping (block) information only. Use with -Q4
.br
#=12: Output the same information as -O4 in binary formats
.RE
.TP
.B -Q[#]
Select algorithm (3)
.RS
0<=#<=3: Genomic segment in the fasta format given by the first
argument vs. cdna/protein given by the second argument
.br
4<=#<=7: Genome mapping and alignment
.br
#=0,4: DP procedure without HSP search
.br
#=1-3,5-7: Recursive HSP searches up to the level of (# % 4)
.RE
.TP
.B -R$
Read block index table from the file $
.TP
.B -S[#]
Specify the orientation of query (0)
.RS
0: depend on the query annotation
.br
1: forward direction only
.br
2: reverse direction only
.br
3: both directions;
.RE
.TP
.B -T$
Specify the species (in combination with the -yS option for cDNA query)
.TP
.B -U
Map/align without splicing
.TP
.B -V#
Minimum space to induce the Hirschberg's algorithm (16M)
.TP
.B -W$
Write block index table to file $
.TP
.B -i[a|p]
Paired-end reads as the queries from a single or two files
.RS
-ia: 5' and 3' matching pairs must appear alternatively in a file
.br
-ip: 5' and 3' matching pairs must appear in the same order in the two files
.RE
.TP
.B -o$
Destination of output file (stdout)
.TP
.B -pa
Suppress trimming of terminal polyA or polyT sequence
.TP
.B -pq
Suppress some outputs to stderr
.TP
.B -pw
Report the result irrespective of the alignment score
.TP
.B -u#
Gap-extension penalty (3, 2, 2)
.TP
.B -v#
Gap-opening penalty (8, 6, 9)
.TP
.B -xB$
Bit pattern of the seeds used for HSP search at level 1
.TP
.B -xb$
Bit pattern of the seeds used for HSP search at level 3
.TP
.B -ya#
Dinucleotide pairs at the ends of an intron (0)
.RS
0: canonical only (GT..AG, GC..AG, AT..AC)
.br
1: relaxed to (GT..AG, GC..AG, AT..AN)
.br
2: #=1 + allow 1 mismatch from GT..AG
.br
3: any;
.RE
.TP
.B -yi#
Intron penalty (11, 8, 11)
.TP
.B -yj#
Incline of long gap penalty (0.6)
.TP
.B -yk#
Flex point where the incline of gap penalty changes (7)
.TP
.B -yl#
Double affine gap penalty if #=3; affine penalty otherwise
.TP
.B -ym#
Score for a nucleotide match (2, 2)
.TP
.B -yn#
Penalty for a nucleotide mismatch (6, 2)
.TP
.B -yo#
Penalty for in-frame termination codon (100)
.TP
.B -yp#
PAM level used in the alignment (third) phase (150)
.TP
.B -yq#
PAM level used in the second phase (50)
.TP
.B -yx#
Penalty for a frame shift (100)
.TP
.B -yy#
Relative contribution of splicing signal (8)
.TP
.B -yz#
Relative contribution of coding potential (2)
.TP
.B -yA#
Relative contribution of the translational initiation or termination signal (8)
.TP
.B -yB#
Relative contribution of branch point signal (0)
.TP
.B -yE#
Minimum exon length (2)
.TP
.TP
.B -yI$
Intron distribution parameters
.TP
.TP
.B -yJ#
Relative contribution of the bonus given to a conserved intron position
.TP
.B -yL#
The minimum intron length (20)
.TP
.B -yS#
Percent contribution of the species-specific splice signal. if #=0, only the ubiquitous signal given to the dinucleotide pair at the ends of an intron is used. By default #=0 for DNA and #=100 for protein queries. -yX option automatically sets #=100 for DNA and #=30 for protein.
.B -yS
For cDNA queries, use species-specific exon-intron boundary signals. For protein queries, invoke 'salvage' procedure in phase 1
.TP
.B -yX
For a DNA query, this option sets parameter values for cross-species comparison. Conversely, this option specifies an intra-species mode for a protein query.
.TP
.B -yY#
Relative contribution of length-dependent part of intron penalty (8)
.TP
.B -yZ#
Relative contribution of oligomer composition within an intron (0)
.sp
.SH REFERENCES
(1) "A Space-Efficient and Accurate Method for Mapping and Aligning
cDNA Sequences onto Genomic Sequence",
Osamu Gotoh, Nucleic Acid Res., 36 (8), 2630-2638 (2008).
.br
(2) "Direct Mapping and Alignment of Protein Sequences onto Genomic Sequence",
Osamu Gotoh, Bioinformatics, 24 (21) 2438-2444 (2008).
.br
(3) "Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features", Iwata, H. and Gotoh, O.", Nucleic Acids Res., 40 (20) e161 (2012).
.SH AUTHOR
Osamu Gotoh <[email protected]>