-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgraphclust_step_by_step.yaml
473 lines (391 loc) · 19.6 KB
/
graphclust_step_by_step.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
name: GraphClust workflow step by step
description: Step by step instructions for using GraphClust for clustering RNA sequences
title_default: "<b>GraphClust step by step</b>"
tags:
- "GraphClust"
steps:
- title: "<b>A tutorial on GraphClust(Clustering RNA sequences)</b>"
content: "This tour will walk you through the process of <b>GraphClust</b> to cluster RNA sequences.<br><br>
Read and Follow the instructions before clicking <b>'Next'</b>.<br><br>
Click <b>'Prev'</b> in case you missed out on any step."
backdrop: true
- title: "<b>A tutorial on GraphClust</b>"
content: "Together we will go through the following 10 steps:<br>
<dir>
<b>
<li>Data Acquisition</li>
<li>Pre-processing</li>
<li>Creating a graph from FASTA</li>
<li>Creating a sparse vector</li>
<li>Computing candidate clusters</li>
<li>Preprocessing data for computing best subtree</li>
<li>Computing best subtree using LocaRna</li>
<li>Finding consensus motives</li>
<li>Building Covariance Model using cmbuild</li>
<li>Searching homologous sequences using cmsearch</li>
<li>Post-processing</li>
</b>
</dir>"
backdrop: true
- title: "<b>Data Acquisition</b>"
content: "We will start with a simple small <b>FASTA</b> file.<br><br>
You will get one FASTA file with RNA sequences that we want to cluster.<br><br>"
backdrop: true
- title: "<b>Data Acquisition</b>"
element: ".upload-button"
intro: "We will import the FASTA file into into the history we just created.<br><br>
Click <b>'Next'</b> and the tour will take you to the Upload screen."
position: "right"
postclick:
- ".upload-button"
- title: "<b>Data Acquisition</b>"
element: "button#btn-new"
intro: "The sample training data available on github is a good place to start.<br><br>
Simply click <b>'Next'</b> and the links to the training data will be automatically inserted and ready for upload.<br><br>
Later on, when you want to upload other data, you can do so by clicking the <b>'Paste/Fetch Data'</b> button or
<b>'Choose local file'</b> to upload locally stored file."
position: "top"
postclick:
- "button#btn-new"
- title: "<b>Data Acquisition</b>"
element: ".upload-text-content:first"
intro: "Links Acquired !"
position: "top"
textinsert:
https://github.com/BackofenLab/docker-galaxy-graphclust/raw/master/data/Rfam-cliques-dataset/cliques-low-representatives.fa
- title: "<b>Data Acquisition</b>"
element: "button#btn-start"
intro: "Click on <b>'Start'</b> to upload the data into your Galaxy history."
position: "top"
- title: "<b>Data Acquisition</b>"
element: "button#btn-close"
intro: "The upload may take awhile.<br><br>
Hit the <b>close</b> button when you see that the files are uploaded into your history."
position: "top"
- title: "<b>Data Acquisition</b>"
element: "#current-history-panel > div.controls"
intro: "You've acquired your data. Now let's start using GraphCLust tools.<br><br>"
position: "left"
- title: "<b>GraphCLust</b>"
intro: "Once we have the data for analysis we can start the process of clustering RNA sequences step by step. <br>
Navigate to the tool panel on the left side and click on <b>GraphCLust</b>. This will open a section
with set of tool necessary for GraphCLust process. <br>
The first step of the process is <b>Preprocessing</b>."
position: "right"
- title: "<b>Pre-processing</b>"
intro: "This tool takes as an input file of sequences in Fasta format
and creates the final input for GraphCLust based on given parameters. Parameters allows us to
split long sequences into smaller fragments to enable the detection of local signals"
position: "right"
postclick:
- "#preproc > div.toolTitle > div > a"
- title: "<b>Pre-processing</b>"
element: "#s2id_uid-18_select > a"
intro: "Here we should define our input file.<br>"
position: "top"
- title: "<b>Pre-processing</b>"
element: "#uid-11"
intro: "<b>'Window size'</b> defines the length of the fragments that the input sequences will be split.
In default settings we set it to very high number to not split the sequences at all.<br><br>"
position: "top"
- title: "<b>Pre-processing</b>"
element: "#uid-64 > div.ui-form-title > span"
intro: "<b>'Window shift in percent'</b> defines the percentage of the shift fot fragments of the input sequences.
In default settings it is 100% because we don't split input sequences.<br><br>"
position: "top"
- title: "<b>Pre-processing</b>"
element: "#uid-46 > div.ui-form-field"
intro: "<b>Minimum sequence length</b> defines the minimal length of input sequences.<br><br>"
position: "top"
- title: "<b>Pre-processing</b>"
element: "#execute"
intro: "To run the tool press <b>'Execute' </b> button.<br>"
position: "top"
- title: "<b>Understanding the Output</b>"
element: "#current-history-panel"
intro: "After the tool is executed several output files are created.
<br> By clicking on the <b>'eye'</b> icon you can see the content of the files.
<br><br>"
position: "left"
- title: "<b>Preprocessing : DONE</b>"
intro: "Once preprocessing of the data is done we can move on to the next step :create a graph from FASTA file.<br>"
position: "right"
- title: "<b>Creating a graph from FASTA</b>"
intro: "To create a graph we need to use a tool called <b>fasta_to_gspan</b> which you
can find in GraphClust section of tool panel.<br>"
position: "right"
postclick:
- "#gspan > div.toolTitle > div > a"
- title: "<b>Creating a graph from FASTA</b>"
element: "#uid-71 > div.ui-form-field > div.ui-select-content > div.ui-options > div.btn-group.ui-radiobutton"
intro: "As an input this tool takes pre-processed FASTA file: <b>'data.fasta'</b>.<br>"
position: "right"
- title: "<b>Creating a graph from FASTA</b>"
intro: "Detailed description of each parameter you can find in help section at the bottom of the page"
position: "top"
- title: "<b>Creating a graph from FASTA</b>"
element: "#execute"
intro: "To run the tool press <b>'Execute' </b> button.<br>"
position: "top"
- title: "<b>Creating a graph from FASTA</b>"
intro: "Once tool is executed it will produce <b>gspan.zip file</b>.<br>"
position: "top"
- title: "<b>Creating a graph from FASTA : DONE</b>"
intro: "Now we have <b>gspan.zip</b> file and can move on the next step.<br>
The next step is to create sparse vector using NSPDK. <br>
From GraphClust tools choose <b>NSPDK_sparseVect</b> to start the process. <br>"
position: "top"
postclick:
- "#nspdk_sparse > div.toolTitle > div > a"
- title: "<b>Creating a sparse vector</b>"
intro: "This tool will create explicit sparse feature encoding using <b>NSPDK</b>.<br>"
position: "top"
- title: "<b>Creating a sparse vector</b>"
intro: "This tools requires 2 input files:
<br>
<dir>
<b>
<li>data.fasta : from pre-processing step</li>
<li>gspan.zip : from the previous step</li>
</b>
</dir>"
position: "top"
- title: "<b>Creating a sparse vector</b>"
intro: "More information about NSPDK you can find in help section<br>"
position: "top"
- title: "<b>Creating a sparse vector</b>"
element: "#execute"
intro: "Run the tool by pressing <b>'Execute' </b> button.<br>"
position: "top"
- title: "<b>Creating a sparse vector</b>"
intro: "After tool is exec it will produce <b>data_svector</b> which we will use in the next step."
position: "top"
- title: "<b>Creating a sparse vector: DONE</b>"
intro: "Created <b>data_svector</b> file will be used in the next step for computing
candidate cluster. For that click on <b>NSPDK_candidateClusters</b> tool."
position: "top"
postclick:
- "#NSPDK_candidateClust > div.toolTitle > div > a"
- title: "<b>Computing candidate clusters</b>"
intro: "During this step we will compute global feature index and get top dense sets.
The candidate clusters are chosen as the top ranking neighborhoods provided that the
size of their overlap is below a specified threshold."
position: "top"
- title: "<b>Computing candidate clusters</b>"
intro: "Here we have to specify 3 input files:
<br>
<dir>
<b>
<li>data_svector : from the previous step</li>
<li>data_fasta </li>
and
<li>data_names from pre-processing step</li>
</b>
</dir>"
position: "top"
- title: "<b>Computing candidate clusters</b>"
intro: "Another important parameter for this tool is <b>Multiple iterations</b>.
This parameter by default is set to <b>'no'</b> which means we will do only a single iteration.
By setting it to <b>'yes'</b> we have to define some other input files which we would get from previous
iterations. But in the scope of this tutorial we will just go for a single iteration."
position: "top"
- title: "<b>Computing candidate clusters</b>"
intro: "For more information about this tool check the <b>help</b> section.<br>"
position: "top"
- title: "<b>Computing candidate clusters</b>"
element: "#execute"
intro: "Run the tool by pressing <b>'Execute' </b> button.<br>"
position: "top"
- title: "<b>Computing candidate clusters</b>"
intro: "This step will produce 3 output files:
<br>
<dir>
<b>
<li>fast_cluster</li>
<li>fast_cluster_sim</li>
<li>blacklist</li>
</b>
</dir>
<b>fast_cluster</b> represents the id's of the candidate clusters. <br>
<b>fast_cluster_sim</b> represents the similarity scores. <br>
<b>blacklist</b> contains the ids of sequences that were already clustered, that's why in case
of the single iteration it's empty. <br>"
position: "top"
- title: "<b>Computing candidate clusters : DONE</b>"
intro: "Once output files are ready we can move to the nexr step by clicking on
<b>premlocarna</b> tool.<br>"
position: "top"
postclick:
- "#preMloc > div.toolTitle > div > a"
- title: "<b>Preprocessing data for computing best subtree</b>"
intro: "This tool will do some pre-processing for computing best subtrees.<br>"
position: "top"
- title: "<b>Preprocessing data for computing best subtree</b>"
intro: "This step needs 4 input files:
<br>
<dir>
<b>
<li>fast_cluster : from the previous step</li>
<li>fast_cluster_sim : from the previous step</li>
<li>data_fasta </li>
and
<li>data_names from pre-processing step</li>
</b>
</dir>
"
position: "top"
- title: "<b>Preprocessing data for computing best subtree</b>"
element: "#execute"
intro: "Run the tool by pressing <b>'Execute' </b> button.<br>"
position: "top"
- title: "<b>Preprocessing data for computing best subtree</b>"
intro: "Execution of this tool results in 4 datasets :
<br>
<dir>
<b>
<li>centers</li>
<li>trees</li>
<li>tree_matrix</li>
<li>cmfinder_fa</li>
<li>model_tree_fa</li>
</b>
</dir>
These datasets will be used in next steps.
<br>"
position: "top"
- title: "<b>Preprocessing data for computing best subtree : DONE</b>"
intro: "Preprocessing for the best tree computation is done, so we can now do the actual computation
of the best subtree. For that we need the tool called <b>locarna_best_subtree</b>.<br>"
position: "top"
postclick:
- "#locarna_best_subtree > div.toolTitle > div > a"
- title: "<b>Computing best subtree using LocaRna</b>"
intro: "This step computes a multiple sequence-structure alignment of RNA sequences using <b>LocaRna</b>.
It uses tree file (from previous step) with guide tree in NEWICK format.
The given tree is used as guide tree for the progressive alignment. It saves the calculation
of pairwise all-vs-all similarities and construction of the guide tree. And at the end return the best subtree<br>"
position: "top"
- title: "<b>Computing best subtree using LocaRna</b>"
intro: "This step takes as an input following files:
<dir>
<b>
<li>centers</li>
<li>trees</li>
<li>tree_matrix</li>
from previous steps and
<li>data_map from pre step</li>
</b>
</dir>
<br>"
position: "top"
- title: "<b>Computing best subtree using LocaRna</b>"
element: "#execute"
intro: "Run the tool by pressing <b>'Execute' </b> button.<br>"
position: "top"
- title: "<b>Computing best subtree using LocaRna</b>"
intro: "Output of this tool is <b>model.tree.stk</b> which will
be used in next step to find consensus motives.<br>"
position: "top"
- title: "<b>Computing best subtree using LocaRna : DONE</b>"
intro: "Now we have <b>model.tree.stk</b> so we can find consensus motives using the next tool -
<b>CMFinder_v0</b>.<br>"
position: "top"
postclick:
- "#cmFinder > div.toolTitle > div > a"
- title: "<b>Finding consensus motives</b>"
intro: "During this step conversion from CLUSTAL format files to STOCKHOLM format is done.
Then using CMFinder we determine consensus motives for sequences.<br>"
position: "top"
- title: "<b>Finding consensus motives</b>"
intro: "This tool takes as an input the following files:
<dir>
<b>
<li>model_tree_stk : from previous step</li>
<li>cmfinder_fa : from 'Preprocessing data for computing best subtree' step </li>
<li>tree_matrix</li>
</b>
</dir>
<br>"
position: "top"
- title: "<b>Finding consensus motives</b>"
element: "#execute"
intro: "Run the tool by pressing <b>'Execute' </b> button.<br>"
position: "top"
- title: "<b>Finding consensus motives</b>"
intro: "Output of the tool is in STOCKHOLM format and contains the consensus structure. <br>"
position: "top"
- title: "<b>Finding consensus motives : DONE</b>"
intro: "Once we have consensus structure we can build covariance model with <b>cmbuild</b> tool.
For that click on the tool named <b>Build covariance models </b><br>"
position: "top"
postclick:
- "#infernal > div:nth-child(3) > div > a "
- title: "<b>Building Covariance Model using cmbuild</b>"
intro: "In this step we cm build a covariance model of an RNA multiple alignment.
<b>cmbuild</b> uses the consensus structure to determine the architecture of the covariance model. <br>"
position: "top"
- title: "<b>Building Covariance Model using cmbuild</b>"
intro: "As an input for this tool we give 'model_cmfinder_stk' file containing consensus
structure from the pre step. <br>
For more information about this tool read the <b>help</b> section of the page. <br>"
position: "top"
- title: "<b>Building Covariance Model using cmbuild</b>"
element: "#execute"
intro: "Run the tool by pressing <b>'Execute' </b> button.<br>"
position: "top"
- title: "<b>Building Covariance Model using cmbuild : DONE</b>"
intro: "After covariance model is built we can move on to <cmsearch>.
Simply click on <b>Search covariance model(s)</b>.<br>"
position: "top"
postclick:
- "#infernal > div:nth-child(4) > div > a"
- title: "<b>Searching homologous sequences using cmsearch</b>"
intro: "<b>cmsearch</b> allows you to make consensus RNA secondary structure profiles,
and use them to search nucleic acid sequence databases for homologous RNAs. <br>"
position: "top"
- title: "<b>Searching homologous sequences using cmsearch</b>"
intro: "As a <b>sequence database </b> we choose <b>data_fasta_scan</b> file generated during
pre-processing. <br>
Then to use <b>covariance model</b> generated in previous step you should
select from <b>Subject covariance models</b> dropdown menu <b>Covariance model from your history</b> option."
position: "top"
- title: "<b>Searching homologous sequences using cmsearch</b>"
element: "#execute"
intro: "Run the tool by pressing <b>'Execute' </b> button.<br>"
position: "top"
- title: "<b>Searching homologous sequences using cmsearch : DONE</b>"
intro: "Finally we reached the last step of our workflow!
Click on <b>Report_Results</b> to do the final step. <br>"
position: "top"
postclick:
- "#glob_report > div.toolTitle > div > a"
- title: "<b>Post-processing</b>"
intro: "Final step of our workflow is <b>Post-processing</b>.<br>
In this step we will report clusters and merge them if needed."
position: "top"
- title: "<b>Post-processing</b>"
intro: "This tool takes as an input the following files:
<dir>
<b>
<li>FASTA.zip : from Preprocessing step</li>
<li>cmsearch_results : from previous step </li>
and
<li>model_tree_files : from <b>'Preprocessing data for computing best subtree'</b> step</li>
</b>
</dir>
<br>"
position: "top"
- title: "<b>Post-processing</b>"
intro: "The final output: <br>
<b>'cluster.final.stat'</b> file contains general information about clusters,
e.g. number of clusters, number of sequences in each cluster etc.
<br> By clicking on the <b>'eye'</b> icon you can see the content of the file.
<br>
<b>'CLUSTERS'</b> dataset collection contains one file for each cluster. <br>
Each file contains information about sequences in that cluster. Each line in the file contains: cluster number,
cm_score, sequence origin (whether it comes from model or from Infernal search) and sequence id.
<br>"
position: "top"
- title: "<b>A tutorial on GraphClust workflow</b>"
intro: "Thank You for going through our tutorial."
backdrop: true