-
Notifications
You must be signed in to change notification settings - Fork 2
/
README
735 lines (581 loc) · 29.1 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
vecscreen_plus_taxonomy
Author: Alejandro Schaffer
Documentation help and code review: Eric Nawrocki
github: https://github.com/aaschaffer/vecscreen_plus_taxonomy.git
Version: 0.17
December 2020
------------------
README
This file describes
-- a script called from_vecscreen_to_summary.pl and its constituent
programs
and
-- a summary classification script compare_vector_matches_wtaxa.pl
that can be used to run vecscreen and parse its outputs, including
taxonomy information.
There is a related repository
github: https://github.com/aaschaffer/generate_vecscreen_candidates.git
Outline of this file:
BACKGROUND
SETTING UP ENVIRONMENT VARIABLES
SAMPLE RUN, USAGE AND OUTPUT OF from_vecscreen_to_summary.pl
SAMPLE RUN, USAGE AND OUTPUT OF compare_vector_matches_w_taxa.pl
METHODS
TEST SCRIPT
RELEVANT FILES
CREATING THE TAXONOMY FILE
AN ADDITIONAL TAXONOMY HELPER PROGRAM
COMPILING vecscreen AND srcchk ON LINUX, FROM SCRATCH
**************
**BACKGROUND**
**************
vecscreen is the established NCBI program to identify matches between
(query) sequences and (subject) vectors in UniVec. These matches may
represent (true) vector contamination, but experience has shown that
there can be many false positives. A primary reason for false
positives is that the query and the matching subject segment come from
the same genus or closely related genera. Therefore, knowing the genus
of the query sequence is helpful to interpret the output. For what
follows below, it is relevant that vecscreen distinguishes matches by
two characteristics:
1) Location: Internal or Terminal
A match is Terminal if and only if it includes a nucleotide within 25
positions of either end of the query; otherwise, the match is Internal.
2) Strength: Strong, Moderate, or Weak
A match is Strong if either: it is Terminal with a raw score of at
least 24 or it is Internal with a raw score of at least 30.
A match is Moderate if either: it is Terminal with a raw score in the
interval [19,23] or Internal with a raw score in the interval [25,29].
A match is Weak if it is Terminal with a raw score in the interval
[16,18] or Internal with a raw score in the interval [23,24].
vecscreen also reports Internal alignments with raw scores in the
range [16, 22] when there is also a reportable match for the same
(query, vector) pair. The score range [16,22] is below the Weak range
match for Internal matches. In the script
from_vecscreen_to_summary.pl, these below-Weak Internal matches are
assigned the level None.
More information about vecscreen can be found at
https://www.ncbi.nlm.nih.gov/tools/vecscreen/about/
In this document, 'vecscreen' refers to the command-line version, not
the Web page version.
The programs and data in this vecscreen_plus_taxonomy repository are
for prospective analysis of nucleotide sequences, especially
sequences that one might want to submit to GenBank.
The related repository
github: https://github.com/aaschaffer/generate_vecscreen_candidates.git
does a preprocessing step relevant to only retrospective analysis of
sequences already in a database set up for BLAST, such as the non-redundant
(nr) database in GenBank. In retrospective analysis, one wants to start
by searching the database for sequences that may be expected to have vecscreen
matches, so as to avoid running vecscreen on the entire database.
************************************
**SETTING UP ENVIRONMENT VARIABLES**
************************************
Before you can run from_vecscreen_to_summary.pl or
compare_vector_matches_wtaxa.pl you will need to update some of your
environment variables. To do this, add the following four lines to
your .bashrc file (if you use bash shell) or .cshrc file (if you use C
shell or tcsh). The .bashrc or .cshrc file will be in your home
directory. To determine what shell you use type 'echo $SHELL', if it
returns '/bin/bash', then update your .bashrc file, if it returns
'/bin/csh' or '/bin/tcsh' then update your .cshrc file.
The 4 lines to add to your .bashrc file, ***but make sure that you
replace PATH/TO/VEC/PLUS with the directory path to the directory
created when you cloned the vecscreen_plus_taxonomy github repository.
-----------
export VECPLUSDIR="PATH/TO/VEC/PLUS"
export PERL5LIB="$VECPLUSDIR:$PERL5LIB"
export PATH="$VECPLUSDIR/scripts:$PATH"
export BLASTDB="$VECPLUSDIR/univec-files:$BLASTDB"
-----------
The 4 lines to add to your .cshrc file:
-----------
setenv VECPLUSDIR "PATH/TO/VEC/PLUS"
setenv PERL5LIB "$VECPLUSDIR":"$PERL5LIB"
setenv PATH "$VECPLUSDIR/scripts":"$PATH"
setenv BLASTDB "$VECPLUSDIR/univec-files":"$BLASTDB"
-----------
Then, after adding those 4 lines, execute this command:
source ~/.bashrc
OR
source ~/.cshrc
If PERL5LIB was not already defined (you will know, if you get an error
message when you run the above 'source' command):
use instead
export PERL5LIB="$VECPLUSDIR"
for .bashrc, OR
setenv PERL5LIB "$VECPLUSDIR"
for .cshrc.
at line 2 out of 4.
Similarly, if BLASTDB was not already defined (you will know, if you
get an error message when you run the above 'source' command):
use instead
export BLASTDB="$VECPLUSDIR/univec-files"
for .bashrc, OR
setenv BLASTDB "$VECPLUSDIR/univec-files"
for .cshrc.
at line 4 out of 4.
To check that your environment variables are properly set up do the
following four commands:
echo $VECPLUSDIR
echo $PERL5LIB
echo $PATH
echo $BLASTDB
The first command should return only:
PATH/TO/VEC/PLUS
The second echo command should return a potentially longer
string that begins with the same path:
PATH/TO/VEC/PLUS
The third echo command should return a (potentially longer) string
that begins with:
PATH/TO/VEC/PLUS/scripts
The fourth echo command should return a (potentially longer) string
that begins with:
PATH/TO/VEC/PLUS/univec-files
If that is not the case, please email Alejandro Schaffer
([email protected]). If you do see the expected output, the
following sample run should work.
Finally, you will need to gunzip two exectuable files in
PATH/TO/VEC/PLUS/scripts, and make the files executable (in case they
are not already executable). Perform the following commands:
> cd $VECPLUSDIR/scripts
> gunzip vecscreen.gz
> gunzip srcchk.gz
> chmod +x vecscreen
> chmod +x srcchk
After this, you should be able to successfully complete the sample
runs below, as well as the test script in the TEST SCRIPT section
below.
****************************************************************
**SAMPLE RUN, USAGE AND OUTPUT OF from_vecscreen_to_summary.pl**
****************************************************************
Here is an example command that will run from_vecscreen_to_summary.pl
on the fasta file myseqs.fa using the taxonomy file included in
PATH/TO/VEC/PLUS/info-files/taxonomy_tree_wlevels.txt.
> from_vecscreen_to_summary.pl --output_root mytest --input_fasta $VECPLUSDIR/test-files/test.input_sequence_file.fa --input_taxa $VECPLUSDIR/info-files/taxonomy_tree_wlevels.txt --combine_output --verbose > mytest.out
This command is contained in the file
$VECPLUSDIR/scripts/sample_run.sh. In fact, it is better to just run
that shell script file, because it will also check that the expected
output is correct:
> sh $VECPLUSDIR/scripts/sample_run.sh
When you run the above 'sh' command, you should see output like this:
--
comparing expected output test-files/expected.output_combined_wtaxonomy.txt to observed output mytest.output_combined_wtaxonomy.txt
SUCCESS: Files are identical
--
After running the above from_vecscreen_to_summary.pl command, the
one-line per sequence output is in the file
mytest.output_combined_wtaxonomy.txt. This 11-column format is
described below under 'Verbose output format (enabled with --verbose
option): 11 columns'.
The output of the script to stdout, which describes briefly what the
script is doing, is in the file mytest.out.
The input sequence file
($VECPLUSDIR/test-files/test.input_sequence_file.fa) can be any
nucleotide sequence file in FASTA format.
The taxonomy file ($VECPLUSDIR/info-files/taxonomy_tree_wlevels.txt)
is a file in a special format that includes taxonomy information based
on NCBI's taxonomy. This can either be created by the user from the
NCBI taxonomy file, or the user can use the provided file
taxonomy_tree_wlevels.txt, which was created in April 2018. If you are
using this in 2018, the provided file above should be fine. After
that, you should create a new up-to-date file. Instructions for doing
that are provided below in the section CREATING THE TAXONOMY FILE.
The options that can be provided to from_vecscreen_to_summary.pl are:
--input_fasta <s> : REQUIRED: file name <s> with sequences in fasta format
--input_taxa <s> : REQUIRED: file name <s> mapping vecscreen matches to taxa
--output_root <s> : REQUIRED: output files will be named starting with <s>
--verbose : output 11 columns instead of 5
--combine_output : combine internal and terminal matches
--keep : keep all intermediate files (e.g. vecscreen output)
The --input_fasta <s> and --input_taxa <s> and --output_root <s>
options are required when running from_vecscreen_to_summary.pl, while
--verbose and --combine_output and --keep are optional. By default
--verbose and --combine_output and --keep are all turned off.
The --combine_output option determines whether one or two files of
output are produced. Currently the names of the output files are
partly fixed at:
<output_root>.output_combined_wtaxonomy.txt (if --combine_output is applied)
<output_root>.output_internal_wtaxonomy.txt (if --combine_output is not applied)
<output_root>.output_terminal_wtaxonomy.txt (if --combine_output is not applied)
The option --verbose determines how many columns of output are
produced, as described in the OUTPUT section below. It does not cause
more diagnostic output to be printed during the execution of the
script. The extra columns are important for classifying some matches
as true or false contamination.
The option --keep determines whether the output file from running
vecscreen within the script is kept (--keep used) or deleted (--keep
not used, default).
from_vecscreen_to_summary.pl will create one line of tabular output
per vecscreen hit in the input sequence file. There are two possible
output formats. For both formats, columns are separated by tabs.
--------------------------------
Default output format: 5 columns
--------------------------------
By default (if --verbose is not used) then the format of those lines
will be the following five columns:
Column 1: Accession of query
Column 2: Genus of query if known, or 1 otherwise
Column 3: Matching vector, starting with uv|
Column 4: One end of the alignment in the vector
Column 5: The other end of the alignment in the vector
This 5 column format was agreed on for internal NCBI usage in JIRA
ticket SM-187.
----------------------------------------------------------------
Verbose output format (enabled with --verbose option): 11 columns
----------------------------------------------------------------
If --verbose is used, then each line of output will include the
following 11 columns:
Column 1: Accession of query
Column 2: Genus of query if known, or 1 otherwise
Column 3: Species of query if known, or 1 otherwise
Column 4: Lower end of the alignment in the query
Column 5: Upper end of the alignment in the query
Column 6: Matching vector, starting with uv|
Column 7: One end of the alignment in the vector
Column 8: The other end of the alignment in the vector
Column 9: The strength of this vecscreen match
Column 10: The strength of the strongest vecscreen match for this query
Column 11: Whether there is any dangling part (called "Suspect" by
vecscreen) at either end of the query
A dangling part is an unmatched segment of <= 25 nucleotides.
This alternative 11-column format has been shown to be useful for some
purposes, such as correcting vector-contaminated sequences in GenBank.
Another circumstance in which the 11-column format
is essential is if there is an input sequence that has a known species
(not 1 in column 3) but do not have a known genus (1 in column 2).
*********************************************************************
**SAMPLE RUN, USAGE, AND OUTPUT OF compare_vector_matches_w_taxa.pl**
*********************************************************************
The compare_vector_matches_w_taxa.pl script takes as input the
one-line per-sequence output file from
from_vecscreen_to_summary.pl. That input file must be the 11-column
output of from_vecscreen_to_summary.pl that is created when the
--verbose option is used for that script.
Here is the example usage of compare_vector_matches_wtaxa.pl:
compare_vector_matches_wtaxa.pl \
--input_summary $VECPLUSDIR/test-files/test.sample_input_final_step.txt \
--input_taxa $VECPLUSDIR/info-files/taxonomy_tree_wlevels.txt \
--input_artificial_vectors $VECPLUSDIR/info-files/artificial_whole_sequences.txt \
--input_artificial_segments $VECPLUSDIR/info-files/artificial_intervals.txt \
--input_univec_sources $VECPLUSDIR/info-files/biological_exclusions.txt \
--input_amr $VECPLUSDIR/info-files/UniVec10_vs_amr_distinct_intervals.txt \
--input_sequences $VECPLUSDIR/test-files/test.sample_candidates.fa \
--outfile my_output_final_step.txt
This command is contained in the file
$VECPLUSDIR/test-files/sample_compare_run.sh. As above, it's better to
just run that shell script file, because it will also check that the
expected output is correct:
> sh $VECPLUSDIR/scripts/sample_compare_run.sh
When you run the above 'sh' command, you should see output like this:
--
comparing expected output test-files/expected.output_final_step.txt to observed output my_outputfinal_step.txt
SUCCESS: Files are identical
--
After running this compare_vector_matches_wtaxa.pl command, the file
'my_output_final_step.txt' will include the output of
compare_vector_matches_wtaxa.pl which is the file
test-files/sample_input_final_step.txt with 3 additional columns:
Column 12: the classification of the match
Column 13: Most pertinent taxid of the vector interval
Column 14: Lowest common ancestor of column 2 and column 13
The possible classifications in column 12 are currently:
NO_DATA, TRUE_ARTIFICIAL, TRUE_ARTIFICIAL_MICROSAT, FALSE_AMR,
FALSE_BIOLOGICAL, TRUE_BIOLOGICAL, TRUE_MICROSAT, LIKELY_FALSE,BACTERIAL.
These are explained below.
In practice, one mainly wants to distinguish between:
{TRUE_ARTIFICIAL, TRUE_ARTIFICIAL_MICROSAT, TRUE_BIOLOGICAL, TRUE_MICROSAT} which are true contamination,
versus
{FALSE_AMR, FALSE_BIOLOGICAL, LIKELY_FALSE,BACTERIAL}, which are false contamination.
In some cases, the classification TRUE_BIOLOGICAL may be too bold and
this can be seen because Column 14 is not much higher up the taxonomy
tree than column 13. When this happens, the conservation of the
vector source needs to be propagated from genus_level_exclusions.txt
to one of the higher-level files*:
superkingdom_level_exclusions.txt
kingdom_level_exclusions.txt
phylum_level_exclusions.txt
class_level_exclusions.txt
order_level_exclusions.txt
family_level_exclusions.txt
tribe_level_exclusions.txt
* Please email [email protected] if you find any examples of
vector sources that should be propagated to a higher level of taxonomy.
***********
**METHODS**
***********
Given the input taxonomy file and an input sequence file in FASTA
format, from_vecscreen_to_summary.pl will do the following:
1) Run vecscreen on the input FASTA-formatted sequence file to
identify high-scoring matches to known vector sequences in UniVec
in the input sequence file.
2) Parse the vecscreen output into two tab-delimited files for
internal and terminal matches by calling parse_vecscreen.pl.
3) Optionally combine the two summary files into one by calling
combine_summaries.pl.
4) Add taxonomy information to the vecscreen output by calling srcchk
and add_taxonomy.pl.
compare_vector_matches_wtaxa.pl uses the six sets of data files listed
to classify each vecscreen match. This program is separate because an
in-house usage needed somewhat different I/O specifications to fit
into an existing software framework.
***************
**TEST SCRIPT**
***************
There is a 'test' script included in the vecscreen_plus_taxonomy
distribution that you should run to make sure that everything is set
up correctly, in addition to doing the two example sample runs above.
To run the test script, execute the following command:
> $VECPLUSDIR/scripts/test_vecscreen_plus_taxonomy_scripts.pl
You should see the following output:
--
Checking that required input files exist ... done.
Testing combine_summaries.pl ... done.
Testing add_taxonomy.pl ... done.
Testing from_vecscreen_to_summary.pl ... done.
Testing compare_vector_matches_wtaxa.pl ... done.
# All tests passed.
# SUCCESS
--
If you do not see this output, make sure that you've set up your
environment variable $VECPLUSDIR correctly as explained above in the
'SETTING UP ENVIRONMENT VARIABLES' section. If you still have
problems, email [email protected].
******************
**RELEVANT FILES**
******************
Several executable files are required for from_vecscreen_to_summary.pl
and compare_vector_matches_wtaxa.pl to work. Two of these executables
were developed by others and must be downloaded and installed
separately outside of NCBI. All of these files are included in the
vecscreen_plus_taxonomy github repository, so you do not need to
create or move any files in order to get from_vecscreen_to_summary.pl
to work. After cloning the git repository, these files will be in the
PATH/TO/VEC/PLUS/scripts directory.
The first two files are NCBI executable programs that were not authored by
Alejandro Schaffer:
1. vecscreen and associated UniVec database
Identifies vector contamination in input sequences.
Within NCBI, vecscreen can be found at
/netopt/ncbi_tools64/c++.stable/ReleaseMT/bin/vecscreen
At NCBI, to add this directory to your path execute this command:
> ln -s /netopt/ncbi_tools64/c++.stable/ReleaseMT/bin/vecscreen .
For users outside NCBI, we provide a gzipped executable of vecscreen
for 64-bit Linux computers in
scripts/vecscreen.gz
Run
gunzip vecscreen.gz
chmod +x vecscreen
and make sure that vecscreen is on the execution path (which it should
be if you followed the steps above in SETTING UP ENVIRONMENT VARIABLES)
vecscreen requires that the BLASTable database UniVec be
accessible. This means that these files must be in a directory
that is contained in your $BLASTDB environment variable. If you
followed the instructions in the SETTING UP ENVIRONMENT VARIABLES
section above, you have already added the appropriate directory
(PATH/TO/VEC/PLUS/univec-files) to $BLASTDB. The UniVec database
is represented (for purposes of vecscreen) in the three files:
univec-files/UniVec.nhr
univec-files/UniVec.nin
univec-files/UniVec.nsq
Do not try to edit the three UniVec files under any circumstances.
See also the section COMPILING vecscreen AND srcchk ON LINUX, FROM SCRATCH
2. srcchk
Determines the taxonomy of input sequences, with respect the NCBI
taxonomy tree.
Within NCBI, srcchk can be found at
/netopt/ncbi_tools64/bin/srcchk
At NCBI, to add this directory to your path execute this command:
> ln -s /netopt/ncbi_tools64/bin/srcchk .
For users outside NCBI, we provide a gzipped executable of srcchk
for 64-bit Linux computers in
scripts/srcchk.gz
Run
gunzip srcchk.gz
chmod +x srcchk
and make sure that srcchk is on the execution path (which it should
be if you followed the steps above in SETTING UP ENVIRONMENT VARIABLES)
See also the section COMPILING vecscreen AND srcchk ON LINUX, FROM SCRATCH
The next 6 files were authored by Alejandro Schaffer:
3. from_vecscreen_to_summary.pl
This is the main script that coordinates the work by calling the
other executables listed below.
4. parse_vecscreen.pl
Auxiliary script called by from_vecscreen_to_summary.pl that parses
the vecscreen output.
5. combine_summaries.pl
Auxiliary script called by from_vecscreen_to_summary.pl that
combines two different output formats of parse_vecscreen.pl
6. add_taxonomy.pl
Auxiliary script called by from_vecscreen_to_summary.pl that adds
taxonomy information to the output of parse_vecscreen.pl.
7. assign_levels_to_taxonomy.pl
Independent script that is used to create a taxonomy file from the
NCBI taxonomy. The file produced by assign_levels_to_taxonomy.pl is
required input to from_vecscreen_to_summary.pl.
8. compare_vector_matches_wtaxa.pl
Program to classify vecscreen matches that have already been parsed
with from_vecscreen_to_summary.pl. Matches can be classified as:
A. NO_DATA: there is no data about the source of the vector
segment in the match.
B. TRUE_ARTIFICIAL: the vector segment matched is an ARTIFICIAL
sequence and hence the match is TRUE
contamination.
C. TRUE_ARTIFICIAL_MICROSAT: the vector segment matched is an
ARTIFICIAL sequence and hence the match
is TRUE contamination and the vector
contains a microsatellite.
D. FALSE_AMR: the query sequence is bacterial; the vector
segment matches a known sequence that confers
anti-microbial resistance and these can often
be transferred horizontally between bacteria
that may be taxonomically distant.
E. FALSE_BIOLOGICAL: the subject's biological origin is known and
its taxid is deemed close enough to that of the
query, so that the match is not contamination.
F. TRUE_BIOLOGICAL : the query and the matching subject originate from
taxa that are too far apart for the vector to occur
plausibly in the query.
G. TRUE_MICROSAT : the query and the matching subject originate from
taxa that are too far apart for the vector to
occur plausibly in the query, and the vector
contains a microsatellite.
H. LIKE_FALSE_BACTERIAL: the query is from uncultured bacteria and
the matching subject isfrom bacteria
Additionally, several sets of data files are required for
compare_vector_matches_wtaxa.pl to work. These were all included with
this software distribution. After cloning the github repository, these
files will be in the PATH/TO/VEC/PLUS/info-files directory.
1. taxonomy_tree_wlevels.txt
A compact form of NCBI's taxonomy tree with added fields to
indicate for each taxid, its level and whether it is a descendant
of the node Bacteria. Descendants of Bacteria are treated
specially because many sequences are now assigned to the generic
taxid "Uncultured bacteria"
2. UniVec10_vs_amr_distinct_intervals.txt
Intervals of vectors from UniVec version 10 that overlap with known
antimicrobial resistance (AMR) regions.
3. artificial_intervals.txt, artificial_whole_sequences.txt
Vector intervals or whole vectors that were generated in a
laboratory, not from a biological source. Many of these are known
also as "adaptors".
4. biological_exclusions.txt (which lists the following files)
superkingdom_level_exclusions.txt
kingdom_level_exclusions.txt
phylum_level_exclusions.txt
class_level_exclusions.txt
order_level_exclusions.txt
family_level_exclusions.txt
tribe_level_exclusions.txt
genus_level_exclusions.txt
The last listed file genus_level_exclusions.txt describes the
biological sources of vector segments at genus level. The other
files summarize in silico sequence analysis that shows that some
vector segments are conserved at seven taxonomic levels higher than
genus. In these files, as well as artificial_intervals.txt:
Column 1 is the vector segment using UniVec notation
Columns 2 and 3 are the start and end of the interval
Column 4 is the taxid
Column 5 is either the Latin or English name for the taxid
5. Microsatellite_vectors.txt a list of vectors that contain
microsatellites; Sequences that have known microsatellites and
match to these vectors are classified specially as TRUE_MICROSAT or
TRUE_ARTIFICAL_MICROSAT
Addtionally, the files in the 'test-files' directory created when
cloning the github repository all used by the
scripts/test_vector_plus_taxonomy.pl script for testing that your
installation and setup is working properly.
Finally, the file epn-options.pm is a perl module authored
by Eric Nawrocki to handle command line options. It will be in the
top-level PATH/TO/VEC/PLUS directory after cloning the repository.
******************************
**CREATING THE TAXONOMY FILE**
******************************
This section describes how to create the taxonomy input file (passed
in to from_vecscreen_to_summary.pl with the --input_taxa <s> option)
using the NCBI taxonomy file.
NCBI's taxonomy is available from directory
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/.
We start from any of the files:
taxdmp.tar.Z
taxdump.tar.gz
taxdmp.zip
from which one can extract the file nodes.dmp.
Then execute the following commands:
> cut -f1,3,5 nodes.dmp > taxonomy_tree.txt
> assign_levels_to_taxonomy.pl --input_taxa taxonomy_tree.txt --outfile taxonomy_tree_wlevels.txt
taxonomy_tree.txt will have three columns:
Column 1: taxid as an an integer
Column 2: parent taxid as an an integer
Column 3: rank (e.g., phylum) as a word
Running assign_levels_to_taxonomy.pl will add
-- a fourth column which is the level in the taxonomy tree, where
the root has level 1 and each child taxid has a level one greater
than the level of its parent taxid;
-- a fifth column that is 1 if the node is a descendant of
Bacteria and 0 if not.
******************************
**AN ADDITIONAL TAXONOMY HELPER PROGRAM**
******************************
The repository for vecscreen_plus_taxonomy also includes the helper program
find_taxonomy_ancestors.pl
The purpose of this program is to solve the following taxonomy-related problem.
Given as input one or more accessions, what are the taxid ancestors of those
accessions at some specified taxonomy level, such as order or class.
Usage:
find_taxonomy_ancestors.pl \
--input_summary <input file of identifiers> \
--input_taxa <input file with NCBI's taxonomy> \
--input_level <desired taxonomy level> \
--outfile <output file>
The input has three columns: 1) accession 2) taxid of accession
typically, at species level 3) taxonomy name of the taxid in column 2
The output repeats the input columns and adds a fourth column with the
taxid of the ancestor at the specified level. If there is no ancestor
at the specified level, or if the taxid is not recognized, then the
value 1 is printed instead because 1 is the root of the taxonomy tree.
find_taxonomy_ancestors.pl is not currently used within
VecScreen_plus_taxonomy, but is used in a related project.
*********************************************************
**COMPILING vecscreen AND srcchk ON LINUX, FROM SCRATCH**
*********************************************************
We provide gzipped executables of vecscreen and srcchk for 64-bit Linux.
These have been tested on two non-NCBI computers, but it is not possible
to prove that they work on all 64-bit Linux computers that are sufficiently
up to date. In this context, sufficiently up to date means that all libraries
associated with gcc version 4.8 or higher are installed. The executables
provided were compiled with gcc version 4.91 because that was the least
recent version of gcc available among the versions >= 4.8.
srcchk and vecscreen are part of the NCBI C++ toolkit, which can
be downloaded at the time of writing from
ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/ncbi_cxx--18_0_0.tar.gz
The current version is 18.0.0, but the version may increase in the future.
Modify the above address as necessary to get the current version.
At this time, the retrieved file will be named ncbi_cxx--18_0_0.tar.gz.
If the version retrieved is higher, replace all occurrences of
18_0_0 below accordingly.
Run
gunzip ncbi_cxx--18_0_0.tar.gz
tar xvf ncbi_cxx--18_0_0.tar
cd ncbi_cxx--18_0_0
In case the user wants to compile vecscreen and srcchk from scratch,
we found that the following sets of steps works on Linux, but we
caution that compiling the NCBI C++ toolkit is a moving target and
instructions may need to change in the future. Let the token <GCC
version> represent the gcc version number without the decimal
points. For example, on the computer we used, the token would be 491.
After the following four commands are run, the executables for srcchk
and vecscreen should be found in the subirectory (of ncbi_cxx--18_0_0)
called GCC<GCC version>-DebugMT64/bin.
The commands below will create files that collectively take about 25Gb
of disk space (as of current version). After the build is complete,
you can copy only the srcchk and vecscreen binaries from GCC<GCC
version>-DebugMT64/bin to PATH/TO/VEC/PLUS, or anywhere else you want,
and delete the rest of the files, if desired.
1) ./configure --without-gui --without-internal --without-boost --with-bin-release --with-flat-makefile
2) cd GCC<GCC version>-Debug
3) make -C build -f Makefile.flat all_files
4) make -C build -f Makefile.flat app/
*******************************************
Send any comments or questions to Alejandro Schaffer