-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
193 lines (148 loc) · 10.2 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
output: github_document
always_allow_html: yes
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# BOLDmineR
<!-- badges: start -->
[![DOI](https://zenodo.org/badge/205451668.svg)](https://zenodo.org/badge/latestdoi/205451668)
<!-- badges: end -->
DNA barcodes are not only used by researchers, but also by decision-makers (e.g. to control food fraud or illegal species commercialization). The big-scale demand of both online services and information to identify species contrasts with the limited ways to automatize either species identification or assessment of barcode quality per species, directly from the web interface.
[BOLD system](http://www.boldsystems.org/) is the main database of DNA barcode worldwide. This database has been stepply growing through time since its release ([Ratnasingham and Hebert 2007](https://onlinelibrary.wiley.com/doi/full/10.1111/j.1471-8286.2007.01678.x)) and its accessibility is pivotal for projects focused on DNA barcodes. Nowdays APIs, to some extant, are offering access to well-know databases such as [FishBase](https://fishbase.ropensci.org/), [WoRMS](http://www.marinespecies.org/rest/) or [BOLD](http://www.boldsystems.org/index.php/api_home). Despite BOLD's API mostly involves only public data, this leverages its data retrieving for wider purposes. The API's applicability, however, seems to be wholly held up by its own needs of having either standalone softwares or functions which could wrap up blocks of information. The main objective of these functions (i.e. BOLD-mineR's functions) is justly circumscribe the BOLD's API performance with R-based scripts to get insights about DNA barcodes by using public information.
## Installation
You can install the development version from GitHub with:
``` r
# library(devtools)
devtools::install_github("Ulises-Rosas/boldminer")
```
## Usage
``` r
library(boldminer)
```
### SpecimenData
This function lets us to mine associated metadata from any specimen according to following arguments:
* `taxon` (e.g. `Aves|Elasmobranchii`).
* `ids` (e.g. `ANGBF12704-15`).
* `bin` (e.g. `BOLD:AAA4689`).
* `container` (e.g. `FIPP`).
* `institution` (e.g. `Smithsonian Institution`).
* `researchers` (including identifiers and collectors).
* `geo` (e.g. `Peru`).
If we want to get information, for instance, from all specimens of elasmobranchs distributed in Peru and stored in BOLD, we can use the following line:
```{r}
specimendata <- boldminer::SpecimenData(taxon = "Elasmobranchii", geo = "Peru")
```
Then,we use _tibble_ package for just assessing `specimendata` dimension:
```{r}
tibble::as_tibble(specimendata)
```
Given its dimension, we can summarize our `specimendata` data frame with the `sumSData()` utility:
```{r}
boldminer::sumSData(df = specimendata, cols = c("species_name", "country"))
```
Where `n` column shows up unique counts after grouping data frame by values from `cols` argument. You can also plot geographical information, when available, by using `leafletPlot()`:
```{r eval=FALSE}
boldminer::leafletPlot(specimendata)
```
![](man/figures/Rplot02.png)
If only sequences are desired, the argument `seq = "only"` should be stated. You can also combine metadata with sequences by using `seq = "combined"`
```{r}
seqs <- boldminer::SpecimenData(taxon = "Elasmobranchii", geo = "Peru", seq = "only")
seqs
```
Above sequences can be also exported into a file by using `write.dna()` function from the _ape_ package, which, in turn, the _boldminer_ package depends:
```{r eval=FALSE}
# library(ape)
ape::write.dna(x = seqs, file = 'sequences.txt', format = 'fasta', nbcol = 1, colw = 90)
```
### ID_engine
This function finds best matches between a query sequence and a database of BOLD by using BLASTn-based algorithms. Arguments of this function are `query` and `db`. The first one are query sequences and the second one are one of avilable databases in BOLD:
* `COX1`
* `COX1_SPECIES`
* `COX1_SPECIES_PUBLIC`
* `COX1_L640bp`
This script also take account for those sequences which are not, by mistake, at the right sense and sends them to perform a BLAST search through its [API](https://ncbi.github.io/blast-cloud/dev/api.html)
This function starts with a fasta-formated file. In order to test this script, we can take sample file within the package:
```{r}
fasta_file <- system.file("sequences.fa", package = "boldminer")
```
Then, we can run the following line get its identification:
```{r eval=FALSE}
id_out <- boldminer::ID_engine(query = fasta_file, db = "COX1_SPECIES")
```
First five rows for each id_out item can be taken by using `lookID()` function:
```{r eval=FALSE}
boldminer::lookID(id_out)
#> Sample ID taxonomicidentification similarity
#> 1 seq1 PHANT458-08 Alopias pelagicus 1
#> 2 seq1 IRREK872-08 Alopias pelagicus 1
#> 3 seq1 IRREK873-08 Alopias pelagicus 1
#> 4 seq1 IRREK874-08 Alopias pelagicus 1
#> 5 seq1 IRREK876-08 Alopias pelagicus 1
#> 6 seq2 ANGBF10914-15 Alopias pelagicus 0.9986
#> 7 seq2 ESHKB029-07 Alopias pelagicus 0.9985
#> 8 seq2 ANGBF10913-15 Alopias pelagicus 0.9971
#> 9 seq2 PHANT458-08 Alopias pelagicus 0.9969
#> 10 seq2 IRREK873-08 Alopias pelagicus 0.9969
#> 11 seq3 ANGBF12626-15 Alopias pelagicus 1
#> 12 seq3 ANGBF12623-15 Alopias pelagicus 1
#> 13 seq3 ANGBF11723-15 Alopias pelagicus 1
#> 14 seq3 ANGBF10915-15 Alopias pelagicus 1
#> 15 seq3 ESHKB036-07 Alopias pelagicus 1
```
### auditOnID
This function adds an **audition** step ([Oliveira _et al._ 2016](https://onlinelibrary.wiley.com/doi/full/10.1111/jfb.13169)) to each selected specimen by `ID_engine()` (see above), given a certain threshold. This function, in turn, uses another function called `AuditionBarcodes()` (see below). This prior function is coupled with `auditOnID()` and can validate species names by taking accepted names from [Worms database](http://www.marinespecies.org/).
As seen with `ID_engine()`, this function starts with a fasta-formated file. We can take the same sample file we used: `fasta_file`. Then, we can identify those sequences under `audiOnID()` functionality by using a `threshold = 0.99` (i.e. 99%
of similarity):
```{r eval=FALSE}
aoID_out <- boldminer::auditOnID(seqs = fasta_file, threshold = 0.99)
#> |*****************************************************************| 100%
```
```{r eval=FALSE}
aoID_out
#> Samples Match Species Grades Observations
#> 1 seq1 Unique Alopias pelagicus A There were 49 matches. External congruence.
#> 2 seq2 Unique Alopias pelagicus A There were 39 matches. External congruence.
#> 3 seq3 Unique Alopias pelagicus A There were 94 matches. External congruence.
```
We can also skip audition step by adding `just_ID = TRUE` within arguments.
### AuditionBarcodes
Despite `AuditionBarcodes()` function is coupled with `auditOnID()` function, it can also work with just a list of names. Furthermore, there is an argument which enables to chose if sequences from GenBank are considered. It is pending, however, assess whether these sequences used to assess barcode's quality come from either a published article or direct submission:
```{r eval=FALSE}
species <- c( "Caretta caretta", "Bathygobius lineatus",
"Albula esuncula", "Vibilia armata",
"Alepisaurus ferox", "Diodon hystrix",
"Vesicomya galatheae", "Caranx ruber")
audit_out <- boldminer::AuditionBarcodes(species, exclude_ncbi = T, validate_name = T)
#> Auditing for:
#>
#> Caretta caretta
#> Bathygobius lineatus
#> Eretmochelys imbricata
#> Vibilia armata
#> Alepisaurus ferox
#> Diodon hystrix
#> Vesicomya galatheae
#> Caranx ruber
```
```{r eval=FALSE}
audit_out
#> Species Grades Observations BIN_structure
#> 1 Caretta caretta A Matched BIN with external congruence 'BOLD:AAB8364':{'Caretta caretta':11}
#> 2 Bathygobius lineatus B Matched BIN with internal congruence only 'BOLD:AAF0181':{'Bathygobius lineatus':4}
#> 3 Albula esuncula C Splitted BIN 'BOLD:AAF1162':{'Albula esuncula':3}, 'BOLD:AAA3538':{'Albula esuncula':1}
#> 4 Vibilia armata D Insufficient data. Institution storing: 2. Specimen records: 20 <NA>
#> 5 Alepisaurus ferox E** Mixtured BIN 'BOLD:AAC5235':{'Alepisaurus ferox':8}, 'BOLD:AAC5236':{'Alepisaurus brevirostris':3,'Alepisaurus ferox':3}
#> 6 Diodon hystrix E* Merged BIN 'BOLD:AAB0446':{'Diodon hystrix':17,'Diodon eydouxii': 1}
#> 7 Vesicomya galatheae F Barcodes mined from GenBank or unvouchered. <NA>
#> 8 Caranx ruber NA No specimen data available <NA>
```
Please notice that grades are obtained with accepted names of species according to [WoRMS database](http://www.marinespecies.org/) Rest service by using its taxamatch algorithm. Hence, since currently accepted names within `species` vector has not been figured out, unevenness between the column `BIN_structure` and `species` could pop up.