-
Notifications
You must be signed in to change notification settings - Fork 3
/
AnnotationHub.Rmd
433 lines (337 loc) · 14 KB
/
AnnotationHub.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
---
title: "AnnotationHub: Access the AnnotationHub Web Service"
output:
html_document:
toc: true
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```
# Introduction
Finding and using public genomics data such as browser or chip-seq tracks; annotation for
genes, exons, transcripts; gene ontology and functional gene information; etc. often
requires quite a bit of work. Bioconductor has done some of this work already by
1. Finding and curating popular genomic resources
2. Using "recipes" to create R object versions of these resources
3. Make those resources available as a web service that is accessible from R
The `AnnotationHub` server provides easy _R / Bioconductor_ access to
large collections of publicly available whole genome resources,
e.g,. ENSEMBL genome fasta or gtf files, UCSC chain resources, ENCODE
data tracks at UCSC, etc.
To get started, make sure that you have the `AnnotationHub` package installed:
```{r eval=FALSE}
BiocManager::install('AnnotationHub')
```
# AnnotationHub objects
The `r Biocpkg("AnnotationHub")` package provides a client interface
to resources stored at the AnnotationHub web service.
```{r library, message=FALSE}
library(AnnotationHub)
```
The `r Biocpkg("AnnotationHub")` package is straightforward to use.
Create an `AnnotationHub` object
```{r AnnotationHub}
ah = AnnotationHub()
```
Now at this point you have already done everything you need in order
to start retrieving annotations. For most operations, using the
`AnnotationHub` object should feel a lot like working with a familiar
`list` or `data.frame`.
Lets take a minute to look at the show method for the hub object ah
```{r show}
ah
```
You can see that it gives you an idea about the different types of data that are present inside the hub. You can see where the data is coming from (dataprovider), as well as what species have samples present (species), what kinds of R data objects could be returned (rdataclass). We can take a closer look at all the kinds of data providers that are available by simply looking at the contents of dataprovider as if it were the column of a data.frame object like this:
```{r dataprovider}
unique(ah$dataprovider)
```
In the same way, you can also see data from different species inside the hub by looking at the contents of species like this:
```{r species}
head(unique(ah$species))
```
And this will also work for any of the other types of metadata present. You can learn which kinds of metadata are available by simply hitting the tab key after you type 'ah$'. In this way you can explore for yourself what kinds of data are present in the hub right from the command line. This interface also allows you to access the hub programatically to extract data that matches a particular set of criteria.
Another valuable types of metadata to pay attention to is the rdataclass.
```{r rdataclass}
head(unique(ah$rdataclass))
```
The rdataclass allows you to see which kinds of R objects the hub will return to you. This kind of information is valuable both as a means to filter results and also as a means to explore and learn about some of the kinds of annotation objects that are widely available for the project. Right now this is a pretty short list, but over time it should grow as we support more of the different kinds of annotation objects via the hub.
Now lets try getting the Chain Files from UCSC using the query and subset methods to selectively pare down the hub based on specific criteria.
The query method lets you search rows for
specific strings, returning an `AnnotationHub` instance with just the
rows matching the query.
From the show method, one can easily see that one of the dataprovider is
UCSC and there is a rdataclass for ChainFile
One can get chain files for Drosophila melanogaster from UCSC with:
```{r dm1}
dm <- query(ah, c("ChainFile", "UCSC", "Drosophila melanogaster"))
dm
```
Query has worked and you can now see that the only species present is
Drosophila melanogaster.
The metadata underlying this hub object can be retrieved by you
```{r show2}
df <- mcols(dm)
# what is df?
class(df)
head(df[,1:5])
```
By default the show method will only display the first 5 and last 5 rows.
There are already thousands of records present in the hub.
```{r length}
length(ah)
```
Lets look at another example, where we pull down only Inparanoid8 data
from the hub and use subset to return a smaller base object (here we
are finding cases where the genome column is set to panda).
```{r subset}
ahs <- query(ah, c('inparanoid8', 'ailuropoda'))
ahs
```
We can also look at the `AnnotationHub` object in a browser using the
`display()` function. We can then filter the `AnnotationHub` object
for _chainFile__ by either using the Global search field on the top
right corner of the page or the in-column search field for `rdataclass'.
```{r display, eval=FALSE}
d <- display(ah)
```
![](images/annotationHub.png)
Displaying and filtering the Annotation Hub object in a browser
By default 1000 entries are displayed per page, we can change this using
the filter on the top of the page or navigate through different pages
using the page scrolling feature at the bottom of the page.
We can also select the rows of interest to us and send them back to
the R session using 'Return rows to R session' button ; this sets a
filter internally which filters the `AnnotationHub` object. The names
of the selected AnnotationHub elements displayed at the top of the
page.
# Using `AnnotationHub` to retrieve data
Looking back at our chain file example, if we are interested in the file
dm1ToDm2.over.chain.gz, we can gets its metadata using
```{r dm2}
dm
dm["AH15146"]
```
We can download the file using
```{r dm3}
dm[["AH15146"]]
```
Each file is retrieved from the AnnotationHub server and the file is
also cache locally, so that the next time you need to retrieve it,
it should download much more quickly.
# Accessing Genome-Scale Data
## Non-model organism gene annotations
_Bioconductor_ offers pre-built `org.*` annotation packages for model
organisms, with their use described in the
[OrgDb](http://bioconductor.org/help/workflows/annotation/Annotation_Resources/#OrgDb)
section of the Annotation work flow. Here we discover available `OrgDb`
objects for less-model organisms.
The `query()` interface allows us to do full-text searching of
_AnnotationHub_ objects. How many OrgDb packages are represented in
_AnnotationHub_?
```{r less-model-org}
library(AnnotationHub)
## create the annotationhub object
ah <- AnnotationHub()
## Query the annotationhub metadata
query(ah, "OrgDb")
```
Let's assume that we are working with yeast (_Saccharomyces
cerevisiae_) and get the OrgDb package.
```{r}
sub_ah = query(ah, c("OrgDb", "cerevisiae"))
sub_ah
orgdb <- query(sub_ah, "OrgDb")[[1]]
```
Look at the `orgdb` object.
```{r}
orgdb
```
The `orgdb` object works like a little database. We can look at the
columns in the database.
```{r}
columns(orgdb)
```
Not all columns can always be used for "lookup", though. We need to know
which columns represent `keys` that we can use to retrieve data.
```{r}
keytypes(orgdb)
```
You can examine the values available for lookup by using the `keys()` method.
```{r}
head(keys(orgdb, keytype="ORF"))
head(keys(orgdb, keytype="GO"))
```
Notice that there are two columns that look useful for the `derisi`
data--`ORF` and `GENENAME`. Let's assume that we had not been given
the gene name as part of the `derisi` data, but only the ORF. We can
use the `OrgDb` `select()` interface to retrieve the gene names for
these five ORF ids, just to get a sense of how `select()` works to get
information from an `OrgDb` object.
```{r}
orfids = c("YAL001C", "YAL002W", "YAL003W", "YAL004W", "YAL005C")
select(orgdb, keys = orfids, columns = "GENENAME", keytype="ORF")
select(orgdb, keys = orfids, columns = "GO", keytype="ORF")
```
*Exercise:* Use the `derisi` ORF column to look up the associated gene
names.
*Exercise:* Use the `derisi` ORF column to look up the gene ontology [`GO`](http://geneontology.org/)
annotations for the ORFs. What is the general relationship between
`GO` terms and ORFs?
*Bonus Exercise:* Read the documentation associated with the GO.db
package
[here](https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf). Use
the GO.db package and the select interface to look up the descriptions
and terms for the gene ontology terms from the previous exercise.
## Roadmap Epigenomics Project
All Roadmap Epigenomics files are hosted
[here](http://egg2.wustl.edu/roadmap/data/byFileType/). If one had to
download these files on their own, one would navigate through the web
interface to find useful files, then use something like the following
_R_ code.
```{r, eval=FALSE}
url <- "http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/E001-H3K4me1.broadPeak.gz"
filename <- basename(url)
download.file(url, destfile=filename)
if (file.exists(filename))
data <- import(filename, format="bed")
```
This would have to be repeated for all files, and the onus would lie
on the user to identify, download, import, and manage the local disk
location of these files.
`r Biocpkg("AnnotationHub")` reduces this task to just a few lines of _R_ code
```{r results='hide'}
library(AnnotationHub)
ah = AnnotationHub()
epiFiles <- query(ah, "EpigenomeRoadMap")
```
A look at the value returned by `epiFiles` shows us that
`r length(epiFiles)` roadmap resources are available via
`r Biocpkg("AnnotationHub")`. Additional information about
the files is also available, e.g., where the files came from
(dataprovider), genome, species, sourceurl, sourcetypes.
```{r}
epiFiles
```
A good sanity check to ensure that we have files only from the Roadmap Epigenomics
project is to check that all the files in the returned smaller hub object
come from _Homo sapiens_ and the `r unique(epiFiles$genome)` genome
```{r}
unique(epiFiles$species)
unique(epiFiles$genome)
```
Broadly, one can get an idea of the different files from this project
looking at the sourcetype
```{r}
table(epiFiles$sourcetype)
```
To get a more descriptive idea of these different files one can use:
```{r}
sort(table(epiFiles$description), decreasing=TRUE)
```
The 'metadata' provided by the Roadmap Epigenomics Project is also
available. Note that the information displayed about a hub with a
single resource is quite different from the information displayed when
the hub references more than one resource.
```{r}
metadata.tab <- query(ah , c("EpigenomeRoadMap", "Metadata"))
metadata.tab
```
So far we have been exploring information about resources, without
downloading the resource to a local cache and importing it into R.
One can retrieve the resource using `[[` as indicated at the
end of the show method
```{r echo=FALSE, results='hide'}
metadata.tab <- ah[["AH41830"]]
```
```{r}
metadata.tab <- ah[["AH41830"]]
```
The metadata.tab file is returned as a _data.frame_. The first 6 rows
of the first 5 columns are shown here:
```{r}
metadata.tab[1:6, 1:5]
```
One can keep constructing different queries using multiple arguments to
trim down these `r length(epiFiles)` to get the files one wants.
For example, to get the ChIP-Seq files for consolidated epigenomes,
one could use
```{r}
bpChipEpi <- query(ah , c("EpigenomeRoadMap", "broadPeak", "chip", "consolidated"))
```
To get all the bigWig signal files, one can query the hub using
```{r}
allBigWigFiles <- query(ah, c("EpigenomeRoadMap", "BigWig"))
```
To access the 15 state chromatin segmentations, one can use
```{r}
seg <- query(ah, c("EpigenomeRoadMap", "segmentations"))
```
If one is interested in getting all the files related to one sample
```{r}
E126 <- query(ah , c("EpigenomeRoadMap", "E126", "H3K4ME2"))
E126
```
Hub resources can also be selected using `$`, `subset()`, and
`display()`.
Hub resources are imported as the appropriate _Bioconductor_ object
for use in further analysis. For example, peak files are returned as
_GRanges_ objects.
```{r echo=FALSE, results='hide'}
peaks <- E126[['AH29817']]
```
```{r}
peaks <- E126[['AH29817']]
seqinfo(peaks)
```
BigWig files are returned as _BigWigFile_ objects. A _BigWigFile_ is a
reference to a file on disk; the data in the file can be read in using
`rtracklayer::import()`, perhaps querying these large files for
particular genomic regions of interest as described on the help page
`?import.bw`.
Each record inside `r Biocpkg("AnnotationHub")` is associated with a
unique identifier. Most _GRanges_ objects returned by
`r Biocpkg("AnnotationHub")` contain the unique AnnotationHub identifier of
the resource from which the _GRanges_ is derived. This can come handy
when working with the _GRanges_ object for a while, and additional
information about the object (e.g., the name of the file in the cache,
or the original sourceurl for the data underlying the resource) that
is being worked with.
```{r}
metadata(peaks)
ah[metadata(peaks)$AnnotationHubName]$sourceurl
```
# Configuring `AnnotationHub` objects
Most users will not need to do any specific configuration, but this
section is included for completeness.
When you create the `AnnotationHub` object, it will set up the object
for you with some default settings. See `?AnnotationHub` for ways to
customize the hub source, the local cache, and other instance-specific
options, and `?getAnnotationHubOption` to get or set package-global
options for use across sessions.
If you look at the object you will see some helpful information about
it such as where the data is cached and where online the hub server is
set to.
```{r show-2}
ah
```
By default the `AnnotationHub` object is set to the latest
`snapshotData` and a snapshot version that matches the version of
_Bioconductor_ that you are using. You can also learn about these data
with the appropriate methods.
```{r snapshot}
snapshotDate(ah)
```
If you are interested in using an older version of a snapshot, you can
list previous versions with the `possibleDates()` like this:
```{r possibleDates}
pd <- possibleDates(ah)
pd
```
Set the dates like this:
```{r setdate, eval=FALSE}
snapshotDate(ah) <- pd[1]
```
# Session info
```{r sessionInfo}
sessionInfo()
```