-
Notifications
You must be signed in to change notification settings - Fork 3
/
F_epigenomic_ranges.Rmd.bak
845 lines (662 loc) · 27 KB
/
F_epigenomic_ranges.Rmd.bak
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
---
title: 'Epigenomic Data Science with Bioconductor'
subtitle: "Using the Bioconductor Ranges Infrastructure"
author: 'Martin Morgan & Sean Davis <[email protected]>'
output: html_document
---
# _Bioconductor_ 'infrastructure' for sequence analysis
```{r style, echo = FALSE, results = 'asis'}
options(width=100, max.print=1000)
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```
```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE}
suppressPackageStartupMessages({
library(GenomicRanges)
library(GenomicAlignments)
})
```
We will need to install a few bioconductor packages for this section.
```{r eval=FALSE}
source('https://bioconductor.org/biocLite.R')
biocLite(c("GenomicAlignments",
"BSgenome.Hsapiens.UCSC.hg19",
"AnnotationHub",
"TxDb.Hsapiens.BioMart.igis",
"rtracklayer",
"readr",
"ComplexHeatmap",
"ggbio"))
```
## Introduction
### Classes, methods, and packages
This section focuses on classes, methods, and packages, with the goal
being to learn to navigate the help system and interactive discovery
facilities.
### Motivation
Sequence analysis is specialized
- Large data needs to be processed in a memory- and time-efficient manner
- Specific algorithms have been developed for the unique
characteristics of sequence data
Additional considerations
- Re-use of existing, tested code is easier to do and less error-prone
than re-inventing the wheel.
- Interoperability between packages is easier when the packages share
similar data structures.
Solution: use well-defined _classes_ to represent complex data;
_methods_ operate on the classes to perform useful functions. Classes
and methods are placed together and distributed as _packages_ so that
we can all benefit from the hard work and tested code of others.
## Core packages
<pre>
VariantAnnotation
|
v
GenomicFeatures
|
v
BSgenome
|
v
rtracklayer
|
v
GenomicAlignments
| |
v v
SummarizedExperiment Rsamtools ShortRead
| | | |
v v v v
GenomicRanges Biostrings
| |
v v
GenomeInfoDb (XVector)
| |
v v
IRanges
|
v
(S4Vectors)
</pre>
## _IRanges_ and _GRanges_
The [IRanges][] package defines an important class for specifying
integer ranges, e.g.,
```{r iranges}
library(IRanges)
ir <- IRanges(start=c(10, 20, 30), width=5)
ir
```
There are many interesting operations to be performed on ranges, e.g,
`flank()` identifies adjacent ranges
```{r iranges-flank}
flank(ir, 3)
```
The `IRanges` class is part of a class hierarchy. To see this, ask R for
the class of `ir`, and for the class definition of the `IRanges` class
```{r iranges-class}
class(ir)
getClass(class(ir))
```
Notice that `IRanges` extends the `Ranges` class. Show
Now try entering `?flank` (if not using _RStudio_, enter
`?"flank,<tab>"` where `<tab>` means to press the tab key to ask for
tab completion). You can see that there are help pages for `flank`
operating on several different classes. Select the completion
```{r iranges-flank-method, eval=FALSE}
?"flank,Ranges-method"
```
and verify that you're at the page that describes the method relevant
to an `IRanges` instance. Explore other range-based operations.
The [GenomicRanges][] package extends the notion of ranges to include
features relevant to application of ranges in sequence analysis,
particularly the ability to associate a range with a sequence name
(e.g., chromosome) and a strand. Create a `GRanges` instance based on
our `IRanges` instance, as follows
```{r granges}
library(GenomicRanges)
gr <- GRanges(c("chr1", "chr1", "chr2"), ir, strand=c("+", "-", "+"))
gr
```
The notion of flanking sequence has a more nuanced meaning in
biology. In particular we might expect that flanking sequence on the
`+` strand would precede the range, but on the minus strand would
follow it. Verify that `flank` applied to a `GRanges` object has this
behavior.
```{r granges-flank}
flank(gr, 3)
```
Discover what classes `GRanges` extends, find the help page
documenting the behavior of `flank` when applied to a `GRanges` object,
It seems like there might be a number of helpful methods available for
working with genomic ranges; we can discover some of these from the
command line, indicating that the methods should be on the current
`search()` path
```{r granges-methods}
methods(class="GRanges")
```
Notice that the available `flank()` methods have been augmented by the
methods defined in the _GenomicRanges_ package, including those that are relevant (via inheritance) to the _GRanges_ class.
```{r granges-flank-method}
grep("flank", methods(class="GRanges"), value=TRUE)
```
Verify that the help page documents the behavior we just observed.
```{r granges-flank-method-help, eval=FALSE}
?"flank,GenomicRanges-method"
```
Use `help()` to list the help pages in the `GenomicRanges` package,
and `vignettes()` to view and access available vignettes; these are
also available in the Rstudio 'Help' tab.
```{r granges-man-and-vignettes, eval=FALSE}
help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
```
### The `GRanges` and `GRangesList` classes
Aside: 'TxDb' packages provide an R representation of gene models
```{r txdb}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
```
`exons()`: _GRanges_
```{r txdb-exons}
exons(txdb)
```
`exonsBy()`: _GRangesList_
```{r txdb-exonsby}
exonsBy(txdb, "tx")
```
_GRanges_ / _GRangesList_ are incredibly useful
- Represent **annotations** -- genes, variants, regulatory elements,
copy number regions, ...
- Represent **data** -- aligned reads, ChIP peaks, called variants,
...
### Algebra of genomic ranges
Many biologically interesting questions represent operations on ranges
- Count overlaps between aligned reads and known genes --
`GenomicRanges::summarizeOverlaps()`
- Genes nearest to regulatory regions -- `GenomicRanges::nearest()`,
[ChIPseeker][]
- Called variants relevant to clinical phenotypes --
[VariantFiltering][]
_GRanges_ Algebra
- Intra-range methods
- Independent of other ranges in the same object
- GRanges variants strand-aware
- `shift()`, `narrow()`, `flank()`, `promoters()`, `resize()`,
`restrict()`, `trim()`
- See `?"intra-range-methods"`
- Inter-range methods
- Depends on other ranges in the same object
- `range()`, `reduce()`, `gaps()`, `disjoin()`
- `coverage()` (!)
- see `?"inter-range-methods"`
- Between-range methods
- Functions of two (or more) range objects
- `findOverlaps()`, `countOverlaps()`, ..., `%over%`, `%within%`,
`%outside%`; `union()`, `intersect()`, `setdiff()`, `punion()`,
`pintersect()`, `psetdiff()`
![Alt Ranges Algebra](images/RangeOperations.png)
## _Biostrings_ (DNA or amino acid sequences)
Classes
- XString, XStringSet, e.g., DNAString (genomes),
DNAStringSet (reads)
Methods --
- [Cheat sheat](http://bioconductor.org/packages/release/bioc/vignettes/Biostrings/inst/doc/BiostringsQuickOverview.pdf)
- Manipulation, e.g., `reverseComplement()`
- Summary, e.g., `letterFrequency()`
- Matching, e.g., `matchPDict()`, `matchPWM()`
Related packages
- [BSgenome][]
- Whole-genome representations
- Model and custom
- [ShortRead][]
- FASTQ files
Example
- Whole-genome sequences are distrubuted by ENSEMBL, NCBI, and others
as FASTA files; model organism whole genome sequences are packaged
into more user-friendly `BSgenome` packages. The following
calculates GC content across chr14.
```{r BSgenome-require, message=FALSE}
library(BSgenome.Hsapiens.UCSC.hg19)
chr14_range = GRanges("chr14", IRanges(1, seqlengths(Hsapiens)["chr14"]))
chr14_dna <- getSeq(Hsapiens, chr14_range)
letterFrequency(chr14_dna, "GC", as.prob=TRUE)
```
## Other formats and packages
![Alt Files and the Bioconductor packages that input them](images/FilesToPackages.png)
Acknowledgements
- Core (Seattle): Sonali Arora, Marc Carlson, Nate Hayden, Jim Hester,
Valerie Obenchain, Hervé Pagès, Paul Shannon, Dan
Tenenbaum.
- The research reported in this presentation was supported by the
National Cancer Institute and the National Human Genome Research
Institute of the National Institutes of Health under Award numbers
U24CA180996 and U41HG004059, and the National Science Foundation
under Award number 1247813. The content is solely the responsibility
of the authors and does not necessarily represent the official views
of the National Institutes of Health or the National Science
Foundation.
[AnnotationDbi]: http://bioconductor.org/packages/AnnotationDbi
[BSgenome]: http://bioconductor.org/packages/BSgenome
[BiocParallel]: http://bioconductor.org/packages/BiocParallel
[Biostrings]: http://bioconductor.org/packages/Biostrings
[CNTools]: http://bioconductor.org/packages/CNTools
[ChIPQC]: http://bioconductor.org/packages/ChIPQC
[ChIPpeakAnno]: http://bioconductor.org/packages/ChIPpeakAnno
[DESeq2]: http://bioconductor.org/packages/DESeq2
[DiffBind]: http://bioconductor.org/packages/DiffBind
[GenomicAlignments]: http://bioconductor.org/packages/GenomicAlignments
[GenomicRanges]: http://bioconductor.org/packages/GenomicRanges
[IRanges]: http://bioconductor.org/packages/IRanges
[KEGGREST]: http://bioconductor.org/packages/KEGGREST
[PSICQUIC]: http://bioconductor.org/packages/PSICQUIC
[rtracklayer]: http://bioconductor.org/packages/rtracklayer
[Rsamtools]: http://bioconductor.org/packages/Rsamtools
[ShortRead]: http://bioconductor.org/packages/ShortRead
[VariantAnnotation]: http://bioconductor.org/packages/VariantAnnotation
[VariantFiltering]: http://bioconductor.org/packages/VariantFiltering
[VariantTools]: http://bioconductor.org/packages/VariantTools
[biomaRt]: http://bioconductor.org/packages/biomaRt
[cn.mops]: http://bioconductor.org/packages/cn.mops
[h5vc]: http://bioconductor.org/packages/h5vc
[edgeR]: http://bioconductor.org/packages/edgeR
[ensemblVEP]: http://bioconductor.org/packages/ensemblVEP
[limma]: http://bioconductor.org/packages/limma
[metagenomeSeq]: http://bioconductor.org/packages/metagenomeSeq
[phyloseq]: http://bioconductor.org/packages/phyloseq
[snpStats]: http://bioconductor.org/packages/snpStats
[org.Hs.eg.db]: http://bioconductor.org/packages/org.Hs.eg.db
[TxDb.Hsapiens.UCSC.hg19.knownGene]: http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg19.knownGene
[BSgenome.Hsapiens.UCSC.hg19]: http://bioconductor.org/packages/BSgenome.Hsapiens.UCSC.hg19
## Public Data and Annotation from _AnnotationHub_
### Roadmap Epigenomics Project
All Roadmap Epigenomics files are hosted
[here](http://egg2.wustl.edu/roadmap/data/byFileType/). If one had to
download these files on their own, one would navigate through the web
interface to find useful files, then use something like the following
_R_ code.
```{r, eval=FALSE}
url <- "http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/E001-H3K4me1.broadPeak.gz"
filename <- basename(url)
download.file(url, destfile=filename)
if (file.exists(filename))
data <- import(filename, format="bed")
```
This would have to be repeated for all files, and the onus would lie
on the user to identify, download, import, and manage the local disk
location of these files.
AnnotationHub reduces this task to just a few lines of _R_ code
```{r results='hide', cache=FALSE}
library(AnnotationHub)
ah = AnnotationHub()
epiFiles <- query(ah, "EpigenomeRoadMap")
```
A look at the value returned by `epiFiles` shows us that
`r length(epiFiles)` roadmap resources are available via
AnnotationHub. Additional information about
the files is also available, e.g., where the files came from
(dataprovider), genome, species, sourceurl, sourcetypes.
```{r}
epiFiles
```
A good sanity check to ensure that we have files only from the Roadmap Epigenomics
project is to check that all the files in the returned smaller hub object
come from _Homo sapiens_ and the `r unique(epiFiles$genome)` genome
```{r}
unique(epiFiles$species)
unique(epiFiles$genome)
```
Broadly, one can get an idea of the different files from this project
looking at the sourcetype
```{r}
table(epiFiles$sourcetype)
```
To get a more descriptive idea of these different files one can use:
```{r}
head(sort(table(epiFiles$description), decreasing=TRUE))
```
The 'metadata' provided by the Roadmap Epigenomics Project is also
available. Note that the information displayed about a hub with a
single resource is quite different from the information displayed when
the hub references more than one resource.
```{r}
metadata.tab <- query(ah , c("EpigenomeRoadMap", "Metadata"))
metadata.tab
```
So far we have been exploring information about resources, without
downloading the resource to a local cache and importing it into R.
One can retrieve the resource using `[[` as indicated at the
end of the show method
```{r echo=FALSE, results='hide'}
metadata.tab <- ah[["AH41830"]]
```
```{r}
metadata.tab <- ah[["AH41830"]]
```
The metadata.tab file is returned as a _data.frame_. The first 6 rows
of the first 5 columns are shown here:
```{r}
metadata.tab[1:6, 1:5]
```
One can keep constructing different queries using multiple arguments to
trim down these `r length(epiFiles)` to get the files one wants.
For example, to get the ChIP-Seq files for consolidated epigenomes,
one could use
```{r}
bpChipEpi <- query(ah , c("EpigenomeRoadMap", "broadPeak", "chip", "consolidated"))
```
To get all the bigWig signal files, one can query the hub using
```{r}
allBigWigFiles <- query(ah, c("EpigenomeRoadMap", "BigWig"))
```
To access the 15 state chromatin segmentations, one can use
```{r}
seg <- query(ah, c("EpigenomeRoadMap", "segmentations"))
```
If one is interested in getting all the files related to one sample
```{r}
E126 <- query(ah , c("EpigenomeRoadMap", "E126", "H3K4ME2"))
E126
```
Hub resources can also be selected using `$`, `subset()`, and
`display()`; see the main
[_AnnotationHub_ vignette](AnnotationHub.html) for additional detail.
Hub resources are imported as the appropriate _Bioconductor_ object
for use in further analysis. For example, peak files are returned as
_GRanges_ objects.
```{r echo=FALSE, results='hide'}
peaks <- E126[['AH29817']]
```
```{r}
peaks <- E126[['AH29817']]
seqinfo(peaks)
```
BigWig files are returned as _BigWigFile_ objects. A _BigWigFile_ is a
reference to a file on disk; the data in the file can be read in using
`rtracklayer::import()`, perhaps querying these large files for
particular genomic regions of interest as described on the help page
`?import.bw`.
Each record inside AnnotationHub is associated with a
unique identifier. Most _GRanges_ objects returned by
AnnotationHub contain the unique AnnotationHub identifier of
the resource from which the _GRanges_ is derived. This can come handy
when working with the _GRanges_ object for a while, and additional
information about the object (e.g., the name of the file in the cache,
or the original sourceurl for the data underlying the resource) that
is being worked with.
```{r}
metadata(peaks)
ah[metadata(peaks)$AnnotationHubName]$sourceurl
```
### Ensembl GTF TxDb gene models
_Bioconductor_ represents gene models using 'transcript'
databases. These are available via packages such as
_TxDb.Hsapiens.UCSC.hg38.knownGene_
or can be constructed using functions such as
`GenomicFeatures::makeTxDbFromBiomart()`.
_AnnotationHub_ provides an easy way to work with gene models
published by Ensembl. We are going to be working with human data
that are mapped to GRCh37 (hg19). We have a choice of providers (UCSC,
NCBI, Ensembl), but we will just choose Ensembl here to minimize the
amount of ID changing.
```{r ah-gene-models}
query(ah,c('gtf','ensembl','sapiens','GRCh37'))
```
We see that there is a GTF file describing gene models.
The GTF file is imported as a _GRanges_ instance, but with a number of
columns that allow the _Granges_ object to encode the complexity of
exons, CDS, and utrs.
```{r}
gtf <- ah[['AH10684']]
head(gtf, 3)
```
It is trivial to make a transcript database, or _TxDb_, instance of
these data. We will use this _TxDb_ to quickly access gene annotation
in genome context.
```{r gtf2txdb,cache=FALSE,message=FALSE}
library(GenomicFeatures) # for makeTxDbFromGRanges
txdb <- makeTxDbFromGRanges(gtf)
````
## Epigenomics case study
We are going to use data from the
[ENCODE project](https://www.genome.gov/10005107/)
as a playground for using the Bioconductor ranges infrastructure. To
keep things manageable, we will focus on only the K562 leukemia cell
line and on histone marks. To keep the data smallish, but without loss
of generality, we will use data that have been processed using a
peak-caller and summarized as regions of statistically significant
regions.
Bioconductor has processed all of the ENCODE data into Bioconductor
data structures (as well as thousands of other datasets) and then
stored and made available via the
_AnnotationHub_ system we saw above.
```{r}
query(ah,c('K562','h3k','E123','narrow'))
histoneSets = names(query(ah,c('K562','E123','h3k','narrow')))
```
The `histoneSets` variable now contains the *names* _AnnotationHub_
datasets from histone modification experiments done in K562 cells and
summarized using a bioinformatics protocol tunes to produce narrow
peaks. However, the data still need to be downloaded and then loaded
into R memory.
_To load your own peaks into R, check out the `rtracklayer::import()` function._
### Peak density across the genome
Start (arbitrarily) with the first `histoneSet` by loading the dataset.
```{r}
histonePeaks = ah[[histoneSets[1]]]
```
The first time the code above is executed, the data will be *both* downloaded *and* loaded into the variable `histonePeaks`. The _next_ time that this line is run, a "cached" version of the data will be used, so no download needs to occur the second time.
```{r}
si = seqinfo(histonePeaks)
si
```
The next block of code looks long, but the main *purpose* of the code
is to produce 100kb windows across the genomic so that we can count
the number of H3K4me1 peaks in each 100kb window. Most of the lines
are do deal with small issues of keeping chromosome names and lengths
straight when transforming our data.
```{r}
fullchroms = as(si,'GRanges')
seqinfo(fullchroms)
# 100kb bins
# This is the work--using tile()
tile0.1mb = unlist(tile(fullchroms,width=1e5))
seqinfo(tile0.1mb) = si
tile0.1mb = keepStandardChromosomes(tile0.1mb,pruning.mode = 'coarse')
histonePeaks = keepStandardChromosomes(histonePeaks,pruning.mode = 'coarse')
seqlevelsStyle(tile0.1mb) = seqlevelsStyle(histonePeaks)
histonePeaks=dropSeqlevels(histonePeaks,'chrM')
genome(tile0.1mb) = 'hg19'
```
Counting peaks in each 100kb window is *VERY* fast using the
`counterOverlaps()` method.
```{r}
tile0.1mb$peakCount = countOverlaps(tile0.1mb,histonePeaks)
```
While the following plot is not as informative as it could be,
sometimes using a circular layout is a good way to show an overview of
full genome datasets. Data on the scale of 100kb is probably still too
fine-level for such a plot.
```{r message=FALSE}
library(ggbio)
ggbio() + circle(tile0.1mb,geom='point', aes(y = peakCount),
color='red', alpha=0.25, size=0.25, grid=TRUE,
radius=60, trackWidth=30) +
circle(tile0.1mb, geom = "scale", size = 2,radius=90) +
circle(tile0.1mb, geom='ideo',fill='gray70')
```
Getting the `GC` content of the peaks is also straightforward. When you have two peaksets
```{r}
dnaInPeaks = getSeq(BSgenome.Hsapiens.UCSC.hg19,histonePeaks)
percentGC = letterFrequency(dnaInPeaks,'GC',as.prob=TRUE)
```
To get a comparison set, just shift all regions to the left by 5kb (for example).
```{r}
shiftedPeaks = shift(histonePeaks,-5000)
head(ranges(histonePeaks))
# data shifted 5kb to the left
head(ranges(shiftedPeaks))
```
Grab the `GC` content and plot.
```{r}
dnaInShiftedPeaks = getSeq(BSgenome.Hsapiens.UCSC.hg19,shift(histonePeaks,-5000))
percentGC1 = letterFrequency(dnaInShiftedPeaks,'GC',as.prob=TRUE)
plot(density(percentGC))
lines(density(percentGC1),col='red')
```
### Compare peak sets
As with many datasets in biology, learning about how data relate to
each other can be a very powerful technique for hypothesis
generation. The goal of the next section is to determine the
similarities among a set of 10 histone mark datasets in hopes of
generating some hypotheses about the functionality of the histone
marks.
Start by loading the datasets that we found above for the K562 cell line.
```{r message=FALSE}
hpList = sapply(histoneSets,function(ahname) ah[[ahname]])
# get the "titles" for the experiment
# And cleanup the messy names
hptitles = gsub('E123-|\\.narrowPeak\\.gz','',ah[histoneSets]$title)
names(hpList) = hptitles
hptitles
```
How can we compare two sets of regions?
The Jaccard Similarity measure quantifies the proportion of shared peaks (or proportion of bases covered by the shared peaks) to that covered by the _union_ of those peak sets. In mathematical terms, the _Jaccard Index_ measures this similarity.
$$ J(A,B) = \frac{\left | A \bigcap B \right |}{\left | A \bigcup B \right |} $$
The _Jaccard Distance_ simply measures how dissimilar two sets of regions are (1-J(A,b)).
$$ d_J(A,B) = 1 - J(A,B) = 1 - \frac{\left | A \bigcap B \right |}{\left | A \bigcup B \right |} $$
The `regionJaccard()` function takes as input two sets of regions and calculates the Jaccard Index using base-level overlaps (as opposed to whole-region overlaps).
```{r}
#' Calculate Jaccard Similarity between two sets of ranges
#'
#' @param a A set of ranges
#' @param b A second set of ranges
#'
#' @details We calculate the Jaccard Distance
#' as a simple ratio of the number of
#' shared bases in the regions to
#' total number of bases covered by
#' both sets.
#'
#' @return a value between 0 and 1
regionJaccard = function(a,b) {
sum(width(intersect(a,b)))/sum(width(union(a,b)))
}
```
We can try this on two peakSets:
```{r}
regionJaccard(hpList[[1]],hpList[[2]])
```
How would you interpret the result?
To make addressing the question above a bit more intuitive, we can perform the same calculation, but between all pairs of regions using the `multiRegStat()` function below.
```{r}
multiRegStat <- function(a,b, fun) {
res <- matrix(0L, length(a), length(b))
for (i in seq_along(a))
{
for (j in seq_along(b))
{
res[i,j] = fun(a[[i]],b[[j]])
}
}
if(!is.null(names(a))) rownames(res) = names(a)
if(!is.null(names(b))) colnames(res) = names(b)
res
}
```
```{r}
res = multiRegStat(hpList,hpList,regionJaccard)
res
```
The `res` matrix now describes the pairwise similarities between the various histone mark peak sets. Making a heatmap and annotating it with knowledge of [wikipedia, for example](https://en.wikipedia.org/wiki/Histone_code) is potentially helpful.
```{r}
library(ComplexHeatmap)
annot = data.frame(exprStatus=c('Active','Active','Active','Active',
'Active','Repressed','Active','Repressed',NA,'Active'),
locale=c(NA,NA,'Promoter','Promoter',NA,'Body','Enhancer','Body','Body',NA),
row.names=hptitles)
Heatmap(res,top_annotation=HeatmapAnnotation(
df=annot,col=list(exprStatus=c('Active'='yellow',Repressed='blue'))))
```
What do you think of the plot above?
### Histone profiles
In the following code block, we:
1. Grab the "regions" of "promoters", defined as a window around the transcript start site.
- Subsample the promoters randomly to reduce runtime and memory usage.
2. Calculate "coverage" of our peaks. For each base in the genome, we will have a single number, mostly zeros, and some ones.
3. Create slices of our coverage data, called "Views".
4. Convert the slices of coverage to rows of a matrix.
5. Sum the columns of the matrix to produce a summary binding profile.
```{r}
wndw = 2000
proms = promoters(txdb, upstream = wndw, downstream = wndw,
columns = c('tx_name','gene_id'))
proms = proms[sample(seq_along(proms),20000,replace=FALSE)]
proms = keepStandardChromosomes(proms,pruning.mode='tidy')
covg = coverage(hpList[[1]])
seqlevelsStyle(proms) = seqlevelsStyle(covg)
covg = covg[seqlevels(proms)]
vi = Views(covg,as(proms[strand(proms)=='+',],'RangesList'))
m = as.matrix(vi)
plot(colSums(m,na.rm=TRUE),type='l')
```
We can wrap all this up as a function to reuse as we see fit.
```{r}
histoneProfile = function(promoters,peaks,n_promoters=10000) {
proms = promoters
proms = proms[sample(seq_along(proms),min(length(proms),n_promoters),replace=FALSE)]
proms = keepStandardChromosomes(proms,pruning.mode='tidy')
covg = coverage(peaks)
seqlevelsStyle(proms) = seqlevelsStyle(covg)
covg = covg[seqlevels(proms)]
vi = Views(covg,as(proms[strand(proms)=='+',],'RangesList'))
return(colSums(as.matrix(vi),na.rm = TRUE))
}
```
Now, we are going to load gene expression for K562 so that we can look
at how the binding profiles vary with expression. The gene expression
from ENCODE data is available from many sources. In this case, we are
downloading directly from Ensembl in Europe.
```{r}
library(readr)
k562expr = read_tsv('https://www.encodeproject.org/files/ENCFF812ADD/@@download/ENCFF812ADD.tsv')
head(k562expr)
```
These data have several columns of values, including `tpm` or
transcripts-per-million, which we be using as our measure of
expression. Go ahead and take a look at the distribution and other
characteristics of the `tpm` column.
```{r}
# Fix gene names
k562expr$gene_id=sub('\\..*','',k562expr$gene_id)
k562expr = subset(k562expr,grepl('ENSG',gene_id))
```
Below, I am simply attaching the expression values to their associated
promoters based on gene id matching.
```{r}
proms$TPM = k562expr$TPM[match(unlist(proms$gene_id),k562expr$gene_id)]
proms$logTPM =
log10(k562expr$TPM[match(unlist(proms$gene_id),k562expr$gene_id)])
head(proms)
```
Now, subset promoters into those with low expression and those with high.
```{r }
promslow = proms[which(proms$logTPM<median(proms$logTPM,na.rm=TRUE))]
promshigh = proms[which(proms$logTPM>=median(proms$logTPM,na.rm=TRUE))]
```
And finally, produce a plot of two histone marks with low and high
expression.
```{r warning=FALSE}
mat = cbind(histoneProfile(promslow,hpList[[1]]),
histoneProfile(promshigh,hpList[[1]]),
histoneProfile(promslow,hpList[[10]]),
histoneProfile(promshigh,hpList[[10]]))
matplot(mat,type='l',main=paste(hptitles[1],hptitles[[10]], sep=" & "))
legend(0,2000,legend=c('H3K4me1_low','H3K4me1_high','H3K79me2_low','H3K79me2_high'),
col=1:4,lty=1:4)
```
## `sessionInfo()`
```{r sessionInfo}
sessionInfo()
```