-
Notifications
You must be signed in to change notification settings - Fork 1
/
feather_genoprob.Rmd
244 lines (178 loc) · 5.74 KB
/
feather_genoprob.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
title: "Feather Genoprob Features"
author: "Brian S. Yandell"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Create feather_genoprob for qtl2 data}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.width = 7, fig.height = 5)
```
The motivation is to reduce memory usage. Matt told us that memory is his greatest obstacle, when running QTL browsers because he cannot add more `R/plumber` workers. Also, if you have a cheap Windows laptop you cannot even open many DO `.RData` projects simply because haplotype probability object is too big.
The solution would be to store data in some database and load just a slice you need at a time. I experimented with `SQLite` but it was too slow. Then I try feather and it run fine (R script attached – crunching 1000 SNPs at a time). It just does not use `qtl2scan` effectively because the package is not prepared for that.
Task: refactor this Rmd to use DOex and run some examples.
```{r}
suppressPackageStartupMessages({
library(qtl2)
library(qtl2feather)
})
```
```{r}
DOex <-
read_cross2(
file.path("https://raw.githubusercontent.com/rqtl",
"qtl2data/master/DOex",
"DOex.zip"))
```
Calculate genotype probabilities and convert to allele probabilities
```{r}
pr <- calc_genoprob(DOex, error_prob=0.002)
apr <- genoprob_to_alleleprob(pr)
```
Write feather database of pr.
```{r}
tmpdir <- tempdir()
```
```{r}
fpr <- feather_genoprob(pr, "pr", tmpdir)
fapr <- feather_genoprob(apr, "apr", tmpdir)
```
```{r}
list.files(tmpdir)
```
Methods `names()` and `length()` both work properly.
```{r}
names(fpr)
```
```{r}
length(fpr)
```
Methods `dim()` and `dimnames()` give dimensions across the chromosomes.
Notice that `fpr` has different genotypes for the `X` chromosome.
```{r}
dim(fpr)
```
```{r}
str(dimnames(fpr))
```
### Selecting one chromosome
Selecting a chromosome causes reading from feather database and creation of an array.
```{r}
dim(fapr[["X"]])
```
Alternate form.
```{r}
dim(fapr$X)
```
### Subsetting by ind, chr, mar
Subsetting using `subset(object,ind,chr,mar)` or `[ind,chr,mar]` only adjusts the vector of
individuals, chromosomes and markers, but does not alter the feather database.
```{r}
dim(subset(fapr, ind=1:20, chr=c("2","3")))
```
You can also use `[]` to subset.
```{r}
dim(fapr[1:20, c("2","3")][["3"]])
```
```{r}
f2 <- fapr[,"2"]
f23 <- fapr[,c("2","3")]
fx <- fapr[,"X"]
```
There is a third dimension for markers. However, be careful that if you select a subset of markers that excludes one or more chromosomes, those will be dropped.
```{r}
dim(subset(fapr, mar=1:30))
```
```{r}
dim(fapr[ , , dimnames(fapr)$mar$X[1:20]])
```
### Binding by columns or rows.
Binding by columns (chromosomes) or rows (individuals) may cause creation of a new feather database if input objects arose from different feather databases. However, if objects are subsets of the same `feather_genoprob` object, then it reuses the one feather database. Further, if objects have the same directory and file basename for their feather databases, they will be combined without creation of any new feather databases.
See `example(cbind.feather_genoprob)` and `example(rbind.feather_genoprob)` with objects having distinct feather databases.
```{r}
newf23 <- cbind(f2,f23)
```
Row bind.
```{r}
f23a <- fapr[1:20, c("2","3")]
f23b <- fapr[40:79, c("2","3")]
f23 <- rbind(f23a, f23b)
```
Subset on markers. This way only extracts the selected `markers` from feather database before creating the array.
```{r}
markers <- dimnames(fapr$X)[[3]][1:10]
dim(fapr[,,markers]$X)
```
This way extracts all markers on `X`, creates the array, then subsets on selected `markers`.
```{r}
markers <- dimnames(fapr$X)[[3]][1:10]
dim(fapr$X[,,markers])
```
Two `feather_genoprob` objects using same area. Combine using `cbind`. Notice that the order of chromosomes is reversed by joining `fapr2` to `fapr3`. Be sure to not overwrite existing feather databases!
```{r}
fapr2 <- feather_genoprob(subset(apr, chr="2"), "aprx", tmpdir)
fapr3 <- feather_genoprob(subset(apr, chr="3"), "aprx", tmpdir)
```
```{r}
fapr32 <- cbind(fapr3,fapr2)
```
```{r}
dim(fapr32)
```
```{r}
list.files(tmpdir)
```
### Looking under the hood at `feather_genoprob` object
Here are the names of elements in a `feather_genoprob` object:
```{r}
names(unclass(fapr))
```
```{r}
unclass(fapr)$feather
```
```{r}
sapply(unclass(fapr)[c("ind","chr","mar")], length)
```
A `feather_genoprob` object has all the original information. Thus, it is possible to restore the original object from a `subset` (but not necessarily from a `cbind` or `rbind`). Here is an example.
```{r}
fapr23 <- subset(fapr, chr=c("2","3"))
dim(fapr23)
```
```{r}
dim(feather_genoprob_restore(fapr23))
```
## Time trial simulation
Compare times to create `pr` once and to Read `fpr` 100 times.
```{r}
system.time(pr <- calc_genoprob(DOex, error_prob=0.002))
```
```{r}
system.time(fpr <- feather_genoprob(pr, "pr", tmpdir, verbose = FALSE))
```
```{r}
tmpfile <- tempfile()
saveRDS(pr, file = tmpfile)
size_pr <- file.size(tmpfile) * 1e-6
unlink(tmpfile)
```
The `pr` object is `r format(object.size(pr), units = "MB")` in `R` or `r size_pr` Mb if saved as RDS file, while the `fpr` object is `r format(object.size(fpr), units = "MB")` in R with additional
`r sum(file.size(unclass(fpr)$feather)) * 1e-6` Mb for saved feather databases.
#### Time to extract chromosome information.
Extract chromosome `"2"` 100 times.
```{r}
system.time({
for(i in seq(100))
tmp <- fpr[["2"]]
})
```
Extract first 50 markers from chromosome `"2"` 100 times.
```{r}
markers <- dimnames(fpr)$mar[["2"]][1:50]
system.time({
for(i in seq(100))
tmp <- fpr[["2"]]
})
```