-
Notifications
You must be signed in to change notification settings - Fork 6
/
data.table.tutorial.Rmd
401 lines (298 loc) · 11.5 KB
/
data.table.tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
---
title: "data.table tutorial"
author: "ned haughton"
date: >
```{r, echo=FALSE}
library(knitr)
opts_chunk$set(comment = NA)
format(Sys.time(), '%B %d, %Y')
```
output:
html_document:
toc: true
toc_depth: 3
---
```{r library load}
# install.packages('data.table')
library(data.table)
```
# R `data.table` tutorial
This tutorial is based largely on these documents:
- the [data.table intro](http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf), as well as `help(data.table)`
- [Matt Dowle's tutorial](http://user2014.stat.ucla.edu/files/tutorial_Matt.pdf)
- the [datacamp tutorial](http://blog.datacamp.com/data-table-r-tutorial/)
- Stackoverflow questions:
[here](http://stackoverflow.com/questions/27511604/dplyr-on-data-table-am-i-really-using-data-table), [here](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly/27840349#27840349)
## What is data.table, and why should you use it?
- (almost) a drop in replacement for `data.frame`
- works the same with plotting functions
- More concise syntax
- More memory efficient, and often much quicker with data transformations, especially for big data.
- Built in support for groupby operations
#### cons:
- steep learning curve
- some new gotchas
- some data.frame syntax won't work
- by-reference modifications means you might need to `copy()` in some instances
## As a drop-in replacement
```{r classes}
class(iris)
iris_dt <- data.table(iris)
class(iris_dt)
```
`data.table`s print nicely:
```{r printing}
iris_dt
```
It's easy to convert back to a `data.frame`
```{r convert to df}
identical(iris, as.data.frame(iris_dt))
```
In-place conversion with `setDT(df)` and `setDF(dt)` - no assignment.
`data.table` works fine with `dplyr`, you can freely mix and match syntax.
`data.table` implements a bunch of optimisations for standard R methods for working with data.frame: `order()`, `merge()`, etc. so you can use that syntax if you prefer, and retain the speed and memory benefits.
## Selecting
data.table's selection uses the format `dt[i, j, by]`.
- `i` is the row selector. It is usually a number, a vector of numbers, or a boolean vector.
- `j` is the column selector. It is usually a column or a **list** of columns (unquoted). You can modify the data in the table with this argument.
- `by` allows group-by operations. It should be a column or **list** of columns (unquoted).
These three arguments can be passed in order without their names
If you use SQL or dplyr, think of it like this:
+------------+------------+---------------------------+----------+
| package | rows | cols | grouping |
+============+============+===========================+==========+
| data.table | i | j | by |
+------------+------------+---------------------------+----------+
| dplyr | filter | select, mutate, summarise | group_by |
+------------+------------+---------------------------+----------+
| SQL | WHERE | SELECT | GROUP BY |
+------------+------------+---------------------------+----------+
Detailed help with arguments at `help(data.table)`.
### Selecting rows with `i`
```{r i select}
iris_dt[30:32]
```
Column names are evaluated, so you can use them with comparators to select rows
```{r i boolean select}
iris_dt[Petal.Length > 6.5]
iris_dt[Sepal.Length < 5 & Species == 'virginica']
```
### selecting columns with `j`
Selecting a single column returns a vector
```{r j select}
iris_dt[45:55, Petal.Length]
```
Selecting with a list of columns returns a data.table
```{r j select multiple}
iris_dt[, list(Petal.Width, Petal.Length)]
```
You can of course combine these:
```{r ij selection}
iris_dt[Species=='setosa' & Sepal.Length<4.5, list(Petal.Width, Petal.Length)]
```
## Aggregation
- the `j` argument can accept functions of columns, or lists of functions of columns:
```{r aggregate}
iris_dt[, mean(Petal.Length)]
iris_dt[, list(mean(Sepal.Length),mean(Petal.Length))]
```
With names:
```{r named aggregate}
iris_dt[, list(mean_SL=mean(Sepal.Length),mean_PL=mean(Petal.Length))]
```
### Group-by
Group-by Species, and then average:
```{r groupby}
iris_dt[, list(mean_SL=mean(Sepal.Length), mean_PL=mean(Petal.Length)), by=Species]
```
The above returns groups in the order that they first appear in the data. To sort by group, you can `setkey(Species)` or use `keyby=Species` instead.
### Assignment
You can add or modify columns in-place (e.g. without copying the entire data.table, as with `data.frame`) using the `:=` function:
```{r assignment}
{
iris_dt[, Group:=c('A','B')]
iris_dt
}
```
## Internal variables
There are a number of special internal variables that allow you to operate more neatly on the data:
- `.SD` contains all the columns to be acted on (e.g. excludes the Group-by columns):
```{r .SD}
iris_dt[, lapply(.SD, mean), by=list(Species, Group)]
```
- `.I` is a vector of the row indices of the group.
```{r .I}
iris_dt[, length(.I), by=cut(Sepal.Length, seq(4, 8, by=0.5))][order(cut)]
```
- `.N` is the number rows in the group
```{r .N}
# equivalent to above
iris_dt[, .N, by=cut(Sepal.Length, seq(4, 8, by=0.5))][order(cut)]
```
- `.BY` is the group name (e.g. 'setosa', in one of the groups of r`iris_dt[by=Species]`
- `.GRP` the number of the group
```{r .BY .GRP}
iris_dt[, list(by=copy(.BY), grp=.GRP), by=Species]
```
> **WARNING:** because data.table does a lot of in-place/by-reference modification, it may be necessary in some cases to `copy()` a variable, to avoid it getting over-written the next time that variable is assigned.
>
> ```{r in-place warning}
> iris_dt[, list(by=.BY, grp=.GRP), by=Species]
> ```
## `set*` functions
- `setDF`: Convert a data.table to data.frame by reference
- `setDT`: Convert lists and data.frames to data.table by reference
- `setattr`: Set attributes to objects by reference
- `setkey`: Create key on a data table
- `setorder`: Fast reordering of a data.table by reference
## Keys and joining
data.table does fast joining using the keys set on each table, by passing one data.table as the `i=` argument to the other table:
```{r}
df1 = data.frame(CustomerId=c(1:4),Product=c(rep(c("Toaster", "Radio"), 2)))
df2 = data.frame(CustomerId=c(1,4,3),State=c(rep("Alabama",2),rep("Ohio",1)))
setDT(df1)
setDT(df2)
setkey(df1, CustomerId)
setkey(df2, CustomerId)
str(df1)
str(df2)
df1[df2]
df2[df1]
# not join:
df1[!df2]
```
`data.table` optimises the R base `merge` function, so `df1[df2]` is the same as `merge(df1, df2)`, if you prefer that syntax, but it's *way* faster than the `data.frame` version of `merge`.
## Speed and memory
### Benchmarking split-apply-combine
Comparison using the diamonds dataset
- filter out "Fair" cuts,
- group by cut
- calculate average, median, and size of each group
- recombine into a summary table
```{r benchmark}
library(rbenchmark)
library(dplyr)
library(data.table)
diamonds_df <- ggplot2::diamonds
diamonds_dt <- data.table(diamonds_df)
benchmark(
# Base R by solution:
a_R_by = {df <- diamonds_df[diamonds_df$cut!="Fair", ]
results <- by(df, df$cut, function(grp) {
return(list(AvgPrice = mean(grp$price),
MedianPrice = as.numeric(median(grp$price)),
Count = nrow(grp)))
})
results <- data.frame(lapply(data.frame(do.call('rbind', results)), unlist))
results[order(results$Count),]
},
# dplyr solution
b_dplyr = diamonds_dt %>%
filter(cut != "Fair") %>%
group_by(cut) %>%
summarize(AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = n()) %>%
arrange(desc(Count)),
# data.tables
c_data.table = diamonds_dt[cut != "Fair",
list(AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = .N), by = cut][order(-Count)],
replications=20,
columns=c("test", "replications", "elapsed", "relative")
)
```
### Reading data
- Comparison of reading a medium sized dataset (42Mb):
```{r setwd, echo=FALSE}
setwd('~/phd/docs/presentations/data.table.tutorial/')
```
*This is just an arbitrary, large .csv file I had laying around.*
```{r benchmark load csv}
system2('file', 'loobos.csv', stdout=TRUE)
system2('ls', '-lah loobos.csv', stdout=TRUE)
benchmark(
read.csv = loobos_df <- read.csv('loobos.csv', stringsAsFactors=FALSE),
read.csv.classes = {
colclasses <- sapply(loobos_df, class)
loobos_df <- read.csv('loobos.csv', colClasses=colclasses, stringsAsFactors=FALSE)
},
data.table.fread = {
loobos_dt <- fread('loobos.csv')
setnames(loobos_dt, colnames(loobos_df))
},
replications=5,
columns=c("test", "replications", "elapsed", "relative")
)
```
### Memory profiling
Make some big dataframes/tables:
```{r mem prof setup}
reps = 20
diamonds_df <- rbind(diamonds_df, diamonds_df, diamonds_df, diamonds_df, diamonds_df,
diamonds_df, diamonds_df, diamonds_df, diamonds_df, diamonds_df)
diamonds_dt <- data.table(diamonds_df)
print(object.size(diamonds_df), units='Mb')
print(object.size(diamonds_dt), units='Mb')
```
#### base R profile
```{r mem prof base}
max_mem <- rep(NA, reps)
for (i in 1:reps) {
Rprof(filename = "baseR.Rprof.out", memory.profiling = TRUE, interval=0.001)
df <- diamonds_df[diamonds_df$cut!="Fair", ]
results <- by(df, df$cut, function(grp) {
return(list(AvgPrice = mean(grp$price),
MedianPrice = as.numeric(median(grp$price)),
Count = nrow(grp)))
})
results <- data.frame(lapply(data.frame(do.call('rbind', results)), unlist))
results[order(results$Count),]
Rprof(NULL)
max_mem[i] <- max(summaryRprof(filename = "baseR.Rprof.out", memory='both')$by.total$mem.total)
}
summary(max_mem)
```
Uses more than 4 times the memory of the data.frame it's manipulating
#### dplyr profile
```{r mem prof dplyr}
max_mem <- rep(NA, 20)
for (i in 1:20) {
Rprof(filename = "dplyr.Rprof.out", memory.profiling = TRUE, interval=0.001)
diamonds_df %>%
filter(cut != "Fair") %>%
group_by(cut) %>%
summarize(AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = n()) %>%
arrange(desc(Count))
Rprof(NULL)
max_mem[i] <- max(summaryRprof(filename = "dplyr.Rprof.out", memory='both')$by.total$mem.total)
}
summary(max_mem)
```
Uses about double the memory of the data frame it's manipulating, because it makes a copy at each step (but df sizes get smaller, so not quite double in some cases).
#### data.table R profile
```{r mem prof data.table}
setDT(diamonds_df)
max_mem <- rep(NA, 20)
for (i in 1:20) {
Rprof(filename = "data.table.Rprof.out", memory.profiling = TRUE, interval=0.001)
diamonds_df[cut != "Fair",
list(AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = .N),
by = cut][order(-Count)]
Rprof(NULL)
max_mem[i] <- max(summaryRprof(filename = "data.table.Rprof.out", memory='both')$by.total$mem.total)
}
summary(max_mem)
```
Only uses about 10% more memory than the data.table it's manipulating - never copies the whole data set.
This is true even with assignment, where it only ever copies at most one column at a time.
## Summary
- More concise syntax
- sometimes clearer, sometimes not as clean as `dplyr`
- Faster and more memory efficient
- stopped my computer from crashing daily when choking on 500Mb datasets (with 8Gb ram)