generated from r4ds/bookclub-template
-
Notifications
You must be signed in to change notification settings - Fork 25
/
02_Names_and_values.Rmd
543 lines (395 loc) · 12.3 KB
/
02_Names_and_values.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
# Names and values
**Learning objectives:**
- To be able to understand distinction between an *object* and its *name*
- With this knowledge, to be able write faster code using less memory
- To better understand R's functional programming tools
Using lobstr package here.
```{r}
library(lobstr)
```
### Quiz {-}
##### 1. How do I create a new column called `3` that contains the sum of `1` and `2`? {-}
```{r}
df <- data.frame(runif(3), runif(3))
names(df) <- c(1, 2)
df
```
```{r}
df$`3` <- df$`1` + df$`2`
df
```
**What makes these names challenging?**
> You need to use backticks (`) when the name of an object doesn't start with a
> a character or '.' [or . followed by a number] (non-syntactic names).
##### 2. How much memory does `y` occupy? {-}
```{r}
x <- runif(1e6)
y <- list(x, x, x)
```
Need to use the lobstr package:
```{r}
lobstr::obj_size(y)
```
> Note that if you look in the RStudio Environment or use R base `object.size()`
> you actually get a value of 24 MB
```{r}
object.size(y)
```
##### 3. On which line does `a` get copied in the following example? {-}
```{r}
a <- c(1, 5, 3, 2)
b <- a
b[[1]] <- 10
```
> Not until `b` is modified, the third line
## Binding basics {-}
- Create values and *bind* a name to them
- Names have values (rather than values have names)
- Multiple names can refer to the same values
- We can look at an object's address to keep track of the values independent of their names
```{r}
x <- c(1, 2, 3)
y <- x
obj_addr(x)
obj_addr(y)
```
### Exercises {-}
##### 1. Explain the relationships {-}
```{r}
a <- 1:10
b <- a
c <- b
d <- 1:10
```
> `a` `b` and `c` are all names that refer to the first value `1:10`
>
> `d` is a name that refers to the *second* value of `1:10`.
##### 2. Do the following all point to the same underlying function object? hint: `lobstr::obj_addr()` {-}
```{r}
obj_addr(mean)
obj_addr(base::mean)
obj_addr(get("mean"))
obj_addr(evalq(mean))
obj_addr(match.fun("mean"))
```
> Yes!
## Copy-on-modify {-}
- If you modify a value bound to multiple names, it is 'copy-on-modify'
- If you modify a value bound to a single name, it is 'modify-in-place'
- Use `tracemem()` to see when a name's value changes
```{r}
x <- c(1, 2, 3)
cat(tracemem(x), "\n")
```
```{r}
y <- x
y[[3]] <- 4L # Changes (copy-on-modify)
y[[3]] <- 5L # Doesn't change (modify-in-place)
```
Turn off `tracemem()` with `untracemem()`
> Can also use `ref(x)` to get the address of the value bound to a given name
## Functions {-}
- Copying also applies within functions
- If you copy (but don't modify) `x` within `f()`, no copy is made
```{r}
f <- function(a) {
a
}
x <- c(1, 2, 3)
z <- f(x) # No change in value
ref(x)
ref(z)
```
<!-- ![](images/02-trace.png) -->
## Lists {-}
- A list overall, has it's own reference (id)
- List *elements* also each point to other values
- List doesn't store the value, it *stores a reference to the value*
- As of R 3.1.0, modifying lists creates a *shallow copy*
- References (bindings) are copied, but *values are not*
```{r}
l1 <- list(1, 2, 3)
l2 <- l1
l2[[3]] <- 4
```
- We can use `ref()` to see how they compare
- See how the list reference is different
- But first two items in each list are the same
```{r}
ref(l1, l2)
```
![](images/02-l-modify-2.png){width=50%}
## Data Frames {-}
- Data frames are lists of vectors
- So copying and modifying a column *only affects that column*
- **BUT** if you modify a *row*, every column must be copied
```{r}
d1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3))
d2 <- d1
d3 <- d1
```
Only the modified column changes
```{r}
d2[, 2] <- d2[, 2] * 2
ref(d1, d2)
```
All columns change
```{r}
d3[1, ] <- d3[1, ] * 3
ref(d1, d3)
```
## Character vectors {-}
- R has a **global string pool**
- Elements of character vectors point to unique strings in the pool
```{r}
x <- c("a", "a", "abc", "d")
```
![](images/02-character-2.png)
## Exercises {-}
##### 1. Why is `tracemem(1:10)` not useful? {-}
> Because it tries to trace a value that is not bound to a name
##### 2. Why are there two copies? {-}
```{r}
x <- c(1L, 2L, 3L)
tracemem(x)
x[[3]] <- 4
```
> Because we convert an *integer* vector (using 1L, etc.) to a *double* vector (using just 4)-
##### 3. What is the relationships among these objects? {-}
```{r}
a <- 1:10
b <- list(a, a)
c <- list(b, a, 1:10) #
```
a <- obj 1
b <- obj 1, obj 1
c <- b(obj 1, obj 1), obj 1, 1:10
```{r}
ref(c)
```
##### 4. What happens here? {-}
```{r}
x <- list(1:10)
x[[2]] <- x
```
- `x` is a list
- `x[[2]] <- x` creates a new list, which in turn contains a reference to the
original list
- `x` is no longer bound to `list(1:10)`
```{r}
ref(x)
```
![](images/02-copy_on_modify_fig2.png){width=50%}
## Object Size {-}
- Use `lobstr::obj_size()`
- Lists may be smaller than expected because of referencing the same value
- Strings may be smaller than expected because using global string pool
- Difficult to predict how big something will be
- Can only add sizes together if they share no references in common
### Alternative Representation {-}
- As of R 3.5.0 - ALTREP
- Represent some vectors compactly
- e.g., 1:1000 - not 10,000 values, just 1 and 1,000
### Exercises {-}
##### 1. Why are the sizes so different? {-}
```{r}
y <- rep(list(runif(1e4)), 100)
object.size(y) # ~8000 kB
obj_size(y) # ~80 kB
```
> From `?object.size()`:
>
> "This function merely provides a rough indication: it should be reasonably accurate for atomic vectors, but **does not detect if elements of a list are shared**, for example.
##### 2. Why is the size misleading? {-}
```{r}
funs <- list(mean, sd, var)
obj_size(funs)
```
> Because they reference functions from base and stats, which are always available.
> Why bother looking at the size? What use is that?
##### 3. Predict the sizes {-}
```{r}
a <- runif(1e6) # 8 MB
obj_size(a)
```
```{r}
b <- list(a, a)
```
- There is one value ~8MB
- `a` and `b[[1]]` and `b[[2]]` all point to the same value.
```{r}
obj_size(b)
obj_size(a, b)
```
```{r}
b[[1]][[1]] <- 10
```
- Now there are two values ~8MB each (16MB total)
- `a` and `b[[2]]` point to the same value (8MB)
- `b[[1]]` is new (8MB) because the first element (`b[[1]][[1]]`) has been changed
```{r}
obj_size(b) # 16 MB (two values, two element references)
obj_size(a, b) # 16 MB (a & b[[2]] point to the same value)
```
```{r}
b[[2]][[1]] <- 10
```
- Finally, now there are three values ~8MB each (24MB total)
- Although `b[[1]]` and `b[[2]]` have the same contents,
they are not references to the same object.
```{r}
obj_size(b)
obj_size(a, b)
```
## Modify-in-place {-}
- Modifying usually creates a copy except for
- Objects with a single binding (performance optimization)
- Environments (special)
### Objects with a single binding {-}
- Hard to know if copy will occur
- If you have 2+ bindings and remove them, R can't follow how many are removed (so will always think there are more than one)
- May make a copy even if there's only one binding left
- Using a function makes a reference to it **unless it's a function based on C**
- Best to use `tracemem()` to check rather than guess.
#### Example - lists vs. data frames in for loop {-}
**Setup**
Create the data to modify
```{r}
x <- data.frame(matrix(runif(5 * 1e4), ncol = 5))
medians <- vapply(x, median, numeric(1))
```
**Data frame - Copied every time!**
```{r}
cat(tracemem(x), "\n")
for (i in seq_along(medians)) {
x[[i]] <- x[[i]] - medians[[i]]
}
untracemem(x)
```
**List (uses internal C code) - Copied once!**
```{r}
y <- as.list(x)
cat(tracemem(y), "\n")
for (i in seq_along(medians)) {
y[[i]] <- y[[i]] - medians[[i]]
}
untracemem(y)
```
#### Benchmark this (Exercise #2) {-}
**First wrap in a function**
```{r}
med <- function(d, medians) {
for (i in seq_along(medians)) {
d[[i]] <- d[[i]] - medians[[i]]
}
}
```
**Try with 5 columns**
```{r}
x <- data.frame(matrix(runif(5 * 1e4), ncol = 5))
medians <- vapply(x, median, numeric(1))
y <- as.list(x)
bench::mark(
"data.frame" = med(x, medians),
"list" = med(y, medians)
)
```
**Try with 20 columns**
```{r}
x <- data.frame(matrix(runif(5 * 1e4), ncol = 20))
medians <- vapply(x, median, numeric(1))
y <- as.list(x)
bench::mark(
"data.frame" = med(x, medians),
"list" = med(y, medians)
)
```
**WOW!**
### Environmments {-}
- Always modified in place (**reference semantics**)
- Interesting because if you modify the environment, all existing bindings have the same reference
- If two names point to the same environment, and you update one, you update both!
```{r}
e1 <- rlang::env(a = 1, b = 2, c = 3)
e2 <- e1
e1$c <- 4
e2$c
```
- This means that environments can contain themselves (!)
### Exercises {-}
##### 1. Why isn't this circular? {-}
```{r}
x <- list()
x[[1]] <- x
```
> Because the binding to the list() object moves from `x` in the first line to `x[[1]]` in the second.
##### 2. (see "Objects with a single binding") {-}
##### 3. What happens if you attempt to use tracemem() on an environment? {-}
```{r}
#| error: true
e1 <- rlang::env(a = 1, b = 2, c = 3)
tracemem(e1)
```
> Because environments always modified in place, there's no point in tracing them
## Unbinding and the garbage collector {-}
- If you delete the 'name' bound to an object, the object still exists
- R runs a "garbage collector" (GC) to remove these objects when it needs more memory
- "Looking from the outside, it’s basically impossible to predict when the GC will run. In fact, you shouldn’t even try."
- If you want to know when it runs, use `gcinfo(TRUE)` to get a message printed
- You can force GC with `gc()` but you never need to to use more memory *within* R
- Only reason to do so is to free memory for other system software, or, to get the
message printed about how much memory is being used
```{r}
gc()
mem_used()
```
- These numbers will **not** be what you OS tells you because,
1. It includes objects created by R, but not R interpreter
2. R and OS are lazy and don't reclaim/release memory until it's needed
3. R counts memory from objects, but there are gaps due to those that are deleted ->
*memory fragmentation* [less memory actually available they you might think]
## Meeting Videos {-}
### Cohort 1 {-}
(no video recorded)
### Cohort 2 {-}
`r knitr::include_url("https://www.youtube.com/embed/pCiNj2JRK50")`
### Cohort 3 {-}
`r knitr::include_url("https://www.youtube.com/embed/-bEXdOoxO_E")`
### Cohort 4 {-}
`r knitr::include_url("https://www.youtube.com/embed/gcVU_F-L6zY")`
### Cohort 5 {-}
`r knitr::include_url("https://www.youtube.com/embed/aqcvKox9V0Q")`
### Cohort 6 {-}
`r knitr::include_url("https://www.youtube.com/embed/O4Oo_qO7SIY")`
<details>
<summary> Meeting chat log </summary>
```
00:16:57 Federica Gazzelloni: cohort 2 video: https://www.youtube.com/watch?v=pCiNj2JRK50
00:18:39 Federica Gazzelloni: cohort 2 presentation: https://r4ds.github.io/bookclub-Advanced_R/Presentations/Week02/Cohort2_America/Chapter2Slides.html#1
00:40:24 Arthur Shaw: Just the opposite, Ryan. Very clear presentation!
00:51:54 Trevin: parquet?
00:53:00 Arthur Shaw: We may all be right. {arrow} looks to deal with feather and parquet files: https://arrow.apache.org/docs/r/
01:00:04 Arthur Shaw: Some questions for future meetings. (1) I find Ryan's use of slides hugely effective in conveying information. Would it be OK if future sessions (optionally) used slides? If so, should/could we commit slides to some folder on the repo? (2) I think reusing the images from Hadley's books really helps understanding and discussion. Is that OK to do? Here I'm thinking about copyright concerns. (If possible, I would rather not redraw variants of Hadley's images.)
01:01:35 Federica Gazzelloni: It's all ok, you can use past presentation, you don't need to push them to the repo, you can use the images from the book
01:07:19 Federica Gazzelloni: Can I use: gc(reset = TRUE) safely?
```
</details>
### Cohort 7 {-}
`r knitr::include_url("https://www.youtube.com/embed/kpAUoGO6elE")`
<details>
<summary>Meeting chat log</summary>
```
00:09:40 Ryan Honomichl: https://drdoane.com/three-deep-truths-about-r/
00:12:51 Robert Hilly: Be right back
00:36:12 Ryan Honomichl: brb
00:41:18 Ron: I tried mapply and also got different answers
00:41:44 collinberke: Interesting, would like to know more what is going on.
00:49:57 Robert Hilly: simple_map <- function(x, f, ...) {
out <- vector("list", length(x))
for (i in seq_along(x)) {
out[[i]] <- f(x[[i]], ...)
}
out
}
```
</details>