-
Notifications
You must be signed in to change notification settings - Fork 1
/
02-UseR.Rmd
1540 lines (1167 loc) · 53.3 KB
/
02-UseR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Use R {#chap-utiliseR}
\toc{1}
The literature devoted to learning R is flourishing.
The following books are an arbitrary but useful selection:
* [R for Data Science](https://r4ds.had.co.nz/) [@Wickham2016] presents a complete working method, consistent with the tidyverse.
* [Advanced R](http://adv-r.had.co.nz/) [@Wickham2014] is the reference for mastering the subtleties of the language and understanding how R works.
* Finally, [Efficient R programming](https://csgillespie.github.io/efficientR/) [@Gillespie2016] deals with code optimization.
Some advanced aspects of coding are seen here.
Details on the different languages of R are useful for creating packages.
The environments are presented next, for the proper understanding of the search for objects called by the code.
Finally, the optimization of code performance is discussed in depth (loops, C++ code and parallelization) and illustrated by a case study.
## The languages of R
R includes several programming languages.
The most common is S3, but it is not the only one[^206].
[^206]: https://adv-r.had.co.nz/OO-essentials.html
### Base
The core of R is the primitive functions and basic data structures like the `sum` function and `matrix` data:
```{r Core}
pryr::otype(sum)
typeof(sum)
pryr::otype(matrix(1))
typeof(matrix(1))
```
The **pryr** package allows to display the language in which objects are defined.
The `typeof()` function displays the internal storage type of the objects:
- the `sum()` function belongs to the basic language of R and is a builtin function.
- the elements of the numerical matrix containing a single 1 are double precision reals, and the matrix itself is defined in the basic language.
Primitive functions are coded in C and are very fast.
They are always available, whatever the packages loaded.
Their use is therefore to be preferred.
### S3 {#sec:S3}
S3 is the most used language, often the only one known by R users.
It is an object-oriented language in which classes, i.e. the type of objects, are declarative.
```{r S3-1}
MyFirstName <- "Eric"
class(MyFirstName) <- "FirstName"
```
The variable `MyFirstName` is here classed as `FirstName` by a simple declaration.
Unlike the way a classical object-oriented language works[^210], S3 methods are related to functions, not objects.
[^210]: https://www.troispointzero.fr/le-blog/introduction-a-la-programmation-orientee-objet-poo/
```{r S3-2, echo=TRUE}
# Default display
MyFirstName
print.Firstname <- function(x) cat("The first name is", x)
# Modified display
MyFirstName
```
In this example, the `print()` method applied to the "Firstname" class is modified.
In a classical object-oriented language, the method would be defined in the class `Firstname`.
In R, methods are defined from generic methods.
`print` is a generic method ("a generic") declared in **base**.
```{r print-otype}
pryr::otype(print)
```
Its code is just a `UseMethod("print")` declaration:
```{r print-code}
print
```
There are many S3 methods for `print`:
```{r printS3}
head(methods("print"))
```
Each applies to a class. `print.default` is used as a last resort and relies on the type of the object, not its S3 class.
```{r print.defaut, echo=TRUE}
typeof(MyFirstName)
pryr::otype(MyFirstName)
```
An object can belong to several classes, which allows a form of inheritance of methods.
In a classical object oriented language, inheritance allows to define more precise classes ("FrenchFirstName") which inherit from more general classes ("FirstName") and thus benefit from their methods without having to redefine them.
In R, inheritance is simply declaring a vector of increasingly broad classes for an object:
```{r S3-3, echo=TRUE}
# Definition of classes by a vector
class(MyFirstName) <- c("FrenchFirstName", "FirstName")
# Alternative code, with inherits()
inherits(MyFirstName, what = "FrenchFirstName")
inherits(MyFirstName, what = "FirstName")
```
The generic looks for a method for each class, in the order of their declaration.
```{r S3-4, echo=TRUE}
print.FrenchFirstName <- function(x) cat("French first name:", x)
MyFirstName
```
In summary, S3 is the common language of R.
Almost all packages are written in S3.
Generics are everywhere but go unnoticed, for example in packages:
```{r S3-5, echo=TRUE}
library("entropart")
.S3methods(class="SpeciesDistribution")
```
The `.S3methods()` function displays all available methods for a class, as opposed to `methods()` which displays all classes for which the method passed as an argument is defined.
Many primitive functions in R are generic methods.
To find out about them, use the `help(InternalMethods)` helper.
### S4
S4 is an evolution of S3 that structures classes to get closer to a classical object oriented language:
- Classes must be explicitly defined, not simply declared.
- 1ttributes (i.e. variables describing objects), called *slots*, are explicitly declared.
- The constructor, i.e. the method that creates a new instance of a class (i.e. a variable containing an object of the class), is explicit.
Using the previous example, the S4 syntax is as follows:
```{r S4-Class, tidy=FALSE}
# Definition of the class Person, with its slots
setClass("Person",
slots = list(LastName = "character", FirstName = "character"))
# Construction of an instance
Me <- new("Person", LastName = "Marcon", FirstName = "Eric")
# Langage
pryr::otype(Me)
```
Methods always belong to functions.
They are declared by the `setMethod()` function:
```{r S4-print}
setMethod("print",
signature="Person",
function(x, ...) {
cat("The person is:", x@FirstName, x@LastName)
}
)
print(Me)
```
The attributes are called by the syntax `variable@slot`.
In summary, S4 is more rigorous than S3.
Some packages on CRAN : **Matrix**, **sp**, **odbc**... and many on Bioconductor are written in S4 but the language is now clearly abandoned in favor of S3, notably because of the success of the **tidyverse**.
### RC
RC was introduced in R 2.12 (2010) with the **methods** package.
Methods belong to classes, as in C++: they are declared in the class and called from the objects.
```{r RC-Class, tidy=FALSE}
library("methods")
# Declaration of the class
PersonRC <- setRefClass("PersonRC",
fields = list(LastName = "character", FirstName = "character"),
methods = list(print = function() cat(FirstName, LastName)))
# Constructeur
MeRC <- new("PersonRC", LastName = "Marcon", FirstName ="Eric")
# Language
pryr::otype(MeRC)
# Call the print method
MeRC$print()
```
RC is a confidential language, although it is the first "true" object-oriented language of R.
### S6
S6[^211] enhances RC but is not included in R: it requires installing its package.
[^211]: https://r6.r-lib.org/
Attributes and methods can be public or private.
An `initialize()` method is used as a constructor.
```{r S6-Class}
library("R6")
PersonR6 <- R6Class("PersonR6",
public = list(LastName="character", FirstName="character",
initialize = function(LastName=NA, FirstName=NA) {
self$LastName <- LastName ; self$FirstName <- FirstName},
print = function() cat(self$FirstName, self$LastName)))
MeR6 <- PersonR6$new(LastName = "Marcon", FirstName ="Eric")
MeR6$print()
```
S6 allows to program rigorously in object but is very little used.
The performances of S6 are much better than those of RC but are inferior to those of S3[^212].
[^212]: https://r6.r-lib.org/articles/Performance.html
The non-inclusion of R6 to R is shown by **pryr**:
```{r otype}
pryr::otype(MeR6)
```
### Tidyverse
The tidyverse is a set of coherent packages that have evolved the way R is programmed.
The set of essential packages can be loaded by the **tidyverse** package which has no other use:
```{r tidyverse}
library("tidyverse")
```
This is not a new language per se but rather an extension of S3, with deep technical modifications, notably the unconventional evaluation of expressions[^220], which it is not essential to master in detail.
[^220]: https://dplyr.tidyverse.org/articles/programming.html
Its principles are written in a manifesto[^221].
Its most visible contribution for the user is the sequence of commands in a flow (code pipeline).
[^221]: https://cran.r-project.org/web/packages/tidyverse/vignettes/manifesto.html
In standard programming, the sequence of functions is written by successive nesting, which makes it difficult to read, especially when arguments are needed:
```{r log_mean_runif}
# Base-2 logarithm of the mean of 100 random numbers in a uniform distribution
log(mean(runif(100)), base=2)
```
In the tidyverse, the functions are chained together, which often better matches the programmer's thinking about data processing:
```{r tidylog, tidy=FALSE}
# 100 random numbers in a uniform distribution
runif(100) %>%
# Mean
mean %>%
# Base-2 logarithm
log(base=2)
```
The pipe `%>%` is an operator that calls the next function by passing it as first argument the result of the previous function.
Additional arguments are passed normally: for the readability of the code, it is essential to name them.
Most of the R functions can be used without difficulty in the tidyverse, even though they were not designed for this purpose: it is sufficient that their first argument is the data to be processed.
The pipeline allows only one value to be passed to the next function, which prohibits multidimensional functions, such as `f(x,y)`.
The preferred data structure is the *tibble*, which is an improved dataframe: its `print()` method is more readable, and it corrects some unintuitive dataframe behavior, such as the automatic conversion of single-column dataframes to vectors.
The columns of the dataframe or tibble allow to pass as much data as needed.
Finally, data visualization is supported by **ggplot2** which relies on a theoretically sound graph grammar [@Wickham2010].
Schematically, a graph is constructed according to the following model:
```
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
```
- The data is necessarily a dataframe.
- The geometry is the type of graph chosen (points, lines, histograms or other).
- The aesthetics (function `aes()`) designates what is represented: it is the correspondence between the columns of the dataframe and the elements necessary for the geometry.
- Statistics is the treatment applied to the data before passing it to the geometry (often "identity", i.e. no transformation but "boxplot" for a boxplot).
The data can be transformed by a scale function, such as `scale_y_log10()`.
- The position is the location of the objects on the graph (often "identity"; "stack" for a stacked histogram, "jitter" to move the overlapping points slightly in a `geom_point`).
- The coordinates define the display of the graph (`coord_fixed()` to avoid distorting a map for example).
- Finally, facets offer the possibility to display several aspects of the same data by producing one graph per modality of a variable.
The set formed by the pipeline and **ggplot2** allows complex processing in a readable code.
Figure \@ref(fig:diamonds) shows the result of the following code:
(ref:diamonds) Price of diamonds according to their weight. Demonstration of the **ggplot2** code combined with tidyverse data processing.
```{r diamonds, fig.cap="(ref:diamonds)", echo=TRUE, tidy=FALSE, message=FALSE}
# Diamonds data provided by ggplot2
diamonds %>%
# Keep only diamonds larger than half a carat
filter(carat > 0.5) %>%
# Graph: price vs. weight
ggplot(aes(x = carat, y = price)) +
# Scatter plot
geom_point() +
# Logarithmic scale
scale_x_log10() +
scale_y_log10() +
# Linear regression
geom_smooth(method = "lm")
```
In this figure, two geometries (scatterplot and linear regression) share the same aesthetics (price vs. weight in carats) which is therefore declared upstream, in the `ggplot()` function.
The tidyverse is documented in detail in @Wickham2016 and **ggplot2** in @Wickham2017.
## Environments {#sec:environnements}
R's objects, data and functions, are named.
Since R is modular, with the ability to add any number of packages to it, it is very likely that name conflicts will arise.
To deal with them, R has a rigorous system of name precedence: code runs in a defined environment, inheriting from parent environments.
### Organization
R starts in an empty environment.
Each loaded package creates a child environment to form a stack of environments, of which each new element is called a "child" of the previous one, which is its "parent".
The console is in the global environment, the child of the last loaded package.
```{r search}
search()
```
The code of a function called from the console runs in a child environment of the global environment:
```{r environment}
# Current environment
environment()
# The function f displays its environment
f <- function() environment()
# Display the environment of the function
f()
# Parent environment of the function's environment
parent.env(f())
```
### Search
The search for ab object starts in the local environment.
If it is not found, it is searched in the parent environment, then in the parent of the parent, until the environments are exhausted, which generates an error indicating that the object was not found.
Example:
```{r GlobalEnv}
# Variable q defined in the global environment
q <- "GlobalEnv"
# Function defining q in its environment
qLocalFunction <- function() {
q <- "Function"
return(q)
}
# The local variable is returned
qLocalFunction()
# Poorly written function using a variable it does not define
qGlobalEnv <- function() {
return(q)
}
# The global environment variable is returned
qGlobalEnv()
# Delete this variable
rm(q)
# The function base::q is returned
qGlobalEnv()
```
The variable `q` is defined in the global environment.
The function `qLocalFunction` defines its own variable `q`.
Calling the function returns the its local value because it is in the function's environment.
The `qGlobalEnv` function returns the `q` variable that it does not define locally.
So it looks for it in its parent environment and finds the variable defined in the global environment.
By removing the variable from the global environment with `rm(q)`, the `qGlobalEnv()` function scans the stack of environments until it finds an object named `q` in the **base** package, which is the function to exit R.
It could have found another object if a package containing a `q` object had been loaded.
To avoid this erratic behavior, a function should *never* call an object not defined in its own environment.
### Package namespaces
It is time to define precisely what packages make visible.
Packages contain objects (functions and data) which they *export* or not.
They are usually called by the `library()` function, which does two things:
- It *loads* the package into memory, allowing access to all its objects with the syntax `package::object` for exported objects and `package:::object` for non-exported ones.
- It then *attaches* the package, which places its environment on top of the stack.
It is possible to detach a package with the `unloadNamespace()` function to remove it from the environment stack.
Example:
```{r unloadNamespace}
# entropart loaded and attached
library("entropart")
# is it attached?
isNamespaceLoaded("entropart")
# stack of environments
search()
# Diversity(), a function exported by entropart is found
Diversity(1, CheckArguments=FALSE)
# Detach and unload entropart
unloadNamespace("entropart")
# Is it attached?
isNamespaceLoaded("entropart")
# Stack of environments, without entropart
search()
# Diversity() cannot be found
tryCatch(Diversity(1), error = function(e) print(e))
# but can be called with its full name
entropart::Diversity(1, CheckArguments=FALSE)
```
Calling `entropart::Diversity()` loads the package (i.e., implicitly executes `loadNamespace("entropart")`) but does not attach it.
In practice, one should limit the number of attached packages to limit the risk of calling an unwanted function, homonymous to the desired function.
In critical cases, the full name of the function should be used: `package::function()`.
A common issue occurs with the `filter()` function of **dplyr**, which is the namesake of the **stats** function.
The **stats** package is usually loaded before **dplyr**, a package in the tidyverse.
Thus, `stats::filter()` must be called explicitly.
However, the **dplyr** or **tidyverse** package (which attaches all the tidyverse packages) can be loaded systematically by creating a `.RProfile` at the root of the project containing the command:
```{r tidyverse2, eval=FALSE}
library("tidyverse")
```
In this case, **dplyr** is loaded *before* **stats** so its function is inaccessible.
## Measuring execution time
The execution time of long code can be measured very simply by the `system.time` command.
For very short execution times, it is necessary to repeat the measurement: this is the purpose of the **microbenchmark** package.
### system.time
The function returns the execution time of the code.
```{r system.time}
# Mean absolute deviation of 1000 values in a uniform distribution, repeated 100 times
system.time(for(i in 1:100) mad(runif(1000)))
```
### microbenchmark
The **microbenchmark** package is the most advanced.
The goal is to compare the speed of computing the square of a vector (or a number) by multiplying it by itself ($x \times x$) or by raising it to the power of 2 ($x^2$).
```{r microbenchmark}
# Functions to test
f1 <- function(x) x*x
f2 <- function(x) x^2
f3 <- function(x) x^2.1
f4 <- function(x) x^3
# Initialization
X <- rnorm(10000)
# Test
library("microbenchmark")
(mb <- microbenchmark(f1(X), f2(X), f3(X), f4(X)))
```
The returned table contains the minimum, median, mean, max and first and third quartile times, as well as the number of repetitions.
The median value is to be compared.
The number of repetitions is by default 100, to be modulated (argument `times`) according to the complexity of the calculation.
The test result, a `microbenchmark` object, is a raw table of execution times.
The statistical analysis is done by the `print` and `summary` methods.
To choose the columns to display, use the following syntax:
```{r mb_summary}
summary(mb)[, c("expr", "median")]
```
To make calculations on these results, we must store them in a variable.
To prevent the results from being displayed in the console, the simplest solution is to use the `capture.output` function by assigning its result to a variable.
```{r capture.output}
dummy <- capture.output(mbs <- summary(mb))
```
The previous test is displayed again.
```{r mb_summary2}
summary(mb)[, c("expr", "median")]
```
The computation time is about the same between $x \times x$ and $x^2$.
The power calculation is much longer, especially if the power is not integer, because it requires a logarithm calculation.
The computation of the power 2 is therefore optimized by R to avoid the use of log.
Two graphical representations are available: the violins represent the probability density of the execution time; the boxplots are classical.
```{r boxplot, message=FALSE}
library("ggplot2")
autoplot(mb)
boxplot(mb)
```
### Profiling
**profvis** is RStudio's profiling tool.
It tracks the execution time of each line of code and the memory used.
The goal is to detect slow code portions that need to be improved.
```{r profvis}
library(profvis)
p <- profvis({
# Cosine calculations
cos(runif(10^7))
# 1/2 second pause
pause(1/2)
})
htmlwidgets::saveWidget(p, "docs/profile.html")
```
The result is an HTML file containing the profiling report[^201].
It can be observed that the time to draw the random numbers is similar to that of the cosine calculation.
[^201]: https://EricMarcon.github.io/WorkingWithR/profile.html
Read the complete documentation[^202] on the RStudio website.
[^202]: https://rstudio.github.io/profvis/
## Loops
The most frequent case of long code to execute is loops: the same code is repeated a large number of times.
### Vector functions
Most of R's functions are vector functions: loops are processed internally, extremely fast.
Therefore, you should think in terms of vectors rather than scalars.
```{r vector}
# Draw two vectors of three random numbers between 0 and 1
x1 <- runif(3)
x2 <- runif(3)
# Square root of the three numbers in x1
sqrt(x1)
# Respective sums of the three numbers of x1 and x2
x1+x2
```
We also have to write vector functions on their first argument.
The function `lnq` of the package **entropart** returns the deformed logarithm of order $q$ of a number $x$.
```{r lnq}
# Code of the function
entropart::lnq
```
For a function to be vector, each line of its code must allow the first argument to be treated as a vector.
Here: `log(x)` and `x^` are a vector function and operator and the condition `[x < 0]` also returns a vector.
### lapply
Code that cannot be written as a vector function requires loops.
`lapply()` applies a function to each element of a list.
There are several versions of this function:
- `lapply()` returns a list (and saves the time of rearranging them in an array).
- `sapply()` returns a dataframe by collapsing the lists (this is done by the `simplify2array()` function).
- `vapply()` is almost identical but requires that the data type of the result be provided.
```{r l_s_v_apply}
# Draw 1000 values in a uniform distribution
x1 <- runif(1000)
# The square root can be calculated for the vector or each value
identical(sqrt(x1), sapply(x1, FUN=sqrt))
mb <- microbenchmark(sqrt(x1), lapply(x1, FUN=sqrt), sapply(x1, FUN=sqrt), vapply(x1, FUN=sqrt, FUN.VALUE = 0))
summary(mb)[, c("expr", "median")]
```
`lapply()` is much slower than a vector function.
`sapply()` requires more time for `simplify2array()`, which must detect how to gather the results.
Finally, `vapply()` saves the time of determining the data type of the result and allows for faster computation with little effort.
### For loops
Loops are handled by the `for` function.
They have the reputation of being slow in R because the code inside the loop must be interpreted at each execution.
This is no longer the case since version 3.5 of R: loops are compiled systematically before execution.
The behavior of the just-in-time compiler is defined by the `enableJIT` function.
The default level is 3: all functions are compiled, and loops in the code are compiled too.
To evaluate the performance gain, the following code removes all automatic compilation, and compares the same loop compiled or not.
```{r compiler}
library("compiler")
# No automatic compilation
enableJIT(level=0)
# Loop to calculate the square root of a vector
Loop <- function(x) {
# Initialization of the result vector, essential
Root <- vector("numeric", length=length(x))
# Loop
for(i in 1:length(x)) Root[i] <- sqrt(x[i])
return(Root)
}
# Compiled version
Loop2 <- cmpfun(Loop)
# Comparison
mb <- microbenchmark(Loop(x1), Loop2(x1))
(mbs <- summary(mb)[, c("expr", "median")])
# Automatic compilation by default since version 3.5
enableJIT(level=3)
```
The gain is considerable: from 1 to ```r format(mbs[1,2]/mbs[2,2], digits=1)```.
For loops are now much faster than `vapply`.
```{r for_mb}
# Test
mb <- microbenchmark(vapply(x1, FUN=sqrt, 0), Loop(x1))
summary(mb)[, c("expr", "median")]
```
Be careful, the performance test can be misleading:
```{r deceptive, tidy=FALSE}
# Preparing the result vector
Root <- vector("numeric", length = length(x1))
# Test
mb <- microbenchmark(vapply(x1, FUN = sqrt, 0),
for(i in 1:length(x1))
Root[i] <- sqrt(x1[i]))
summary(mb)[, c("expr", "median")]
```
In this code, the for loop is not compiled so it is much slower than in its normal use (in a function or at the top level of the code).
The long loops allow tracking of their progress by a text bar, which is another advantage.
The following function executes pauses of one tenth of a second for the time passed in parameter (in seconds).
```{r BoucleSuivie}
MonitoredLoop <- function(duration=1) {
# Progress bar
pgb <- txtProgressBar(min = 0, max = duration*10)
# Boucle
for(i in 1:(duration*10)) {
# Pause for a tenth of a second
Sys.sleep(1/10)
# Track progress
setTxtProgressBar(pgb, i)
}
}
MonitoredLoop()
```
### replicate
`replicate()` repeats a statement.
```{r replicate}
replicate(3, runif(1))
```
This code is equivalent to `runif(3)`, with performance similar to `vapply`: 50 to 100 times slower than a vector function.
```{r replicate_mb}
mb <- microbenchmark(replicate(1E3, runif(1)), runif(1E3))
summary(mb)[, c("expr", "median")]
```
### Vectorize
`Vectorize()` makes a function that is not vectorized, by loops.
Write the loops instead.
### Marginal statistics
`apply` applies a function to the rows or columns of a two dimensional object.
`colSums` and similar functions (`rowSums`, `colMeans`, `rowMeans`) are optimized.
```{r BoucleSomme, tidy=FALSE}
# Sum of the numeric columns of the diamonds dataset of ggplot2
# Loop identical to the action of apply(, 2, )
SumLoop <- function(Table) {
Sum <- vector("numeric", length = ncol(Table))
for (i in 1:ncol(Table)) Sum[i] <- sum(Table[, i])
return(Sum)
}
mb <- microbenchmark(SumLoop(diamonds[-(2:4)]),
apply(diamonds[-(2:4)], 2, sum),
colSums(diamonds[-(2:4)]))
summary(mb)[, c("expr", "median")]
```
`apply` clarifies the code but is slower than the loop, which is only slightly slower than `colSums`.
## C++ code {#sec:cpp}
Integrating C++ code into R is greatly simplified by the **Rcpp** package but is still difficult to debug and therefore should be reserved for very simple code (to avoid errors) repeated a large number of times (to be worth the effort).
The preparation and verification of the data must be done in R, as well as the processing and presentation of the results.
The common practise is to include C++ code in a package, but running it outside a package is possible:
- C++ code can be included in a C++ document (file with extension `.cpp`): it is compiled by the `sourceCpp()` command, which creates the R functions to call the C++ code.
- In an RMarkdown document, Rcpp code snippets can be created to insert the C++ code: they are compiled and interfaced to R at the time of knitting.
The following example shows how to create a C++ function to calculate the double of a numerical vector.
```{Rcpp}
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector timesTwo(NumericVector x) {
return x * 2;
}
```
An R function with the same name as the C++ function is now available.
```{r timesTwo}
timesTwo(1:5)
```
The performance is two orders of magnitude faster than the R code (see the case study, section \@ref(sec:cas)).
## Parallelizing R
When long computations can be split into independent tasks, the simultaneous (*parallel*) execution of these tasks reduces the total computation time to that of the longest task, to which is added the cost of setting up the parallelization (creation of the tasks, recovery of the results...).
Read Josh Errickson's excellent introduction[^203] which details the issues and constraints of parallelization.
[^203]: <http://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html>
Two mechanisms are available for parallel code execution:
- _fork_: the running process is duplicated on multiple cores of the computing computer's processor.
This is the simplest method but it does not work on Windows (it is a limitation of the operating system).
- _Socket_: a cluster is constituted, either physically (a set of computers running R is necessary) or logically (an instance of R on each core of the computer used).
The members of the cluster communicate through the network (the internal network of the computer is used in a logical cluster).
Different R packages allow to implement these mechanisms.
### mclapply (fork)
The `mclapply` function of the **parallel** package has the same syntax as `lapply` but parallelizes the execution of loops.
On Windows, it has no effect since the system does not allow _fork_: it simply calls `lapply`.
However, a workaround exists to emulate `mclapply` on Windows by calling `parLapply`, which uses a cluster.
```{r, tidy=FALSE}
##
## mclapply.hack.R
##
## Nathan VanHoudnos
## nathanvan AT northwestern FULL STOP edu
## July 14, 2014
##
## A script to implement a hackish version of
## parallel:mclapply() on Windows machines.
## On Linux or Mac, the script has no effect
## beyond loading the parallel library.
require(parallel)
## Define the hack
# mc.cores argument added: Eric Marcon
mclapply.hack <- function(..., mc.cores=detectCores()) {
## Create a cluster
size.of.list <- length(list(...)[[1]])
cl <- makeCluster( min(size.of.list, mc.cores) )
## Find out the names of the loaded packages
loaded.package.names <- c(
## Base packages
sessionInfo()$basePkgs,
## Additional packages
names( sessionInfo()$otherPkgs ))
tryCatch( {
## Copy over all of the objects within scope to
## all clusters.
this.env <- environment()
while( identical( this.env, globalenv() ) == FALSE ) {
clusterExport(cl,
ls(all.names=TRUE, env=this.env),
envir=this.env)
this.env <- parent.env(environment())
}
clusterExport(cl,
ls(all.names=TRUE, env=globalenv()),
envir=globalenv())
## Load the libraries on all the clusters
## N.B. length(cl) returns the number of clusters
parLapply( cl, 1:length(cl), function(xx){
lapply(loaded.package.names, function(yy) {
require(yy , character.only=TRUE)})
})
## Run the lapply in parallel
return( parLapply( cl, ...) )
}, finally = {
## Stop the cluster
stopCluster(cl)
})
}
## Warn the user if they are using Windows
if( Sys.info()[['sysname']] == 'Windows' ){
message(paste(
"\n",
" *** Microsoft Windows detected ***\n",
" \n",
" For technical reasons, the MS Windows version of mclapply()\n",
" is implemented as a serial function instead of a parallel\n",
" function.",
" \n\n",
" As a quick hack, we replace this serial version of mclapply()\n",
" with a wrapper to parLapply() for this R session. Please see\n\n",
" http://www.stat.cmu.edu/~nmv/2014/07/14/
implementing-mclapply-on-windows \n\n",
" for details.\n\n"))
}
## If the OS is Windows, set mclapply to the
## the hackish version. Otherwise, leave the
## definition alone.
mclapply <- switch( Sys.info()[['sysname']],
Windows = {mclapply.hack},
Linux = {mclapply},
Darwin = {mclapply})
## end mclapply.hack.R
```
The following code tests the parallelization of a function that returns its argument unchanged after a quarter-second pause.
This is knitted with ```r parallel::detectCores()``` cores, all of which are used except for one so as not to saturate the system.
```{r detectCores}
f <- function(x, time=.25) {
Sys.sleep(time)
return(x)
}
#Leave one core out for the system
nbCores <- detectCores()-1
# Serial : theoretical time = nbCores/4 seconds
(tserie <- system.time(lapply(1:nbCores, f)))
# Parallel : theoretical time = 1/4 second
(tparallele <- system.time(mclapply(1:nbCores, f, mc.cores=nbCores)))
```
Setting up parallelization has a cost of about ```r format(tparallele[3]-1/4, digits=2)``` seconds here.
The execution time is much longer in parallel on Windows because setting up the cluster takes much more time than parallelization saves.
Parallelization is interesting for longer tasks, such as a one second break.
```{r serie_parallele}
# Serial
system.time(lapply(1:nbCores, f, time=1))
# Parallel
system.time(mclapply(1:nbCores, f, time=1, mc.cores=nbCores))
```
The additional time required for parallel execution of the new code is relatively smaller: the costs become less than the savings when the time of each task increases.
If the number of parallel tasks exceeds the number of cores used, performance collapses because the additional task must be executed after the first ones.
```{r nbCoeurs}
system.time(mclapply(1:nbCores, f, time=1, mc.cores=nbCores))
system.time(mclapply(1:(nbCores+1), f, time=1, mc.cores=nbCores))
```
The time then remains stable until the number of cores is doubled.
Figure \@ref(fig:r-parallele) shows the evolution of the computation time according to the number of tasks.
(ref:r-parallele) Parallel execution time of tasks requiring one second (each task is a one second pause). The number of tasks varies from 1 to twice the number of cores used (equal to ```r nbCores```) plus one.
```{r r-parallele, fig.cap="(ref:r-parallele)", fig.scap="Execution time in parallel", tidy=FALSE}
Tasks <- 1:(2 * nbCores+1)
Time <- sapply(Tasks, function(nbTasks) {
system.time(mclapply(1:nbTasks, f, time=1, mc.cores=nbCores))
})
library("tidyverse")
tibble(Tasks, Time=Time["elapsed", ]) %>%
ggplot +
geom_line(aes(x = Tasks, y = Time)) +
geom_vline(xintercept = nbCores, col = "red", lty = 2) +
geom_vline(xintercept = 2 * nbCores, col = "red", lty = 2)
```
The theoretical shape of this curve is as follows:
- For a task, the time is equal to one second plus the parallelization setup time.
- The time should remain stable until the number of cores used.
- When all the cores are used (red dotted line), the time should increase by one second and then remain stable until the next limit.
In practice, the computation time is determined by other factors that are difficult to predict.
The best practice is to adapt the number of tasks to the number of cores, otherwise performance will be lost.
### parLapply (socket)
`parLapply` requires to create a cluster, export the useful variables on each node, load the necessary packages on each node, execute the code and finally stop the cluster.
The code for each step can be found in the `mclapply.hack` function above.
For everyday use, `mclapply` is faster, except on Windows, and simpler (including on Windows thanks to the above workaround).
### foreach
#### How it works
The **foreach** package allows advanced use of parallelization.
Read its vignettes.
```{r foreach_v, eval=FALSE}
# Manual
vignette("foreach", "foreach")
# Nested loops
vignette("nested", "foreach")
```
Regardless of parallelization, **foreach** redefines *for* loops.
```{r foreach}
for (i in 1:3) {
f(i)
}
# becomes
library("foreach")
foreach (i=1:3) %do% {
f(i)
}
```
The `foreach` function returns a list containing the results of each loop.
The elements of the list can be combined by any function, such as `c`.
```{r foreach2}
foreach (i=1:3, .combine="c") %do% {
f(i)
}
```
The `foreach` function is capable of using iterators, that is, functions that pass to the loop only the data it needs without loading the rest into memory.
Here, the `icount` iterator passes the values 1, 2 and 3 individually, without loading the 1:3 vector into memory.
```{r iterators}
library("iterators")
foreach (i=icount(3), .combine="c") %do% {
f(i)
}
```
It is therefore very useful when each object of the loop uses a large amount of memory.
#### Parallelization
Replacing the `%do%` operator with `%dopar%` parallelizes loops, provided that an adapter, i.e. an intermediate package between `foreach` and a package implementing parallelization, is loaded.
**doParallel** is an adapter for using the **parallel** package that comes with R.
```{r doParallel}
library(doParallel)
registerDoParallel(cores = nbCores)
# Serial
system.time(foreach (i=icount(nbCores), .combine="c") %do% {f(i)})
# Parallel
system.time(foreach (i=icount(nbCores), .combine="c") %dopar% {f(i)})
```
The fixed cost of parallelization is low.
### future
The **future** package is used to abstract the code of the parallelization implementation.
It is at the centre of an ecosystem of packages that facilitate its use[^250].
[^250]: https://www.futureverse.org/
The parallelization strategy used is declared by the `plan()` function.
The default strategy is `sequential`, i.e. single-task.
The `multicore` and `multisession` strategies are based respectively on the _fork_ and _socket_ techniques seen above.
Other strategies are available for using physical clusters (several computers prepared to run R together): the **future** documentation details how to do this.