Merge branch 'master' of github.com:EricMarcon/WorkingWithR

EricMarcon · Jan 3, 2024 · e102265 · e102265
2 parents c20c8e0 + f37d47a
commit e102265
Show file tree

Hide file tree

Showing 10 changed files with 345 additions and 213 deletions.
diff --git a/.github/workflows/bookdown.yml b/.github/workflows/bookdown.yml
@@ -10,7 +10,7 @@ jobs:
     runs-on: macOS-latest
     steps:
       - name: Checkout repo
-        uses: actions/checkout@v3
+        uses: actions/checkout@v4
       - name: Setup R
         uses: r-lib/actions/setup-r@v2
       - name: Install pandoc
@@ -21,19 +21,17 @@ jobs:
         run: |
           options(pkgType = "binary")
           options(install.packages.check.source = "no")
-          install.packages(c("remotes", "bookdown", "tinytex", "webshot", "downlit"))
-          remotes::install_deps(dependencies = TRUE)
-          tinytex::install_tinytex()
+          install.packages(c("distill", "downlit", "memoiR", "rmdformats", "tinytex"))
+          tinytex::install_tinytex(bundle = "TinyTeX")
           tinytex::tlmgr_install(c("tex-gyre", "tex-gyre-math"))
-          webshot::install_phantomjs()
         shell: Rscript {0}
       - name: Render pdf book
         env:
           GITHUB_PAT: ${{ secrets.GH_PAT }}
         run: |
           bookdown::render_book("index.Rmd", "bookdown::pdf_book")
         shell: Rscript {0}
-      - name: Render gitbook
+      - name: Render bs4 book
         env:
           GITHUB_PAT: ${{ secrets.GH_PAT }}
         run: |

diff --git a/02-UseR.Rmd b/02-UseR.Rmd
@@ -202,7 +202,7 @@ Attributes and methods can be public or private.
 An `initialize()` method is used as a constructor.
 
 ```{r S6-Class}
-library(R6)
+library("R6")
 PersonR6 <- R6Class("PersonR6", 
                       public = list(LastName="character", FirstName="character",
                                     initialize = function(LastName=NA, FirstName=NA) {
@@ -987,10 +987,52 @@ system.time(foreach (i=icount(nbCores), .combine="c") %dopar% {f(i)})
 The fixed cost of parallelization is low.
 
 
+### future
+
+The **future** package is used to abstract the code of the parallelization implementation.
+It is at the centre of an ecosystem of packages that facilitate its use[^250].
+
+[^250]: https://www.futureverse.org/
+
+The parallelization strategy used is declared by the `plan()` function.
+The default strategy is `sequential`, i.e. single-task.
+The `multicore` and `multisession` strategies are based respectively on the _fork_ and _socket_ techniques seen above.
+Other strategies are available for using physical clusters (several computers prepared to run R together): the **future** documentation details how to do this.
+
+Here we will use the `multisession` strategy, which works on the local computer, whatever its operating system.
+
+```{r future}
+library("future")
+# Socket strategy on all available cores except 1
+usedCores <- availableCores() - 1
+plan(multisession, workers = usedCores)
+```
+
+The **future.apply** package allows all `*apply()` and `replicate()` loops to be effortlessly parallelized by prefixing their names with `future_`.
+
+```{r future.apply}
+library("future.apply")
+system.time(future_replicate(usedCores - 1, f(usedCores)))
+```
+
+foreach loops can be parallelized with the **doFuture** package by simply replacing `%dopar%` with `%dofuture%`.
+
+```{r doFuture}
+library("doFuture")
+system.time(foreach (i = icount(nbCores), .combine="c") %dofuture% {f(i)})
+```
+
+The strategy is reset to `sequential` at the end.
+
+```{r sequential}
+plan(sequential)
+```
+
+
 ## Case study {#sec:cas}
 
 This case study tests the different techniques seen above to solve a concrete problem.
-The objective is to compute the average distance between two points of a random seed of 1000 points in a square window of side 1.
+The objective is to compute the average distance between two points of a random set of 1000 points in a square window of side 1.
 
 Its expectation is computable[^230].
 It is equal to $\frac{2+\sqrt{2}+5\ln{(1+\sqrt{2})}}{15} \approx 0.5214$.
@@ -1103,6 +1145,33 @@ d
 ```
 
 
+### future.apply
+
+The `fsapply4()` function optimised above can be parallelled directly by prefixing the `vapply` function with `future_`.
+Only the main loop is parallelized: nesting `future_vapply()` would collapse performance.
+
+```{r}
+library("future.apply")
+# Socket strategy on all available cores except 1
+plan(multisession, workers = availableCores() - 1)
+future_fsapply4_ <- function() {
+  distances <- future_vapply(1:NbPoints, function(i) {
+    vapply(1:NbPoints, function(j) {
+      if (j>i) {
+        (X$x[i] - X$x[j])^2 + (X$y[i] - X$y[j])^2
+      } else {
+        0
+      }
+    }, 0.0)
+  }, 1:1000+0.0)
+  return(sum(sqrt(distances)) / NbPoints / (NbPoints - 1) * 2)
+}
+system.time(d <- future_fsapply4_())
+d
+plan(sequential)
+```
+
+
 ### for loop
 
 A for loop is faster and consumes less memory because it does not store the distance matrix.
@@ -1126,7 +1195,29 @@ This is the simplest and most efficient way to write this code with core R and n
 
 ### foreach loop
 
-Two nested foreach loops are needed here: they are extremely slow compared to a simple loop.
+Parallelization executes for loops inside a foreach loop, which is quite efficient.
+However, distances are calculated twice.
+
+```{r registerDoParallel, tidy=FALSE}
+registerDoParallel(cores = detectCores())
+fforeach3 <- function(Y) {
+  distances <- foreach(
+    i = icount(Y$n), 
+    .combine = '+') %dopar% {
+      distance <- 0
+      for (j in 1:Y$n) {
+        distance <- distance + 
+          sqrt((Y$x[i] - Y$x[j])^2 + (Y$y[i] - Y$y[j])^2)
+      }
+      distance
+    }
+  return(distances / Y$n / (Y$n - 1))
+}
+system.time(d <- fforeach3(X))
+d
+```
+
+It is possible to nest two foreach loops, but they are extremely slow compared with a simple loop.
 The test is run here with 10 times fewer points, so 100 times fewer distances to calculate.
 
 ```{r}
@@ -1149,32 +1240,6 @@ d
 
 Nested foreach loops should be reserved for very long tasks (several seconds at least) to compensate the fixed costs of setting them up.
 
-Parallelization is efficient in the code below, especially because it avoids nested foreach loops.
-On the other hand, distances are calculated twice.
-The performance is still much lower than a simple for loop (remember: 100 times less distances are computed).
-
-```{r registerDoParallel, tidy=FALSE}
-registerDoParallel(cores = detectCores())
-fforeach3 <- function(Y) {
-  distances <- 
-    foreach(i=icount(NbPointsReduit), 
-            .combine='+') %dopar% {
-      distance <- 0
-      for (j in 1:Y$n) {
-        distance <- distance + 
-          sqrt((Y$x[i]-Y$x[j])^2 + (Y$y[i]-Y$y[j])^2)
-      }
-      distance
-    }
-  return(distances/NbPointsReduit/(NbPointsReduit-1))
-}
-system.time(d <- fforeach3(Y))
-d
-```
-
-**foreach** has optimized adapters allowing to use physical clusters for example. 
-Its interest is limited with the **parallel** package.
-
 
 ### RCpp
 
@@ -1319,7 +1384,9 @@ system.time(d <- TotalDistance(X$x, X$y)/NbPoints/(NbPoints-1)*2)
 From this case study, several lessons can be learned:
 
 - A for loop is a good basis for repetitive calculations, faster than `vapply()`, simple to read and write.
+- **foreach** loops are extremely effective for parallelizing for loops;
 - Optimized functions may exist in R packages for common tasks (here, the `pairdist()` function of **spatstat** is two orders of magnitude faster than the for loop).
+- the **future.apply** package makes it very easy to parallelize code that has already been written with `*apply()` functions, regardless of the hardware used;
 - The use of C++ code allows to speed up the calculations significantly, by three orders of magnitude here.
 - Parallelization of the C++ code further divides the computation time by about half the number of cores for long computations.
 
@@ -1332,8 +1399,7 @@ Writing vector code, using `sapply()` is still justified for its readability.
 
 The choice of parallelizing the code must be evaluated according to the execution time of each parallelizable task.
 If it exceeds a few seconds, parallelization is justified.
-`mclapply()` replaces `lapply()` without any effort, but requires a hack (provided here) on Windows.
-`foreach()` does not replace `for()` as easily and is only justified for very memory and computationally heavy tasks, especially on computing clusters.
+
 
 ## Workflow {#sec:targets}
 

diff --git a/04-Writing.Rmd b/04-Writing.Rmd
@@ -5,6 +5,20 @@
 R and RStudio make it possible to efficiently write documents of all formats, from simple notepads to theses to slide shows.
 The tools to do this are the subject of this chapter, completed by the production of web sites (including a personal site).
 
+Two document production processes are available:
+
+- *R Markdown* with the **knitR** and **bookdown** packages. 
+This is the classic method, presented here in detail.
+- *Quarto*, designed to be used with languages beyond R and in working environments beyond RStudio.
+Quarto is under active development but does not yet allow documents to be produced with the same quality as *R Markdown*: for example, punctuation in French documents is not handled correctly in PDF[^rediger-41], tables cannot include equations[^rediger-42] and the width of figures is inconsistent in PDF documents formatted with several columns[^rediger-43].
+The use of Quarto is well documented on its site[^rediger-40] and is not presented here.
+
+[^rediger-40]: <https://quarto.org/>
+[^rediger-41]: <https://github.com/jgm/pandoc/issues/8283/>
+[^rediger-42]: <https://github.com/quarto-dev/quarto-cli/issues/555>
+[^rediger-43]: <https://github.com/quarto-dev/quarto-cli/issues/855>
+
+
 ## Markdown notebook (R Notebook)
 
 In an `.R` file, the code should always be commented to make it easier to read.
@@ -385,6 +399,18 @@ The correspondence and the complete list of languages can be found in table 3 of
 
 [^403]: http://mirrors.ctan.org/macros/unicodetex/latex/polyglossia/polyglossia.pdf
 
+HTML formatting of punctuation in French documents is possible using a filter declared in pandoc [^450].
+The `fr-nbsp.lua` file must be copied into the project directory from its GitHub repository and declared into the header of the Markdown document.
+
+```
+output:
+    pandoc_args:
+      --lua-filter=en-nbsp.lua
+```
+
+The filter formats all the punctuation in the document, whatever the language: it should therefore only be used for documents written entirely in French.
+
+[^450]: https://github.com/InseeFrLab/pandoc-filter-fr-nbsp
 
 ### Simple Article template {#sec:memo}
 

diff --git a/05-Package.Rmd b/05-Package.Rmd
@@ -230,31 +230,27 @@ The development of the package is punctuated by many commits at each modificatio
 ### package.R
 
 The `package.R` file is intended to receive the R code and especially the comments for **roxygen2** which concern the whole package.
+This file can also be named `multiple-package.R`, prefixed with the package name, for compatibility with **usethis**.
+It can be created under this name with the command:
+```{r use_package_doc, eval=FALSE}
+usethis::use_package_doc()
+```
 
-The first comment block will produce the package help (`?multiple`).
+The first comment block will generate the package help (`?multiple`).
 
 ```
-#' multiple-package
-#'
-#' Multiples of numbers
-#' 
-#' This package allows simple computation of multiples 
-#' of numbers, including fast algorithms for integers.
-#'
-#' @name multiple
-#' @docType package
-NULL
+#' @keywords internal 
+"_PACKAGE"
 ```
 
-Its organization is identical to that of the function documentations, with two particular declarations for the package name and the documentation type.
-The `NULL` code after the comments tells **roxygen2** that there is no related R code.
+The "_PACKAGE" keyword indicates that package documentation must be produced. 
+It could be written in the block, with a syntax identical to that of functions, but its default content is that of the `Description` field in the `DESCRIPTION` file.
+The `internal` keyword hides the package documentation in the help summary.
 
 The documentation is updated by the `roxygen2::roxygenise()` command.
 After rebuilding the package, check that the help has appeared: `?multiple`.
 
 
-
-
 ## Package organization
 
 ### DESCRIPTION file {#sec:package-description}
@@ -270,9 +266,8 @@ Authors@R:
            role = c("aut", "cre"),
            email = "[email protected]",
            comment = c(ORCID = "0000-0002-5249-321X"))
-Description: This package allows simple computation
-  of multiples of numbers, including fast algorithms
-  for integers.
+Description: Simple computation of multiples of numbers, 
+  including fast algorithms for integers.
 License: GPL-3
 Encoding: UTF-8
 LazyData: true
@@ -301,7 +296,7 @@ When the development is stabilized, the new version, intended to be used in prod
 
 The description of the authors is rather heavy but simple to understand.
 The Orcid identifiers of academic authors can be used.
-If the package has several authors, they are placed in a `c()` function: `c(person(...), person())` for two authors.
+If the package has several authors, they are placed in a `c()` function: `c(person(...), person(...))` for two authors.
 In this case, the role of each must be specified:
 
 * "cre" for the creator of the package.
@@ -1003,20 +998,23 @@ References are cited by the command `\insertCite{key}{package}` in the documenta
 `package` is the name of the package in which the `REFERENCES.bib` file is to be searched: this will normally be the current package, but references to other packages are accessible, provided only that they use **Rdpack**.
 
 `key` is the identifier of the reference in the file.
-The following example[^507] is from the documentation of the **SpatDiv** package hosted on GitHub, in its `.R` file:
+The following example[^507] is from the documentation of the **divent** package hosted on GitHub, in its `.R` file:
 
 ```{r Citations}
-#' SpatDiv
+#' divent
 #'
-#' Spatially Explicit Measures of Diversity
+#' Measures of Diversity and Entropy
 #' 
-#' This package extends the **entropart** package
-#' \insertCite{Marcon2014c}{SpatDiv}.
-#' It provides spatially explicit measures of 
-#' diversity such as the mixing index.
+#' This package is a reboot of the **entropart** package \insertCite{Marcon2014c}{divent}.
+#'
+#' @importFrom Rdpack reprompt
+#' 
+#' @references
+#' \insertAllCited{}
+"_PACKAGE"
 ```
 
-[^507]: **SpatDiv** package on GitHub: https://github.com/EricMarcon/SpatDiv/blob/master/R/package.R
+[^507]: **divent** package on GitHub: https://github.com/EricMarcon/divent/blob/master/R/package.R
 
 The cited reference is in `inst/REFERENCES.bib`:
 
@@ -1037,7 +1035,7 @@ Citations are enclosed in parentheses.
 To place the author's name outside the parenthesis, add the statement `;textual`:
 
 ```
-\insertCite{Marcon2014c;textual}{SpatDiv}
+\insertCite{Marcon2014c;textual}{divent}
 ```
 To cite several references (necessarily from the same package), separate them with commas.