Skip to content

Commit

Permalink
Merge branch 'master' of github.com:EricMarcon/WorkingWithR
Browse files Browse the repository at this point in the history
  • Loading branch information
EricMarcon committed Jan 3, 2024
2 parents c20c8e0 + f37d47a commit e102265
Show file tree
Hide file tree
Showing 10 changed files with 345 additions and 213 deletions.
10 changes: 4 additions & 6 deletions .github/workflows/bookdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ jobs:
runs-on: macOS-latest
steps:
- name: Checkout repo
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Setup R
uses: r-lib/actions/setup-r@v2
- name: Install pandoc
Expand All @@ -21,19 +21,17 @@ jobs:
run: |
options(pkgType = "binary")
options(install.packages.check.source = "no")
install.packages(c("remotes", "bookdown", "tinytex", "webshot", "downlit"))
remotes::install_deps(dependencies = TRUE)
tinytex::install_tinytex()
install.packages(c("distill", "downlit", "memoiR", "rmdformats", "tinytex"))
tinytex::install_tinytex(bundle = "TinyTeX")
tinytex::tlmgr_install(c("tex-gyre", "tex-gyre-math"))
webshot::install_phantomjs()
shell: Rscript {0}
- name: Render pdf book
env:
GITHUB_PAT: ${{ secrets.GH_PAT }}
run: |
bookdown::render_book("index.Rmd", "bookdown::pdf_book")
shell: Rscript {0}
- name: Render gitbook
- name: Render bs4 book
env:
GITHUB_PAT: ${{ secrets.GH_PAT }}
run: |
Expand Down
128 changes: 97 additions & 31 deletions 02-UseR.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -202,7 +202,7 @@ Attributes and methods can be public or private.
An `initialize()` method is used as a constructor.

```{r S6-Class}
library(R6)
library("R6")
PersonR6 <- R6Class("PersonR6",
public = list(LastName="character", FirstName="character",
initialize = function(LastName=NA, FirstName=NA) {
Expand Down Expand Up @@ -987,10 +987,52 @@ system.time(foreach (i=icount(nbCores), .combine="c") %dopar% {f(i)})
The fixed cost of parallelization is low.


### future

The **future** package is used to abstract the code of the parallelization implementation.
It is at the centre of an ecosystem of packages that facilitate its use[^250].

[^250]: https://www.futureverse.org/

The parallelization strategy used is declared by the `plan()` function.
The default strategy is `sequential`, i.e. single-task.
The `multicore` and `multisession` strategies are based respectively on the _fork_ and _socket_ techniques seen above.
Other strategies are available for using physical clusters (several computers prepared to run R together): the **future** documentation details how to do this.

Here we will use the `multisession` strategy, which works on the local computer, whatever its operating system.

```{r future}
library("future")
# Socket strategy on all available cores except 1
usedCores <- availableCores() - 1
plan(multisession, workers = usedCores)
```

The **future.apply** package allows all `*apply()` and `replicate()` loops to be effortlessly parallelized by prefixing their names with `future_`.

```{r future.apply}
library("future.apply")
system.time(future_replicate(usedCores - 1, f(usedCores)))
```

foreach loops can be parallelized with the **doFuture** package by simply replacing `%dopar%` with `%dofuture%`.

```{r doFuture}
library("doFuture")
system.time(foreach (i = icount(nbCores), .combine="c") %dofuture% {f(i)})
```

The strategy is reset to `sequential` at the end.

```{r sequential}
plan(sequential)
```


## Case study {#sec:cas}

This case study tests the different techniques seen above to solve a concrete problem.
The objective is to compute the average distance between two points of a random seed of 1000 points in a square window of side 1.
The objective is to compute the average distance between two points of a random set of 1000 points in a square window of side 1.

Its expectation is computable[^230].
It is equal to $\frac{2+\sqrt{2}+5\ln{(1+\sqrt{2})}}{15} \approx 0.5214$.
Expand Down Expand Up @@ -1103,6 +1145,33 @@ d
```


### future.apply

The `fsapply4()` function optimised above can be parallelled directly by prefixing the `vapply` function with `future_`.
Only the main loop is parallelized: nesting `future_vapply()` would collapse performance.

```{r}
library("future.apply")
# Socket strategy on all available cores except 1
plan(multisession, workers = availableCores() - 1)
future_fsapply4_ <- function() {
distances <- future_vapply(1:NbPoints, function(i) {
vapply(1:NbPoints, function(j) {
if (j>i) {
(X$x[i] - X$x[j])^2 + (X$y[i] - X$y[j])^2
} else {
0
}
}, 0.0)
}, 1:1000+0.0)
return(sum(sqrt(distances)) / NbPoints / (NbPoints - 1) * 2)
}
system.time(d <- future_fsapply4_())
d
plan(sequential)
```


### for loop

A for loop is faster and consumes less memory because it does not store the distance matrix.
Expand All @@ -1126,7 +1195,29 @@ This is the simplest and most efficient way to write this code with core R and n

### foreach loop

Two nested foreach loops are needed here: they are extremely slow compared to a simple loop.
Parallelization executes for loops inside a foreach loop, which is quite efficient.
However, distances are calculated twice.

```{r registerDoParallel, tidy=FALSE}
registerDoParallel(cores = detectCores())
fforeach3 <- function(Y) {
distances <- foreach(
i = icount(Y$n),
.combine = '+') %dopar% {
distance <- 0
for (j in 1:Y$n) {
distance <- distance +
sqrt((Y$x[i] - Y$x[j])^2 + (Y$y[i] - Y$y[j])^2)
}
distance
}
return(distances / Y$n / (Y$n - 1))
}
system.time(d <- fforeach3(X))
d
```

It is possible to nest two foreach loops, but they are extremely slow compared with a simple loop.
The test is run here with 10 times fewer points, so 100 times fewer distances to calculate.

```{r}
Expand All @@ -1149,32 +1240,6 @@ d

Nested foreach loops should be reserved for very long tasks (several seconds at least) to compensate the fixed costs of setting them up.

Parallelization is efficient in the code below, especially because it avoids nested foreach loops.
On the other hand, distances are calculated twice.
The performance is still much lower than a simple for loop (remember: 100 times less distances are computed).

```{r registerDoParallel, tidy=FALSE}
registerDoParallel(cores = detectCores())
fforeach3 <- function(Y) {
distances <-
foreach(i=icount(NbPointsReduit),
.combine='+') %dopar% {
distance <- 0
for (j in 1:Y$n) {
distance <- distance +
sqrt((Y$x[i]-Y$x[j])^2 + (Y$y[i]-Y$y[j])^2)
}
distance
}
return(distances/NbPointsReduit/(NbPointsReduit-1))
}
system.time(d <- fforeach3(Y))
d
```

**foreach** has optimized adapters allowing to use physical clusters for example.
Its interest is limited with the **parallel** package.


### RCpp

Expand Down Expand Up @@ -1319,7 +1384,9 @@ system.time(d <- TotalDistance(X$x, X$y)/NbPoints/(NbPoints-1)*2)
From this case study, several lessons can be learned:

- A for loop is a good basis for repetitive calculations, faster than `vapply()`, simple to read and write.
- **foreach** loops are extremely effective for parallelizing for loops;
- Optimized functions may exist in R packages for common tasks (here, the `pairdist()` function of **spatstat** is two orders of magnitude faster than the for loop).
- the **future.apply** package makes it very easy to parallelize code that has already been written with `*apply()` functions, regardless of the hardware used;
- The use of C++ code allows to speed up the calculations significantly, by three orders of magnitude here.
- Parallelization of the C++ code further divides the computation time by about half the number of cores for long computations.

Expand All @@ -1332,8 +1399,7 @@ Writing vector code, using `sapply()` is still justified for its readability.

The choice of parallelizing the code must be evaluated according to the execution time of each parallelizable task.
If it exceeds a few seconds, parallelization is justified.
`mclapply()` replaces `lapply()` without any effort, but requires a hack (provided here) on Windows.
`foreach()` does not replace `for()` as easily and is only justified for very memory and computationally heavy tasks, especially on computing clusters.


## Workflow {#sec:targets}

Expand Down
26 changes: 26 additions & 0 deletions 04-Writing.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,20 @@
R and RStudio make it possible to efficiently write documents of all formats, from simple notepads to theses to slide shows.
The tools to do this are the subject of this chapter, completed by the production of web sites (including a personal site).

Two document production processes are available:

- *R Markdown* with the **knitR** and **bookdown** packages.
This is the classic method, presented here in detail.
- *Quarto*, designed to be used with languages beyond R and in working environments beyond RStudio.
Quarto is under active development but does not yet allow documents to be produced with the same quality as *R Markdown*: for example, punctuation in French documents is not handled correctly in PDF[^rediger-41], tables cannot include equations[^rediger-42] and the width of figures is inconsistent in PDF documents formatted with several columns[^rediger-43].
The use of Quarto is well documented on its site[^rediger-40] and is not presented here.

[^rediger-40]: <https://quarto.org/>
[^rediger-41]: <https://github.com/jgm/pandoc/issues/8283/>
[^rediger-42]: <https://github.com/quarto-dev/quarto-cli/issues/555>
[^rediger-43]: <https://github.com/quarto-dev/quarto-cli/issues/855>


## Markdown notebook (R Notebook)

In an `.R` file, the code should always be commented to make it easier to read.
Expand Down Expand Up @@ -385,6 +399,18 @@ The correspondence and the complete list of languages can be found in table 3 of

[^403]: http://mirrors.ctan.org/macros/unicodetex/latex/polyglossia/polyglossia.pdf

HTML formatting of punctuation in French documents is possible using a filter declared in pandoc [^450].
The `fr-nbsp.lua` file must be copied into the project directory from its GitHub repository and declared into the header of the Markdown document.

```
output:
pandoc_args:
--lua-filter=en-nbsp.lua
```

The filter formats all the punctuation in the document, whatever the language: it should therefore only be used for documents written entirely in French.

[^450]: https://github.com/InseeFrLab/pandoc-filter-fr-nbsp

### Simple Article template {#sec:memo}

Expand Down
54 changes: 26 additions & 28 deletions 05-Package.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -230,31 +230,27 @@ The development of the package is punctuated by many commits at each modificatio
### package.R

The `package.R` file is intended to receive the R code and especially the comments for **roxygen2** which concern the whole package.
This file can also be named `multiple-package.R`, prefixed with the package name, for compatibility with **usethis**.
It can be created under this name with the command:
```{r use_package_doc, eval=FALSE}
usethis::use_package_doc()
```

The first comment block will produce the package help (`?multiple`).
The first comment block will generate the package help (`?multiple`).

```
#' multiple-package
#'
#' Multiples of numbers
#'
#' This package allows simple computation of multiples
#' of numbers, including fast algorithms for integers.
#'
#' @name multiple
#' @docType package
NULL
#' @keywords internal
"_PACKAGE"
```

Its organization is identical to that of the function documentations, with two particular declarations for the package name and the documentation type.
The `NULL` code after the comments tells **roxygen2** that there is no related R code.
The "_PACKAGE" keyword indicates that package documentation must be produced.
It could be written in the block, with a syntax identical to that of functions, but its default content is that of the `Description` field in the `DESCRIPTION` file.
The `internal` keyword hides the package documentation in the help summary.

The documentation is updated by the `roxygen2::roxygenise()` command.
After rebuilding the package, check that the help has appeared: `?multiple`.




## Package organization

### DESCRIPTION file {#sec:package-description}
Expand All @@ -270,9 +266,8 @@ Authors@R:
role = c("aut", "cre"),
email = "[email protected]",
comment = c(ORCID = "0000-0002-5249-321X"))
Description: This package allows simple computation
of multiples of numbers, including fast algorithms
for integers.
Description: Simple computation of multiples of numbers,
including fast algorithms for integers.
License: GPL-3
Encoding: UTF-8
LazyData: true
Expand Down Expand Up @@ -301,7 +296,7 @@ When the development is stabilized, the new version, intended to be used in prod

The description of the authors is rather heavy but simple to understand.
The Orcid identifiers of academic authors can be used.
If the package has several authors, they are placed in a `c()` function: `c(person(...), person())` for two authors.
If the package has several authors, they are placed in a `c()` function: `c(person(...), person(...))` for two authors.
In this case, the role of each must be specified:

* "cre" for the creator of the package.
Expand Down Expand Up @@ -1003,20 +998,23 @@ References are cited by the command `\insertCite{key}{package}` in the documenta
`package` is the name of the package in which the `REFERENCES.bib` file is to be searched: this will normally be the current package, but references to other packages are accessible, provided only that they use **Rdpack**.

`key` is the identifier of the reference in the file.
The following example[^507] is from the documentation of the **SpatDiv** package hosted on GitHub, in its `.R` file:
The following example[^507] is from the documentation of the **divent** package hosted on GitHub, in its `.R` file:

```{r Citations}
#' SpatDiv
#' divent
#'
#' Spatially Explicit Measures of Diversity
#' Measures of Diversity and Entropy
#'
#' This package extends the **entropart** package
#' \insertCite{Marcon2014c}{SpatDiv}.
#' It provides spatially explicit measures of
#' diversity such as the mixing index.
#' This package is a reboot of the **entropart** package \insertCite{Marcon2014c}{divent}.
#'
#' @importFrom Rdpack reprompt
#'
#' @references
#' \insertAllCited{}
"_PACKAGE"
```

[^507]: **SpatDiv** package on GitHub: https://github.com/EricMarcon/SpatDiv/blob/master/R/package.R
[^507]: **divent** package on GitHub: https://github.com/EricMarcon/divent/blob/master/R/package.R

The cited reference is in `inst/REFERENCES.bib`:

Expand All @@ -1037,7 +1035,7 @@ Citations are enclosed in parentheses.
To place the author's name outside the parenthesis, add the statement `;textual`:

```
\insertCite{Marcon2014c;textual}{SpatDiv}
\insertCite{Marcon2014c;textual}{divent}
```
To cite several references (necessarily from the same package), separate them with commas.

Expand Down
Loading

0 comments on commit e102265

Please sign in to comment.