02-UseR.Rmd

# Use R {#chap-utiliseR}

\toc{1}

The literature devoted to learning R is flourishing.
The following books are an arbitrary but useful selection:

* [R for Data Science](https://r4ds.had.co.nz/) [@Wickham2016] presents a complete working method, consistent with the tidyverse.
* [Advanced R](http://adv-r.had.co.nz/) [@Wickham2014] is the reference for mastering the subtleties of the language and understanding how R works.
* Finally, [Efficient R programming](https://csgillespie.github.io/efficientR/) [@Gillespie2016] deals with code optimization.

Some advanced aspects of coding are seen here.
Details on the different languages of R are useful for creating packages.
The environments are presented next, for the proper understanding of the search for objects called by the code.
Finally, the optimization of code performance is discussed in depth (loops, C++ code and parallelization) and illustrated by a case study.


## The languages of R

R includes several programming languages.
The most common is S3, but it is not the only one[^206].

[^206]: https://adv-r.had.co.nz/OO-essentials.html


### Base

The core of R is the primitive functions and basic data structures like the `sum` function and `matrix` data:

```{r Core}
pryr::otype(sum)
typeof(sum)
pryr::otype(matrix(1))
typeof(matrix(1))
```

The **pryr** package allows to display the language in which objects are defined.
The `typeof()` function displays the internal storage type of the objects: 

- the `sum()` function belongs to the basic language of R and is a builtin function.
- the elements of the numerical matrix containing a single 1 are double precision reals, and the matrix itself is defined in the basic language. 

Primitive functions are coded in C and are very fast.
They are always available, whatever the packages loaded.
Their use is therefore to be preferred.


### S3 {#sec:S3}

S3 is the most used language, often the only one known by R users.

It is an object-oriented language in which classes, i.e. the type of objects, are declarative.

```{r S3-1}
MyFirstName <- "Eric"
class(MyFirstName) <- "FirstName"
```

The variable `MyFirstName` is here classed as `FirstName` by a simple declaration.

Unlike the way a classical object-oriented language works[^210], S3 methods are related to functions, not objects.

[^210]: https://www.troispointzero.fr/le-blog/introduction-a-la-programmation-orientee-objet-poo/

```{r S3-2, echo=TRUE}
# Default display
MyFirstName
print.Firstname <- function(x) cat("The first name is", x) 
# Modified display
MyFirstName
```

In this example, the `print()` method applied to the "Firstname" class is modified.
In a classical object-oriented language, the method would be defined in the class `Firstname`.
In R, methods are defined from generic methods.

`print` is a generic method ("a generic") declared in **base**.

```{r print-otype}
pryr::otype(print)
```

Its code is just a `UseMethod("print")` declaration:
```{r print-code}
print
```

There are many S3 methods for `print`:
```{r printS3}
head(methods("print"))
```

Each applies to a class. `print.default` is used as a last resort and relies on the type of the object, not its S3 class.

```{r print.defaut, echo=TRUE}
typeof(MyFirstName)
pryr::otype(MyFirstName)
```

An object can belong to several classes, which allows a form of inheritance of methods.
In a classical object oriented language, inheritance allows to define more precise classes ("FrenchFirstName") which inherit from more general classes ("FirstName") and thus benefit from their methods without having to redefine them.
In R, inheritance is simply declaring a vector of increasingly broad classes for an object:

```{r S3-3, echo=TRUE}
# Definition of classes by a vector
class(MyFirstName) <- c("FrenchFirstName", "FirstName")
# Alternative code, with inherits()
inherits(MyFirstName, what = "FrenchFirstName")
inherits(MyFirstName, what = "FirstName")
```

The generic looks for a method for each class, in the order of their declaration.

```{r S3-4, echo=TRUE}
print.FrenchFirstName <- function(x) cat("French first name:", x) 
MyFirstName
```

In summary, S3 is the common language of R.
Almost all packages are written in S3.
Generics are everywhere but go unnoticed, for example in packages:

```{r S3-5, echo=TRUE}
library("entropart")
.S3methods(class="SpeciesDistribution")
```

The `.S3methods()` function displays all available methods for a class, as opposed to `methods()` which displays all classes for which the method passed as an argument is defined.

Many primitive functions in R are generic methods.
To find out about them, use the `help(InternalMethods)` helper.


### S4

S4 is an evolution of S3 that structures classes to get closer to a classical object oriented language:

- Classes must be explicitly defined, not simply declared.
- 1ttributes (i.e. variables describing objects), called *slots*, are explicitly declared.
- The constructor, i.e. the method that creates a new instance of a class (i.e. a variable containing an object of the class), is explicit.

Using the previous example, the S4 syntax is as follows:

```{r S4-Class, tidy=FALSE}
# Definition of the class Person, with its slots
setClass("Person",  
         slots = list(LastName = "character", FirstName = "character"))
#  Construction of an instance
Me <- new("Person", LastName = "Marcon", FirstName = "Eric")
# Langage
pryr::otype(Me)
```

Methods always belong to functions.
They are declared by the `setMethod()` function:
```{r S4-print}
setMethod("print",
  signature="Person",
  function(x, ...) {
    cat("The person is:", x@FirstName, x@LastName) 
  }
)
print(Me)
```

The attributes are called by the syntax `variable@slot`.

In summary, S4 is more rigorous than S3.
Some packages on CRAN : **Matrix**, **sp**, **odbc**... and many on Bioconductor are written in S4 but the language is now clearly abandoned in favor of S3, notably because of the success of the **tidyverse**.


### RC

RC was introduced in R 2.12 (2010) with the **methods** package.

Methods belong to classes, as in C++: they are declared in the class and called from the objects.

```{r RC-Class, tidy=FALSE}
library("methods")
# Declaration of the class
PersonRC <- setRefClass("PersonRC", 
    fields = list(LastName = "character", FirstName = "character"),
    methods = list(print = function() cat(FirstName, LastName)))
# Constructeur
MeRC <- new("PersonRC", LastName = "Marcon", FirstName ="Eric")
# Language
pryr::otype(MeRC)
# Call the print method
MeRC$print()
```

RC is a confidential language, although it is the first "true" object-oriented language of R.


### S6

S6[^211] enhances RC but is not included in R: it requires installing its package. 

[^211]: https://r6.r-lib.org/

Attributes and methods can be public or private.
An `initialize()` method is used as a constructor.

```{r S6-Class}
library("R6")
PersonR6 <- R6Class("PersonR6", 
                      public = list(LastName="character", FirstName="character",
                                    initialize = function(LastName=NA, FirstName=NA) {
                                      self$LastName <- LastName ; self$FirstName <- FirstName},
                                    print = function() cat(self$FirstName, self$LastName)))
MeR6 <- PersonR6$new(LastName = "Marcon", FirstName ="Eric")
MeR6$print()
```

S6 allows to program rigorously in object but is very little used.
The performances of S6 are much better than those of RC but are inferior to those of S3[^212].

[^212]: https://r6.r-lib.org/articles/Performance.html

The non-inclusion of R6 to R is shown by **pryr**:
```{r otype}
pryr::otype(MeR6)
```


### Tidyverse

The tidyverse is a set of coherent packages that have evolved the way R is programmed.
The set of essential packages can be loaded by the **tidyverse** package which has no other use:
```{r tidyverse}
library("tidyverse")
```

This is not a new language per se but rather an extension of S3, with deep technical modifications, notably the unconventional evaluation of expressions[^220], which it is not essential to master in detail.

[^220]: https://dplyr.tidyverse.org/articles/programming.html

Its principles are written in a manifesto[^221].
Its most visible contribution for the user is the sequence of commands in a flow (code pipeline).

[^221]: https://cran.r-project.org/web/packages/tidyverse/vignettes/manifesto.html

In standard programming, the sequence of functions is written by successive nesting, which makes it difficult to read, especially when arguments are needed:
```{r log_mean_runif}
# Base-2 logarithm of the mean of 100 random numbers in a uniform distribution
log(mean(runif(100)), base=2)
```

In the tidyverse, the functions are chained together, which often better matches the programmer's thinking about data processing:
```{r tidylog, tidy=FALSE}
# 100 random numbers in a uniform distribution
runif(100) %>% 
  # Mean
  mean %>% 
  # Base-2 logarithm
  log(base=2)
```

The pipe `%>%` is an operator that calls the next function by passing it as first argument the result of the previous function.
Additional arguments are passed normally: for the readability of the code, it is essential to name them.
Most of the R functions can be used without difficulty in the tidyverse, even though they were not designed for this purpose: it is sufficient that their first argument is the data to be processed.

The pipeline allows only one value to be passed to the next function, which prohibits multidimensional functions, such as `f(x,y)`. 
The preferred data structure is the *tibble*, which is an improved dataframe: its `print()` method is more readable, and it corrects some unintuitive dataframe behavior, such as the automatic conversion of single-column dataframes to vectors.
The columns of the dataframe or tibble allow to pass as much data as needed.

Finally, data visualization is supported by **ggplot2** which relies on a theoretically sound graph grammar [@Wickham2010].
Schematically, a graph is constructed according to the following model:
```
ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>
```

- The data is necessarily a dataframe.
- The geometry is the type of graph chosen (points, lines, histograms or other).
- The aesthetics (function `aes()`) designates what is represented: it is the correspondence between the columns of the dataframe and the elements necessary for the geometry.
- Statistics is the treatment applied to the data before passing it to the geometry (often "identity", i.e. no transformation but "boxplot" for a boxplot).
The data can be transformed by a scale function, such as `scale_y_log10()`.
- The position is the location of the objects on the graph (often "identity"; "stack" for a stacked histogram, "jitter" to move the overlapping points slightly in a `geom_point`).
- The coordinates define the display of the graph (`coord_fixed()` to avoid distorting a map for example).
- Finally, facets offer the possibility to display several aspects of the same data by producing one graph per modality of a variable.

The set formed by the pipeline and **ggplot2** allows complex processing in a readable code.
Figure \@ref(fig:diamonds) shows the result of the following code:

(ref:diamonds) Price of diamonds according to their weight. Demonstration of the **ggplot2** code combined with tidyverse data processing.
```{r diamonds, fig.cap="(ref:diamonds)", echo=TRUE, tidy=FALSE, message=FALSE}
# Diamonds data provided by ggplot2
diamonds %>% 
  # Keep only diamonds larger than half a carat
  filter(carat > 0.5) %>% 
  # Graph: price vs. weight
  ggplot(aes(x = carat, y = price)) +
    # Scatter plot
    geom_point() + 
    # Logarithmic scale
    scale_x_log10() + 
    scale_y_log10() + 
    # Linear regression
    geom_smooth(method = "lm")
```

In this figure, two geometries (scatterplot and linear regression) share the same aesthetics (price vs. weight in carats) which is therefore declared upstream, in the `ggplot()` function.

The tidyverse is documented in detail in @Wickham2016 and **ggplot2** in @Wickham2017.


## Environments {#sec:environnements}

R's objects, data and functions, are named.
Since R is modular, with the ability to add any number of packages to it, it is very likely that name conflicts will arise.
To deal with them, R has a rigorous system of name precedence: code runs in a defined environment, inheriting from parent environments.


### Organization

R starts in an empty environment.
Each loaded package creates a child environment to form a stack of environments, of which each new element is called a "child" of the previous one, which is its "parent".

The console is in the global environment, the child of the last loaded package.

```{r search}
search()
```

The code of a function called from the console runs in a child environment of the global environment:

```{r environment}
# Current environment
environment()
# The function f displays its environment
f <- function() environment()
# Display the environment of the function
f()
# Parent environment of the function's environment
parent.env(f())
```


### Search

The search for ab object starts in the local environment. 
If it is not found, it is searched in the parent environment, then in the parent of the parent, until the environments are exhausted, which generates an error indicating that the object was not found.

Example:
```{r GlobalEnv}
# Variable q defined in the global environment
q <- "GlobalEnv"
# Function defining q in its environment
qLocalFunction <- function() {
  q <- "Function"
  return(q)
}
# The local variable is returned
qLocalFunction()
# Poorly written function using a variable it does not define
qGlobalEnv <- function() {
   return(q)
}
# The global environment variable is returned
qGlobalEnv()
# Delete this variable
rm(q)
# The function base::q is returned
qGlobalEnv()
```

The variable `q` is defined in the global environment.
The function `qLocalFunction` defines its own variable `q`.
Calling the function returns the its local value because it is in the function's environment.

The `qGlobalEnv` function returns the `q` variable that it does not define locally.
So it looks for it in its parent environment and finds the variable defined in the global environment.
By removing the variable from the global environment with `rm(q)`, the `qGlobalEnv()` function scans the stack of environments until it finds an object named `q` in the **base** package, which is the function to exit R.
It could have found another object if a package containing a `q` object had been loaded.

To avoid this erratic behavior, a function should *never* call an object not defined in its own environment.


### Package namespaces

It is time to define precisely what packages make visible.
Packages contain objects (functions and data) which they *export* or not.
They are usually called by the `library()` function, which does two things: 

- It *loads* the package into memory, allowing access to all its objects with the syntax `package::object` for exported objects and `package:::object` for non-exported ones.
- It then *attaches* the package, which places its environment on top of the stack.

It is possible to detach a package with the `unloadNamespace()` function to remove it from the environment stack.
Example:
```{r unloadNamespace}
# entropart loaded and attached
library("entropart")
# is it attached?
isNamespaceLoaded("entropart")
# stack of environments
search()
# Diversity(), a function exported by entropart is found
Diversity(1, CheckArguments=FALSE)
# Detach and unload entropart
unloadNamespace("entropart")
# Is it attached?
isNamespaceLoaded("entropart")
# Stack of environments, without entropart
search()
# Diversity() cannot be found
tryCatch(Diversity(1), error = function(e) print(e))
# but can be called with its full name
entropart::Diversity(1, CheckArguments=FALSE)
```

Calling `entropart::Diversity()` loads the package (i.e., implicitly executes `loadNamespace("entropart")`) but does not attach it.

In practice, one should limit the number of attached packages to limit the risk of calling an unwanted function, homonymous to the desired function.
In critical cases, the full name of the function should be used: `package::function()`.

A common issue occurs with the `filter()` function of **dplyr**, which is the namesake of the **stats** function. 
The **stats** package is usually loaded before **dplyr**, a package in the tidyverse.
Thus, `stats::filter()` must be called explicitly.

However, the **dplyr** or **tidyverse** package (which attaches all the tidyverse packages) can be loaded systematically by creating a `.RProfile` at the root of the project containing the command:
```{r tidyverse2, eval=FALSE}
library("tidyverse")
```

In this case, **dplyr** is loaded *before* **stats** so its function is inaccessible.


## Measuring execution time

The execution time of long code can be measured very simply by the `system.time` command.
For very short execution times, it is necessary to repeat the measurement: this is the purpose of the **microbenchmark** package.


### system.time

The function returns the execution time of the code.

```{r system.time}
# Mean absolute deviation of 1000 values in a uniform distribution, repeated 100 times
system.time(for(i in 1:100) mad(runif(1000)))
```


### microbenchmark

The **microbenchmark** package is the most advanced.

The goal is to compare the speed of computing the square of a vector (or a number) by multiplying it by itself ($x \times x$) or by raising it to the power of 2 ($x^2$).

```{r microbenchmark}
# Functions to test
f1 <- function(x) x*x
f2 <- function(x) x^2
f3 <- function(x) x^2.1
f4 <- function(x) x^3
# Initialization
X <- rnorm(10000)
# Test
library("microbenchmark")
(mb <- microbenchmark(f1(X), f2(X), f3(X), f4(X)))
```

The returned table contains the minimum, median, mean, max and first and third quartile times, as well as the number of repetitions.
The median value is to be compared.
The number of repetitions is by default 100, to be modulated (argument `times`) according to the complexity of the calculation.

The test result, a `microbenchmark` object, is a raw table of execution times. 
The statistical analysis is done by the `print` and `summary` methods.
To choose the columns to display, use the following syntax:

```{r mb_summary}
summary(mb)[, c("expr", "median")]
```

To make calculations on these results, we must store them in a variable.
To prevent the results from being displayed in the console, the simplest solution is to use the `capture.output` function by assigning its result to a variable.

```{r capture.output}
dummy <- capture.output(mbs <- summary(mb))
```

The previous test is displayed again.

```{r mb_summary2}
summary(mb)[, c("expr", "median")]
```

The computation time is about the same between $x \times x$ and $x^2$.
The power calculation is much longer, especially if the power is not integer, because it requires a logarithm calculation.
The computation of the power 2 is therefore optimized by R to avoid the use of log.

Two graphical representations are available: the violins represent the probability density of the execution time; the boxplots are classical.

```{r boxplot, message=FALSE}
library("ggplot2")
autoplot(mb)
boxplot(mb)
```


### Profiling

**profvis** is RStudio's profiling tool.

It tracks the execution time of each line of code and the memory used.
The goal is to detect slow code portions that need to be improved.


```{r profvis}
library(profvis)
p <- profvis({
  # Cosine calculations
  cos(runif(10^7))
  # 1/2 second pause
  pause(1/2)
  })
htmlwidgets::saveWidget(p, "docs/profile.html")
```

The result is an HTML file containing the profiling report[^201].
It can be observed that the time to draw the random numbers is similar to that of the cosine calculation.

[^201]: https://EricMarcon.github.io/WorkingWithR/profile.html

Read the complete documentation[^202] on the RStudio website.

[^202]: https://rstudio.github.io/profvis/


## Loops

The most frequent case of long code to execute is loops: the same code is repeated a large number of times.


### Vector functions

Most of R's functions are vector functions: loops are processed internally, extremely fast.
Therefore, you should think in terms of vectors rather than scalars.

```{r vector}
# Draw two vectors of three random numbers between 0 and 1
x1 <- runif(3)
x2 <- runif(3)
# Square root of the three numbers in x1
sqrt(x1)
# Respective sums of the three numbers of x1 and x2
x1+x2
```

We also have to write vector functions on their first argument.
The function `lnq` of the package **entropart** returns the deformed logarithm of order $q$ of a number $x$.

```{r lnq}
# Code of the function
entropart::lnq
```

For a function to be vector, each line of its code must allow the first argument to be treated as a vector.
Here: `log(x)` and `x^` are a vector function and operator and the condition `[x < 0]` also returns a vector.


### lapply

Code that cannot be written as a vector function requires loops.

`lapply()` applies a function to each element of a list.
There are several versions of this function:

- `lapply()` returns a list (and saves the time of rearranging them in an array).
- `sapply()` returns a dataframe by collapsing the lists (this is done by the `simplify2array()` function).
- `vapply()` is almost identical but requires that the data type of the result be provided.

```{r l_s_v_apply}
# Draw 1000 values in a uniform distribution
x1 <- runif(1000)
# The square root can be calculated for the vector or each value 
identical(sqrt(x1), sapply(x1, FUN=sqrt))
mb <- microbenchmark(sqrt(x1), lapply(x1, FUN=sqrt), sapply(x1, FUN=sqrt), vapply(x1, FUN=sqrt, FUN.VALUE = 0))
summary(mb)[, c("expr", "median")]
```

`lapply()` is much slower than a vector function.
`sapply()` requires more time for `simplify2array()`, which must detect how to gather the results.
Finally, `vapply()` saves the time of determining the data type of the result and allows for faster computation with little effort. 

### For loops

Loops are handled by the `for` function.
They have the reputation of being slow in R because the code inside the loop must be interpreted at each execution.
This is no longer the case since version 3.5 of R: loops are compiled systematically before execution.
The behavior of the just-in-time compiler is defined by the `enableJIT` function.
The default level is 3: all functions are compiled, and loops in the code are compiled too.

To evaluate the performance gain, the following code removes all automatic compilation, and compares the same loop compiled or not.

```{r compiler}
library("compiler")
# No automatic compilation
enableJIT(level=0)
# Loop to calculate the square root of a vector
Loop <- function(x) {
  # Initialization of the result vector, essential
  Root <- vector("numeric", length=length(x))
  # Loop
  for(i in 1:length(x)) Root[i] <- sqrt(x[i])
  return(Root)
}
# Compiled version
Loop2 <- cmpfun(Loop)
# Comparison
mb <- microbenchmark(Loop(x1), Loop2(x1))
(mbs <- summary(mb)[, c("expr", "median")])
# Automatic compilation by default since version 3.5
enableJIT(level=3)
```

The gain is considerable: from 1 to ```r format(mbs[1,2]/mbs[2,2], digits=1)```.

For loops are now much faster than `vapply`.

```{r for_mb}
# Test
mb <- microbenchmark(vapply(x1, FUN=sqrt, 0), Loop(x1))
summary(mb)[, c("expr", "median")]
```

Be careful, the performance test can be misleading:

```{r deceptive, tidy=FALSE}
# Preparing the result vector
Root <- vector("numeric", length = length(x1))
# Test
mb <- microbenchmark(vapply(x1, FUN = sqrt, 0), 
                     for(i in 1:length(x1)) 
                       Root[i] <- sqrt(x1[i]))
summary(mb)[, c("expr", "median")]
```

In this code, the for loop is not compiled so it is much slower than in its normal use (in a function or at the top level of the code).

The long loops allow tracking of their progress by a text bar, which is another advantage.
The following function executes pauses of one tenth of a second for the time passed in parameter (in seconds).

```{r BoucleSuivie}
MonitoredLoop <- function(duration=1) {
  # Progress bar
  pgb <- txtProgressBar(min = 0, max = duration*10)
  # Boucle
  for(i in 1:(duration*10)) {
    # Pause for a tenth of a second
    Sys.sleep(1/10)
    # Track progress
    setTxtProgressBar(pgb, i)
  }
}
MonitoredLoop()
```


### replicate

`replicate()` repeats a statement.

```{r replicate}
replicate(3, runif(1))
```

This code is equivalent to `runif(3)`, with performance similar to `vapply`: 50 to 100 times slower than a vector function.

```{r replicate_mb}
mb <- microbenchmark(replicate(1E3, runif(1)), runif(1E3))
summary(mb)[, c("expr", "median")]
```


### Vectorize

`Vectorize()` makes a function that is not vectorized, by loops. 
Write the loops instead.


### Marginal statistics

`apply` applies a function to the rows or columns of a two dimensional object.

`colSums` and similar functions (`rowSums`, `colMeans`, `rowMeans`) are optimized.

```{r BoucleSomme, tidy=FALSE}
# Sum of the numeric columns of the diamonds dataset of ggplot2
# Loop identical to the action of apply(, 2, )
SumLoop <- function(Table) {
  Sum <- vector("numeric", length = ncol(Table))
  for (i in 1:ncol(Table)) Sum[i] <- sum(Table[, i])
  return(Sum)
}
mb <- microbenchmark(SumLoop(diamonds[-(2:4)]), 
                     apply(diamonds[-(2:4)], 2, sum), 
                     colSums(diamonds[-(2:4)]))
summary(mb)[, c("expr", "median")]
```

`apply` clarifies the code but is slower than the loop, which is only slightly slower than `colSums`.


## C++ code {#sec:cpp}

Integrating C++ code into R is greatly simplified by the **Rcpp** package but is still difficult to debug and therefore should be reserved for very simple code (to avoid errors) repeated a large number of times (to be worth the effort).
The preparation and verification of the data must be done in R, as well as the processing and presentation of the results.

The common practise is to include C++ code in a package, but  running it outside a package is possible:

- C++ code can be included in a C++ document (file with extension `.cpp`): it is compiled by the `sourceCpp()` command, which creates the R functions to call the C++ code.
- In an RMarkdown document, Rcpp code snippets can be created to insert the C++ code: they are compiled and interfaced to R at the time of knitting.

The following example shows how to create a C++ function to calculate the double of a numerical vector.

```{Rcpp}
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector timesTwo(NumericVector x) {
  return x * 2;
}
```
  
An R function with the same name as the C++ function is now available.

```{r timesTwo}
timesTwo(1:5)
```

The performance is two orders of magnitude faster than the R code (see the case study, section \@ref(sec:cas)).


## Parallelizing R

When long computations can be split into independent tasks, the simultaneous (*parallel*) execution of these tasks reduces the total computation time to that of the longest task, to which is added the cost of setting up the parallelization (creation of the tasks, recovery of the results...).

Read Josh Errickson's excellent introduction[^203] which details the issues and constraints of parallelization.

[^203]: <http://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html>

Two mechanisms are available for parallel code execution:

- _fork_: the running process is duplicated on multiple cores of the computing computer's processor.
This is the simplest method but it does not work on Windows (it is a limitation of the operating system).
- _Socket_: a cluster is constituted, either physically (a set of computers running R is necessary) or logically (an instance of R on each core of the computer used). 
The members of the cluster communicate through the network (the internal network of the computer is used in a logical cluster).

Different R packages allow to implement these mechanisms.

### mclapply (fork)

The `mclapply` function of the **parallel** package has the same syntax as `lapply` but parallelizes the execution of loops.
On Windows, it has no effect since the system does not allow _fork_: it simply calls `lapply`.
However, a workaround exists to emulate `mclapply` on Windows by calling `parLapply`, which uses a cluster.

```{r, tidy=FALSE}
##
## mclapply.hack.R
##
## Nathan VanHoudnos
## nathanvan AT northwestern FULL STOP edu
## July 14, 2014
##
## A script to implement a hackish version of 
## parallel:mclapply() on Windows machines.
## On Linux or Mac, the script has no effect
## beyond loading the parallel library. 

require(parallel)    

## Define the hack
# mc.cores argument added: Eric Marcon
mclapply.hack <- function(..., mc.cores=detectCores()) {
  ## Create a cluster
  size.of.list <- length(list(...)[[1]])
  cl <- makeCluster( min(size.of.list, mc.cores) )
  
  ## Find out the names of the loaded packages 
  loaded.package.names <- c(
    ## Base packages
    sessionInfo()$basePkgs,
    ## Additional packages
    names( sessionInfo()$otherPkgs ))
  
  tryCatch( {
    
    ## Copy over all of the objects within scope to
    ## all clusters. 
    this.env <- environment()
    while( identical( this.env, globalenv() ) == FALSE ) {
      clusterExport(cl,
                    ls(all.names=TRUE, env=this.env),
                    envir=this.env)
      this.env <- parent.env(environment())
    }
    clusterExport(cl,
                  ls(all.names=TRUE, env=globalenv()),
                  envir=globalenv())
    
    ## Load the libraries on all the clusters
    ## N.B. length(cl) returns the number of clusters
    parLapply( cl, 1:length(cl), function(xx){
      lapply(loaded.package.names, function(yy) {
        require(yy , character.only=TRUE)})
    })
    
    ## Run the lapply in parallel 
    return( parLapply( cl, ...) )
  }, finally = {        
    ## Stop the cluster
    stopCluster(cl)
  })
}

## Warn the user if they are using Windows
if( Sys.info()[['sysname']] == 'Windows' ){
  message(paste(
    "\n", 
    "   *** Microsoft Windows detected ***\n",
    "   \n",
    "   For technical reasons, the MS Windows version of mclapply()\n",
    "   is implemented as a serial function instead of a parallel\n",
    "   function.",
    "   \n\n",
    "   As a quick hack, we replace this serial version of mclapply()\n",
    "   with a wrapper to parLapply() for this R session. Please see\n\n",
    "     http://www.stat.cmu.edu/~nmv/2014/07/14/
    implementing-mclapply-on-windows \n\n",
    "   for details.\n\n"))
}

## If the OS is Windows, set mclapply to the
## the hackish version. Otherwise, leave the
## definition alone. 
mclapply <- switch( Sys.info()[['sysname']],
                    Windows = {mclapply.hack}, 
                    Linux   = {mclapply},
                    Darwin  = {mclapply})

## end mclapply.hack.R
```

The following code tests the parallelization of a function that returns its argument unchanged after a quarter-second pause.
This is knitted with ```r parallel::detectCores()``` cores, all of which are used except for one so as not to saturate the system.

```{r detectCores}
f <- function(x, time=.25) {
  Sys.sleep(time)
  return(x)
}

#Leave one core out for the system
 nbCores <- detectCores()-1

# Serial : theoretical time = nbCores/4 seconds
(tserie <- system.time(lapply(1:nbCores, f)))
# Parallel : theoretical time = 1/4 second
(tparallele <- system.time(mclapply(1:nbCores, f, mc.cores=nbCores)))
```

Setting up parallelization has a cost of about ```r format(tparallele[3]-1/4, digits=2)``` seconds here.
The execution time is much longer in parallel on Windows because setting up the cluster takes much more time than parallelization saves.
Parallelization is interesting for longer tasks, such as a one second break.

```{r serie_parallele}
# Serial
system.time(lapply(1:nbCores, f, time=1))
# Parallel
system.time(mclapply(1:nbCores, f, time=1, mc.cores=nbCores))
```

The additional time required for parallel execution of the new code is relatively smaller: the costs become less than the savings when the time of each task increases.

If the number of parallel tasks exceeds the number of cores used, performance collapses because the additional task must be executed after the first ones.

```{r nbCoeurs}
system.time(mclapply(1:nbCores, f, time=1, mc.cores=nbCores))
system.time(mclapply(1:(nbCores+1), f, time=1, mc.cores=nbCores))
```

The time then remains stable until the number of cores is doubled.
Figure \@ref(fig:r-parallele) shows the evolution of the computation time according to the number of tasks.

(ref:r-parallele) Parallel execution time of tasks requiring one second (each task is a one second pause). The number of tasks varies from 1 to twice the number of cores used (equal to ```r nbCores```) plus one. 
```{r r-parallele, fig.cap="(ref:r-parallele)", fig.scap="Execution time in parallel", tidy=FALSE}
Tasks <- 1:(2 * nbCores+1)
Time <- sapply(Tasks, function(nbTasks) {
      system.time(mclapply(1:nbTasks, f, time=1, mc.cores=nbCores))
              })
library("tidyverse")
tibble(Tasks, Time=Time["elapsed", ]) %>% 
  ggplot +
  geom_line(aes(x = Tasks, y = Time)) +
  geom_vline(xintercept = nbCores, col = "red", lty = 2) +
  geom_vline(xintercept = 2 * nbCores, col = "red", lty = 2)
```

The theoretical shape of this curve is as follows:

- For a task, the time is equal to one second plus the parallelization setup time.
- The time should remain stable until the number of cores used.
- When all the cores are used (red dotted line), the time should increase by one second and then remain stable until the next limit.

In practice, the computation time is determined by other factors that are difficult to predict.
The best practice is to adapt the number of tasks to the number of cores, otherwise performance will be lost.

### parLapply (socket)

`parLapply` requires to create a cluster, export the useful variables on each node, load the necessary packages on each node, execute the code and finally stop the cluster. 
The code for each step can be found in the `mclapply.hack` function above.

For everyday use, `mclapply` is faster, except on Windows, and simpler (including on Windows thanks to the above workaround).


### foreach

#### How it works

The **foreach** package allows advanced use of parallelization.
Read its vignettes.
```{r foreach_v, eval=FALSE}
# Manual
vignette("foreach", "foreach")
# Nested loops
vignette("nested", "foreach")
```

Regardless of parallelization, **foreach** redefines *for* loops.

```{r foreach}
for (i in 1:3) {
  f(i)
}
# becomes
library("foreach")
foreach (i=1:3) %do% {
  f(i)
}
```

The `foreach` function returns a list containing the results of each loop.
The elements of the list can be combined by any function, such as `c`.

```{r foreach2}
foreach (i=1:3, .combine="c") %do% {
  f(i)
}
```

The `foreach` function is capable of using iterators, that is, functions that pass to the loop only the data it needs without loading the rest into memory.
Here, the `icount` iterator passes the values 1, 2 and 3 individually, without loading the 1:3 vector into memory.

```{r iterators}
library("iterators")
foreach (i=icount(3), .combine="c") %do% {
  f(i)
}
```

It is therefore very useful when each object of the loop uses a large amount of memory.


#### Parallelization

Replacing the `%do%` operator with `%dopar%` parallelizes loops, provided that an adapter, i.e. an intermediate package between `foreach` and a package implementing parallelization, is loaded.
**doParallel** is an adapter for using the **parallel** package that comes with R.

```{r doParallel}
library(doParallel)
registerDoParallel(cores = nbCores)
# Serial
system.time(foreach (i=icount(nbCores), .combine="c") %do% {f(i)})
# Parallel
system.time(foreach (i=icount(nbCores), .combine="c") %dopar% {f(i)})
```

The fixed cost of parallelization is low.


### future

The **future** package is used to abstract the code of the parallelization implementation.
It is at the centre of an ecosystem of packages that facilitate its use[^250].

[^250]: https://www.futureverse.org/

The parallelization strategy used is declared by the `plan()` function.
The default strategy is `sequential`, i.e. single-task.
The `multicore` and `multisession` strategies are based respectively on the _fork_ and _socket_ techniques seen above.
Other strategies are available for using physical clusters (several computers prepared to run R together): the **future** documentation details how to do this.

Here we will use the `multisession` strategy, which works on the local computer, whatever its operating system.

```{r future}
library("future")
# Socket strategy on all available cores except 1
usedCores <- availableCores() - 1
plan(multisession, workers = usedCores)
```

The **future.apply** package allows all `*apply()` and `replicate()` loops to be effortlessly parallelized by prefixing their names with `future_`.

```{r future.apply}
library("future.apply")
system.time(future_replicate(usedCores - 1, f(usedCores)))
```

foreach loops can be parallelized with the **doFuture** package by simply replacing `%dopar%` with `%dofuture%`.

```{r doFuture}
library("doFuture")
system.time(foreach (i = icount(nbCores), .combine="c") %dofuture% {f(i)})
```

The strategy is reset to `sequential` at the end.

```{r sequential}
plan(sequential)
```


## Case study {#sec:cas}

This case study tests the different techniques seen above to solve a concrete problem.
The objective is to compute the average distance between two points of a random set of 1000 points in a square window of side 1.

Its expectation is computable[^230].
It is equal to $\frac{2+\sqrt{2}+5\ln{(1+\sqrt{2})}}{15} \approx 0.5214$.

[^230]: https://mindyourdecisions.com/blog/2016/07/03/distance-between-two-random-points-in-a-square-sunday-puzzle/

### Creation of the data

The point set is created with the **spatstat** package.

```{r spatstat, message=FALSE}
NbPoints <- 1000
library("spatstat")
X <- runifpoint(NbPoints)
```

### Spatstat

The `pairdist()` function of **spatstat** returns the matrix of distances between points.
The average distance is calculated by dividing the sum by the number of pairs of distinct points.

```{r pairdist, message=FALSE}
mb <- microbenchmark(d <- sum(pairdist(X))/NbPoints/(NbPoints-1))
# suppressMessages to eliminate useless messages
suppressMessages(autoplot(mb))
d
```

The function is fast because it is coded in C language in the **spatstat** package for the core of its calculations.


### apply

The distance can be calculated by two nested `sapply()`.

```{r fsapply1}
fsapply1 <- function() {
  distances <- sapply(1:NbPoints, function(i) sapply(1:NbPoints, function(j) sqrt((X$x[i]-X$x[j])^2 + (X$y[i]-X$y[j])^2)))
  return(sum(distances)/NbPoints/(NbPoints-1))
}
system.time(d <- fsapply1())
d
```

Some time can be saved by replacing `sapply` with `vapply`: the format of the results does not have to be determined by the function.
The gain is negligible on a long computation like this one but important for short computations.

```{r fsapply2}
fsapply2 <- function() {
  distances <- vapply(1:NbPoints, function(i) vapply(1:NbPoints, function(j) sqrt((X$x[i]-X$x[j])^2 + (X$y[i]-X$y[j])^2), 0.0), 1:1000+0.0)
  return(sum(distances)/NbPoints/(NbPoints-1))
}
system.time(d <- fsapply2())
d
```

The output format is not always obvious to write: 

- it must respect the size of the data: a vector of size 1000 for the outer loop, a scalar for the inner loop.
- it must respect the type: `0` for an integer, `0.0` for a real number. In the outer loop, adding `0.0` to the vector of integers turns it into a vector of real numbers.

A more significant improvement is to compute the square roots only at the end of the loop, to take advantage of the vectorization of the function.

```{r fsapply3}
fsapply3 <- function() {
  distances <- vapply(1:NbPoints, function(i) vapply(1:NbPoints, function(j) (X$x[i]-X$x[j])^2 + (X$y[i]-X$y[j])^2, 0.0), 1:1000+0.0)
  return(sum(sqrt(distances))/NbPoints/(NbPoints-1))
}
system.time(d <- fsapply3())
d
```

The computations are performed twice (distance between points $i$ and $j$, but also between points $j$ and $i$): a test on the indices allows to divide the time almost by 2 (not quite because the loops without computation, which return $0$, take time).

```{r fsapply4}
fsapply4 <- function() {
  distances <- vapply(1:NbPoints, function(i) {
    vapply(1:NbPoints, function(j) {
      if (j>i) {
        (X$x[i]-X$x[j])^2 + (X$y[i]-X$y[j])^2
      } else {
        0
      }
    }, 0.0)
  }, 1:1000+0.0)
  return(sum(sqrt(distances))/NbPoints/(NbPoints-1)*2)
}
system.time(d <- fsapply4())
d
```

In parallel, the computation time is not improved on Windows because the individual tasks are too short.
On MacOS or Linux, the computation is accelerated.

```{r fsapply5}
fsapply5 <- function() {
  distances <- mclapply(1:NbPoints, function(i) {
    vapply(1:NbPoints, function(j) {
      if (j>i) {
        (X$x[i]-X$x[j])^2 + (X$y[i]-X$y[j])^2
      } else {
        0
      }
    }, 0.0)
  })
  return(sum(sqrt(simplify2array(distances)))/NbPoints/(NbPoints-1)*2)
}
system.time(d <- fsapply5())
d
```


### future.apply

The `fsapply4()` function optimised above can be parallelled directly by prefixing the `vapply` function with `future_`.
Only the main loop is parallelized: nesting `future_vapply()` would collapse performance.

```{r}
library("future.apply")
# Socket strategy on all available cores except 1
plan(multisession, workers = availableCores() - 1)
future_fsapply4_ <- function() {
  distances <- future_vapply(1:NbPoints, function(i) {
    vapply(1:NbPoints, function(j) {
      if (j>i) {
        (X$x[i] - X$x[j])^2 + (X$y[i] - X$y[j])^2
      } else {
        0
      }
    }, 0.0)
  }, 1:1000+0.0)
  return(sum(sqrt(distances)) / NbPoints / (NbPoints - 1) * 2)
}
system.time(d <- future_fsapply4_())
d
plan(sequential)
```


### for loop

A for loop is faster and consumes less memory because it does not store the distance matrix.

```{r distance}
distance <- 0.0
ffor <- function() {
  for (i in 1:(NbPoints-1)) {
    for (j in (i+1):NbPoints) {
        distance <- distance + sqrt((X$x[i]-X$x[j])^2 + (X$y[i]-X$y[j])^2)
    }
  }
  return(distance/NbPoints/(NbPoints-1)*2)
}
# Calculation time, stored
(for_time <- system.time(d <- ffor()))
d
```

This is the simplest and most efficient way to write this code with core R and no parallelization.

### foreach loop

Parallelization executes for loops inside a foreach loop, which is quite efficient.
However, distances are calculated twice.

```{r registerDoParallel, tidy=FALSE}
registerDoParallel(cores = detectCores())
fforeach3 <- function(Y) {
  distances <- foreach(
    i = icount(Y$n), 
    .combine = '+') %dopar% {
      distance <- 0
      for (j in 1:Y$n) {
        distance <- distance + 
          sqrt((Y$x[i] - Y$x[j])^2 + (Y$y[i] - Y$y[j])^2)
      }
      distance
    }
  return(distances / Y$n / (Y$n - 1))
}
system.time(d <- fforeach3(X))
d
```

It is possible to nest two foreach loops, but they are extremely slow compared with a simple loop.
The test is run here with 10 times fewer points, so 100 times fewer distances to calculate.

```{r}
NbPointsReduit <- 100
Y <- runifpoint(NbPointsReduit)
fforeach1 <- function(Y) {
  distances <- foreach(i=1:NbPointsReduit, .combine='cbind') %:%
    foreach(j=1:NbPointsReduit, .combine='c') %do% {
      if (j>i) {
        (Y$x[i]-Y$x[j])^2 + (Y$y[i]-Y$y[j])^2
      } else {
        0
      }
    }
  return(sum(sqrt(distances))/NbPointsReduit/(NbPointsReduit-1)*2)
}
system.time(d <- fforeach1(Y))
d
```

Nested foreach loops should be reserved for very long tasks (several seconds at least) to compensate the fixed costs of setting them up.


### RCpp

The C++ function to calculate distances is the following.

```{Rcpp}
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double MeanDistance(NumericVector x, NumericVector y) {
  double distance=0;
  double dx, dy;
  for (int i=0; i < (x.length()-1); i++) {
    for (int j=i+1; j < x.length(); j++) {
    // Calculate distance
        dx = x[i]-x[j];
        dy = y[i]-y[j];
        distance += sqrt(dx*dx + dy*dy);
    }
  }
  return distance/(double)(x.length()/2*(x.length()-1));
}
```

It is called in R very simply. 
The computation time is very short.

```{r RCpp}
mb <- microbenchmark(d <- MeanDistance(X$x, X$y))
# suppressMessages to eliminate superfluous messages
suppressMessages(autoplot(mb))
d
```


### RcppParallel

**RcppParallel** allows to interface parallelized C++ code, at the cost of a more complex syntax than **RCpp**.
Documentation is available[^204].

[^204]: <http://rcppcore.github.io/RcppParallel/>

The C++ function exported to R does not perform the computations but only organizes the parallel execution of another, non-exported, function of type `Worker`.

Two (C++) parallelization functions are available for two types of tasks:

- `parallelReduce` to accumulate a value, used here to sum distances.
- `parallelFor` to fill a result matrix.

The syntax of the `Worker` is a bit tricky but simple enough to adapt: the constructors initialize the C variables from the values passed by R and declare the parallelization.

```{Rcpp}
// [[Rcpp::depends(RcppParallel)]]
#include <Rcpp.h>
#include <RcppParallel.h>
using namespace Rcpp;
using namespace RcppParallel;

// Working function, not exported
struct TotalDistanceWrkr : public Worker
{
  // source vectors
  const RVector<double> Rx;
  const RVector<double> Ry;
  
  // accumulated value
  double distance;
   
  // constructors
  TotalDistanceWrkr(const NumericVector x, const NumericVector y) :
    Rx(x), Ry(y), distance(0) {}
  TotalDistanceWrkr(const TotalDistanceWrkr& totalDistanceWrkr, Split) :
    Rx(totalDistanceWrkr.Rx), Ry(totalDistanceWrkr.Ry),  distance(0) {}
  
  // count neighbors
  void operator()(std::size_t begin, std::size_t end) {
    double dx, dy;
    unsigned int Npoints = Rx.length();

    for (unsigned int i = begin; i < end; i++) {
      for (unsigned int j=i+1; j < Npoints; j++) {
          // Calculate squared distance
          dx = Rx[i]-Rx[j];
          dy = Ry[i]-Ry[j];
          distance += sqrt(dx*dx + dy*dy);
      }
    }
  }

  // join my value with that of another Sum
  void join(const TotalDistanceWrkr& rhs) { 
    distance += rhs.distance; 
  }
};


// Exported function
// [[Rcpp::export]]
double TotalDistance(NumericVector x, NumericVector y) {
  
  // Declare TotalDistanceWrkr instance
  TotalDistanceWrkr totalDistanceWrkr(x, y);
  
  // call parallel_reduce to start the work
  parallelReduce(0, x.length(), totalDistanceWrkr);
  
  // return the result
  return totalDistanceWrkr.distance;
}

```

The usage in R is identical to the usage of C++ functions interfaced by **RCpp**. 

```{r RcppParallel}
(mb <- microbenchmark(d <- TotalDistance(X$x, X$y)/NbPoints/(NbPoints-1)*2))
# suppressMessages to eliminate superfluous messages
suppressMessages(autoplot(mb))
d
```

The setup time for parallel tasks is much longer than the serial computation time.

Multiplying the number of points by 50, the serial computation time must be multiplied by about 2500.

```{r Rcpp50, message=FALSE}
NbPoints <- 50000
X <- runifpoint(NbPoints)
system.time(d <- MeanDistance(X$x, X$y))
```

In parallel, the time increases little: parallelization becomes really efficient.
This time is to be compared to that of the reference for loop, multiplied by 2500, that is ```r format(as.integer(for_time["elapsed"]*2500), digits=1)``` seconds.

```{r RcppParallel50}
system.time(d <- TotalDistance(X$x, X$y)/NbPoints/(NbPoints-1)*2)
```

### Conclusions on code speed optimization

From this case study, several lessons can be learned:

- A for loop is a good basis for repetitive calculations, faster than `vapply()`, simple to read and write.
- **foreach** loops are extremely effective for parallelizing for loops;
- Optimized functions may exist in R packages for common tasks (here, the `pairdist()` function of **spatstat** is two orders of magnitude faster than the for loop).
- the **future.apply** package makes it very easy to parallelize code that has already been written with `*apply()` functions, regardless of the hardware used;
- The use of C++ code allows to speed up the calculations significantly, by three orders of magnitude here.
- Parallelization of the C++ code further divides the computation time by about half the number of cores for long computations.

Beyond this example, optimizing computation time in R can be complicated if it involves parallelization and writing C++ code.
The effort must therefore be concentrated on the really long computations while the readability of the code must remain the priority for the current code.
C code is quite easy to integrate with **RCpp** and its parallelization is not very expensive with **RCppParallel**.

The use of for loops is no longer penalized since version 3.5 of R.
Writing vector code, using `sapply()` is still justified for its readability.

The choice of parallelizing the code must be evaluated according to the execution time of each parallelizable task.
If it exceeds a few seconds, parallelization is justified.


## Workflow {#sec:targets}

The **targets** package allows you to manage a *workflow*, i.e. to break down the code into elementary tasks called *targets* that follow each other, the result of which is stored in a variable, itself saved on disk.
In case of a change in the code or in the data used, only the targets concerned are reevaluated.

The operation of the flow is similar to that of a cache, but does not depend on the computer on which it runs.
It is also possible to integrate the flow into a document project (see section \@ref(sec:targetsmd)), and even to use a computing cluster to process the tasks in parallel.


### How it works

The documentation[^241] of **targets** is detailed and provides a worked example to learn how to use the package[^242].
It is not repeated here, but the principles of how the flow works are explained..

[^241]: <https://books.ropensci.org/targets/>

[^242]: <https://books.ropensci.org/targets/walkthrough.html>

The workflow is unique for a given project.
It is coded in the `_targets.R` file at the root of the project.
It contains:

- Global commands, such as loading packages.
- A list of targets, which describe the code to be executed and the variable that stores their result.

The workflow is run by the `tar_make()` function, which updates the targets that need it.
Its content is placed in the `_targets` folder.
Stored variables are read by `tar_read()`.

If the project requires long computations, **targets** can be used to run only those that are necessary.
If the project is shared or placed under source control (chapter \@ref(chap-git)), the result of the computations is also integrated.
Finally, if the project is a document (chapter \@ref(chap-rediger)), its formatting is completely independent of the calculation of its content, for possibly considerable time saving.


### Minimal example

The following example is even simpler than the one in the **targets** manual, which will allow you to go further.
It takes up the previous case study: a set of points is generated and the average distance between the points is calculated.
A map of the points is also drawn.
Each of these three operations is a target in the vocabulary of **targets**. 

The workflow file is therefore the following:
```{r targets, eval=FALSE, tidy=FALSE}
# File _targets.R 
library("targets")
tar_option_set(packages = c("spatstat", "dbmss"))
list(
  # Draw points
  tar_target(X,
    runifpoint(NbPoints)
  ),
  # Choose Parameters
  tar_target(NbPoints,
    1000
  ),
  # Average Distance
  tar_target(d,
    sum(pairdist(X)) / NbPoints / (NbPoints - 1)
  ),
  # Map
  tar_target(map, 
    autoplot(as.wmppp(X))
  )
)
```


```{r ecrire_targets, include=FALSE}
# Write the file _targets.R to disk
usethis::write_over("_targets.R", c(
  'library(targets)',
  'tar_option_set(packages = c("spatstat", "dbmss"))',
  'list(',
  '  # Draw points',
  '  tar_target(X,',
  '    runifpoint(NbPoints)',
  '  ),',
  '  # Parameterize',
  '  tar_target(NbPoints,',
  '    1000',
  '  ),',
  '  # Average Distance',
  '  tar_target(d,',
  '    sum(pairdist(X))/NbPoints/(NbPoints-1)',
  '  ),',
  '  # Map',
  '  tar_target(map,',
  '    autoplot(as.wmppp(X))',
  '  )',
  ')'
  ),
  quiet=TRUE)

# Destroy previous executions for demonstration
targets::tar_destroy()
```

The global commands consist in loading the **targets** package itself and then listing the packages needed for the code.
The execution of the workflow takes place in a new instance of R.

The targets are then listed. 
Each one is declared by the `tar_target()` function whose first argument is the name of the target, which will be the name of the variable that will receive the result.
The second argument is the code that produces the result.
Targets are very simple here and can be written in a single command.
When this is not the case, each target can be written as a function, stored in a separate code file loaded by the `source()` function at the beginning of the workflow file.

The `tar_visnetwork` command displays the sequence of targets and their possibly obsolete status.

```{r tar_visnetwork}
library("targets")
tar_visnetwork()
```

The order of declaration of the targets in the list is not important: they are ordered automatically.

The workflow is run by `tar_make()`.
```{r tar_make}
tar_make()
```

The workflow is now up to date and `tar_make()` does not recompute anything.
```{r tar_visnetwork2}
tar_visnetwork()
tar_make()
```

The results are read by `tar_read()`.
```{r tar_read}
tar_read(d)
tar_read(map)
```

### Practical interest

In this example, **targets** complicates writing the code and `tar_make()` is much slower than simply executing the code it processes because it has to check if the targets are up to date.
In a real project that requires long computations, processing the status of the targets is negligible and the time saved by just evaluating the necessary targets is considerable.
The definition of targets remains a constraint, but forces the user to structure their project rigorously.