forked from hadley/r-pkgs
-
Notifications
You must be signed in to change notification settings - Fork 0
/
tests.rmd
419 lines (307 loc) · 20.1 KB
/
tests.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
---
title: Testing
layout: default
output: bookdown::html_chapter
---
```{r, echo = FALSE}
library(testthat)
```
# Testing {#tests}
Testing is a vital part of package development. It ensures that your code does what you want it to do. Testing, however, adds an additional step to your development workflow. The goal of this chapter is to show you how to make this task easier and more effective by doing formal automated testing using the testthat package.
Up until now, your workflow probably looks like this:
1. Write a function.
1. Load it with Ctrl/Cmd + Shift + L or `devtools::load_all()`.
1. Experiment with it in the console to see if it works.
1. Rinse and repeat.
While you _are_ testing your code in this workflow, you're only doing it informally. The problem with this approach is that when you come back to this code in 3 months time to add a new feature, you've probably forgotten some of the informal tests you ran the first time around. This makes it very easy to break code that used to work.
I started using automated tests because I discovered I was spending too much time re-fixing bugs that I'd already fixed before. While writing code or fixing bugs, I'd perform interactive tests to make sure the code worked. But I never had a system which could store those tests so I could re-run them as needed. I think that this is a common practice among R programmers. It's not that you don't test your code, it's that you don't automate your tests.
In this chapter you'll learn how to graduate from using informal ad hoc testing, done at the command line, to formal automated testing (aka unit testing). While turning casual interactive tests into reproducible scripts requires a little more work up front, it pays off in four ways:
* Fewer bugs. Because you're explicit about how your code should behave
you will have fewer bugs. The reason why is a bit like the reason double
entry book-keeping works: because you describe the behaviour of your code in
two places, both in your code and in your tests, you are able to check one against
the other. By following this approach to testing, you can be sure that bugs
that you've fixed in the past will never come back to haunt you.
* Better code structure. Code that's easy to test is usually better designed.
This is because writing tests forces you to break up complicated parts of
your code into separate functions that can work in isolation. This reduces
duplication in your code. As a result, functions will be easier to test,
understand and work with (it'll be easier to combine them in new ways).
* Easier restarts. If you always finish a coding session by creating a failing
test (e.g. for the next feature you want to implement), testing makes it
easier for you to pick up where you left off: your tests will let you know
what to do next.
* Robust code. If you know that all the major functionality of your package has
an associated test, you can confidently make big changes without worrying
about accidentally breaking something. For me, this is particularly useful
when I think I have a simpler way to accomplish a task (usually the reason my
solution is simpler is that I've forgotten an important use case!).
If you're familiar with unit testing in other languages, you should note that there are some fundamental differences with testthat. This is because R is, at heart, more a functional programming language than an object oriented programming language. For instance, because R's main OO systems (S3 and S4) are based on generic functions (i.e., methods belong to functions not classes), testing approaches built around objects and methods don't make much sense.
## Test workflow {#test-workflow}
To set up your package to use testthat, run:
```{r, eval = FALSE}
devtools::use_testthat()
```
This will:
1. Create a `tests/testthat` directory.
1. Adds testthat to the `Suggests` field in the `DESCRIPTION`.
1. Creates a file `tests/testthat.R` that runs all your tests are when
`R CMD check` runs. (You'll learn more about that in
[automated checking](#check).)
Once you're set up the workflow is simple:
1. Modify your code or tests.
2. Test your package with Ctrl/Cmd + Shift + T or `devtools::test()`.
3. Repeat until all tests pass.
The testing output looks like this:
Expectation : ...........
rv : ...
Variance : ....123.45.
Each line represents a test file. Each `.` represents a passed test. Each number represents a failed test. The numbers index into a list of failures that provides more details:
1. Failure(@test-variance.R#22): Variance correct for discrete uniform rvs -----
VAR(dunif(0, 10)) not equal to var_dunif(0, 10)
Mean relative difference: 3
2. Failure(@test-variance.R#23): Variance correct for discrete uniform rvs -----
VAR(dunif(0, 100)) not equal to var_dunif(0, 100)
Mean relative difference: 3.882353
Each failure gives a description of the test (e.g., "Variance correct for discrete uniform rvs"), its location (e.g., "\@test-variance.R#22"), and the reason for the failure (e.g., "VAR(dunif(0, 10)) not equal to var_dunif(0, 10)"). The goal is to pass all the tests.
## Test structure {#test-structure}
A test file lives in `tests/testthat/`. Its name must start with `test`. Here's an example of a test file from the stringr package:
```{r}
library(stringr)
context("String length")
test_that("str_length is number of characters", {
expect_equal(str_length("a"), 1)
expect_equal(str_length("ab"), 2)
expect_equal(str_length("abc"), 3)
})
test_that("str_length of factor is length of level", {
expect_equal(str_length(factor("a")), 1)
expect_equal(str_length(factor("ab")), 2)
expect_equal(str_length(factor("abc")), 3)
})
test_that("str_length of missing is missing", {
expect_equal(str_length(NA), NA_integer_)
expect_equal(str_length(c(NA, 1)), c(NA, 1))
expect_equal(str_length("NA"), 2)
})
```
Tests are organised hierarchically: __expectations__ are grouped into __tests__ which are organised in __files__:
* An __expectation__ is the atom of testing. It describes the expected result
of a computation: Does it have the right value and right class? Does it
produce error messages when it should? An expectation automates visual
checking of results in the console. Expectations are functions that start
with `expect_`.
* A __test__ groups together multiple expectations to test the output
from a simple function, a range of possibilities for a single parameter
from a more complicated function, or tightly related functionality from
across multiple functions. This is why they are sometimes called __unit__
as they test one unit of functionality. A test is created with `test_that()` .
* A __file__ groups together multiple related tests. Files are given a human
readable name with `context()`.
These are described in detail below.
### Expectations
An expectation is the finest level of testing. It makes a binary assertion about whether or not a function call does what you expect. All expectations have a similar structure:
* They start with `expect_`.
* They have two arguments: the first is the actual result, the second is what
you expect.
* If the actual and expected results don't agree, testthat throws an error.
While you'll normally put expectations inside tests inside files, you can also run them directly. This makes it easy to explore expectations interactively. There are almost 20 expectations in the testthat package. The most important are discussed below.
* There are two basic ways to test for equality: `expect_equal()`,
and `expect_identical()`. `expect_equal()` is the most commonly used: it
uses `all.equal()` to check for equality within a numerical tolerance:
```{r, error = TRUE}
expect_equal(10, 10)
expect_equal(10, 10 + 1e-7)
expect_equal(10, 11)
```
If you want to test for exact equivalence, or need to compare a more
exotic object like an environment, use `expect_identical()`. It's built
on top of `identical()`:
```{r, error = TRUE}
expect_equal(10, 10 + 1e-7)
expect_identical(10, 10 + 1e-7)
```
* `expect_match()` matches a character vector against a regular expression.
The optional `all` argument controls whether all elements or just one
element needs to match. This is powered by `grepl()` (additional arguments
like `ignore.case = FALSE` or `fixed = TRUE` are passed on down).
```{r, error = TRUE}
string <- "Testing is fun!"
expect_match(string, "Testing")
# Fails, match is case-sensitive
expect_match(string, "testing")
# Additional arguments are passed to grepl:
expect_match(string, "testing", ignore.case = TRUE)
```
* Four variations of `expect_match()` let you check for other types of
result: `expect_output()`, inspects printed output; `expect_message()`,
messages; `expect_warning()`, warnings; and `expect_error()` errors.
```{r, error = TRUE}
a <- list(1:10, letters)
expect_output(str(a), "List of 2")
expect_output(str(a), "int [1:10]", fixed = TRUE)
expect_message(library(mgcv), "This is mgcv")
```
With `expect_message()`, `expect_warning()`, `expect_error()` you can
leave the second argument blank if you just want to see if a message,
warning or error is created. However, it's normally better to be explicit,
and provide some text from the message.
```{r, error = TRUE}
expect_warning(log(-1))
expect_error(1 / "a")
# But always better to be explicit
expect_warning(log(-1), "NaNs produced")
expect_error(1 / "a", "non-numeric argument")
# Failure to produce a warning or error when expected is an error
expect_warning(log(0))
expect_error(1 / 2)
```
* `expect_is()` checks that an object `inherit()`s from a specified class.
```{r, error = TRUE}
model <- lm(mpg ~ wt, data = mtcars)
expect_is(model, "lm")
expect_is(model, "glm")
```
* `expect_true()` and `expect_false()` are useful catchalls if none of the
other expectations do what you need.
* Sometimes you don't know exactly what the result should be, or it's too
complicated to easily recreate in code. In that case the best you can do is
check that the result is the same as last time. `expect_equal_to_reference()`
caches the result the first time its run, and then compares it to subsequent
runs. If for some reason the result does change, just delete the cache (*)
file and re-test.
Running a sequence of expectations is useful because it ensures that your code behaves as expected. You could even use an expectation within a function to check that the inputs are what you expect. However, they're not so useful when something goes wrong. All you know is that something is not as expected. You don't know the goal of the expectation. Tests, described next, organise expectations into coherent blocks that describe the overall goal of a set of expectations.
## Writing tests {#test-tests}
Each test should have an informative name and cover a single unit of functionality. The idea is that when a test fails, you'll know what's wrong and where in your code to look for the problem. You create a new test using `test_that()`, with test name and code block as arguments. The test name should complete the sentence "Test that ...". The code block should be a collection of expectations.
It's up to you how to organise your expectations into tests. The main thing is that the message associated with the test should be informative so that you can quickly narrow down the source of the problem. Try to avoid putting too many expectations in one test - it's better to have more smaller tests than fewer larger tests.
Each test is run in its own environment and is self-contained. However, testthat doesn't know how to cleanup after actions affect the R landscape:
* The filesystem: creating and deleting files, changing the working directory,
etc.
* The search path: `library()`, `attach()`.
* Global options, like `options()` and `par()`.
When you use these actions in tests, you'll need to clean up after yourself. While many other testing packages have set-up and teardown methods that are run automatically before and after each test, these are not so important with testthat because you can create objects outside of the tests and you can rely on R's copy-on-modify semantics to keep them unchanged between test runs. To clean up other actions you can use regular R functions.
### What to test
> Whenever you are tempted to type something into a print statement or a
> debugger expression, write it as a test instead.
> --- Martin Fowler
There is a fine balance to writing tests. Each test that you write makes your code less likely to change inadvertently; but it also can make it harder to change your code on purpose. It's hard to give good general advice about writing tests, but you might find these points helpful:
* Focus on testing the external interface to your functions - if you test the
internal interface, then it's harder to change the implementation in the
future because as well as modifying the code, you'll also need to update all
the tests.
* Strive to test each behaviour in one and only one test. Then if that
behaviour later changes you only need to update a single test.
* Avoid testing simple code that you're confident will work. Instead focus your
time on code that you're not sure about, is fragile, or has complicated
interdependencies. That said, I often find I make the most mistakes when I
falsely assume that the problem is simple and doesn't need any tests.
* Always write a test when you discover a bug. You may find it helpful to adopt
the test-first philosphy. There you always start by writing the tests, and
then write the code that makes them pass. This reflects an important problem
solving strategy: start by establishing your success critieria, how you
know if you've solved the problem.
### Skipping a test
Sometimes it's impossible to perform a test - you may not have an internet connection or you may be missing an important file. Unfortunately, another likely reason follows from this simple rule: the more machines you use to write your code, the more likely it is that you won't be able to run all of your tests. In short, there are times when, instead of getting a failure, you just want to skip a test. To do that, you can use the `skip()` function - rather than throwing an error it simply prints an `S` in the output.
```{r, eval = FALSE}
check_api <- function() {
if (not_working()) {
skip("API not available")
}
}
test_that("foo api returns bar when given baz", {
check_api()
...
})
```
### Building your own testing tools
As you start to write more tests, you might notice duplication in your code. For example, the following code shows one test of the `floor_date()` function from `library(lubridate)`. There are seven expectations that check the results of rounding a date down to the nearest second, minute, hour, etc. There's a lot of duplication (which increases the chance of bugs), so we might want to extract common behaviour into a new function.
```{r}
library(lubridate)
test_that("floor_date works for different units", {
base <- as.POSIXct("2009-08-03 12:01:59.23", tz = "UTC")
expect_equal(floor_date(base, "second"),
as.POSIXct("2009-08-03 12:01:59", tz = "UTC"))
expect_equal(floor_date(base, "minute"),
as.POSIXct("2009-08-03 12:01:00", tz = "UTC"))
expect_equal(floor_date(base, "hour"),
as.POSIXct("2009-08-03 12:00:00", tz = "UTC"))
expect_equal(floor_date(base, "day"),
as.POSIXct("2009-08-03 00:00:00", tz = "UTC"))
expect_equal(floor_date(base, "week"),
as.POSIXct("2009-08-02 00:00:00", tz = "UTC"))
expect_equal(floor_date(base, "month"),
as.POSIXct("2009-08-01 00:00:00", tz = "UTC"))
expect_equal(floor_date(base, "year"),
as.POSIXct("2009-01-01 00:00:00", tz = "UTC"))
})
```
I'd start by defining a couple of helper functions to make each expectation more concise. That allows each test to fit on one line, so you can line up actual and expected values to make it easier to see the differences:
```{r}
test_that("floor_date works for different units", {
base <- as.POSIXct("2009-08-03 12:01:59.23", tz = "UTC")
floor_base <- function(unit) floor_date(base, unit)
as_time <- function(x) as.POSIXct(x, tz = "UTC")
expect_equal(floor_base("second"), as_time("2009-08-03 12:01:59"))
expect_equal(floor_base("minute"), as_time("2009-08-03 12:01:00"))
expect_equal(floor_base("hour"), as_time("2009-08-03 12:00:00"))
expect_equal(floor_base("day"), as_time("2009-08-03 00:00:00"))
expect_equal(floor_base("week"), as_time("2009-08-02 00:00:00"))
expect_equal(floor_base("month"), as_time("2009-08-01 00:00:00"))
expect_equal(floor_base("year"), as_time("2009-01-01 00:00:00"))
})
```
We could go a step further and create a custom expectation function:
```{r}
base <- as.POSIXct("2009-08-03 12:01:59.23", tz = "UTC")
expect_floor_equal <- function(unit, time) {
expect_equal(floor_date(base, unit), as.POSIXct(time, tz = "UTC"))
}
expect_floor_equal("year", "2009-01-01 00:00:00")
```
However, if the expectation fails this doesn't give very informative output:
```{r, error = TRUE}
expect_floor_equal("year", "2008-01-01 00:00:00")
```
Instead you can use a little [non-standard evaluation](http://adv-r.had.co.nz/Computing-on-the-language.html) to produce something more informative. The key is to use `bquote()` and `eval()`. In the `bquote()` call below, note the use of `.(x)` - the contents of `()` will be inserted into the call.
```{r, error = TRUE}
expect_floor_equal <- function(unit, time) {
as_time <- function(x) as.POSIXct(x, tz = "UTC")
eval(bquote(expect_equal(floor_date(base, .(unit)), as_time(.(time)))))
}
expect_floor_equal("year", "2008-01-01 00:00:00")
```
This sort of refactoring is often worthwhile because removing redundant code makes it easier to see what's changing. Readable tests give you more confidence that they're correct.
```{r}
test_that("floor_date works for different units", {
as_time <- function(x) as.POSIXct(x, tz = "UTC")
expect_floor_equal <- function(unit, time) {
eval(bquote(expect_equal(floor_date(base, .(unit)), as_time(.(time)))))
}
base <- as_time("2009-08-03 12:01:59.23")
expect_floor_equal("second", "2009-08-03 12:01:59")
expect_floor_equal("minute", "2009-08-03 12:01:00")
expect_floor_equal("hour", "2009-08-03 12:00:00")
expect_floor_equal("day", "2009-08-03 00:00:00")
expect_floor_equal("week", "2009-08-02 00:00:00")
expect_floor_equal("month", "2009-08-01 00:00:00")
expect_floor_equal("year", "2009-01-01 00:00:00")
})
```
## Test files {#test-files}
The highest-level structure of tests is the file. Each file should contain a single `context()` call that provides a brief description of its contents. Just like the files in the `R/` directory, you are free to organise your tests any way that you like. But again, the two extremes are clearly bad (all tests in one file, one file per test). You need to find a happy medium that works for you. A good starting place is to have one file of tests for each complicated function.
## CRAN notes {#test-cran}
CRAN will run your tests on all CRAN platforms: Windows, Mac, Linux and Solaris. There are a few things to bear in mind:
* Tests need to run relatively quickly - aim for under a minute. Place
`skip_on_cran()` at the beginning of long-running tests that shouldn't be run
on CRAN - they'll still be run locally, but not on CRAN.
* Note that tests are always run in the English language (`LANGUAGE=EN`) and
with C sort order (`LC_COLLATE=C`). This minimises spurious differences
between platforms.
* Be careful about testing things that are likely to be variable on CRAN
machines. It's risky to test how long something takes (because CRAN machines
are often heavily loaded) or to test parallel code (because CRAN runs multiple
package tests in parallel, multiple cores will not always available).
Numerical precision can also vary across platforms (it's often less precise on
32-bit versions of R) so use `expect_equal()` rather than `expect_identical()`.
[tdd]:http://en.wikipedia.org/wiki/Test-driven_development
[extreme-programming]:http://en.wikipedia.org/wiki/Extreme_programming