forked from Farewe/invacost
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathReadme.Rmd
1509 lines (1166 loc) · 65.3 KB
/
Readme.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "The invacost R Package: Global Costs of Biological Invasions"
author: "Leroy B, Kramer AM, Vaissière AC, Kourantidou M, Courchamp F & Diagne C"
date: "`r Sys.setlocale('LC_TIME', 'C');format(Sys.time(), '%d %B, %Y')`"
output:
github_document:
toc: true
toc_depth: 2
html_document:
df_print: kable
toc: yes
toc_float: true
---
```{r hiddenload, echo=FALSE, results="hide", message=FALSE, warning=FALSE}
# # knitr::opts_chunk$set(warning=FALSE,
# message=FALSE)
library(invacost)
data(invacost)
```
# Introduction
The invacost R package provides the InvaCost database in R with several
functions to analyse monetary costs of invasive species.
There are two main methods developed in this package to estimate the costs
associated
to biological invasions.
- *Approach based on available estimates*: the first approach consists in using
cost estimates, as they appear in the collected materials, that occurred over
specific intervals of time. From this approach you can obtain the
cumulative and
average annual costs of biological invasions for different time intervals.
This approach is generalisable to any subset of the InvaCost database,
regardless of data quantity. However, it does not account for the temporal dynamics
of costs, and may thus underestimate the costs of invasions, especially for
recent years for which we likely have incomplete data.
- *Modelling approach*: the second approach consists in **estimating** the
long-term trend in annual cost
values with a statistical modelling approach. Because it includes the dynamic
nature of cost values, it is probably the most robust approach to estimate
average annual costs. However, it requires more data and for data-poor cases it
will not work or provide inadequate results. We fit several models because the
trend of costs over time is not necessarily linear.
This approach requires decisions about models to keep or remove, on the basis
of the quality of model fit, and/or of our *a priori* assumptions. Note that this
modelling approach is not designed for extrapolations because there is no
certainty that the underlying factors of costs will have similar trends in the
future.
# Acknowledgements
Thanks a lot to Anna Turbelin for her help in correcting mistakes in this
tutorial, and to Phillip Haubrock for agreeing to be the "Guinea pig" of
various test versions of the invacost R package!
# Installation
**The package requires at least R 4.0.0.**
The package can now be officially installed from CRAN.
``` {r install, eval = FALSE}
# install.packages("devtools")
# devtools::install_github("Farewe/invacost")
install.packages("invacost")
```
# Recent changes
<details>
<summary><b>Click here</b> to see a summary of recent changes</summary>
*February 2022*
- Updated to invacost *version 4.1*
- Added new data integrity tests / checks in the package. These checks
were used to build invacost version 4.1.
*June 2021*
- Updated the database version in the package to *version 4.0*.
*December 2020*
- *The main functions of the package have changed names.* `calculateRawAvgCosts`
is now `summarizeCosts`, and `costTrendOverTime` is now `modelCosts`. Older
function names are still functional but will throw a warning, and will no longer
be updated. Please update your scripts accordingly.
- *Default values in `summarizeCosts` (formerly `calculateRawAvgCosts`) and
`modelCosts` (formerly `costTrendOverTime`) have changed.* First, the last
year of analysis will, by default, be the last year of your input data. Naturally,
you can still define it manually. Second, by default, `modelCosts` will no
longer exclude data from calibration (formerly, `costTrendOverTime` excluded
by default data from 2015 and above). If you wish to keep this threshold as
in former versions of the packages, you can still alter it by manually setting
`incomplete.year.threshold = 2015`.
- A new function is available (`getInvaCostVersion`) to rollback the database to
former major releases. Current options are: `1.0`, `2.0`, `2.1`, `3.0`, `4.0` &
`4.1`. This is
to ensure reproducibility of analyses performed on former versions of the
database. Naturally, any new analysis should be done on the the latest version
shipped with the package (`data(invacost)`).
</details>
For more detailed information about version of the invacost R package,
see the file "NEWS" in the package files.
# Basic steps to get started
## How to load the data in R?
```{r i2, cache = TRUE}
library(invacost)
data(invacost)
```
The database is now loaded in R memory under a very original name: `invacost`.
```{r i3, cache = TRUE}
# Number of rows (cost estimates)
nrow(invacost)
# Number of columns (database descriptors)
ncol(invacost)
```
There are `r ncol(invacost)` columns in the INVACOST database, and I strongly
recommend you get familiarised with them by inspecting them individually and
reading the [paper describing the database (and its fields) available here]
(https://doi.org/10.1038/s41597-020-00586-z). The updated versions of the
database (and its fields) are stored in a [figShare repository accessible here](https://doi.org/10.6084/m9.figshare.12668570).
## How to distinguish between invacost v1 and subsequent versions?
Since the first official release of InvaCost in 2019, there have been several
major releases, some of which were internal, some of which were public. To
ensure reproducibility of analyses performed on older releases, we provide a
function in the database to fetch older releases. we expect it will also help
to understand how and why results may change with new additions and corrections
to this living database.
As of now, we offer the possibility to roll back to version 1.0, 2.0 and 2.1.
<details>
<summary> <b>Click here</b> for an example on how to roll back to V1 or V2.1</summary>
```{r rollback, cache = TRUE}
# # Version 1.0
# invacost_1.0 <- getInvaCostVersion("1.0")
# dim(invacost_1.0)
#
# # Version 2.1
# invacost_2.1 <- getInvaCostVersion("2.1")
# dim(invacost_2.1)
```
</details>
If you want, the database can be automatically saved on your hard drive in csv
format, by specifying a file name with argument `destination_file`.
## Where are the monetary values I should use?
There are several columns that contain monetary values (they generally contain
`cost` in their names). We stored raw information from the papers that is often
difficult to compare because of
different currencies, different years and different periods of time over which
costs or expenditure occurred
(e.g. 1 year vs 10 years). Therefore, **to be able to choose which monetary values
to use, you need to understand some of the basic concepts about the conversion
of monetary values** (raw costs or costs per year) into a standard currency (i.e.
US$ for the year 2017). **Hereafter, we explain the process and provide
recommendations on which monetary values to use.**
The conversion is based on two main steps. The first includes conversion of the
cost estimates from local currencies to US$. The second includes conversion the
cost estimate (previously converted to US\$) from the year it was estimated to
2017 US\$.
### Step 1: Currency conversion
For the currency conversion step (from local currency to US\$), it is possible
to use two conversion factors:
- [The official market exchange rate](https://data.worldbank.org/indicator/PA.NUS.FCRF?end=2017&start=1960).
- The Purchasing Power Parity (PPP, local currency unit per US\$, calculated as
an annual average), that allows to take into account the differences in price
levels between countries. We used data provided [by the World Bank](https://data.worldbank.org/indicator/PA.NUS.PPP?end=2017&start=1990), or
[by the OECD](https://data.oecd.org/conversion/purchasing-power-parities-ppp.htm)
when information was not retrievable through the World Bank database.
Occasionally, we had to deal with published costs that were expressed in currency
that was different from the country where the costs were estimated (e.g.
published cost in African countries expressed in US or Canadian \$). Thus, prior
to using PPP as a conversion index, we had to perform a retro-conversion by
multiplying the original cost estimate by the official market exchange rate
(local currency unit per currency unit used). Note that, it was not possible to
perform the PPP-based standardisation for all cost estimates as PPP data
do not exist for all countries and/or specific periods (we mentioned `NA` in the
database when such information was missing).
### Step 2: Conversion of the year of currency
For the **year of currency conversion step (from US\$ from a given year to 2017 US\$)**, we use an inflation factor that reflects the evolution of the value of
the US\$ since the year of cost estimation. The inflation factor is computed by
dividing the [Consumer Price Index](https://data.worldbank.org/indicator/FP.CPI.TOTL?end=2017&start=1960)
(CPI, which is a measure of the average change
over time in the prices paid by consumers for a market basket of consumer goods
and services) of the USA for 2017 by the CPI of the USA for the year of the cost
estimation.
The standardization of the cost estimates into 2017 \$US is thus a multi-step
procedure for which the following formula is used:
**C~e~ = (M~V~ /C~F~) x I~F~**
with **C~e~** = Converted cost estimate (to 2017 US dollars based on exchange rate or
Purchase Power Parity), **M~V~** = Cost estimate (either the
`Raw_cost_estimate_original_currency` extracted from analysed paper or the
`Cost_estimate_per_year_original_currency`
transformed by us), **C~F~** = Conversion factor (either the official market exchange
rate or the purchasing power parity, in US dollars), **I~F~** = Inflation factor since
the year of cost estimation, calculated as CPI~2017~ / CPI~y~ with CPI corresponding
to the Consumer Price Index and **y** corresponding to the year of the cost
estimation (`Applicable_year`).
### Choosing monetary values in InvaCost
In InvaCost, we provide several columns with monetary values expressed in
2017 USD based on the exchange rate or PPP - they contain either `exchange_rate`
or `PPP` in their names.
<table class="tg">
<thead>
<tr>
<th class="tg-0pky"></th>
<th class="tg-fymr">Purchase Power Parity (PPP)</th>
<th class="tg-fymr">Exchange Rate (ER)</th>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-5pln">Definitions</td>
<td class="tg-y698"><a href="https://data.oecd.org/conversion/purchasing-power-parities-ppp.htm" target="_blank" rel="noopener noreferrer">OECD definition of PPP</a></td>
<td class="tg-y698"><a href="https://data.oecd.org/conversion/purchasing-power-parities-ppp.htm" target="_blank" rel="noopener noreferrer">OECD definition of ER</a></td>
</tr>
<tr>
<td class="tg-jq0u">Useful resources to choose between PPP and ER</td>
<td class="tg-c6of" colspan="2"><a href="https://doi.org/10.1017/CBO9780511754159.005" target="_blank" rel="noopener noreferrer">- Sarno, L., Taylor, M., & Frankel, J. (2003). Purchasing power parity and the real exchange rate</a>. In The Economics of Exchange Rates (pp. 51-96). Cambridge: Cambridge University Press<br><a href="https://doi.org/10.1257/0895330042632744" target="_blank" rel="noopener noreferrer">- Taylor, A. M., & Taylor, M. P. (2004). The purchasing power parity debate.</a> Journal of economic perspectives, 18(4), 135-158. <br>- <a href="https://doi.org/10.1787/oecd_papers-v6-art9-en" target="_blank" rel="noopener noreferrer">OECD. (2006). A Contribution to the "Purchase Power Parity vs. Market Exchange Rates for Use in Long-Term Scenarios" Discussion.</a> OECD papers, 19pp.<br><a href="https://www.imf.org/external/pubs/ft/fandd/2007/03/basics.htm" target="_blank" rel="noopener noreferrer">- Callen T. (2007) PPP Versus the Market: Which Weight Matters? Finance and Development 44:1 </a><br><a href="https://www.worldbank.org/en/programs/icp/brief/VC_Uses" target="_blank" rel="noopener noreferrer">- PPPs for policy making: a visual guide to using data from the ICP - Uses and limitations of PPPs</a><br><a href="https://doi.org/10.1596/978-1-4648-1530-0" target="_blank" rel="noopener noreferrer">- World Bank. (2020) Purchasing Power Parities and the Size of World Economies: Results from the 2017 International Comparison Program.</a>Washington, DC: World Bank.<br><a href="https://thedocs.worldbank.org/en/doc/872661487091786258-0050022017/original/19Chapter18.pdf" target="_blank" rel="noopener noreferrer">- McCarthy, P. (2013) Extrapolating PPPs and comparing ICP benchmark results</a>. Measuring the Real Size of the World Economy, Chapter 18, 473-506<br></td>
</tr>
<tr>
<td class="tg-5pln">Column to use in invacost</td>
<td class="tg-qoao"><span style="color:#905">`Cost_estimate_per_year_2017_USD_PPP`</span></td>
<td class="tg-qoao"><span style="color:#905">`Cost_estimate_per_year_2017_USD_exchange_rate`</span></td>
</tr>
<tr>
<td class="tg-jq0u">Examples</td>
<td class="tg-c6of">PPP are more robust when performing multiple-country comparisons, as long as all the investigated countries have available PPP data (e.g., <a href="https://doi.org/10.1038/s41467-020-15586-1" target="_blank" rel="noopener noreferrer">Schaffner et al. 2020</a>)</td>
<td class="tg-c6of">To analyse the economic impacts of invasive alien species at the global scale, the absence of PPP information for many countries and time periods requires to rely on exchange rates (e.g., <a href="https://doi.org/10.1038/s41586-021-03405-6" target="_blank" rel="noopener noreferrer">Diagne et al. 2020</a>)</td>
</tr>
</tbody>
</table>
Therefore, because our objective here is to illustrate the costs of biological
invasions at the global scale, we will, onwards, in the sections that
follow, use exclusively the column: `Cost_estimate_per_year_2017_USD_exchange_rate`.
Replace the code with `Cost_estimate_per_year_2017_USD_PPP` if your study
requires using Purchase Power Parity instead of Exchange Rates.
### Filtering out estimates with missing information
There is a small number of costs for which we could not derive these estimates
(e.g. studies that lacked useful information of the periods over which
costs occurred), so we have to exclude them of our analyses.
```{r z4, cache = TRUE}
if(any(is.na(invacost$Cost_estimate_per_year_2017_USD_exchange_rate)))
{
invacost <- invacost[-which(is.na(invacost$Cost_estimate_per_year_2017_USD_exchange_rate)), ]
}
# Number of rows (cost estimates)
nrow(invacost)
```
In addition, there can be costs for which we have **inadequate time
period information**: some studies omitted to provide time
periods, which can cause tremendous biases when analysing the temporal trends of
costs. Sometimes, this is not problematic as we can safely guess when costs
occurred (see section [How do I know when costs occurred](#costtime)).
However, other times, costs spread over 10+ and
sometimes 100+ years can be presented
as occurring over a single year or require highly speculative guesses about when
they occurred. As a general rule, it is much easier and safer to estimate the
ending year than the starting year.
Therefore, it is useful to verify if there are estimates that are known to occur
over more than 1 year with no information on
the starting date whatsoever in your subset of
InvaCost. **Note: we have progressively corrected the database such such entries
should no longer occur in InvaCost version 4.1 and later.**. If you find such
estimates, we recommend to remove them, as it is not possible to safely estimate
if they occurred over a single or multiple
years.
The following block of code does all of that automatically, but will do nothing
if there are no such cases ;)
```{r filter2}
# Uncertain starting periods
uncertain.starts <- invacost[which(invacost$Time_range == "Period" &
is.na(invacost$Probable_starting_year)), ]
# Number of estimates without adequate information about starting year
nrow(uncertain.starts)
# No info about whether cost was annual or over a period
unknown.periods <- invacost[which(is.na(invacost$Time_range)), ]
# Number of estimates without adequate information about period
nrow(unknown.periods)
# Applying the filter
if(nrow(uncertain.starts) + nrow(unknown.periods) > 0)
{
invacost <- invacost[-which(invacost$Cost_ID %in% c(uncertain.starts$Cost_ID,
unknown.periods$Cost_ID)), ]
}
# Number of rows after filtering
nrow(invacost)
```
## How do we filter out unreliable costs?
You have to be aware of the limitations of such a database. This database is
the most up-to-date and comprehensive compilation of both the published
and grey literature on the
monetary costs of biological invasions. Therefore, it includes sound estimations
of monetary costs of invasive species, with detailed sources and reproducible
methodology; but it also includes unsourced and irreproducible "*guestimates*",
as those are reported in the literature.
Therefore, we want to apply filters on the database to avoid as much as possible
any such unreliable cost estimates that can not easily be reproduced or
originate from non-peer reviewed sources. Furthermore, the database also includes
observed and potential costs, so depending on our objectives we may or may not
want to filter out potential costs.
- **Reliability**: There is a standard field in the database called `Method_reliability`,
which provides a simple yet objective evaluation of the reliability of cost
estimates. It uses the following decision tree:
`r knitr::include_graphics("./Readme_files/figure-html/reliability.png")`
Red means categorised as unreliable source, green means categorised as reliable
source. This
`Method_reliability` descriptor has some limitations. The most important one
is that we decided to not evaluate the methodology for peer-reviewed articles
and official reports, assuming that this was done at an earlier stage (i.e.
peer-review, prior to publication). We chose such an objective criterion, rather
than assessing the reliability of methods during the process of filling the
database, because we observed divergence in reliability decisions during
experiments among members of the
INVACOST team. We also identified that depending on the study objective,
different decisions about reliability could be made. Therefore, this
`Method_reliability` descriptor should be considered as a first approximation of
cost reliability, and you should decide whether or not you want to eliminate
papers on the basis of the lack of reproducibility of their methodologies.
To do that, take time to investigate the `Details` field (especially for cost
values that you deem suspiciously high) and potentially go back to the source to
make your decision. For an example on how to do that, take a look at the
"Determining cost estimate reproducibility" section in
[Bradshaw et al. 2016](https://www.nature.com/articles/ncomms12986#Sec8).
**Important update as of InvaCost version 3.0:** The database now also includes a
`Method_reliability_refined` descriptor (not exhaustive as of now) which
includes an expert-based assessment of reliability. It can be used as an
alternative or as a complementary approach to the standard `Method_reliability`
descriptor.
```{r filt1}
unique(invacost$Method_reliability)
```
- **Observed vs. Potential costs**: The `Implementation` field in the database
documents whether the costs correspond to *Observed* or *Potential* costs.
Choosing to keep one or both of them depends on your study objectives.
In addition, costs can also be based on
direct *reports* or *estimations*, or can be based on *extrapolations*: this is
documented in the `Acquisition_method` field. Extrapolation does
not necessarily mean *potential*: some Observed costs may have been extrapolated
from a reduced spatial extent. Below is a table showing the number of cases
of *extrapolations* and *reports/estimation* for *Observed* and *Potential* costs. As
you can see, the majority of Observed costs are based on *reports* /
*estimations*; yet a few are based on *extrapolations.*
```{r obs}
table(invacost$Acquisition_method, invacost$Implementation)
```
For the rest of this tutorial, we will be working only on costs categorised as
*High* in `Method_reliability` and "Observed" in `Implementation`:
```{r filter1}
invacost <- invacost[which(invacost$Method_reliability == "High"), ]
invacost <- invacost[which(invacost$Implementation == "Observed"), ]
# Number of rows after filtering
nrow(invacost)
```
## How do I know when costs occurred? {#costtime}
A crucial aspect in analysing cost values is to know the periods of time
over which costs have occurred. Indeed, knowing the period over which a cost occurred
allows to derive cumulative costs and to estimate average annual costs; it also
enables temporal analyses of costs.
We have stored information on the periods of time over which cost occurred
in two columns: `Probable_starting_year` and `Probable_ending_year`.
However, this information was not readily available in a
substantial portion of the papers we compiled in the database: for
`r nrow(invacost[which(is.na(invacost$Probable_starting_year) |
is.na(invacost$Probable_ending_year)), ])` out of
`r nrow(invacost)` papers (`r round(nrow(invacost[which(is.na(invacost$Probable_starting_year) |
is.na(invacost$Probable_ending_year)), ]) / nrow(invacost), 3) * 100`
% of papers), this information was not available for at
least one of the two columns.
Therefore, for papers for which it was not available, we made
educated guesses on the probable starting and ending years.
These educated guesses were based on conservative rules
(e.g., if no duration information was provided,
then the impact was reported of one year only). These estimated starting and
ending years are available in
two new columns in the database called `Probable_starting_year_adjusted` and
`Probable_ending_year_adjusted`.
As it stands now, each cost has a starting and an ending year, and schematically
it looks like this:
| Cost ID | Species | Annual Cost | Starting year | Ending year |
|------------|:--------------|--------:|-----------:|--------:|
| 1 | Sp 1 | 100 | 1998 | 2001 |
| 2 | Sp 2 | 15 | 2005 | 2007 |
| 3 | Sp 3 | 3 | 1981 | 1981 |
However, to properly analyse the temporal trends of costs, we need to *expand*
it, to make it look like this:
| Cost ID | Species | Annual Cost | Year |
|------------|:--------------|------------:|-----:|
| 1 | Sp 1 | 100 | 1998 |
| 1 | Sp 1 | 100 | 1999 |
| 1 | Sp 1 | 100 | 2000 |
| 1 | Sp 1 | 100 | 2001 |
| 2 | Sp 2 | 15 | 2005 |
| 2 | Sp 2 | 15 | 2006 |
| 2 | Sp 2 | 15 | 2007 |
| 3 | Sp 3 | 3 | 1981 |
To do this, we use the function `expandYearlyCosts`, to
which we provide the starting and ending year columns. It will store the
years over which monetary costs have occurred in a column named `Impact_year`.
In this example, expanded database is here stored in an object called
`db.over.time`.
```{r i2y, cache = TRUE}
# Expanding and formating the database
db.over.time <- expandYearlyCosts(invacost,
startcolumn = "Probable_starting_year_adjusted",
endcolumn = "Probable_ending_year_adjusted")
# Let's see some columns
head(db.over.time[, c("Cost_ID", "Species",
"Cost_estimate_per_year_2017_USD_exchange_rate",
"Impact_year")])
```
## How complete are our monetary data?
It is impossible to evaluate the absolute degree of completeness of our InvaCost
datbase - we know that we lack data for many taxonomic groups and so are data
from many places around the globe (most cost data available in InvaCost are
located in North America and Europe and are cost data primarily available in
English and to a lesser extent in other languages). You have to be aware of
these potential biases and remember them when interpreting data/analyses.
There is, however, a temporal bias that we can at least evaluate. Indeed, we
can expect that there is delay between the timing at which the economic impact
of an invasive species occurs, and the timing at which people will start
estimating the cost of the impact, which will later be followed by publication
in a report or journal.
We can grasp an idea of this delay by looking at the difference between
`Impact_year` and `Publication_year` in the expanded database, `db.over.time`.
```{r lag1, cache=TRUE, fig.height = 4, fig.width = 12}
# Calculating time lag
db.over.time$Publication_lag <- db.over.time$Publication_year - db.over.time$Impact_year
# Make a nice boxplot of the time lag
ggplot(db.over.time,
aes(y = Publication_lag)) +
geom_boxplot(outlier.alpha = .2) +
ylab("Publication lag (in years)") +
theme_minimal() +
theme(panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
axis.text.y = element_blank(),
axis.text = element_text(size = 16),
axis.title = element_text(size = 18)) +
scale_y_continuous(breaks = c(-25, 0, 25, 50, 75, 100),
labels = c(-25, 0, 25, 50, 75, 100)) +
xlab("") +
coord_flip()
```
*Note that the few occurrences of publications before economic impacts (negative
lag) corresponded to planned budgets over specific periods expanding beyond
the publication year.*
Our suggestion is to use the quartiles of publication lag as an indication of
cost data completeness.
The first quartile indicates the delay to reach 25% completeness;
the median indicates the delay to reach 50% completeness; and the third quartile
indicates the delay to reach 75% completeness.
Let's see these delays in practice:
```{r lag4.1, cache=TRUE}
quantiles <- quantile(db.over.time$Publication_lag, probs = c(.25, .5, .75))
quantiles
```
The median delay between impact and publication of impact was
**`r median(db.over.time$Publication_lag)` years**. In other words, **for the last
`r median(db.over.time$Publication_lag)` years, we can expect to have less than 50%
of the economic impacts**.
Likewise, we can say it takes approximately `r quantiles[1]` year to have 25% of the
cost data for any year, and `r quantiles[3]` years to reach at least 75% of
cost data.
Hence, any analysis on recent years will be based on incomplete data and is
highly likely to provide an underestimation of actual costs.
It is up to you to determine how you desire to include this information in your
analyses; here, we will provide examples of how we suggest doing that.
One question that may arise for people working on specific subsets of the
database (e.g., only one taxon or one country) is whether you should evaluate
the time lag on your subset. *I would recommend to avoid that, because your
subset may be too incomplete to provide a decent estimation of the time lag.
**Therefore, we would suggest to evaluate the time lag only on the global dataset,
as we did here.***
Last, note that there may be a correlation between the time lag and the
magnitude of cost estimates. Indeed, direct reports of small costs tend to be
aired earlier than large estimates of invasion impacts. It is an important
issue to consider when investigating the time lag. For example, the quantiles
of time lag for all costs above US\$ 1 million annually are `r quantile(db.over.time$Publication_lag[db.over.time$Cost_estimate_per_year_2017_USD_exchange_rate > 1e6], probs = c(.25, .5, .75))` years.
# Approach based on available estimates: cumulative and average costs of invasions
## Basic usage
The first method to analyse monetary costs of invasive species consists in
calculating the observed cumulative and average costs - as they appear in the
collected materials - over a specific period
of time, or at different time intervals.
We implemented this method in the function `summarizeCosts`. It will
calculate the cumulative and average costs for the entire period, as well as
for different time intervals (by default, 10-year intervals).
```{r raw1, cache=TRUE}
obs.costs <- summarizeCosts(db.over.time,
maximum.year = 2020)
```
We can get a summary of the results by typing the name of the object in the
console
```{r raw2, cache=TRUE}
obs.costs
```
And we can have a graphical output with:
```{r raw3, cache=TRUE, fig.width=8, fig.height = 8}
plot(obs.costs)
```
This graph represents the observed annual costs of biological invasions over
time. Points are total annual costs for every year (i.e., all individual costs
for a specific year are summed). Horizontal bars represent the average annual
cost for a specific interval of time (here, 10-year intervals). Notice how
average annual costs can sometimes be driven by a limited number of high-cost
years. The dashed line connects average annual cost of each time interval at mid-years
(black dots). The horizontal dotted line represents the average cost over the
entire period.
The average annual cost seems to suspiciously drop for the most recent years -
most likely because of the time lag between the occurrence of a cost and its
reporting. Therefore, caution would be required in interpreting the values of
most recent years.
A good way to illustrate this issue would be to look at the number of estimates
per year.
```{r raw3.}
ggplot(obs.costs$cost.per.year,
aes(x = year, y = number_estimates)) +
geom_point() +
ylab("Number of estimates") +
xlab("Year") +
theme_minimal()
```
We can see that the number of estimates increases exponentially, but this
increase seems to slow down/drop after 2015.
We can refine this graph to add the relationship with annual costs:
```{r raw3..}
ggplot(obs.costs$cost.per.year,
aes(x = year, y = number_estimates,
size = cost)) +
geom_point() +
ylab("Number of estimates") +
xlab("Year") +
theme_minimal()
```
**Note, however, that the number of estimates in itself is not a meaningful
index to describe the sampling effort. Indeed, a single estimate can be
a complete synthesis over a country, and thus be complete in itself, whereas
many estimates can be direct reports of small localized costs from grey
litterature, and altogether still represent an incomplete picture of overall
costs.** Notably, recent years have been characterized by a high number of
direct reports from the grey literature, which can result in a false sense of
completeness when only looking at the number estimates (e.g.,
[Angulo et al. 2021](https://doi.org/10.1016/j.scitotenv.2020.144441),
[Renault et al. 2021](https://doi.org/10.3897/neobiota.67.59134)).
Use this information, in combination with the quantiles of
completeness calculated above, to decide how you want to interpret the last years,
and whether or not you want to omit them in further analyses. Remember that
no choice will be objective here, but also that objectivity does not emerge from
the absence of choice (keeping biased data is a choice in itself).
We would recommend a transparently explained choice, along with e.g. a
sensitivity analysis.
We can access the content of the output object with
```{r raw3.1, cache=TRUE}
str(obs.costs)
```
Notice that the expanded database used to calculate costs has been stored in
the object, in a slot called `cost.data`.
This is especially important for reproducibility: in case you
decide to publish your work, you can provide this R object which has the exact
copy of your specific version/filters of the database.
There are also some other important elements in this object:
- `parameters`: provides arguments you chose and basic information about your
dataset
- `year.breaks`: your time intervals
- `cost.per.year`: costs for every year in the data with number of estimates
- `average.total.cost`: contains cumulative and average annual costs for the
entire time period
- `average.cost.per.period`: contains cumulative and average annual costs
for each time interval
You can access each element with the `$` sign; for example for the costs
for all time intervals:
```{r raw3.2, cache=TRUE}
obs.costs$average.cost.per.period
```
## Customising parameters
There are two main parameters to customize:
- **beginning** (`minimum.year`) and **ending year** (`maximum.year`) of the entire
time period. For example in our
analyses for the Nature paper ([Diagne et al. 2021](https://doi.org/10.1038/s41586-021-03405-6))
we chose to start in 1970, because data for the
1960s are scarce and uncertain.
```{r raw4, cache=TRUE}
obs.costs2 <- summarizeCosts(db.over.time,
minimum.year = 1970,
maximum.year = 2017)
obs.costs2
```
The function tells you how many values were removed from the
dataset because they were outside the 1970-2017 time periods.
- **time intervals**: set them with the arguments `year.breaks`, where you specify
the starting year of each interval. For example, if your specify
`year.breaks = c(1970, 1980, 1990, 2000, 2010, 2017)`, then intervals will be
[1970-1979], [1980-1989], [1990-1999], [2000-2009], [2010-2017]
```{r raw4.1, cache=TRUE}
# let's plot 20-year intervals
obs.costs3 <- summarizeCosts(db.over.time,
minimum.year = 1960,
maximum.year = 2017,
year.breaks = seq(1960, 2017, by = 20))
plot(obs.costs3)
```
## Customising graphs
There are two methods to customise graphs.
- The first one is to use the
standard ggplot produced by the package and adding things or changing
parameters. This method is easy to implement but you cannot change everything
(e.g. adjust the colors/shapes of points is not possible). This is what we
will see here. See the help here: `?plot.invacost.costsummary`
- The second one is to make your own ggplot from the output object. It is
more difficult to implement if you are not familiar with graphs in R - but it
will be fully customisable.
[Take a look at the scripts from our main paper](http://borisleroy.com/invacost/global_invasion_costs_scripts.html)
to see how to that.
There are two base plots provided with the package; you have already seen the
default, and here is another one:
```{r raw5, cache=TRUE}
plot(obs.costs,
plot.type = "bars")
```
You can also remove the log10 scale:
```{r raw5.1, cache=TRUE}
plot(obs.costs,
plot.type = "bars",
cost.transf = NULL)
```
To customise parameters using the standard ggplot produced by the package,
you will have to set the argument `graphical.parameters = "manual"`.
```{r raw5.2, cache=TRUE}
# Store the plot in object p1 to customize it afterwards
p1 <- plot(obs.costs,
graphical.parameters = "manual")
# Show the graph in its initial state
p1
```
You see that when we specify `graphical.parameters = "manual"`, all the cosmetic choices
we made in the function are removed. You can now choose them by yourself; here
is a starting point:
```{r raw5.3, cache=TRUE}
# Customize p1 now
p1 <- p1 +
xlab("Year") +
ylab("Average annual cost of invasions in US$ millions") +
scale_x_continuous(breaks = obs.costs$year.breaks) + # X axis breaks
theme_bw() + # Minimal theme
scale_y_log10(breaks = 10^(-15:15), # y axis in log 10 with pretty labels
labels = scales::comma) +
annotation_logticks(sides = "l") # log10 tick marks on y axis
# Let's see how it goes
p1
```
Your turn to play with graphs now!
# Modelling approach: estimating the average annual cost of invasions
The second method we provide in the package consists in estimating the long-term
trend in annual cost with different modelling techniques. In other words, we fit
a model to predict costs as a function of years. Then, we can inspect the
different models and the shapes of cost trends over time, and grasp an idea
of dynamics of invasion costs.
This approach requires more data than the approach based on available estimates
and for data-poor cases it
will not work or provide inadequate results.
## Correction for data incompleteness due to publication lag
Because the average annual monetary cost of invasive species will be determined
by the trend over time, we should consider applying a correction to ‘incomplete’
years.
This is no simple decision and your choice will have to be justified.
There are two different methods we can apply here, either independently or in
combination.
- The first method consists in applying a threshold of incompleteness to remove
the most incomplete years. For example, remove
from calibration all years with < 75% of data; threshold = `r quantiles[3]` years.
- Another possibility includes weighting incomplete years to reduce their importance in
the estimation of average annual costs of invasions.
This approach can be justified if we do not want to underestimate the average annual
cost of invasions. However, be advised that reviewers may disagree with your
choices of weights, so justify them carefully, and consider analysing the
difference with / without weights.
### Example of a vector of incompleteness weights
An example
to reduce the negative impact of the incompleteness of recent years
would be to apply weights proportional to their degree of incompleteness. For
example, apply the following set of rules:
• completeness ≤ 25%: exclusion
• 25% < completeness ≤ 50%: weight = 0.25
• 50% < completeness ≤ 75%: weight = 0.50
• completeness > 75%: weight = 1
Remember that we stored quantiles in the beginning of this tutorial, so we can
access them now to know to what years they correspond:
```{r incomp}
quantiles
```
In the next lines of code we will create a vector of weights for each year, which we
can provide to the function later on.
```{r lag5, cache = TRUE}
# Creating the vector of weights
year_weights <- rep(1, length(1960:2017))
names(year_weights) <- 1960:2017
# Assigning weights
# Below 25% the weight does not matter because years will be removed
year_weights[names(year_weights) >= (2017 - quantiles["25%"])] <- 0
# Between 25 and 50%, assigning 0.25 weight
year_weights[names(year_weights) >= (2017 - quantiles["50%"]) &
names(year_weights) < (2017 - quantiles["25%"])] <- .25
# Between 50 and 75%, assigning 0.5 weight
year_weights[names(year_weights) >= (2017 - quantiles["75%"]) &
names(year_weights) < (2017 - quantiles["50%"])] <- .5
# Let's look at it
year_weights
```
## Assumptions
As we fit several models based on different techniques, we suggest to define
rules for deciding which model(s) should be finally considered.
Here are examples of rules:
- statistical information about the quality of the fit:
adequate error estimations for models, sensitivity to outliers, the Root Mean
Square Error (RMSE - lower is generally better but it may also
be indicative of model overfitting), inspection of model terms and residuals
- simplicity: for models with similar performance, we prefer the models with
less assumptions
- a qualitative rationale on the probable shape of trends over time: because
of the exponential increase in the number of invasive species globally (Seebens
et al. 2017), we expect the long-term temporal trend over time to be either
increasing or stabilising, but not decreasing. Hence, we assume that a model
describing a decreasing trend in recent years (i.e., for years lower than the
75% completeness threshold) would indicate an effect of the lack of data for
recent years.
## Models included in the ensemble modelling
There are several models included in the function. All models are calibrated
with cost data as the response variable and time as predictive variable. Note
that the monetary data have statistical issues typical to *econometric* data:
heteroskedasticity, temporal autocorrelation and outliers. Therefore, our choice of
modelling methods was driven by methods relatively robust to such issues.
**Note: when using the package, please cite the individual model packages.
Update citations with your package versions! To do that
type for example `citation("earth")` in R**
- **Ordinary least square regressions** (hereafter called OLS, R package `stats`,
function `lm`).
OLS regressions are the classical regression methods used in statistics.
Coefficient estimates should be relatively robust to heteroscedasticity &
temporal autocorrelation but not to outliers. However, error estimates need to
be corrected, so we used the Heteroskedasticity and Autocorrelation Consistent
covariance matrix estimations from the R package
`sandwich` (Andrews 1991, Zeileis 2004), and we
used the function `coeftest` from package `lmtest` (Zeileis & Hothorn 2004) to test for the significance
of estimates upon this corrected variance covariance matrix. We
implemented two OLS methods:
+ **linear OLS regressions**
+ **quadratic OLS regressions**
- **Robust regressions** (R package `robustbase`,
function `lmrob`, Maechler et al. 2020). We implemented MM-type regressions (hereafter called
robust regressions) based on iteratively reweighted least squares since these types
of regressions are less sensitive to outliers than ordinary least square
regressions (and we do have outliers in InvaCost!) (Yohai 1987, Koller and
Stahel 2011). In addition, this method estimates standard errors robust to heteroskedasticity and autocorrelation as described in Croux et al. 2003. *Thanks
Andrew Kramer for this addition*. Likewise to OLS regressions, we implemented
two methods:
+ **linear robust regressions**
+ **quadratic robust regressions**
- **Multivariate adaptive regression splines** (Multivariate Adaptive Regression
Splines, MARS, R package `earth`, function `earth`, Milborrow 2018). The MARS
model is a non-parametric regression method which automatically models
nonlinearities, using Generalized Cross-Validation to avoid overfitting
(Friedman 1991, Hastie et al. 2009). The default parameters are implemented in
order to follow Friedman's default, as described in
[Milborrow (2019a).](http://www.milbo.org/doc/earth-notes.pdf)
We account for heteroskedasticity by
fitting a linear model on the residuals. This linear model is fitted by
Iteratively Reweighting Least Squares. Note, however, that we have enough
data to only approximately model variance, as explained in
[Milborrow (2019b)](http://www.milbo.org/doc/earth-varmod.pdf). Therefore, MARS
models will have more uncertainty in the prediction intervals than in the
predictions themselves.
- **Generalized additive models** (Generalized Additive Models, GAM, R package
`mgcv`, function `gam`, Wood et al. 2016). GAM models are automatic flexible statistical methods
used to identify and characterize nonlinear regression effects
(Hastie et al. 2009).
The GAM model will also show non-linear patterns in cost trends over
time. To account for heteroskedasticity, we used a
location-scale method which consists in fitting two GAMs, one for the average
trend and one for the standard deviation. We used a simple Gaussian location
scale family because, likewise to GAMs, we have enough
data to only approximately model variance.
- **Quantile regressions** (R package `quantreg`, Koenker et al. 2020).
Contrary to previous models,
quantile regressions do not try to estimate the average value, they estimate
a specific quantile. In the package we chose quantiles 0.1 (lower boundary
of annual costs), 0.5 (median cost value) and 0.9 (upper boundary of annual costs).
**References:**
Andrews DWK (1991), Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation. _Econometrica_, *59*, 817–858.
Croux, C., Dhaene, G. and Hoorelbeke, D. (2003) Robust standard errors for robust estimators, _Discussion Papers Series 03.16_, K.U. Leuven, CES.
Friedman, J. H. (1991). "Multivariate Adaptive Regression Splines". _The Annals of Statistics._ *19* (1): 1–67.
Hastie T, Tibshirani R, Friedman J. (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York City, USA
Koenker, R. quantreg: Quantile Regression. R package version 5.61. Available at
http://CRAN.R-project.org/package=quantreg (2020).
Koller, M. and Stahel, W.A. (2011) Sharpening Wald-type inference in robust regression for small samples. Computational _Statistics & Data Analysis_ *55*(8), 2504–2515
Maechler M, Rousseeuw P, Croux C, Todorov V, Ruckstuhl A, Salibian-Barrera M, Verbeke T, Koller M, Eduardo LT, Conceicao and di Palma MA (2020). robustbase: Basic Robust Statistics R package version 0.93-6. URL http://CRAN.R-project.org/package=robustbase
Milborrow, S. earth: Multivariate Adaptive regression Splines, R package version 4.6. 3 (2018).
Milborrow S. (2019a) Notes on earth package. Vignettes of the R package ‘earth’, http://www.milbo.org/doc/earth-notes.pdf
Milborrow S. (2019b) Variance models in earth. Vignettes of the R package ‘earth’, http://www.milbo.org/doc/earth-varmod.pdf
Wood, S.N., N. Pya and B. Saefken (2016), Smoothing parameter and model selection for general smooth models. _Journal of the American Statistical Association_ *111*, 1548-1575
Yohai, V., Stahel, W.A. and Zamar, R. (1991) A procedure for robust estimation and inference in linear regression; in Stahel and Weisberg (eds), _Directions in Robust Statistics and Diagnostics, Part II_, Springer, New York, 365–374
Zeileis A (2004). “Econometric Computing with HC and HAC Covariance Matrix Estimators.” _Journal of Statistical Software_, *11*(10), 1-17.
Zeileis A, Hothorn T (2002). Diagnostic Checking in Regression Relationships. _R News_ *2*(3), 7-10. URL https://CRAN.R-project.org/doc/Rnews/
**A few remarks:**
Paying attention to the statistical robustness of the model is important.
However, remember that even if we had found a model that has an ideal fit to the