diff --git a/.gitignore b/.gitignore index 787433b..a082621 100644 --- a/.gitignore +++ b/.gitignore @@ -11,4 +11,5 @@ error_reproduction_scripts # To reduce repo size vignettes/vignette_files -vignettes/vignette_cache \ No newline at end of file +vignettes/vignette_cache +inst/doc/BRCindicators_files \ No newline at end of file diff --git a/R/bayesian_meta_analysis.R b/R/bayesian_meta_analysis.R index 7796e0b..4af7297 100644 --- a/R/bayesian_meta_analysis.R +++ b/R/bayesian_meta_analysis.R @@ -43,11 +43,9 @@ #' A generic method for estimating and smoothing multispecies biodiversity indices, robust to intermittent data. #' \emph{Journal of Agricultural Biological and Environmental Statistics}, in revision. #' @export -#' @examples -#' -#' # Only run if there is a JAGS installation -#' if(suppressWarnings(runjags::testjags(silent = TRUE)$JAGS.found)){ #' +#' @examples +#' \dontrun{ #' # Create some example data in the format required #' data <- data.frame(species = rep(letters, each = 50), #' year = rep(1:50, length(letters)), @@ -60,8 +58,7 @@ #' # Plot the resulting indicator #' plot_indicator(indicator = bma_indicator[,'Index.Mprime'], #' CIs = bma_indicator[,c(3,4)]) -#' -#' } +#' } bma <- function (data, plot = TRUE, diff --git a/benchmarking/package_notes_and_benchmarking.Rmd b/benchmarking/package_notes_and_benchmarking.Rmd index eb618ed..d0dcc99 100644 --- a/benchmarking/package_notes_and_benchmarking.Rmd +++ b/benchmarking/package_notes_and_benchmarking.Rmd @@ -7,9 +7,9 @@ output: keep_md: yes toc: yes --- -**This shows how us to use the indicator pipeline to create biodiversity indicators such as those for DEFRA Biodiversity Indicators in Your Pocke.** +*This shows how us to use the indicator pipeline to create biodiversity indicators such as those for DEFRA Biodiversity Indicators in Your Pocke.* -**BRCindicators works with yearly estimates of species abundance or occurrence and aggregate them into an scaled indicator value with bootstrapped confidence intervals.** +*BRCindicators works with yearly estimates of species abundance or occurrence and aggregate them into an scaled indicator value with bootstrapped confidence intervals.* First, we create some example data. NOTE, the format is the same as the data created within the sparta vignette. ```{r create_some_data} @@ -69,17 +69,360 @@ para_out <- sapply(unique(myData$taxa), occ_mod_function) ``` -**Now that we have some species trends data to work with, we can use the first function in BRCindicators. This function reads in all the output files from sparta and returns a simple summary table that we can use for calculating the indicator.** +*Now that we have some species trends data to work with, we can use the first function in BRCindicators. This function reads in all the output files from sparta and returns a simple summary table that we can use for calculating the indicator.* ```{r set_up_run} library(BRCindicators) # All we have to supply is the directory where out data is saved # You will note this is the 'output_dir' passed to sparta above. -trends_summary <- summarise_occDet(para_out) +trends_summary <- summarise_occDet(input_dir = '~/Testing_indicator_pipe') # Lets see the summary head(trends_summary[,1:5]) ``` -**Returned from this function is a summary of the data as a matrix. In each row we have the year, specified in the first column, and each subsequent column is a species.** \ No newline at end of file +*Returned from this function is a summary of the data as a matrix. In each row we have the year, specified in the first column, and each subsequent column is a species.* + +We are now in a position to calculate an indictor. There are a number of options available for this + +## Geometric mean + +*We must first re-scale the data so the value for all species for the first year is the same. Afterwards we calculate the geometric mean across species for each year. This approach accounts for species that have no data at the beginning of the dataset by entering them at the geometric mean for that year, thus stopping them dramatically changing the indicator value in the year they join the dataset. Likweise, it accounts for species that leave the dataset before the end, holding them at their last value. Finally limites to species values can be given, preventing extremely high or low values that could bias the indicator.* + +### Rescaling and calculating geometric mean + +To show the capability of the rescale_species function, we mess up the data + +```{r trends_summary} +trends_summary[1:3, 'a'] <- NA +trends_summary[1:5, 'b'] <- NA +trends_summary[2:4, 'c'] <- 1000 +trends_summary[45:50, 'd'] <- NA + +# Let's have a look at these changes +head(trends_summary[,1:5]) +tail(trends_summary[,1:5]) +``` + +*We now have two species with data missing at the beginning and one species with data missing at the end. We additionally have a species with some very high values.* + +We can now rescale + +```{r rescaled_indicator} +# Let's run this data through our scaling function (all defaults used) +rescaled_trends <- rescale_species(Data = trends_summary) + +# Here's the result +head(rescaled_trends[,c('year', 'indicator', 'a', 'b', 'c', 'd')]) +tail(rescaled_trends[,c('year', 'indicator', 'a', 'b', 'c', 'd')]) +``` + +*species 'a' and 'b' enter the dataset at the geometric mean. All present species are indexed at 100 in the first year. 'c' is capped at 10000. 'd' has been held at it's end value.* + +### Confidence intervals + +*We can get confidence intervals for this indicator by bootstrapping across species.* + +```{r create_confidence_intervals} +# This function takes just the species columns +scaled_species <- rescaled_trends[,!colnames(rescaled_trends) %in% c('year', 'indicator')] +indicator_CIs <- bootstrap_indicator(Data = scaled_species) + +# Returned are the CIs for our indicator +head(indicator_CIs) +``` + +to create a smoothed indicator value (which is not always necessary), we fit a GAM (general additive model) to the indicator using a spline. The spline is a smoothed curve through the raw values. + +```{r smoothing_indicator} +# The smoothing function takes the indicator values +smoothed_indicator <- GAM_smoothing(rescaled_trends[,'indicator']) + +# In this example there is little support for a non-linear trend and +# so the line almost linear +plot(x = rescaled_trends[,'year'], y = rescaled_trends[,'indicator']) +lines(x = rescaled_trends[,'year'], y = smoothed_indicator, col = 'red') + +# But if our indicator did support a non-linear trend it might look +# like this +eg_indicator <- jitter(sort(rnorm(50)), amount = 0.5) +eg_smoothed <- GAM_smoothing(eg_indicator) +plot(x = 1:50, y = eg_indicator) +lines(x = 1:50, y = eg_smoothed, col = 'red') +``` + +### Plotting + +*We now have our indicator and the confidence intervals around it. The next step is to plot it. We have included a function that creates a simple plot using ggplot2, or you could create your own.* + +```{r plot_indicator} +# Plot our indicator. +plot_indicator(indicator = rescaled_trends[,'indicator'], + smoothed_line = smoothed_indicator, + CIs = indicator_CIs) +``` + +You can see in this plot the high upper confidence interval in years 2-4, due to the artificially high values we gave to species 'c'. + +## Bayesian Meta-Analysis (BMA) + +*The Bayesian Meta-Analysis method, or BMA, is suited to data with standard errors associated with them. We therefore require an additional error for each species-year estimate.* + +lets create this + +```{r BMAdata} +# Here is an example dataset for the BMA method +data <- data.frame(species = rep(letters, each = 50), + year = rep(1:50, length(letters)), + index = runif(n = 50 * length(letters), min = 0, max = 1), + se = runif(n = 50 * length(letters), min = 0.01, max = .1)) +head(data) +``` + +*Your data must be in the same format and your columns must have the same names* + +First run bma with the default settings + +```{r runBMA} +bma_indicator <- bma(data) +``` + +*function returns a plot to your screen which is a diagnostic of the model. When Converged the lines on the plots on the left will sit on top of one another and the plots on the right will have a nice bell shape. You can turn off this plot by setting `plot` to `FALSE`. By default the method runs the chains in series. Running them in parallel makes the models run faster. You can change this with the `parallel` parameter. The number of iterations defaults to 10000. If possible you should run it for more. `m.scale` gives the scale your data is on, and it is important this is correct. Choose from 'loge' (natural log, sometimes simply called 'log'), 'log10' (log to the base 10), or 'logit' (output from models of proportions or probabilities).* + +We can implement these changes +```{r runBMAparameters} +bma_indicator2 <- bma(data, + parallel = TRUE, + n.iter = 500, + m.scale = 'log10') +``` + +*The model nolonger has a good convergence and the graphs on the right are no longer a smooth bell shape because we have reduced the number of interations.* + +*The object returned from `bma` is a data.frame with years as rows and columns giving the year value, index value and confidence intervals. You can write this to a csv using the function `write.csv`.* + +```{r BMAresults} +head(bma_indicator) +``` + +*use the plotting function in BRCindicators to plot the results of this analysis* + +```{r BMAplot} +plot_indicator(indicator = bma_indicator[,'Index.M'], + CIs = bma_indicator[,c(3,4)]) +``` + +## Multi-species Indicator + +*The multi-species indicator method was developed by Statistics Netherlands. Here is an example of how this method runs in `BRCindicators`.* + +First lets create some mock data of species population decreasing over the years. As part of this method we must set all species values to 100 in the first year with a standard error of 0. That is first done below +```{r msi1, fig.height=4} +# Create some example data in the format required +nyr = 20 +species = rep(letters, each = nyr) +year = rev(rep(1:nyr, length(letters))) + +# Create an index value that increases with time +index = rep(seq(50, 100, length.out = nyr), length(letters)) +# Add randomness to species +index = index * runif(n = length(index), 0.7, 1.3) +# Add correlated randomness across species, to years +index = index * rep(runif(0.8, 1.2, n = nyr), length(letters)) + +se = runif(n = nyr * length(letters), min = 10, max = 20) + +data <- data.frame(species, year, index, se) + +# Our species are decreasing +plot(data$year, data$index) + +# Species index values need to be 100 in the base year. Here I use +# the first year as my base year and rescale to 100. The standard error +# in the base year should be 0. +min_year <- min(data$year) + +for(sp in unique(data$species)){ + + subset_data <- data[data$species == sp, ] + multi_factor <- 100 / subset_data$index[subset_data$year == min_year] + data$index[data$species == sp] <- data$index[data$species == sp] * multi_factor + data$se[data$species == sp] <- data$se[data$species == sp] * multi_factor + data$se[data$species == sp][1] <- 0 + +} + +# Our first year is now indexed at 100 +plot(data$year, data$index) +``` + +Now we run the `msi` function. +```{r} +# Run the MSI function +msi_out <- msi(data) + +head(msi_out$CV) +``` + +*code returns two plots to the console. The first shows the coefficient of variation (CV) for each of the species. Species with high CV may adversly effect the relaibility of the trend estimation. Use this graph to identify the CV values of the species and use the `maxCV` parameter to set a threshold above which species will be excluded. The CV values are hard to assign to species from this plot as the species are coded to numbers. To see the raw values look at the CV component of msi_out (i.e. `msi_out$CV`). The second plot shows the smoothed trend and the MSI values.* + +Let's create a second example and set some more parameters. The parameters for `msi` get passed to `msi_tool`. + +```{r msi2, fig.height=4} +msi_out <- msi(data, + nsim = 500, # The number of Mote Carlo simulations + SEbaseyear = 10, # The year to index on + plotbaseyear = 15, # The year to set as 100 in plots + index_smoot = 'INDEX', # plotbaseyear uses MSI not trend + span = 0.7, # 'wigglyness' of line, between 0 and 1 + lastyears = 5, # last X years of time series for short-term trends + maxCV = 10, # maximum allowed Coefficient of Variation + changepoint = 10, # compare trends before and after this year + truncfac = 8, # max year-to-year index ratio + TRUNC = 5, #set all indices below TRUNC to this + plot = TRUE # should the plots be returned?) + ) +``` + +*These parameters are unrealistic but shows the options available. Note, in the second graph the year 10 point now has a se = 0, year 15 MSI is set to 100, and the short term trend is reported for the last 5 years.* + +*The first of the two elements (`results`) returned the data* + +```{r msi3} +# The returned object has 2 elements +head(msi_out$results) +``` + +*The second element (`trends`) give a summary of various trend assessments* + +```{r msi4} +# The returned object has 2 elements +msi_out$trends +``` + +*We have means of exploring the effect of changing the span value in the analysis* + +```{r msi_span, fig.height=4} +for(i in c(0.3, 0.5, 0.7)){ # use a range of values for span + + msi_out <- msi(data, span = i, # span is set to i + nsim = 200, plot = FALSE) + print( # print makes the plot visible in the for loop + plot(msi_out, title = paste('MSI - span =', i)) # plot + ) + +} +``` + +## Lambda Indicator + +*The lambda indicator calculates an indicator using growth rates from one year to the next. Formulating the indicator in terms of growth rates has two distinct advantages: First, the categorisation of species as ‘increasing’ or ‘decreasing’ can be made from the same set of data (the growth rates) as the construction of the indicator. Second, it provides an elegant solution to the problem of species that join the indicator after the first year (i.e. where the first year is unreliable). Other indicators adopt a rescaling approach to ensure species entering the indicator after the first year do not bias the assessment. It thirdly creates a robust, although untestable, assumption about species that drop out of the indicator prior to the final year: It assumes that their fluctuations are the same, in aggregate, as those of the species that remain in the indicator.* + +*Input data is on the occupancy scale and ranges between 0 and 1.* + +```{r lambda_1} +# number of species +nsp = 50 + +# number of years +nyr = 40 + +#number of iterations +iter = 500 + +# Build a random set of data +myArray <- array(data = rnorm(n = nsp*nyr*iter, + mean = 0.5, + sd = 0.1), + dim = c(nsp, nyr, iter), + dimnames = list(paste0('SP',1:nsp), + 1:nyr, + 1:iter)) + +# Ensure values are bounded by 0 and 1 +myArray[myArray > 1] <- 1 +myArray[myArray < 0] <- 0 + +str(myArray) +``` + +*`lambda_indicator` takes in an array of data in a three-dimensional matrix. The dimensions of this array represent species, years, and iterations. Each row represents a species and each column a year. The third dimension of the array contains the iterations. Essentially each slice contains occupancy estimates for each species year combination for a single iteration and the overall array contains as many slices as there are iterations.* + +```{r lambda_2} +# Run the lambda_interpolation method on this data +myIndicator <- lambda_indicator(myArray) + +# Plot the indicator +plot_indicator(myIndicator$summary[,'indicator'], + myIndicator$summary[,c('lower' ,'upper')]) + +``` + +*There are a number of options available in the `lambda_indicator` function* + +```{r lambda_3} +myIndicator <- lambda_indicator(myArray, + index = 1, # Set the index value to 1 not 100 + year_range = c(30,40), # Set year range + threshold_yrs = 5) # set a threshold +plot_indicator(myIndicator$summary[,'indicator'], + myIndicator$summary[,c('lower' ,'upper')]) +``` + +*Note that there are a range of threshold functions. There are options to remove species year estimates based on their standard deviation, Rhat value and based on the number of years a species is present. Note the Rhat threshold can only be used if you are using a directory path as you input rather than an array.* + +# Creating a custom pipeline function + +*in a 'pipeline' we want data to flow through seamlessly. Here is an example of how you can create your own pipeline function. Our function will wrap around the functions described above.* + +```{r,cache=TRUE} + +library(sparta) +library(BRCindicators) + +# create a pipeline function +run_pipeline <- function(input_dir){ + + # Create the trends summary + trends_summary <- summarise_occDet(input_dir = input_dir) + + # Rescale the values and get the indicator values + # Here I set the index to 1 and change the value limits + rescaled_trends <- rescale_species(Data = trends_summary, + index = 1, + max = 100, + min = 0.001) + + # Bootstrap the indicator to get CIs + scaled_species <- rescaled_trends[,!colnames(rescaled_trends) %in% c('year', 'indicator')] + # This time I set the iterations to twice the default and + # use custom confidence intervals + indicator_CIs <- bootstrap_indicator(Data = scaled_species, + CI_limits = c(0.25, 0.75), + iterations = 20000) + + # Get the smoothed indicator line + smoothed_indicator <- GAM_smoothing(rescaled_trends[,'indicator']) + + # This time I specify the years and index value + plot_indicator(indicator = rescaled_trends[,'indicator'], + year = rescaled_trends[,'year'], + index = 1, + CIs = indicator_CIs, + smoothed_line = smoothed_indicator) + + ## I'll return all my data + return(cbind(smoothed_indicator, indicator_CIs, as.data.frame(trends_summary))) + } +``` + +Once we have created this function we can run this pipeline on a directory in one line + +```{r, in_one} +# Now we can run the pipeline in one line, like a boss +indicator_data <- run_pipeline(input_dir = '~/Testing_indicator_pipe') + +head(indicator_data) +``` + diff --git a/inst/doc/BRCindicators.html b/inst/doc/BRCindicators.html new file mode 100644 index 0000000..9888532 --- /dev/null +++ b/inst/doc/BRCindicators.html @@ -0,0 +1,840 @@ + + + + + + + + + + + + + + +BRCindicators + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + +
+ +
+ +
require(snowfall)
+
## Loading required package: snowfall
+
## Loading required package: snow
+
+

Introduction

+

This document shows how to use the indicator pipeline to create biodiversity indicators such as those for DEFRA’s Biodiversity Indicators in Your Pocket. The pipeline is shared in the form of an R package called ‘BRCindicators’ making it easy to share and maintain.

+

The functions in BRCindicators work with yearly estimates of species abundance or occurrence and aggregate them into an scaled indicator value with bootstrapped confidence intervals

+

This package has the ability to read in the output of occupancy models created in the R package sparta, a package for estimating species trends from occurrence data. This package can be installed from Github and details of how to use the package are given in the package vignette. There is no need to use sparta to create your yearly species estimates as BRCindicators can also work with other data.

+

To create an indicator we first need to have species trends, let’s create some using the sparta R package.

+
+
+

Creating yearly estimates of occurrence in sparta

+

If you already have yearly estimates of abundance or occurrence for your species you can skip this stage. Here we show how you can create these estimates from raw species observation data using sparta.

+
# Install the package from CRAN
+# THIS WILL WORK ONLY AFTER THE PACKAGE IS PUBLISHED
+install.packages('sparta')
+
+# Or install the development version from GitHub
+library(devtools)
+install_github('biologicalrecordscentre/sparta')
+

Let’s assume you have some raw data already, we can under take occupancy modelling like this

+
# Create data
+n <- 8000 # size of dataset
+nyr <- 50 # number of years in data
+nSamples <- 200 # set number of dates
+nSites <- 100 # set number of sites
+set.seed(125) # set a random seed
+
+# Create somes dates
+first <- as.Date(strptime("1950/01/01", "%Y/%m/%d")) 
+last <- as.Date(strptime(paste(1950+(nyr-1),"/12/31", sep=''), "%Y/%m/%d")) 
+dt <- last-first 
+rDates <- first + (runif(nSamples)*dt)
+
+# taxa are set semi-randomly
+taxa_probabilities <- seq(from = 0.1, to = 0.7, length.out = 26)
+taxa <- sample(letters, size = n, TRUE, prob = taxa_probabilities)
+
+# sites are visited semi-randomly
+site_probabilities <- seq(from = 0.1, to = 0.7, length.out = nSites)
+site <- sample(paste('A', 1:nSites, sep=''), size = n, TRUE, prob = site_probabilities)
+
+# the date of visit is selected semi-randomly from those created earlier
+time_probabilities <- seq(from = 0.1, to = 0.7, length.out = nSamples)
+time_period <- sample(rDates, size = n, TRUE, prob = time_probabilities)
+
+myData <- data.frame(taxa, site, time_period)
+
# Load the sparta package
+library(sparta)
+
## Loading required package: lme4
+
## Loading required package: Matrix
+

For demonstration purposes I have a faked dataset of 8000 species observations. In my dataset the species are named after the letters in the alphabet. Below I show how I can use the Bayesian occupancy models in sparta to create yearly estimates of occurrence. For more information please see the vignette for sparta

+
# Preview of my data
+head(myData)
+
##   taxa site time_period
+## 1    r  A51  1970-01-14
+## 2    v  A87  1980-09-29
+## 3    e  A56  1996-04-14
+## 4    z  A28  1959-01-16
+## 5    r  A77  1970-09-21
+## 6    x  A48  1990-02-25
+
# First format our data
+formattedOccData <- formatOccData(taxa = myData$taxa,
+                                  site = myData$site,
+                                  survey = myData$time_period)
+
## Warning in errorChecks(taxa = taxa, site = site, survey = survey,
+## closure_period = closure_period): 94 out of 8000 observations will be
+## removed as duplicates
+
# Here we are going to use the package snowfall to parallelise
+library(snowfall)
+
+# I have 4 cpus on my PC so I set cpus to 4
+# when I initialise the cluster
+sfInit(parallel = TRUE, cpus = 4)
+
## Warning in searchCommandline(parallel, cpus = cpus, type = type,
+## socketHosts = socketHosts, : Unknown option on commandline:
+## rmarkdown::render('W:/PYWELL_SHARED/Pywell Projects/BRC/Tom August/R
+## Packages/BRCindicators/vignettes/BRCindicators.Rmd', encoding
+
## R Version:  R version 3.5.2 (2018-12-20)
+
## snowfall 1.84-6.1 initialized (using snow 0.4-3): parallel execution on 4 CPUs.
+
# Export my data to the cluster
+sfExport('formattedOccData')
+
+# I create a function that takes a species name and runs my model
+occ_mod_function <- function(taxa_name){
+  
+  library(sparta)
+  
+  # Note that this will write you results to your computer
+  # the location is set to your user folder
+  occ_out <- occDetFunc(taxa_name = as.character(taxa_name),
+                        n_iterations = 200,
+                        burnin = 15, 
+                        occDetdata = formattedOccData$occDetdata,
+                        spp_vis = formattedOccData$spp_vis,
+                        write_results = TRUE,
+                        output_dir = '~/Testing_indicator_pipe',
+                        seed = 123)  
+} 
+
+# I then run this in parallel
+system.time({
+para_out <- sfClusterApplyLB(unique(myData$taxa), occ_mod_function)
+})
+
##    user  system elapsed 
+##    0.04    0.05  139.66
+
# Stop the cluster
+sfStop()
+
## 
+## Stopping cluster
+
# We can see all the files this has created
+list.files('~/Testing_indicator_pipe')
+
##  [1] "a.rdata" "b.rdata" "c.rdata" "d.rdata" "e.rdata" "f.rdata" "g.rdata"
+##  [8] "h.rdata" "i.rdata" "j.rdata" "k.rdata" "l.rdata" "m.rdata" "n.rdata"
+## [15] "o.rdata" "p.rdata" "q.rdata" "r.rdata" "s.rdata" "t.rdata" "u.rdata"
+## [22] "v.rdata" "w.rdata" "x.rdata" "y.rdata" "z.rdata"
+
+
+

Installing BRCindicators

+

Installing the package is easy and can be done in a couple of lines

+
library(devtools)
+install_github(repo = 'biologicalrecordscentre/BRCindicators')
+
+
+

Summarising sparta output for an indicator

+

Now that we have some species trends data to work with (no doubt you already have your own) we can use the first function in BRCindicators. This function reads in all the output files from sparta (which are quite large and complex) and returns a simple summary table that we can use for calculating the indicator. If you have done your analysis without using sparta you can skip to the next step.

+
library(BRCindicators)
+
+# All we have to supply is the directory where out data is saved
+# You will note this is the 'output_dir' passed to sparta above.
+trends_summary <- summarise_occDet(input_dir = '~/Testing_indicator_pipe')
+
## Loading data...done
+
# Lets see the summary
+head(trends_summary[,1:5])
+
##      year         a         b         c         d
+## [1,] 1950 0.6936066 0.6412022 0.5689617 0.5595082
+## [2,] 1951 0.7638251 0.4861749 0.4021311 0.5510929
+## [3,] 1952 0.6248634 0.6434973 0.2830601 0.5293443
+## [4,] 1953 0.7024590 0.7724590 0.4861202 0.3753552
+## [5,] 1954 0.6174317 0.5121858 0.7556831 0.3475956
+## [6,] 1955 0.4302732 0.3584699 0.6068852 0.5328962
+

Returned from this function is a summary of the data as a matrix. In each row we have the year, specified in the first column, and each subsequent column is a species. The values in the table are the mean of the posterior for the predicted proportion of sites occupied, a measure of occurrence.

+
+
+

Calculating indicator values

+

Once we have species-year indicies we are in a position to proceed to calculating an indictor. To do this there are a number of mehods available, some of which are presented here in ‘BRCindicators’

+
+

Geometric mean

+

The geometric mean method is often used with data that do not have errors associated with them.

+

The first step is to re-scale the data so that the value for all species in the first year is the same. Once this is done we calculate the geometric mean across species for each year creating the indicator value. This function also accounts for species that have no data at the beginning of the dataset by entering them at the geometric mean for that year, this stops them dramatically changing the indicator value in the year they join the dataset. It also accounts for species that leave the dataset before the end by holding them at their last value. Finally limits to species values can be given, preventing extremely high or low values biasing the indicator.

+
+

Rescaling and calculating geometric mean

+

The data I have generated in ‘trends_summary’ is very easy to work with but to show off what this function can do I’m going to mess it up a bit.

+
trends_summary[1:3, 'a'] <- NA
+trends_summary[1:5, 'b'] <- NA
+trends_summary[2:4, 'c'] <- 1000
+trends_summary[45:50, 'd'] <- NA
+
+# Let's have a look at these changes
+head(trends_summary[,1:5])
+
##      year         a         b            c         d
+## [1,] 1950        NA        NA    0.5689617 0.5595082
+## [2,] 1951        NA        NA 1000.0000000 0.5510929
+## [3,] 1952        NA        NA 1000.0000000 0.5293443
+## [4,] 1953 0.7024590        NA 1000.0000000 0.3753552
+## [5,] 1954 0.6174317        NA    0.7556831 0.3475956
+## [6,] 1955 0.4302732 0.3584699    0.6068852 0.5328962
+
tail(trends_summary[,1:5])
+
##       year         a         b         c  d
+## [45,] 1994 0.1308743 0.6617486 0.3050273 NA
+## [46,] 1995 0.5324044 0.2455191 0.7130601 NA
+## [47,] 1996 0.7407650 0.4489617 0.5519672 NA
+## [48,] 1997 0.3051366 0.7181421 0.7871585 NA
+## [49,] 1998 0.7282514 0.2540984 0.5586339 NA
+## [50,] 1999 0.5635519 0.4161202 0.4713661 NA
+

Now that I have ‘messed up’ the data a bit we have two species with data missing at the beginning and one species with data missing at the end. We also have one species with some very high values.

+

Now lets run this through the re-scaling function.

+
# Let's run this data through our scaling function (all defaults used)
+rescaled_trends <- rescale_species(Data = trends_summary)
+
+# Here's the result
+head(rescaled_trends[,c('year', 'indicator', 'a', 'b', 'c', 'd')])
+
##      year indicator         a        b          c         d
+## [1,] 1950 100.00000        NA       NA   100.0000 100.00000
+## [2,] 1951 119.03406        NA       NA 10000.0000  98.49595
+## [3,] 1952 126.49450        NA       NA 10000.0000  94.60885
+## [4,] 1953 118.53382 118.53382       NA 10000.0000  67.08663
+## [5,] 1954  94.25122 104.18620       NA   132.8179  62.12521
+## [6,] 1955  99.15398  72.60485 99.15398   106.6654  95.24368
+
tail(rescaled_trends[,c('year', 'indicator', 'a', 'b', 'c', 'd')])
+
##       year indicator         a         b         c        d
+## [45,] 1994  92.77695  22.08390 183.04187  53.61122 140.7169
+## [46,] 1995 102.48140  89.83859  67.91140 125.32655 140.7169
+## [47,] 1996 105.83425 124.99763 124.18431  97.01306 140.7169
+## [48,] 1997 106.16325  51.48914 198.64048 138.34998 140.7169
+## [49,] 1998  94.26861 122.88605  70.28445  98.18479 140.7169
+## [50,] 1999  95.96255  95.09446 115.10023  82.84672 140.7169
+

You can see that species ‘a’ and ‘b’ enter the dataset at the geometric mean (the indicator value), all species are indexed at 100 in the first year and the very high values in ‘c’ are capped at 10000 at the end ‘d’ has been held at it’s end value.

+

The ‘indicator’ column that is returned here is our indicator, calculated as the geometric mean of all the species in the data set.

+
+
+

Confidence intervals

+

We can get confidence intervals for this indicator by bootstrapping across species. We have a function for that too!

+
# This function takes just the species columns
+scaled_species <- rescaled_trends[,!colnames(rescaled_trends) %in% c('year', 'indicator')]
+indicator_CIs <- bootstrap_indicator(Data = scaled_species)
+
## Running bootstrapping for 10000 iterations...done
+
# Returned are the CIs for our indicator
+head(indicator_CIs)
+
##      quant_025 quant_975
+## [1,] 100.00000  100.0000
+## [2,]  87.33579  190.2349
+## [3,]  95.60814  198.0757
+## [4,]  88.79716  183.7160
+## [5,]  80.60150  110.3239
+## [6,]  86.15162  115.1020
+
+
+

Smoothing

+

It is sometimes desirable to create a smoothed indicator value from the raw values. This can be achieved by fitting a GAM (general additive model) to the indicator using a spline. This spline is a smoothed curve that goes through the raw values for the indicator and is fitted using the function ‘gam’ in the ‘mgcv’ R package.

+
# The smoothing function takes the indicator values
+smoothed_indicator <- GAM_smoothing(rescaled_trends[,'indicator'])
+
+# In this example there is little support for a non-linear trend and 
+# so the line almost linear
+plot(x = rescaled_trends[,'year'], y = rescaled_trends[,'indicator'])
+lines(x = rescaled_trends[,'year'], y = smoothed_indicator, col = 'red')
+

+
# But if our indicator did support a non-linear trend it might look 
+# like this
+eg_indicator <- jitter(sort(rnorm(50)), amount = 0.5)
+eg_smoothed <- GAM_smoothing(eg_indicator)
+plot(x = 1:50, y = eg_indicator)
+lines(x = 1:50, y = eg_smoothed, col = 'red')
+

+

Where there is little support for a non-linear trend a GAM smoothed line will tend towards linear. Where there is good support for a non-linear trend the smoothed line will become more ‘bendy’.

+
+
+

Plotting

+

We now have our indicator and the confidence intervals around it. The next step is to plot it. We have included a function that creates a simple plot using ggplot2, however you could easily create your own plots in R using the data.

+
# Plot our indicator.
+plot_indicator(indicator = rescaled_trends[,'indicator'],
+               smoothed_line = smoothed_indicator,
+               CIs = indicator_CIs)
+

+

In this plot you can see the high upper confidence interval in years 2-4, this is due to the artificially high values we gave to species ‘c’.

+
+
+
+

Bayesian Meta-Analysis (BMA)

+

The Bayesian Meta-Analysis method, or BMA, is suited to data with standard errors associated with them. As with other methods we require data from more than one species, across a number of years, with an error for each species-year estimate.

+
# Here is an example dataset for the BMA method
+data <- data.frame(species = rep(letters, each = 50),
+                   year = rep(1:50, length(letters)), 
+                   index = runif(n = 50 * length(letters), min = 0, max = 1), 
+                   se = runif(n = 50 * length(letters), min = 0.01, max = .1))
+head(data)
+
##   species year     index         se
+## 1       a    1 0.5377357 0.01769631
+## 2       a    2 0.9644076 0.07268163
+## 3       a    3 0.3653818 0.08335768
+## 4       a    4 0.7266121 0.04494251
+## 5       a    5 0.8660005 0.09993223
+## 6       a    6 0.4728771 0.09887020
+

It is important that your data is in the same format and that your columns are in the same order and have the same names. Remember you can use the function read.csv() to read in the data from a .csv on your computer.

+

BMA is run using the function bma, here we will use the default settings and then see what we can change.

+
bma_indicator <- bma(data)
+
## 
+## Processing function input....... 
+## 
+## Done. 
+##  
+## Compiling model graph
+##    Resolving undeclared variables
+##    Allocating nodes
+## Graph information:
+##    Observed stochastic nodes: 2600
+##    Unobserved stochastic nodes: 1379
+##    Total graph size: 6587
+## 
+## Initializing model
+## 
+## Adaptive phase..... 
+## Adaptive phase complete 
+##  
+## 
+##  Burn-in phase, 5000 iterations x 3 chains 
+##  
+## 
+## Sampling from joint posterior, 5000 iterations x 3 chains 
+##  
+## 
+## Calculating statistics.......
+
## Warning in doTryCatch(return(expr), name, parentenv, handler): At least one
+## Rhat value could not be calculated.
+
## 
+## Done.
+

+

The function returns a plot to your screen which is a diagnostic plot of the model. When the model has converged (i.e. reached a point where the three chains agree on the answer) the lines on the plots on the left will sit on top of one another and the plots on the right will have a nice bell shape. You can turn off this plot by setting plot to FALSE. By default the method runs the chains in series. Running them in parallel makes the models run faster (about half the time) but will slow down your computer more. We can change this with the parameter parallel. The number of iterations the model runs is controlled by n.iter and defaults to 10000. If you can it is better to run it for more iterations, though this will take longer. m.scale gives the scale your data is on. It is very important that this is correct, choose from ‘loge’ (natural log, sometimes simply called ‘log’), ‘log10’ (log to the base 10), or ‘logit’ (output from models of proportions or probabilities).

+

Let’s implement a few of these changes

+
bma_indicator2 <- bma(data,
+                     parallel = TRUE,
+                     n.iter = 500,
+                     m.scale = 'log10')
+
## 
+## Processing function input....... 
+## 
+## Done. 
+##  
+## Beginning parallel processing using 3 cores. Console output will be suppressed.
+## 
+## Parallel processing completed.
+## 
+## Calculating statistics.......
+
## Warning in doTryCatch(return(expr), name, parentenv, handler): At least one
+## Rhat value could not be calculated.
+
## 
+## Done.
+

+

Because we have reduced the number of interations the model no longer has a good convergence. The lines on the graphs on the left do not overlap and the graphs on the right are no longer a smooth bell shape.

+

The object that is returned is a data.frame with years as rows and columns giving the year value, index value and confidence intervals. You can write this to a csv using the function write.csv.

+
head(bma_indicator)
+
##   Year    Index  lower2.5 upper97.5
+## 1    1 100.0000 100.00000  100.0000
+## 2    2 100.1559  98.00631  102.3808
+## 3    3 100.3311  97.39855  103.5038
+## 4    4 100.5416  96.95652  104.3762
+## 5    5 100.5611  96.63643  104.8945
+## 6    6 100.4664  95.99929  105.0179
+

We can use the plotting function in BRCindicators to plot the results of this analysis, which in this case are not all that interesting!

+
plot_indicator(indicator = bma_indicator[,'Index'],
+               CIs = bma_indicator[,c(3,4)])
+

+
+
+

Multi-species Indicator

+

The multi-species indicator method was developed by Statistics Netherlands and the code is made available on their website. To find out more about the inner working of this method please read the detailed documentation on the authors website. Here is a simple example of how this method runs in BRCindicators.

+
# Create some example data in the format required
+nyr = 20
+species = rep(letters, each = nyr)
+year = rev(rep(1:nyr, length(letters)))
+
+# Create an index value that increases with time
+index = rep(seq(50, 100, length.out = nyr), length(letters))
+# Add randomness to species
+index = index * runif(n = length(index), 0.7, 1.3)
+# Add correlated randomness across species, to years
+index = index * rep(runif(0.8, 1.2, n = nyr), length(letters))
+
+se = runif(n = nyr * length(letters), min = 10, max = 20)
+
+data <- data.frame(species, year, index, se)
+
+# Our species are decreasing
+plot(data$year, data$index)
+

+
# Species index values need to be 100 in the base year. Here I use
+# the first year as my base year and rescale to 100. The standard error
+# in the base year should be 0.
+min_year <- min(data$year)
+
+for(sp in unique(data$species)){
+
+  subset_data <- data[data$species == sp, ]
+  multi_factor <- 100 / subset_data$index[subset_data$year == min_year]
+  data$index[data$species == sp] <- data$index[data$species == sp] * multi_factor
+  data$se[data$species == sp] <- data$se[data$species == sp] * multi_factor
+  data$se[data$species == sp][1] <- 0
+
+}
+
+# Our first year is now indexed at 100
+plot(data$year, data$index)
+

+
# Alternativly I could read in data from a csv
+# data <- read.csv('path/to/my/data.csv')
+
+# Run the MSI function
+msi_out <- msi(data)
+

+
head(msi_out$CV)
+
##   species   mean_CV
+## 1       a 0.1927516
+## 2       b 0.1918315
+## 3       c 0.2048356
+## 4       d 0.2195197
+## 5       e 0.2048833
+## 6       f 0.2096112
+
# I can capture the output figures too
+# pdf('test.pdf')
+#   msi_out <- msi(data)
+# dev.off()
+

The code returns two plots to the console, the first plot shows the coefficient of variation (CV) for each of the species. Species with high values of CV may adversly effect the relaibility of the trend estimation. Use this graph to identify the CV values of the species and use the maxCV parameter to set a threshold above which species will be excluded. The results of excluding species in this way can be tested by comparing trend plots. The CV values are hard to assign to species from this plot as the species are coded to numbers. To see the raw values look at the CV component of msi_out (i.e. msi_out$CV). The second plot shows the smoothed trend and the MSI values. These two figures can be captured in the usual way in R by using pdf() for example. In the example I create a dataset from random numbers but usually you would use read.csv() to read in data from a local file.

+

Here is a second example which sets some additional parameters. The parameters for msi get passed to msi_tool so to see a list of all the parameters you can change look at the help documentation in msi_tool usign ?msi_tool at the R console. I cover most of hte important ones here.

+
msi_out <- msi(data,
+               nsim = 500, # The number of Mote Carlo simulations
+               SEbaseyear = 10, # The year to index on
+               plotbaseyear = 15, # The year to set as 100 in plots
+               index_smoot = 'INDEX', # plotbaseyear uses MSI not trend
+               span = 0.7, # 'wigglyness' of line, between 0 and 1
+               lastyears = 5, # last X years of time series for short-term trends
+               maxCV = 10, # maximum allowed Coefficient of Variation 
+               changepoint = 10, # compare trends before and after this year
+               truncfac = 8, # max year-to-year index ratio
+               TRUNC = 5, #set all indices below TRUNC to this
+               plot = TRUE # should the plots be returned?)
+               )
+

+

This set of parameters is unrealistic but shows the options available. Note that in the second graph the year 10 point now has a se = 0, year 15 MSI is set to 100, and the short term trend is reported for the last 5 years.

+

The analysis also returns data which provide more insights into the analysis and let you create your own plots if required.

+
# The returned object has 2 elements
+head(msi_out$results)
+
##   year    MSI sd_MSI lower_CL_MSI upper_CL_MSI  Trend lower_CL_trend
+## 1    1 164.57   7.50       150.50       179.96 162.23         151.14
+## 2    2 164.60   7.11       151.24       179.13 156.51         149.63
+## 3    3 124.16   6.20       112.59       136.94 151.27         145.21
+## 4    4 154.53   6.93       141.53       168.72 146.51         140.92
+## 5    5 170.74   7.36       156.90       185.78 142.41         136.76
+## 6    6 143.41   6.76       130.77       157.30 139.41         134.05
+##   upper_CL_trend      trend_class
+## 1         173.85 moderate_decline
+## 2         163.73 moderate_decline
+## 3         157.31 moderate_decline
+## 4         152.66 moderate_decline
+## 5         148.15 moderate_decline
+## 6         145.21 moderate_decline
+

The first of the two elements (results) returned gives all the data, and a little more, that is presented in the second figure.

+
# The returned object has 2 elements
+msi_out$trends
+
##                             Measure    value     significance
+## 1                     overall trend   0.9691 moderate decline
+## 2                  SE overall trend   0.0019                 
+## 3                trend last 5 years   0.9711        uncertain
+## 4             SE trend last 5 years   0.0154                 
+## 5                  changepoint (10)  10.0000                 
+## 6     trend before changepoint (10)   0.9789 moderate decline
+## 7  SE trend before changepoint (10)   0.0046                 
+## 8      trend after changepoint (10)   0.9487 moderate decline
+## 9   SE trend after changepoint (10)   0.0040                 
+## 10                         % change -46.1560           p<0.01
+## 11                      SE % change   2.5360                 
+## 12            % change last 5 years  -9.4120           p<0.01
+## 13         SE % change last 5 years   2.3900                 
+## 14                      changepoint       NA           p<0.01
+
# I could write this as a csv too
+# write.csv(msi_out$trends, file = 'path/to/my/output.csv')
+

The second element (trends) returned give a summary of various trend assessments across the time series.

+

We have also added a plot method for the MSI output which provides a plot similar to that of the second figure we have seen already. Lets use this plot method to explore the effect of changing the span value in the analysis

+
for(i in c(0.3, 0.5, 0.7)){ # use a range of values for span
+  
+  msi_out <- msi(data, span = i, # span is set to i
+                 nsim = 200, plot = FALSE)
+  print( # print makes the plot visible in the for loop
+    plot(msi_out, title = paste('MSI - span =', i)) # plot
+  )
+  
+}
+

+

As the value of span gets closer to 1 the trend line gets smoother.

+
+
+

Lambda Indicator

+

The lambda indicator calculates an indicator using growth rates from one year to the next. Formulating the indicator in terms of growth rates has two distinct advantages over the conventional approach to constructing indicators. First, it means that the categorisation of species as ‘increasing’ or ‘decreasing’ can be made from the same set of data (the growth rates) as the construction of the indicator. Second, it provides an elegant solution to the problem of species that join the indicator after the first year (i.e. where the first year is unreliable): other indicators typically adopt a complicated rescaling approach to ensure that species entering the indicator after the first year do not bias the overall assessment. It also makes a simple and robust, though untestable, assumption about species that drop out of the indicator prior to the final year: specifically it assumes that their fluctuations are the same, in aggregate, as those of the species that remain in the indicator. For more details see http://webarchive.nationalarchives.gov.uk/20170302170037/http://jncc.defra.gov.uk/Docs/UKBI2015_TechBG_C4b-D1c_Bayesian_Final.docx

+

Very few species’ models produced reliable occupancy estimates for every year, so a majority of the time series contain missing values. This presents a problem for estimating growth rates for each species-year combination. Missing values of growth rate that would be equivalent to linear interpolation of the log odds between adjacent years with reliable estimates were therefore calculated. This indicator can therefore work with missing values.

+

Input data is on the occupancy scale, and is therefore bounded between 0 and 1.

+
# number of species
+nsp = 50
+
+# number of years
+nyr = 40
+
+#number of iterations
+iter = 500
+
+# Build a random set of data
+myArray <- array(data = rnorm(n = nsp*nyr*iter,
+                               mean = 0.5,
+                               sd = 0.1),
+                  dim = c(nsp, nyr, iter),
+                  dimnames = list(paste0('SP',1:nsp),
+                                  1:nyr,
+                                  1:iter))
+
+# Ensure values are bounded by 0 and 1
+myArray[myArray > 1] <- 1
+myArray[myArray < 0] <- 0
+
+str(myArray)
+
##  num [1:50, 1:40, 1:500] 0.555 0.578 0.489 0.562 0.55 ...
+##  - attr(*, "dimnames")=List of 3
+##   ..$ : chr [1:50] "SP1" "SP2" "SP3" "SP4" ...
+##   ..$ : chr [1:40] "1" "2" "3" "4" ...
+##   ..$ : chr [1:500] "1" "2" "3" "4" ...
+

lambda_indicator takes in an array of data, a three-dimensional matrix. The dimensions of this array represent species, years, and iterations.

+
# Run the lambda_interpolation method on this data
+myIndicator <- lambda_indicator(myArray)
+
+# Plot the indicator
+plot_indicator(myIndicator$summary[,'indicator'],
+               myIndicator$summary[,c('lower' ,'upper')])
+

+

There are a number of options available in the lambda_indicator function

+
myIndicator <- lambda_indicator(myArray,
+                                index = 1, # Set the index value to 1 not 100
+                                year_range = c(30,40), # Set year range
+                                threshold_yrs = 5) # set a threshold
+plot_indicator(myIndicator$summary[,'indicator'],
+               myIndicator$summary[,c('lower' ,'upper')])
+

+

Note that there are a range of threshold functions that allow you to adjust which data points are used in the indicator. There are options to remove species year estimates based on their standard deviaction, rHat value and based on the number of years a species is present in the dataset. Note that the rhat threshold can only be used if you are using a directory path as you input rather than an array.

+
+
+
+

Creating a custom pipeline function

+

We have demonstrated how you might run the indicator functions one at a time, however in a ‘pipeline’ we want data to flow through seamlessly. Additionally there are a number of parameters in the functions that we have not shown you that you might find useful. Here is an example of how you can create your own pipeline function. Our function will wrap around the functions described above, setting the parameters to meet our needs. Once we have done this it will allow use to execute our pipeline in one line.

+
# I call my function 'run_pipeline' and the only arguement it
+# takes is the directory of sparta's output
+run_pipeline <- function(input_dir){
+
+  require(sparta)
+  require(BRCindicators)
+  
+  # Create the trends summary
+  trends_summary <- summarise_occDet(input_dir = input_dir)
+
+  # Rescale the values and get the indicator values
+  # Here I set the index to 1 and change the value limits
+  rescaled_trends <- rescale_species(Data = trends_summary,
+                                     index = 1,
+                                     max = 100,
+                                     min = 0.001)
+  
+  # Bootstrap the indicator to get CIs
+  scaled_species <- rescaled_trends[,!colnames(rescaled_trends) %in% c('year', 'indicator')]
+  # This time I set the iterations to twice the default and 
+  # use custom confidence intervals
+  indicator_CIs <- bootstrap_indicator(Data = scaled_species,
+                                       CI_limits = c(0.25, 0.75),
+                                       iterations = 20000)
+  
+  # Get the smoothed indicator line
+  smoothed_indicator <- GAM_smoothing(rescaled_trends[,'indicator'])
+  
+  # This time I specify the years and index value
+  plot_indicator(indicator = rescaled_trends[,'indicator'],
+                 year = rescaled_trends[,'year'],
+                 index = 1,
+                 CIs = indicator_CIs,
+                 smoothed_line = smoothed_indicator)
+  
+  ## I'll return all my data  
+  return(cbind(smoothed_indicator, indicator_CIs, as.data.frame(trends_summary)))
+ }
+

Once we have created this function we can run this pipeline on a directory in one line, or put it in a loop to run across many directories.

+
# Now we can run the pipeline in one line, like a boss
+indicator_data <- run_pipeline(input_dir = '~/Testing_indicator_pipe')
+
## Loading data...done
+## Running bootstrapping for 20000 iterations...done
+

+
head(indicator_data)
+
##   smoothed_indicator  quant_25 quant_75 year         a         b         c
+## 1          0.9410110 1.0000000 1.000000 1950 0.6936066 0.6412022 0.5689617
+## 2          0.9410241 0.9161441 1.013355 1951 0.7638251 0.4861749 0.4021311
+## 3          0.9410373 0.9654977 1.054758 1952 0.6248634 0.6434973 0.2830601
+## 4          0.9410504 0.9418938 1.025378 1953 0.7024590 0.7724590 0.4861202
+## 5          0.9410636 0.8829743 0.980575 1954 0.6174317 0.5121858 0.7556831
+## 6          0.9410767 0.9152991 1.016191 1955 0.4302732 0.3584699 0.6068852
+##           d         e         f         g         h         i         j
+## 1 0.5595082 0.6493443 0.6055738 0.6360656 0.6808197 0.5023497 0.7240984
+## 2 0.5510929 0.6752459 0.7466667 0.6198361 0.3294536 0.7777049 0.5445902
+## 3 0.5293443 0.7142623 0.4499454 0.8585246 0.6218033 0.7155738 0.4503825
+## 4 0.3753552 0.4351366 0.7007650 0.6812568 0.2843716 0.7617486 0.7204918
+## 5 0.3475956 0.7890710 0.9208743 0.4224590 0.3489617 0.6709290 0.6630055
+## 6 0.5328962 0.6496175 0.5071038 0.8983607 0.8183607 0.1828415 0.5910383
+##           k         l         m         n         o         p         q
+## 1 0.6745355 0.7474317 0.4280874 0.8637705 0.4754098 0.3063934 0.8561749
+## 2 0.5612568 0.8439891 0.7219672 0.7681421 0.8899454 0.8829508 0.6603279
+## 3 0.7230055 0.6057377 0.8093443 0.6449727 0.8291803 0.5590164 0.9317486
+## 4 0.6781421 0.3494536 0.6693443 0.8687432 0.7666667 0.3861202 0.8895628
+## 5 0.4267760 0.7104918 0.5197268 0.6080328 0.9333880 0.6397814 0.8000000
+## 6 0.4730601 0.7218579 0.7813115 0.8701093 0.7691803 0.9534973 0.7195628
+##           r         s         t         u         v         w         x
+## 1 0.4616940 0.8857377 0.9212568 0.9278142 0.7791803 0.8990164 0.8789617
+## 2 0.4956284 0.5505464 0.6045902 0.9059016 0.5865027 0.9578689 0.5522404
+## 3 0.8715847 0.8665574 0.7143169 0.9012022 0.7806557 0.8487978 0.6573224
+## 4 0.7289617 0.9055191 0.9448634 0.7619126 0.8996721 0.7424044 0.8715847
+## 5 0.7063934 0.4081967 0.8507650 0.6043716 0.8098361 0.8346448 0.4581421
+## 6 0.5015847 0.6086339 0.7546448 0.9357923 0.9027322 0.8691257 0.8631148
+##           y         z
+## 1 0.8712022 0.8991257
+## 2 0.8714208 0.6720765
+## 3 0.8427869 0.7418033
+## 4 0.8560656 0.7687432
+## 5 0.9773770 0.6566667
+## 6 0.8149180 0.6977049
+
+ + + + +
+ + + + + + + + diff --git a/inst/doc/BRCindicators.html.orig b/inst/doc/BRCindicators.html.orig new file mode 100644 index 0000000..9888532 --- /dev/null +++ b/inst/doc/BRCindicators.html.orig @@ -0,0 +1,840 @@ + + + + + + + + + + + + + + +BRCindicators + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + +
+ +
+ +
require(snowfall)
+
## Loading required package: snowfall
+
## Loading required package: snow
+
+

Introduction

+

This document shows how to use the indicator pipeline to create biodiversity indicators such as those for DEFRA’s Biodiversity Indicators in Your Pocket. The pipeline is shared in the form of an R package called ‘BRCindicators’ making it easy to share and maintain.

+

The functions in BRCindicators work with yearly estimates of species abundance or occurrence and aggregate them into an scaled indicator value with bootstrapped confidence intervals

+

This package has the ability to read in the output of occupancy models created in the R package sparta, a package for estimating species trends from occurrence data. This package can be installed from Github and details of how to use the package are given in the package vignette. There is no need to use sparta to create your yearly species estimates as BRCindicators can also work with other data.

+

To create an indicator we first need to have species trends, let’s create some using the sparta R package.

+
+
+

Creating yearly estimates of occurrence in sparta

+

If you already have yearly estimates of abundance or occurrence for your species you can skip this stage. Here we show how you can create these estimates from raw species observation data using sparta.

+
# Install the package from CRAN
+# THIS WILL WORK ONLY AFTER THE PACKAGE IS PUBLISHED
+install.packages('sparta')
+
+# Or install the development version from GitHub
+library(devtools)
+install_github('biologicalrecordscentre/sparta')
+

Let’s assume you have some raw data already, we can under take occupancy modelling like this

+
# Create data
+n <- 8000 # size of dataset
+nyr <- 50 # number of years in data
+nSamples <- 200 # set number of dates
+nSites <- 100 # set number of sites
+set.seed(125) # set a random seed
+
+# Create somes dates
+first <- as.Date(strptime("1950/01/01", "%Y/%m/%d")) 
+last <- as.Date(strptime(paste(1950+(nyr-1),"/12/31", sep=''), "%Y/%m/%d")) 
+dt <- last-first 
+rDates <- first + (runif(nSamples)*dt)
+
+# taxa are set semi-randomly
+taxa_probabilities <- seq(from = 0.1, to = 0.7, length.out = 26)
+taxa <- sample(letters, size = n, TRUE, prob = taxa_probabilities)
+
+# sites are visited semi-randomly
+site_probabilities <- seq(from = 0.1, to = 0.7, length.out = nSites)
+site <- sample(paste('A', 1:nSites, sep=''), size = n, TRUE, prob = site_probabilities)
+
+# the date of visit is selected semi-randomly from those created earlier
+time_probabilities <- seq(from = 0.1, to = 0.7, length.out = nSamples)
+time_period <- sample(rDates, size = n, TRUE, prob = time_probabilities)
+
+myData <- data.frame(taxa, site, time_period)
+
# Load the sparta package
+library(sparta)
+
## Loading required package: lme4
+
## Loading required package: Matrix
+

For demonstration purposes I have a faked dataset of 8000 species observations. In my dataset the species are named after the letters in the alphabet. Below I show how I can use the Bayesian occupancy models in sparta to create yearly estimates of occurrence. For more information please see the vignette for sparta

+
# Preview of my data
+head(myData)
+
##   taxa site time_period
+## 1    r  A51  1970-01-14
+## 2    v  A87  1980-09-29
+## 3    e  A56  1996-04-14
+## 4    z  A28  1959-01-16
+## 5    r  A77  1970-09-21
+## 6    x  A48  1990-02-25
+
# First format our data
+formattedOccData <- formatOccData(taxa = myData$taxa,
+                                  site = myData$site,
+                                  survey = myData$time_period)
+
## Warning in errorChecks(taxa = taxa, site = site, survey = survey,
+## closure_period = closure_period): 94 out of 8000 observations will be
+## removed as duplicates
+
# Here we are going to use the package snowfall to parallelise
+library(snowfall)
+
+# I have 4 cpus on my PC so I set cpus to 4
+# when I initialise the cluster
+sfInit(parallel = TRUE, cpus = 4)
+
## Warning in searchCommandline(parallel, cpus = cpus, type = type,
+## socketHosts = socketHosts, : Unknown option on commandline:
+## rmarkdown::render('W:/PYWELL_SHARED/Pywell Projects/BRC/Tom August/R
+## Packages/BRCindicators/vignettes/BRCindicators.Rmd', encoding
+
## R Version:  R version 3.5.2 (2018-12-20)
+
## snowfall 1.84-6.1 initialized (using snow 0.4-3): parallel execution on 4 CPUs.
+
# Export my data to the cluster
+sfExport('formattedOccData')
+
+# I create a function that takes a species name and runs my model
+occ_mod_function <- function(taxa_name){
+  
+  library(sparta)
+  
+  # Note that this will write you results to your computer
+  # the location is set to your user folder
+  occ_out <- occDetFunc(taxa_name = as.character(taxa_name),
+                        n_iterations = 200,
+                        burnin = 15, 
+                        occDetdata = formattedOccData$occDetdata,
+                        spp_vis = formattedOccData$spp_vis,
+                        write_results = TRUE,
+                        output_dir = '~/Testing_indicator_pipe',
+                        seed = 123)  
+} 
+
+# I then run this in parallel
+system.time({
+para_out <- sfClusterApplyLB(unique(myData$taxa), occ_mod_function)
+})
+
##    user  system elapsed 
+##    0.04    0.05  139.66
+
# Stop the cluster
+sfStop()
+
## 
+## Stopping cluster
+
# We can see all the files this has created
+list.files('~/Testing_indicator_pipe')
+
##  [1] "a.rdata" "b.rdata" "c.rdata" "d.rdata" "e.rdata" "f.rdata" "g.rdata"
+##  [8] "h.rdata" "i.rdata" "j.rdata" "k.rdata" "l.rdata" "m.rdata" "n.rdata"
+## [15] "o.rdata" "p.rdata" "q.rdata" "r.rdata" "s.rdata" "t.rdata" "u.rdata"
+## [22] "v.rdata" "w.rdata" "x.rdata" "y.rdata" "z.rdata"
+
+
+

Installing BRCindicators

+

Installing the package is easy and can be done in a couple of lines

+
library(devtools)
+install_github(repo = 'biologicalrecordscentre/BRCindicators')
+
+
+

Summarising sparta output for an indicator

+

Now that we have some species trends data to work with (no doubt you already have your own) we can use the first function in BRCindicators. This function reads in all the output files from sparta (which are quite large and complex) and returns a simple summary table that we can use for calculating the indicator. If you have done your analysis without using sparta you can skip to the next step.

+
library(BRCindicators)
+
+# All we have to supply is the directory where out data is saved
+# You will note this is the 'output_dir' passed to sparta above.
+trends_summary <- summarise_occDet(input_dir = '~/Testing_indicator_pipe')
+
## Loading data...done
+
# Lets see the summary
+head(trends_summary[,1:5])
+
##      year         a         b         c         d
+## [1,] 1950 0.6936066 0.6412022 0.5689617 0.5595082
+## [2,] 1951 0.7638251 0.4861749 0.4021311 0.5510929
+## [3,] 1952 0.6248634 0.6434973 0.2830601 0.5293443
+## [4,] 1953 0.7024590 0.7724590 0.4861202 0.3753552
+## [5,] 1954 0.6174317 0.5121858 0.7556831 0.3475956
+## [6,] 1955 0.4302732 0.3584699 0.6068852 0.5328962
+

Returned from this function is a summary of the data as a matrix. In each row we have the year, specified in the first column, and each subsequent column is a species. The values in the table are the mean of the posterior for the predicted proportion of sites occupied, a measure of occurrence.

+
+
+

Calculating indicator values

+

Once we have species-year indicies we are in a position to proceed to calculating an indictor. To do this there are a number of mehods available, some of which are presented here in ‘BRCindicators’

+
+

Geometric mean

+

The geometric mean method is often used with data that do not have errors associated with them.

+

The first step is to re-scale the data so that the value for all species in the first year is the same. Once this is done we calculate the geometric mean across species for each year creating the indicator value. This function also accounts for species that have no data at the beginning of the dataset by entering them at the geometric mean for that year, this stops them dramatically changing the indicator value in the year they join the dataset. It also accounts for species that leave the dataset before the end by holding them at their last value. Finally limits to species values can be given, preventing extremely high or low values biasing the indicator.

+
+

Rescaling and calculating geometric mean

+

The data I have generated in ‘trends_summary’ is very easy to work with but to show off what this function can do I’m going to mess it up a bit.

+
trends_summary[1:3, 'a'] <- NA
+trends_summary[1:5, 'b'] <- NA
+trends_summary[2:4, 'c'] <- 1000
+trends_summary[45:50, 'd'] <- NA
+
+# Let's have a look at these changes
+head(trends_summary[,1:5])
+
##      year         a         b            c         d
+## [1,] 1950        NA        NA    0.5689617 0.5595082
+## [2,] 1951        NA        NA 1000.0000000 0.5510929
+## [3,] 1952        NA        NA 1000.0000000 0.5293443
+## [4,] 1953 0.7024590        NA 1000.0000000 0.3753552
+## [5,] 1954 0.6174317        NA    0.7556831 0.3475956
+## [6,] 1955 0.4302732 0.3584699    0.6068852 0.5328962
+
tail(trends_summary[,1:5])
+
##       year         a         b         c  d
+## [45,] 1994 0.1308743 0.6617486 0.3050273 NA
+## [46,] 1995 0.5324044 0.2455191 0.7130601 NA
+## [47,] 1996 0.7407650 0.4489617 0.5519672 NA
+## [48,] 1997 0.3051366 0.7181421 0.7871585 NA
+## [49,] 1998 0.7282514 0.2540984 0.5586339 NA
+## [50,] 1999 0.5635519 0.4161202 0.4713661 NA
+

Now that I have ‘messed up’ the data a bit we have two species with data missing at the beginning and one species with data missing at the end. We also have one species with some very high values.

+

Now lets run this through the re-scaling function.

+
# Let's run this data through our scaling function (all defaults used)
+rescaled_trends <- rescale_species(Data = trends_summary)
+
+# Here's the result
+head(rescaled_trends[,c('year', 'indicator', 'a', 'b', 'c', 'd')])
+
##      year indicator         a        b          c         d
+## [1,] 1950 100.00000        NA       NA   100.0000 100.00000
+## [2,] 1951 119.03406        NA       NA 10000.0000  98.49595
+## [3,] 1952 126.49450        NA       NA 10000.0000  94.60885
+## [4,] 1953 118.53382 118.53382       NA 10000.0000  67.08663
+## [5,] 1954  94.25122 104.18620       NA   132.8179  62.12521
+## [6,] 1955  99.15398  72.60485 99.15398   106.6654  95.24368
+
tail(rescaled_trends[,c('year', 'indicator', 'a', 'b', 'c', 'd')])
+
##       year indicator         a         b         c        d
+## [45,] 1994  92.77695  22.08390 183.04187  53.61122 140.7169
+## [46,] 1995 102.48140  89.83859  67.91140 125.32655 140.7169
+## [47,] 1996 105.83425 124.99763 124.18431  97.01306 140.7169
+## [48,] 1997 106.16325  51.48914 198.64048 138.34998 140.7169
+## [49,] 1998  94.26861 122.88605  70.28445  98.18479 140.7169
+## [50,] 1999  95.96255  95.09446 115.10023  82.84672 140.7169
+

You can see that species ‘a’ and ‘b’ enter the dataset at the geometric mean (the indicator value), all species are indexed at 100 in the first year and the very high values in ‘c’ are capped at 10000 at the end ‘d’ has been held at it’s end value.

+

The ‘indicator’ column that is returned here is our indicator, calculated as the geometric mean of all the species in the data set.

+
+
+

Confidence intervals

+

We can get confidence intervals for this indicator by bootstrapping across species. We have a function for that too!

+
# This function takes just the species columns
+scaled_species <- rescaled_trends[,!colnames(rescaled_trends) %in% c('year', 'indicator')]
+indicator_CIs <- bootstrap_indicator(Data = scaled_species)
+
## Running bootstrapping for 10000 iterations...done
+
# Returned are the CIs for our indicator
+head(indicator_CIs)
+
##      quant_025 quant_975
+## [1,] 100.00000  100.0000
+## [2,]  87.33579  190.2349
+## [3,]  95.60814  198.0757
+## [4,]  88.79716  183.7160
+## [5,]  80.60150  110.3239
+## [6,]  86.15162  115.1020
+
+
+

Smoothing

+

It is sometimes desirable to create a smoothed indicator value from the raw values. This can be achieved by fitting a GAM (general additive model) to the indicator using a spline. This spline is a smoothed curve that goes through the raw values for the indicator and is fitted using the function ‘gam’ in the ‘mgcv’ R package.

+
# The smoothing function takes the indicator values
+smoothed_indicator <- GAM_smoothing(rescaled_trends[,'indicator'])
+
+# In this example there is little support for a non-linear trend and 
+# so the line almost linear
+plot(x = rescaled_trends[,'year'], y = rescaled_trends[,'indicator'])
+lines(x = rescaled_trends[,'year'], y = smoothed_indicator, col = 'red')
+

+
# But if our indicator did support a non-linear trend it might look 
+# like this
+eg_indicator <- jitter(sort(rnorm(50)), amount = 0.5)
+eg_smoothed <- GAM_smoothing(eg_indicator)
+plot(x = 1:50, y = eg_indicator)
+lines(x = 1:50, y = eg_smoothed, col = 'red')
+

+

Where there is little support for a non-linear trend a GAM smoothed line will tend towards linear. Where there is good support for a non-linear trend the smoothed line will become more ‘bendy’.

+
+
+

Plotting

+

We now have our indicator and the confidence intervals around it. The next step is to plot it. We have included a function that creates a simple plot using ggplot2, however you could easily create your own plots in R using the data.

+
# Plot our indicator.
+plot_indicator(indicator = rescaled_trends[,'indicator'],
+               smoothed_line = smoothed_indicator,
+               CIs = indicator_CIs)
+

+

In this plot you can see the high upper confidence interval in years 2-4, this is due to the artificially high values we gave to species ‘c’.

+
+
+
+

Bayesian Meta-Analysis (BMA)

+

The Bayesian Meta-Analysis method, or BMA, is suited to data with standard errors associated with them. As with other methods we require data from more than one species, across a number of years, with an error for each species-year estimate.

+
# Here is an example dataset for the BMA method
+data <- data.frame(species = rep(letters, each = 50),
+                   year = rep(1:50, length(letters)), 
+                   index = runif(n = 50 * length(letters), min = 0, max = 1), 
+                   se = runif(n = 50 * length(letters), min = 0.01, max = .1))
+head(data)
+
##   species year     index         se
+## 1       a    1 0.5377357 0.01769631
+## 2       a    2 0.9644076 0.07268163
+## 3       a    3 0.3653818 0.08335768
+## 4       a    4 0.7266121 0.04494251
+## 5       a    5 0.8660005 0.09993223
+## 6       a    6 0.4728771 0.09887020
+

It is important that your data is in the same format and that your columns are in the same order and have the same names. Remember you can use the function read.csv() to read in the data from a .csv on your computer.

+

BMA is run using the function bma, here we will use the default settings and then see what we can change.

+
bma_indicator <- bma(data)
+
## 
+## Processing function input....... 
+## 
+## Done. 
+##  
+## Compiling model graph
+##    Resolving undeclared variables
+##    Allocating nodes
+## Graph information:
+##    Observed stochastic nodes: 2600
+##    Unobserved stochastic nodes: 1379
+##    Total graph size: 6587
+## 
+## Initializing model
+## 
+## Adaptive phase..... 
+## Adaptive phase complete 
+##  
+## 
+##  Burn-in phase, 5000 iterations x 3 chains 
+##  
+## 
+## Sampling from joint posterior, 5000 iterations x 3 chains 
+##  
+## 
+## Calculating statistics.......
+
## Warning in doTryCatch(return(expr), name, parentenv, handler): At least one
+## Rhat value could not be calculated.
+
## 
+## Done.
+

+

The function returns a plot to your screen which is a diagnostic plot of the model. When the model has converged (i.e. reached a point where the three chains agree on the answer) the lines on the plots on the left will sit on top of one another and the plots on the right will have a nice bell shape. You can turn off this plot by setting plot to FALSE. By default the method runs the chains in series. Running them in parallel makes the models run faster (about half the time) but will slow down your computer more. We can change this with the parameter parallel. The number of iterations the model runs is controlled by n.iter and defaults to 10000. If you can it is better to run it for more iterations, though this will take longer. m.scale gives the scale your data is on. It is very important that this is correct, choose from ‘loge’ (natural log, sometimes simply called ‘log’), ‘log10’ (log to the base 10), or ‘logit’ (output from models of proportions or probabilities).

+

Let’s implement a few of these changes

+
bma_indicator2 <- bma(data,
+                     parallel = TRUE,
+                     n.iter = 500,
+                     m.scale = 'log10')
+
## 
+## Processing function input....... 
+## 
+## Done. 
+##  
+## Beginning parallel processing using 3 cores. Console output will be suppressed.
+## 
+## Parallel processing completed.
+## 
+## Calculating statistics.......
+
## Warning in doTryCatch(return(expr), name, parentenv, handler): At least one
+## Rhat value could not be calculated.
+
## 
+## Done.
+

+

Because we have reduced the number of interations the model no longer has a good convergence. The lines on the graphs on the left do not overlap and the graphs on the right are no longer a smooth bell shape.

+

The object that is returned is a data.frame with years as rows and columns giving the year value, index value and confidence intervals. You can write this to a csv using the function write.csv.

+
head(bma_indicator)
+
##   Year    Index  lower2.5 upper97.5
+## 1    1 100.0000 100.00000  100.0000
+## 2    2 100.1559  98.00631  102.3808
+## 3    3 100.3311  97.39855  103.5038
+## 4    4 100.5416  96.95652  104.3762
+## 5    5 100.5611  96.63643  104.8945
+## 6    6 100.4664  95.99929  105.0179
+

We can use the plotting function in BRCindicators to plot the results of this analysis, which in this case are not all that interesting!

+
plot_indicator(indicator = bma_indicator[,'Index'],
+               CIs = bma_indicator[,c(3,4)])
+

+
+
+

Multi-species Indicator

+

The multi-species indicator method was developed by Statistics Netherlands and the code is made available on their website. To find out more about the inner working of this method please read the detailed documentation on the authors website. Here is a simple example of how this method runs in BRCindicators.

+
# Create some example data in the format required
+nyr = 20
+species = rep(letters, each = nyr)
+year = rev(rep(1:nyr, length(letters)))
+
+# Create an index value that increases with time
+index = rep(seq(50, 100, length.out = nyr), length(letters))
+# Add randomness to species
+index = index * runif(n = length(index), 0.7, 1.3)
+# Add correlated randomness across species, to years
+index = index * rep(runif(0.8, 1.2, n = nyr), length(letters))
+
+se = runif(n = nyr * length(letters), min = 10, max = 20)
+
+data <- data.frame(species, year, index, se)
+
+# Our species are decreasing
+plot(data$year, data$index)
+

+
# Species index values need to be 100 in the base year. Here I use
+# the first year as my base year and rescale to 100. The standard error
+# in the base year should be 0.
+min_year <- min(data$year)
+
+for(sp in unique(data$species)){
+
+  subset_data <- data[data$species == sp, ]
+  multi_factor <- 100 / subset_data$index[subset_data$year == min_year]
+  data$index[data$species == sp] <- data$index[data$species == sp] * multi_factor
+  data$se[data$species == sp] <- data$se[data$species == sp] * multi_factor
+  data$se[data$species == sp][1] <- 0
+
+}
+
+# Our first year is now indexed at 100
+plot(data$year, data$index)
+

+
# Alternativly I could read in data from a csv
+# data <- read.csv('path/to/my/data.csv')
+
+# Run the MSI function
+msi_out <- msi(data)
+

+
head(msi_out$CV)
+
##   species   mean_CV
+## 1       a 0.1927516
+## 2       b 0.1918315
+## 3       c 0.2048356
+## 4       d 0.2195197
+## 5       e 0.2048833
+## 6       f 0.2096112
+
# I can capture the output figures too
+# pdf('test.pdf')
+#   msi_out <- msi(data)
+# dev.off()
+

The code returns two plots to the console, the first plot shows the coefficient of variation (CV) for each of the species. Species with high values of CV may adversly effect the relaibility of the trend estimation. Use this graph to identify the CV values of the species and use the maxCV parameter to set a threshold above which species will be excluded. The results of excluding species in this way can be tested by comparing trend plots. The CV values are hard to assign to species from this plot as the species are coded to numbers. To see the raw values look at the CV component of msi_out (i.e. msi_out$CV). The second plot shows the smoothed trend and the MSI values. These two figures can be captured in the usual way in R by using pdf() for example. In the example I create a dataset from random numbers but usually you would use read.csv() to read in data from a local file.

+

Here is a second example which sets some additional parameters. The parameters for msi get passed to msi_tool so to see a list of all the parameters you can change look at the help documentation in msi_tool usign ?msi_tool at the R console. I cover most of hte important ones here.

+
msi_out <- msi(data,
+               nsim = 500, # The number of Mote Carlo simulations
+               SEbaseyear = 10, # The year to index on
+               plotbaseyear = 15, # The year to set as 100 in plots
+               index_smoot = 'INDEX', # plotbaseyear uses MSI not trend
+               span = 0.7, # 'wigglyness' of line, between 0 and 1
+               lastyears = 5, # last X years of time series for short-term trends
+               maxCV = 10, # maximum allowed Coefficient of Variation 
+               changepoint = 10, # compare trends before and after this year
+               truncfac = 8, # max year-to-year index ratio
+               TRUNC = 5, #set all indices below TRUNC to this
+               plot = TRUE # should the plots be returned?)
+               )
+

+

This set of parameters is unrealistic but shows the options available. Note that in the second graph the year 10 point now has a se = 0, year 15 MSI is set to 100, and the short term trend is reported for the last 5 years.

+

The analysis also returns data which provide more insights into the analysis and let you create your own plots if required.

+
# The returned object has 2 elements
+head(msi_out$results)
+
##   year    MSI sd_MSI lower_CL_MSI upper_CL_MSI  Trend lower_CL_trend
+## 1    1 164.57   7.50       150.50       179.96 162.23         151.14
+## 2    2 164.60   7.11       151.24       179.13 156.51         149.63
+## 3    3 124.16   6.20       112.59       136.94 151.27         145.21
+## 4    4 154.53   6.93       141.53       168.72 146.51         140.92
+## 5    5 170.74   7.36       156.90       185.78 142.41         136.76
+## 6    6 143.41   6.76       130.77       157.30 139.41         134.05
+##   upper_CL_trend      trend_class
+## 1         173.85 moderate_decline
+## 2         163.73 moderate_decline
+## 3         157.31 moderate_decline
+## 4         152.66 moderate_decline
+## 5         148.15 moderate_decline
+## 6         145.21 moderate_decline
+

The first of the two elements (results) returned gives all the data, and a little more, that is presented in the second figure.

+
# The returned object has 2 elements
+msi_out$trends
+
##                             Measure    value     significance
+## 1                     overall trend   0.9691 moderate decline
+## 2                  SE overall trend   0.0019                 
+## 3                trend last 5 years   0.9711        uncertain
+## 4             SE trend last 5 years   0.0154                 
+## 5                  changepoint (10)  10.0000                 
+## 6     trend before changepoint (10)   0.9789 moderate decline
+## 7  SE trend before changepoint (10)   0.0046                 
+## 8      trend after changepoint (10)   0.9487 moderate decline
+## 9   SE trend after changepoint (10)   0.0040                 
+## 10                         % change -46.1560           p<0.01
+## 11                      SE % change   2.5360                 
+## 12            % change last 5 years  -9.4120           p<0.01
+## 13         SE % change last 5 years   2.3900                 
+## 14                      changepoint       NA           p<0.01
+
# I could write this as a csv too
+# write.csv(msi_out$trends, file = 'path/to/my/output.csv')
+

The second element (trends) returned give a summary of various trend assessments across the time series.

+

We have also added a plot method for the MSI output which provides a plot similar to that of the second figure we have seen already. Lets use this plot method to explore the effect of changing the span value in the analysis

+
for(i in c(0.3, 0.5, 0.7)){ # use a range of values for span
+  
+  msi_out <- msi(data, span = i, # span is set to i
+                 nsim = 200, plot = FALSE)
+  print( # print makes the plot visible in the for loop
+    plot(msi_out, title = paste('MSI - span =', i)) # plot
+  )
+  
+}
+

+

As the value of span gets closer to 1 the trend line gets smoother.

+
+
+

Lambda Indicator

+

The lambda indicator calculates an indicator using growth rates from one year to the next. Formulating the indicator in terms of growth rates has two distinct advantages over the conventional approach to constructing indicators. First, it means that the categorisation of species as ‘increasing’ or ‘decreasing’ can be made from the same set of data (the growth rates) as the construction of the indicator. Second, it provides an elegant solution to the problem of species that join the indicator after the first year (i.e. where the first year is unreliable): other indicators typically adopt a complicated rescaling approach to ensure that species entering the indicator after the first year do not bias the overall assessment. It also makes a simple and robust, though untestable, assumption about species that drop out of the indicator prior to the final year: specifically it assumes that their fluctuations are the same, in aggregate, as those of the species that remain in the indicator. For more details see http://webarchive.nationalarchives.gov.uk/20170302170037/http://jncc.defra.gov.uk/Docs/UKBI2015_TechBG_C4b-D1c_Bayesian_Final.docx

+

Very few species’ models produced reliable occupancy estimates for every year, so a majority of the time series contain missing values. This presents a problem for estimating growth rates for each species-year combination. Missing values of growth rate that would be equivalent to linear interpolation of the log odds between adjacent years with reliable estimates were therefore calculated. This indicator can therefore work with missing values.

+

Input data is on the occupancy scale, and is therefore bounded between 0 and 1.

+
# number of species
+nsp = 50
+
+# number of years
+nyr = 40
+
+#number of iterations
+iter = 500
+
+# Build a random set of data
+myArray <- array(data = rnorm(n = nsp*nyr*iter,
+                               mean = 0.5,
+                               sd = 0.1),
+                  dim = c(nsp, nyr, iter),
+                  dimnames = list(paste0('SP',1:nsp),
+                                  1:nyr,
+                                  1:iter))
+
+# Ensure values are bounded by 0 and 1
+myArray[myArray > 1] <- 1
+myArray[myArray < 0] <- 0
+
+str(myArray)
+
##  num [1:50, 1:40, 1:500] 0.555 0.578 0.489 0.562 0.55 ...
+##  - attr(*, "dimnames")=List of 3
+##   ..$ : chr [1:50] "SP1" "SP2" "SP3" "SP4" ...
+##   ..$ : chr [1:40] "1" "2" "3" "4" ...
+##   ..$ : chr [1:500] "1" "2" "3" "4" ...
+

lambda_indicator takes in an array of data, a three-dimensional matrix. The dimensions of this array represent species, years, and iterations.

+
# Run the lambda_interpolation method on this data
+myIndicator <- lambda_indicator(myArray)
+
+# Plot the indicator
+plot_indicator(myIndicator$summary[,'indicator'],
+               myIndicator$summary[,c('lower' ,'upper')])
+

+

There are a number of options available in the lambda_indicator function

+
myIndicator <- lambda_indicator(myArray,
+                                index = 1, # Set the index value to 1 not 100
+                                year_range = c(30,40), # Set year range
+                                threshold_yrs = 5) # set a threshold
+plot_indicator(myIndicator$summary[,'indicator'],
+               myIndicator$summary[,c('lower' ,'upper')])
+

+

Note that there are a range of threshold functions that allow you to adjust which data points are used in the indicator. There are options to remove species year estimates based on their standard deviaction, rHat value and based on the number of years a species is present in the dataset. Note that the rhat threshold can only be used if you are using a directory path as you input rather than an array.

+
+
+
+

Creating a custom pipeline function

+

We have demonstrated how you might run the indicator functions one at a time, however in a ‘pipeline’ we want data to flow through seamlessly. Additionally there are a number of parameters in the functions that we have not shown you that you might find useful. Here is an example of how you can create your own pipeline function. Our function will wrap around the functions described above, setting the parameters to meet our needs. Once we have done this it will allow use to execute our pipeline in one line.

+
# I call my function 'run_pipeline' and the only arguement it
+# takes is the directory of sparta's output
+run_pipeline <- function(input_dir){
+
+  require(sparta)
+  require(BRCindicators)
+  
+  # Create the trends summary
+  trends_summary <- summarise_occDet(input_dir = input_dir)
+
+  # Rescale the values and get the indicator values
+  # Here I set the index to 1 and change the value limits
+  rescaled_trends <- rescale_species(Data = trends_summary,
+                                     index = 1,
+                                     max = 100,
+                                     min = 0.001)
+  
+  # Bootstrap the indicator to get CIs
+  scaled_species <- rescaled_trends[,!colnames(rescaled_trends) %in% c('year', 'indicator')]
+  # This time I set the iterations to twice the default and 
+  # use custom confidence intervals
+  indicator_CIs <- bootstrap_indicator(Data = scaled_species,
+                                       CI_limits = c(0.25, 0.75),
+                                       iterations = 20000)
+  
+  # Get the smoothed indicator line
+  smoothed_indicator <- GAM_smoothing(rescaled_trends[,'indicator'])
+  
+  # This time I specify the years and index value
+  plot_indicator(indicator = rescaled_trends[,'indicator'],
+                 year = rescaled_trends[,'year'],
+                 index = 1,
+                 CIs = indicator_CIs,
+                 smoothed_line = smoothed_indicator)
+  
+  ## I'll return all my data  
+  return(cbind(smoothed_indicator, indicator_CIs, as.data.frame(trends_summary)))
+ }
+

Once we have created this function we can run this pipeline on a directory in one line, or put it in a loop to run across many directories.

+
# Now we can run the pipeline in one line, like a boss
+indicator_data <- run_pipeline(input_dir = '~/Testing_indicator_pipe')
+
## Loading data...done
+## Running bootstrapping for 20000 iterations...done
+

+
head(indicator_data)
+
##   smoothed_indicator  quant_25 quant_75 year         a         b         c
+## 1          0.9410110 1.0000000 1.000000 1950 0.6936066 0.6412022 0.5689617
+## 2          0.9410241 0.9161441 1.013355 1951 0.7638251 0.4861749 0.4021311
+## 3          0.9410373 0.9654977 1.054758 1952 0.6248634 0.6434973 0.2830601
+## 4          0.9410504 0.9418938 1.025378 1953 0.7024590 0.7724590 0.4861202
+## 5          0.9410636 0.8829743 0.980575 1954 0.6174317 0.5121858 0.7556831
+## 6          0.9410767 0.9152991 1.016191 1955 0.4302732 0.3584699 0.6068852
+##           d         e         f         g         h         i         j
+## 1 0.5595082 0.6493443 0.6055738 0.6360656 0.6808197 0.5023497 0.7240984
+## 2 0.5510929 0.6752459 0.7466667 0.6198361 0.3294536 0.7777049 0.5445902
+## 3 0.5293443 0.7142623 0.4499454 0.8585246 0.6218033 0.7155738 0.4503825
+## 4 0.3753552 0.4351366 0.7007650 0.6812568 0.2843716 0.7617486 0.7204918
+## 5 0.3475956 0.7890710 0.9208743 0.4224590 0.3489617 0.6709290 0.6630055
+## 6 0.5328962 0.6496175 0.5071038 0.8983607 0.8183607 0.1828415 0.5910383
+##           k         l         m         n         o         p         q
+## 1 0.6745355 0.7474317 0.4280874 0.8637705 0.4754098 0.3063934 0.8561749
+## 2 0.5612568 0.8439891 0.7219672 0.7681421 0.8899454 0.8829508 0.6603279
+## 3 0.7230055 0.6057377 0.8093443 0.6449727 0.8291803 0.5590164 0.9317486
+## 4 0.6781421 0.3494536 0.6693443 0.8687432 0.7666667 0.3861202 0.8895628
+## 5 0.4267760 0.7104918 0.5197268 0.6080328 0.9333880 0.6397814 0.8000000
+## 6 0.4730601 0.7218579 0.7813115 0.8701093 0.7691803 0.9534973 0.7195628
+##           r         s         t         u         v         w         x
+## 1 0.4616940 0.8857377 0.9212568 0.9278142 0.7791803 0.8990164 0.8789617
+## 2 0.4956284 0.5505464 0.6045902 0.9059016 0.5865027 0.9578689 0.5522404
+## 3 0.8715847 0.8665574 0.7143169 0.9012022 0.7806557 0.8487978 0.6573224
+## 4 0.7289617 0.9055191 0.9448634 0.7619126 0.8996721 0.7424044 0.8715847
+## 5 0.7063934 0.4081967 0.8507650 0.6043716 0.8098361 0.8346448 0.4581421
+## 6 0.5015847 0.6086339 0.7546448 0.9357923 0.9027322 0.8691257 0.8631148
+##           y         z
+## 1 0.8712022 0.8991257
+## 2 0.8714208 0.6720765
+## 3 0.8427869 0.7418033
+## 4 0.8560656 0.7687432
+## 5 0.9773770 0.6566667
+## 6 0.8149180 0.6977049
+
+ + + + +
+ + + + + + + + diff --git a/man/bma.Rd b/man/bma.Rd index 50d04e3..f9b61c2 100644 --- a/man/bma.Rd +++ b/man/bma.Rd @@ -93,10 +93,7 @@ There are a number of model to choose from: } } \examples{ - -# Only run if there is a JAGS installation -if(suppressWarnings(runjags::testjags(silent = TRUE)$JAGS.found)){ - +\dontrun{ # Create some example data in the format required data <- data.frame(species = rep(letters, each = 50), year = rep(1:50, length(letters)), @@ -109,8 +106,7 @@ bma_indicator <- bma(data, model="smooth", m.scale="logit", n.iter=100) # Plot the resulting indicator plot_indicator(indicator = bma_indicator[,'Index.Mprime'], CIs = bma_indicator[,c(3,4)]) - - } +} } \references{ Freeman, S.N., Isaac, N.J.B., Besbeas, P.T., Dennis, E.B. & Morgan, B.J.T. (2020) diff --git a/vignettes/BRCindicators.Rmd b/vignettes/BRCindicators.Rmd new file mode 100644 index 0000000..9221a6c --- /dev/null +++ b/vignettes/BRCindicators.Rmd @@ -0,0 +1,521 @@ +--- +title: "BRCindicators" +author: "Tom August" +date: "`r format(Sys.time(), '%d %B, %Y')`" +output: + html_document: + keep_md: yes + toc: yes +vignette: > + %\VignetteIndexEntry{BRCindicators} + %\usepackage[utf8]{inputenc} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +--- + +```{r knitr_setup, echo=FALSE} +knitr::opts_chunk$set(echo = TRUE, + cache = TRUE, + tidy.opts = list(width.cutoff = 60), + tidy = FALSE) +``` + +```{r libraries} +require(snowfall) +``` + +# Introduction + +This document shows how to use the indicator pipeline to create biodiversity indicators such as those for DEFRA's [Biodiversity Indicators in Your Pocket](http://jncc.defra.gov.uk/page-1824). The pipeline is shared in the form of an R package called 'BRCindicators' making it easy to share and maintain. + +The functions in BRCindicators work with yearly estimates of species abundance or occurrence and aggregate them into an scaled indicator value with bootstrapped confidence intervals + +This package has the ability to read in the output of occupancy models created in the R package sparta, a package for estimating species trends from occurrence data. This package can be installed [from Github](https://github.com/BiologicalRecordsCentre/sparta) and details of how to use the package are given in [the package vignette](https://github.com/BiologicalRecordsCentre/sparta/raw/master/vignette/sparta_vignette.pdf). There is no need to use sparta to create your yearly species estimates as BRCindicators can also work with other data. + +To create an indicator we first need to have species trends, let's create some using the sparta R package. + +# Creating yearly estimates of occurrence in sparta + +If you already have yearly estimates of abundance or occurrence for your species you can skip this stage. Here we show how you can create these estimates from raw species observation data using sparta. + +```{r set_up, eval=FALSE} +library(devtools) +install_github('biologicalrecordscentre/sparta') +``` + +Let's assume you have some raw data already, we can under take occupancy modelling like this + +```{r create_some_data} +# Create data +n <- 8000 # size of dataset +nyr <- 50 # number of years in data +nSamples <- 200 # set number of dates +nSites <- 100 # set number of sites +set.seed(125) # set a random seed + +# Create somes dates +first <- as.Date(strptime("1950/01/01", "%Y/%m/%d")) +last <- as.Date(strptime(paste(1950+(nyr-1),"/12/31", sep=''), "%Y/%m/%d")) +dt <- last-first +rDates <- first + (runif(nSamples)*dt) + +# taxa are set semi-randomly +taxa_probabilities <- seq(from = 0.1, to = 0.7, length.out = 26) +taxa <- sample(letters, size = n, TRUE, prob = taxa_probabilities) + +# sites are visited semi-randomly +site_probabilities <- seq(from = 0.1, to = 0.7, length.out = nSites) +site <- sample(paste('A', 1:nSites, sep=''), size = n, TRUE, prob = site_probabilities) + +# the date of visit is selected semi-randomly from those created earlier +time_probabilities <- seq(from = 0.1, to = 0.7, length.out = nSamples) +time_period <- sample(rDates, size = n, TRUE, prob = time_probabilities) + +myData <- data.frame(taxa, site, time_period) +``` + +```{r library_sparta} +# Load the sparta package +library(sparta) +``` + +For demonstration purposes I have a faked dataset of 8000 species observations. In my dataset the species are named after the letters in the alphabet. Below I show how I can use the Bayesian occupancy models in sparta to create yearly estimates of occurrence. For more information please see the [vignette for sparta](https://github.com/BiologicalRecordsCentre/sparta/raw/master/vignette/sparta_vignette.pdf) + +```{r format_data} +# Preview of my data +head(myData) + +# First format our data +formattedOccData <- formatOccData(taxa = myData$taxa, + site = myData$site, + survey = myData$time_period) + +# Here we are going to use the package snowfall to parallelise +library(snowfall) + +# I have 4 cpus on my PC so I set cpus to 4 +# when I initialise the cluster +sfInit(parallel = TRUE, cpus = 4) + +# Export my data to the cluster +sfExport('formattedOccData') + +# I create a function that takes a species name and runs my model +occ_mod_function <- function(taxa_name){ + + library(sparta) + + # Note that this will write you results to your computer + # the location is set to your user folder + occ_out <- occDetFunc(taxa_name = as.character(taxa_name), + n_iterations = 200, + burnin = 15, + occDetdata = formattedOccData$occDetdata, + spp_vis = formattedOccData$spp_vis, + write_results = TRUE, + output_dir = '~/Testing_indicator_pipe', + seed = 123) +} +# I then run this in parallel +system.time({ +para_out <- sfClusterApplyLB(unique(myData$taxa), occ_mod_function) +}) + +# Stop the cluster +sfStop() + +# We can see all the files this has created +list.files('~/Testing_indicator_pipe') +``` + +# Installing BRCindicators + +Installing the package is easy and can be done in a couple of lines + +```{r eval = FALSE} +library(devtools) + +install_github('biologicalrecordscentre/BRCindicators') +``` + +# Summarising sparta output for an indicator + +Now that we have some species trends data to work with (no doubt you already have your own) we can use the first function in BRCindicators. This function reads in all the output files from sparta (which are quite large and complex) and returns a simple summary table that we can use for calculating the indicator. If you have done your analysis without using sparta you can skip to the next step. + +```{r set_up_run} +library(BRCindicators) + +# All we have to supply is the directory where out data is saved +# You will note this is the 'output_dir' passed to sparta above. +trends_summary <- summarise_occDet(input_dir = '~/Testing_indicator_pipe') + +# Lets see the summary +head(trends_summary[,1:5]) +``` + +Returned from this function is a summary of the data as a matrix. In each row we have the year, specified in the first column, and each subsequent column is a species. The values in the table are the mean of the posterior for the predicted proportion of sites occupied, a measure of occurrence. + +# Calculating indicator values + +Once we have species-year indicies we are in a position to proceed to calculating an indictor. To do this there are a number of mehods available, some of which are presented here in 'BRCindicators' + +## Geometric mean + +The geometric mean method is often used with data that do not have errors associated with them. + +The first step is to re-scale the data so that the value for all species in the first year is the same. Once this is done we calculate the geometric mean across species for each year creating the indicator value. This function also accounts for species that have no data at the beginning of the dataset by entering them at the geometric mean for that year, this stops them dramatically changing the indicator value in the year they join the dataset. It also accounts for species that leave the dataset before the end by holding them at their last value. Finally limits to species values can be given, preventing extremely high or low values biasing the indicator. + +### Rescaling and calculating geometric mean + +The data I have generated in 'trends_summary' is very easy to work with but to show off what this function can do I'm going to mess it up a bit. + +```{r trends_summary} +trends_summary[1:3, 'a'] <- NA +trends_summary[1:5, 'b'] <- NA +trends_summary[2:4, 'c'] <- 1000 +trends_summary[45:50, 'd'] <- NA + +# Let's have a look at these changes +head(trends_summary[,1:5]) +tail(trends_summary[,1:5]) +``` + +Now that I have 'messed up' the data a bit we have two species with data missing at the beginning and one species with data missing at the end. We also have one species with some very high values. + +Now lets run this through the re-scaling function. + +```{r rescaled_indicator} +# Let's run this data through our scaling function (all defaults used) +rescaled_trends <- rescale_species(Data = trends_summary) + +# Here's the result +head(rescaled_trends[,c('year', 'indicator', 'a', 'b', 'c', 'd')]) +tail(rescaled_trends[,c('year', 'indicator', 'a', 'b', 'c', 'd')]) +``` + +You can see that species 'a' and 'b' enter the dataset at the geometric mean (the indicator value), all species are indexed at 100 in the first year and the very high values in 'c' are capped at 10000 at the end 'd' has been held at it's end value. + +The 'indicator' column that is returned here is our indicator, calculated as the geometric mean of all the species in the data set. + +### Confidence intervals + +We can get confidence intervals for this indicator by bootstrapping across species. We have a function for that too! + +```{r create_confidence_intervals} +# This function takes just the species columns +scaled_species <- rescaled_trends[,!colnames(rescaled_trends) %in% c('year', 'indicator')] +indicator_CIs <- bootstrap_indicator(Data = scaled_species) + +# Returned are the CIs for our indicator +head(indicator_CIs) +``` + +### Smoothing + +It is sometimes desirable to create a smoothed indicator value from the raw values. This can be achieved by fitting a GAM (general additive model) to the indicator using a spline. This spline is a smoothed curve that goes through the raw values for the indicator and is fitted using the function 'gam' in the 'mgcv' R package. + +```{r smoothing_indicator} +# The smoothing function takes the indicator values +smoothed_indicator <- GAM_smoothing(rescaled_trends[,'indicator']) + +# In this example there is little support for a non-linear trend and +# so the line almost linear +plot(x = rescaled_trends[,'year'], y = rescaled_trends[,'indicator']) +lines(x = rescaled_trends[,'year'], y = smoothed_indicator, col = 'red') + +# But if our indicator did support a non-linear trend it might look +# like this +eg_indicator <- jitter(sort(rnorm(50)), amount = 0.5) +eg_smoothed <- GAM_smoothing(eg_indicator) +plot(x = 1:50, y = eg_indicator) +lines(x = 1:50, y = eg_smoothed, col = 'red') +``` + +Where there is little support for a non-linear trend a GAM smoothed line will tend towards linear. Where there is good support for a non-linear trend the smoothed line will become more 'bendy'. + +### Plotting + +We now have our indicator and the confidence intervals around it. The next step is to plot it. We have included a function that creates a simple plot using ggplot2, however you could easily create your own plots in R using the data. + +```{r plot_indicator} +# Plot our indicator. +plot_indicator(indicator = rescaled_trends[,'indicator'], + smoothed_line = smoothed_indicator, + CIs = indicator_CIs) +``` + +In this plot you can see the high upper confidence interval in years 2-4, this is due to the artificially high values we gave to species 'c'. + +## Bayesian Meta-Analysis (BMA) + +The Bayesian Meta-Analysis method, or BMA, is suited to data with standard errors associated with them. As with other methods we require data from more than one species, across a number of years, with an error for each species-year estimate. + +```{r BMAdata} +# Here is an example dataset for the BMA method +data <- data.frame(species = rep(letters, each = 50), + year = rep(1:50, length(letters)), + index = runif(n = 50 * length(letters), min = 0, max = 1), + se = runif(n = 50 * length(letters), min = 0.01, max = .1)) +head(data) +``` + +It is important that your data is in the same format and that your columns are in the same order and have the same names. Remember you can use the function `read.csv()` to read in the data from a .csv on your computer. + +BMA is run using the function `bma`, here we will use the default settings and then see what we can change. + +```{r runBMA} +bma_indicator <- bma(data) +``` + +The function returns a plot to your screen which is a diagnostic plot of the model. When the model has converged (i.e. reached a point where the three chains agree on the answer) the lines on the plots on the left will sit on top of one another and the plots on the right will have a nice bell shape. You can turn off this plot by setting `plot` to `FALSE`. By default the method runs the chains in series. Running them in parallel makes the models run faster (about half the time) but will slow down your computer more. We can change this with the parameter `parallel`. The number of iterations the model runs is controlled by `n.iter` and defaults to 10000. If you can it is better to run it for more iterations, though this will take longer. `m.scale` gives the scale your data is on. It is very important that this is correct, choose from 'loge' (natural log, sometimes simply called 'log'), 'log10' (log to the base 10), or 'logit' (output from models of proportions or probabilities). + +Let's implement a few of these changes + +```{r runBMAparameters} +bma_indicator2 <- bma(data, + parallel = TRUE, + n.iter = 500, + m.scale = 'log10') +``` + +Because we have reduced the number of interations the model no longer has a good convergence. The lines on the graphs on the left do not overlap and the graphs on the right are no longer a smooth bell shape. + +The object that is returned is a data.frame with years as rows and columns giving the year value, index value and confidence intervals. You can write this to a csv using the function `write.csv`. + +```{r BMAresults} +head(bma_indicator) +``` + +We can use the plotting function in BRCindicators to plot the results of this analysis, which in this case are not all that interesting! + +```{r BMAplot} +plot_indicator(indicator = bma_indicator[,'Index.M'], + CIs = bma_indicator[,c(3,4)]) +``` + + +## Multi-species Indicator + +The multi-species indicator method was developed by Statistics Netherlands and the code is made available on [their website](https://www.cbs.nl/en-gb/society/nature-and-environment/indices-and-trends--trim--/msi-tool). To find out more about the inner working of this method please read the [detailed documentation](https://www.cbs.nl/-/media/_pdf/2017/22/msi_manual.pdf) on the authors website. Here is a simple example of how this method runs in `BRCindicators`. + +```{r msi1, fig.height=4} +# Create some example data in the format required +nyr = 20 +species = rep(letters, each = nyr) +year = rev(rep(1:nyr, length(letters))) + +# Create an index value that increases with time +index = rep(seq(50, 100, length.out = nyr), length(letters)) +# Add randomness to species +index = index * runif(n = length(index), 0.7, 1.3) +# Add correlated randomness across species, to years +index = index * rep(runif(0.8, 1.2, n = nyr), length(letters)) + +se = runif(n = nyr * length(letters), min = 10, max = 20) + +data <- data.frame(species, year, index, se) + +# Our species are decreasing +plot(data$year, data$index) + +# Species index values need to be 100 in the base year. Here I use +# the first year as my base year and rescale to 100. The standard error +# in the base year should be 0. +min_year <- min(data$year) + +for(sp in unique(data$species)){ + + subset_data <- data[data$species == sp, ] + multi_factor <- 100 / subset_data$index[subset_data$year == min_year] + data$index[data$species == sp] <- data$index[data$species == sp] * multi_factor + data$se[data$species == sp] <- data$se[data$species == sp] * multi_factor + data$se[data$species == sp][1] <- 0 + +} + +# Our first year is now indexed at 100 +plot(data$year, data$index) + +# Alternativly I could read in data from a csv +# data <- read.csv('path/to/my/data.csv') + +# Run the MSI function +msi_out <- msi(data) + +head(msi_out$CV) + +# I can capture the output figures too +# pdf('test.pdf') +# msi_out <- msi(data) +# dev.off() +``` + +The code returns two plots to the console, the first plot shows the coefficient of variation (CV) for each of the species. Species with high values of CV may adversly effect the relaibility of the trend estimation. Use this graph to identify the CV values of the species and use the `maxCV` parameter to set a threshold above which species will be excluded. The results of excluding species in this way can be tested by comparing trend plots. The CV values are hard to assign to species from this plot as the species are coded to numbers. To see the raw values look at the CV component of msi_out (i.e. `msi_out$CV`). The second plot shows the smoothed trend and the MSI values. These two figures can be captured in the usual way in R by using `pdf()` for example. In the example I create a dataset from random numbers but usually you would use `read.csv()` to read in data from a local file. + +Here is a second example which sets some additional parameters. The parameters for `msi` get passed to `msi_tool` so to see a list of all the parameters you can change look at the help documentation in `msi_tool` usign `?msi_tool` at the R console. I cover most of hte important ones here. + +```{r msi2, fig.height=4} +msi_out <- msi(data, + nsim = 500, # The number of Mote Carlo simulations + SEbaseyear = 10, # The year to index on + plotbaseyear = 15, # The year to set as 100 in plots + index_smoot = 'INDEX', # plotbaseyear uses MSI not trend + span = 0.7, # 'wigglyness' of line, between 0 and 1 + lastyears = 5, # last X years of time series for short-term trends + maxCV = 10, # maximum allowed Coefficient of Variation + changepoint = 10, # compare trends before and after this year + truncfac = 8, # max year-to-year index ratio + TRUNC = 5, #set all indices below TRUNC to this + plot = TRUE # should the plots be returned?) + ) +``` + +This set of parameters is unrealistic but shows the options available. Note that in the second graph the year 10 point now has a se = 0, year 15 MSI is set to 100, and the short term trend is reported for the last 5 years. + +The analysis also returns data which provide more insights into the analysis and let you create your own plots if required. + +```{r msi3} +# The returned object has 2 elements +head(msi_out$results) +``` + +The first of the two elements (`results`) returned gives all the data, and a little more, that is presented in the second figure. + +```{r msi4} +# The returned object has 2 elements +msi_out$trends + +# I could write this as a csv too +# write.csv(msi_out$trends, file = 'path/to/my/output.csv') +``` + +The second element (`trends`) returned give a summary of various trend assessments across the time series. + +We have also added a plot method for the MSI output which provides a plot similar to that of the second figure we have seen already. Lets use this plot method to explore the effect of changing the span value in the analysis + +```{r msi_span, fig.height=4} +for(i in c(0.3, 0.5, 0.7)){ # use a range of values for span + + msi_out <- msi(data, span = i, # span is set to i + nsim = 200, plot = FALSE) + print( # print makes the plot visible in the for loop + plot(msi_out, title = paste('MSI - span =', i)) # plot + ) + +} +``` + +As the value of span gets closer to 1 the trend line gets smoother. + +## Lambda Indicator + +The lambda indicator calculates an indicator using growth rates from one year to the next. Formulating the indicator in terms of growth rates has two distinct advantages over the conventional approach to constructing indicators. First, it means that the categorisation of species as ‘increasing’ or ‘decreasing’ can be made from the same set of data (the growth rates) as the construction of the indicator. Second, it provides an elegant solution to the problem of species that join the indicator after the first year (i.e. where the first year is unreliable): other indicators typically adopt a complicated rescaling approach to ensure that species entering the indicator after the first year do not bias the overall assessment. It also makes a simple and robust, though untestable, assumption about species that drop out of the indicator prior to the final year: specifically it assumes that their fluctuations are the same, in aggregate, as those of the species that remain in the indicator. For more details see http://webarchive.nationalarchives.gov.uk/20170302170037/http://jncc.defra.gov.uk/Docs/UKBI2015_TechBG_C4b-D1c_Bayesian_Final.docx + +Very few species’ models produced reliable occupancy estimates for every year, so a majority of the time series contain missing values. This presents a problem for estimating growth rates for each species-year combination. Missing values of growth rate that would be equivalent to linear interpolation of the log odds between adjacent years with reliable estimates were therefore calculated. This indicator can therefore work with missing values. + +Input data is on the occupancy scale, and is therefore bounded between 0 and 1. + +```{r lambda_1} +# number of species +nsp = 50 + +# number of years +nyr = 40 + +#number of iterations +iter = 500 + +# Build a random set of data +myArray <- array(data = rnorm(n = nsp*nyr*iter, + mean = 0.5, + sd = 0.1), + dim = c(nsp, nyr, iter), + dimnames = list(paste0('SP',1:nsp), + 1:nyr, + 1:iter)) + +# Ensure values are bounded by 0 and 1 +myArray[myArray > 1] <- 1 +myArray[myArray < 0] <- 0 + +str(myArray) +``` + +`lambda_indicator` takes in an array of data, a three-dimensional matrix. The dimensions of this array represent species, years, and iterations. Each row represents a species and each column a year. The third dimension of the array contains the iterations. Essentially each slice contains occupancy estimates for each species year combination for a single iteration and the overall array contains as many slices as there are iterations. + +```{r lambda_2} +# Run the lambda_interpolation method on this data +myIndicator <- lambda_indicator(myArray) + +# Plot the indicator +plot_indicator(myIndicator$summary[,'indicator'], + myIndicator$summary[,c('lower' ,'upper')]) + +``` + +There are a number of options available in the `lambda_indicator` function + +```{r lambda_3} +myIndicator <- lambda_indicator(myArray, + index = 1, # Set the index value to 1 not 100 + year_range = c(30,40), # Set year range + threshold_yrs = 5) # set a threshold +plot_indicator(myIndicator$summary[,'indicator'], + myIndicator$summary[,c('lower' ,'upper')]) +``` + +Note that there are a range of threshold functions that allow you to adjust which data points are used in the indicator. There are options to remove species year estimates based on their standard deviaction, Rhat value and based on the number of years a species is present in the dataset. Note that the Rhat threshold can only be used if you are using a directory path as you input rather than an array. + +# Creating a custom pipeline function + +We have demonstrated how you might run the indicator functions one at a time, however in a 'pipeline' we want data to flow through seamlessly. Additionally there are a number of parameters in the functions that we have not shown you that you might find useful. Here is an example of how you can create your own pipeline function. Our function will wrap around the functions described above, setting the parameters to meet our needs. Once we have done this it will allow use to execute our pipeline in one line. + +```{r,cache=TRUE} +# I call my function 'run_pipeline' and the only arguement it +# takes is the directory of sparta's output +run_pipeline <- function(input_dir){ + + require(sparta) + require(BRCindicators) + + # Create the trends summary + trends_summary <- summarise_occDet(input_dir = input_dir) + + # Rescale the values and get the indicator values + # Here I set the index to 1 and change the value limits + rescaled_trends <- rescale_species(Data = trends_summary, + index = 1, + max = 100, + min = 0.001) + + # Bootstrap the indicator to get CIs + scaled_species <- rescaled_trends[,!colnames(rescaled_trends) %in% c('year', 'indicator')] + # This time I set the iterations to twice the default and + # use custom confidence intervals + indicator_CIs <- bootstrap_indicator(Data = scaled_species, + CI_limits = c(0.25, 0.75), + iterations = 20000) + + # Get the smoothed indicator line + smoothed_indicator <- GAM_smoothing(rescaled_trends[,'indicator']) + + # This time I specify the years and index value + plot_indicator(indicator = rescaled_trends[,'indicator'], + year = rescaled_trends[,'year'], + index = 1, + CIs = indicator_CIs, + smoothed_line = smoothed_indicator) + + ## I'll return all my data + return(cbind(smoothed_indicator, indicator_CIs, as.data.frame(trends_summary))) + } +``` + +Once we have created this function we can run this pipeline on a directory in one line, or put it in a loop to run across many directories. + +```{r, in_one} +# Now we can run the pipeline in one line, like a boss +indicator_data <- run_pipeline(input_dir = '~/Testing_indicator_pipe') + +head(indicator_data) +``` \ No newline at end of file diff --git a/vignettes/BRCindicators.html b/vignettes/BRCindicators.html new file mode 100644 index 0000000..ba03b48 --- /dev/null +++ b/vignettes/BRCindicators.html @@ -0,0 +1,645 @@ + + + + + + + + + + + + + + +BRCindicators + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + +
+ +
+ +
+

Introduction

+

This document shows how to use the indicator pipeline to create biodiversity indicators such as those for DEFRA’s Biodiversity Indicators in Your Pocket. The pipeline is shared in the form of an R package called ‘BRCindicators’ making it easy to share and maintain.

+

The functions in BRCindicators work with yearly estimates of species abundance or occurrence and aggregate them into an scaled indicator value with bootstrapped confidence intervals

+

This package has the ability to read in the output of occupancy models created in the R package sparta, a package for estimating species trends from occurrence data. This package can be installed from Github and details of how to use the package are given in the package vignette. There is no need to use sparta to create your yearly species estimates as BRCindicators can also work with other data.

+

To create an indicator we first need to have species trends, let’s create some using the sparta R package.

+
+
+

Creating yearly estimates of occurrence in sparta

+

If you already have yearly estimates of abundance or occurrence for your species you can skip this stage. Here we show how you can create these estimates from raw species observation data using sparta.

+
# Install the package from CRAN
+# THIS WILL WORK ONLY AFTER THE PACKAGE IS PUBLISHED
+install.packages('sparta')
+
+# Or install the development version from GitHub
+library(devtools)
+install_github('biologicalrecordscentre/sparta')
+

Let’s assume you have some raw data already, we can under take occupancy modelling like this

+
# Load the sparta package
+library(sparta)
+

For demonstration purposes I have a faked dataset of 8000 species observations. In my dataset the species are named after the letters in the alphabet. Below I show how I can use the Bayesian occupancy models in sparta to create yearly estimates of occurrence. For more information please see the vignette for sparta

+
# Preview of my data
+head(myData)
+
##   taxa site time_period
+## 1    r  A51  1970-01-14
+## 2    v  A87  1980-09-29
+## 3    e  A56  1996-04-14
+## 4    z  A28  1959-01-16
+## 5    r  A77  1970-09-21
+## 6    x  A48  1990-02-25
+
# First format our data
+formattedOccData <- formatOccData(taxa = myData$taxa,
+                                  site = myData$site,
+                                  time_period = myData$time_period)
+
## Warning in errorChecks(taxa = taxa, site = site, time_period =
+## time_period): 94 out of 8000 observations will be removed as duplicates
+
# Here we are going to use the package snowfall to parallelise
+library(snowfall)
+
+# I have 4 cpus on my PC so I set cpus to 4
+# when I initialise the cluster
+sfInit(parallel = TRUE, cpus = 4)
+
## Warning in searchCommandline(parallel, cpus = cpus, type = type,
+## socketHosts = socketHosts, : Unknown option on commandline:
+## rmarkdown::render('W:/PYWELL_SHARED/Pywell Projects/BRC/Tom August/R
+## Packages/BRCindicators/vignettes/BRCindicators.Rmd', encoding
+
## R Version:  R version 3.3.2 (2016-10-31)
+
## snowfall 1.84-6.1 initialized (using snow 0.4-2): parallel execution on 4 CPUs.
+
# Export my data to the cluster
+sfExport('formattedOccData')
+
+# I create a function that takes a species name and runs my model
+occ_mod_function <- function(taxa_name){
+  
+  library(sparta)
+  
+  # Note that this will write you results to your computer
+  # the location is set to your user folder
+  occ_out <- occDetFunc(taxa_name = as.character(taxa_name),
+                        n_iterations = 200,
+                        burnin = 15, 
+                        occDetdata = formattedOccData$occDetdata,
+                        spp_vis = formattedOccData$spp_vis,
+                        write_results = TRUE,
+                        output_dir = '~/Testing_indicator_pipe',
+                        seed = 123)  
+} 
+
+# I then run this in parallel
+system.time({
+para_out <- sfClusterApplyLB(unique(myData$taxa), occ_mod_function)
+})
+
##    user  system elapsed 
+##    0.52    0.08  149.15
+
# Stop the cluster
+sfStop()
+
## 
+## Stopping cluster
+
# We can see all the files this has created
+list.files('~/Testing_indicator_pipe')
+
##  [1] "a.rdata" "b.rdata" "c.rdata" "d.rdata" "e.rdata" "f.rdata" "g.rdata"
+##  [8] "h.rdata" "i.rdata" "j.rdata" "k.rdata" "l.rdata" "m.rdata" "n.rdata"
+## [15] "o.rdata" "p.rdata" "q.rdata" "r.rdata" "s.rdata" "t.rdata" "u.rdata"
+## [22] "v.rdata" "w.rdata" "x.rdata" "y.rdata" "z.rdata"
+
+
+

Installing BRCindicators

+

Installing the package is easy and can be done in a couple of lines

+
library(devtools)
+install_github(repo = 'biologicalrecordscentre/BRCindicators')
+
+
+

Summarising sparta output for an indicator

+

Now that we have some species trends data to work with (no doubt you already have your own) we can use the first function in BRCindicators. This function reads in all the output files from sparta (which are quite large and complex) and returns a simple summary table that we can use for calculating the indicator. If you have done your analysis without using sparta you can skip to the next step.

+
library(BRCindicators)
+
+# All we have to supply is the directory where out data is saved
+# You will note this is the 'output_dir' passed to sparta above.
+trends_summary <- summarise_occDet(input_dir = '~/Testing_indicator_pipe')
+
## Loading data...done
+
# Lets see the summary
+head(trends_summary[,1:5])
+
##      year         a         b         c         d
+## [1,] 1950 0.6745699 0.7173656 0.4802151 0.5568280
+## [2,] 1951 0.6675806 0.6460215 0.5637097 0.6809677
+## [3,] 1952 0.4059677 0.6024731 0.5306989 0.5035484
+## [4,] 1953 0.1990860 0.5976344 0.4347312 0.5723656
+## [5,] 1954 0.5780108 0.5145699 0.6909677 0.5766667
+## [6,] 1955 0.2000000 0.4475806 0.5319892 0.4764516
+

Returned from this function is a summary of the data as a matrix. In each row we have the year, specified in the first column, and each subsequent column is a species. The values in the table are the mean of the posterior for the predicted proportion of sites occupied, a measure of occurrence.

+
+
+

Calculating indicator values

+

Once we have species-year indicies we are in a position to proceed to calculating an indictor. To do this there are a number of mehods available, some of which are presented here in ‘BRCindicators’

+
+

Geometric mean

+

The geometric mean method is often used with data that do not have errors associated with them.

+

The first step is to re-scale the data so that the value for all species in the first year is the same. Once this is done we calculate the geometric mean across species for each year creating the indicator value. This function also accounts for species that have no data at the beginning of the dataset by entering them at the geometric mean for that year, this stops them dramatically changing the indicator value in the year they join the dataset. It also accounts for species that leave the dataset before the end by holding them at their last value. Finally limits to species values can be given, preventing extremely high or low values biasing the indicator.

+
+

Rescaling and calculating geometric mean

+

The data I have generated in ‘trends_summary’ is very easy to work with but to show off what this function can do I’m going to mess it up a bit.

+
trends_summary[1:3, 'a'] <- NA
+trends_summary[1:5, 'b'] <- NA
+trends_summary[2:4, 'c'] <- 1000
+trends_summary[45:50, 'd'] <- NA
+
+# Let's have a look at these changes
+head(trends_summary[,1:5])
+
##      year         a         b            c         d
+## [1,] 1950        NA        NA    0.4802151 0.5568280
+## [2,] 1951        NA        NA 1000.0000000 0.6809677
+## [3,] 1952        NA        NA 1000.0000000 0.5035484
+## [4,] 1953 0.1990860        NA 1000.0000000 0.5723656
+## [5,] 1954 0.5780108        NA    0.6909677 0.5766667
+## [6,] 1955 0.2000000 0.4475806    0.5319892 0.4764516
+
tail(trends_summary[,1:5])
+
##       year         a         b          c  d
+## [45,] 1994 0.3546237 0.5100000 0.06005376 NA
+## [46,] 1995 0.6591935 0.5248925 0.77193548 NA
+## [47,] 1996 0.4116129 0.6036022 0.33881720 NA
+## [48,] 1997 0.3696237 0.6756989 0.76268817 NA
+## [49,] 1998 0.6983333 0.4795161 0.44989247 NA
+## [50,] 1999 0.3657527 0.4788172 0.56693548 NA
+

Now that I have ‘messed up’ the data a bit we have two species with data missing at the beginning and one species with data missing at the end. We also have one species with some very high values.

+

Now lets run this through the re-scaling function.

+
# Let's run this data through our scaling function (all defaults used)
+rescaled_trends <- rescale_species(Data = trends_summary)
+
+# Here's the result
+head(rescaled_trends[,c('year', 'indicator', 'a', 'b', 'c', 'd')])
+
##      year indicator        a       b          c         d
+## [1,] 1950 100.00000       NA      NA   100.0000 100.00000
+## [2,] 1951 116.64823       NA      NA 10000.0000 122.29410
+## [3,] 1952 126.98409       NA      NA 10000.0000  90.43159
+## [4,] 1953 112.18622 112.1862      NA 10000.0000 102.79038
+## [5,] 1954  98.31502 325.7127      NA   143.8871 103.56281
+## [6,] 1955 102.34902 112.7013 102.349   110.7815  85.56532
+
tail(rescaled_trends[,c('year', 'indicator', 'a', 'b', 'c', 'd')])
+
##       year indicator        a        b         c        d
+## [45,] 1994  93.14009 199.8327 116.6226  12.50560 121.6472
+## [46,] 1995 109.22845 371.4597 120.0281 160.74787 121.6472
+## [47,] 1996 108.59012 231.9465 138.0267  70.55531 121.6472
+## [48,] 1997 115.28010 208.2852 154.5132 158.82221 121.6472
+## [49,] 1998 101.41978 393.5152 109.6518  93.68562 121.6472
+## [50,] 1999 100.58702 206.1039 109.4919 118.05867 121.6472
+

You can see that species ‘a’ and ‘b’ enter the dataset at the geometric mean (the indicator value), all species are indexed at 100 in the first year and the very high values in ‘c’ are capped at 10000 at the end ‘d’ has been held at it’s end value.

+

The ‘indicator’ column that is returned here is our indicator, calculated as the geometric mean of all the species in the data set.

+
+
+

Confidence intervals

+

We can get confidence intervals for this indicator by bootstrapping across species. We have a function for that too!

+
# This function takes just the species columns
+scaled_species <- rescaled_trends[,!colnames(rescaled_trends) %in% c('year', 'indicator')]
+indicator_CIs <- bootstrap_indicator(Data = scaled_species)
+
## Running bootstrapping for 10000 iterations...done
+
# Returned are the CIs for our indicator
+head(indicator_CIs)
+
##      quant_025 quant_975
+## [1,] 100.00000  100.0000
+## [2,]  86.41204  185.8857
+## [3,]  94.34784  199.9593
+## [4,]  78.12008  179.0048
+## [5,]  82.79171  117.4378
+## [6,]  90.77336  114.8024
+
+
+

Smoothing

+

It is sometimes desirable to create a smoothed indicator value from the raw values. This can be achieved by fitting a GAM (general additive model) to the indicator using a spline. This spline is a smoothed curve that goes through the raw values for the indicator and is fitted using the function ‘gam’ in the ‘mgcv’ R package.

+
# The smoothing function takes the indicator values
+smoothed_indicator <- GAM_smoothing(rescaled_trends[,'indicator'])
+
+# In this example there is little support for a non-linear trend and 
+# so the line almost linear
+plot(x = rescaled_trends[,'year'], y = rescaled_trends[,'indicator'])
+lines(x = rescaled_trends[,'year'], y = smoothed_indicator, col = 'red')
+

+
# But if our indicator did support a non-linear trend it might look 
+# like this
+eg_indicator <- jitter(sort(rnorm(50)), amount = 0.5)
+eg_smoothed <- GAM_smoothing(eg_indicator)
+plot(x = 1:50, y = eg_indicator)
+lines(x = 1:50, y = eg_smoothed, col = 'red')
+

+

Where there is little support for a non-linear trend a GAM smoothed line will tend towards linear. Where there is good support for a non-linear trend the smoothed line will become more ‘bendy’.

+
+
+

Plotting

+

We now have our indicator and the confidence intervals around it. The next step is to plot it. We have included a function that creates a simple plot using ggplot2, however you could easily create your own plots in R using the data.

+
# Plot our indicator.
+plot_indicator(indicator = rescaled_trends[,'indicator'],
+               smoothed_line = smoothed_indicator,
+               CIs = indicator_CIs)
+

+

In this plot you can see the high upper confidence interval in years 2-4, this is due to the artificially high values we gave to species ‘c’.

+
+
+
+

Bayesian Meta-Analysis (BMA)

+

The Bayesian Meta-Analysis method, or BMA, is suited to data with standard errors associated with them. As with other methods we require data from more than one species, across a number of years, with an error for each species-year estimate.

+
# Here is an example dataset for the BMA method
+data <- data.frame(species = rep(letters, each = 50),
+                   year = rep(1:50, length(letters)), 
+                   index = runif(n = 50 * length(letters), min = 0, max = 1), 
+                   se = runif(n = 50 * length(letters), min = 0.01, max = .1))
+head(data)
+
##   species year     index         se
+## 1       a    1 0.5377357 0.01769631
+## 2       a    2 0.9644076 0.07268163
+## 3       a    3 0.3653818 0.08335768
+## 4       a    4 0.7266121 0.04494251
+## 5       a    5 0.8660005 0.09993223
+## 6       a    6 0.4728771 0.09887020
+

It is important that your data is in the same format and that your columns are in the same order and have the same names. Remember you can use the function read.csv() to read in the data from a .csv on your computer.

+

BMA is run using the function bma, here we will use the default settings and then see what we can change.

+
bma_indicator <- bma(data)
+
## 
+## Processing function input....... 
+## 
+## Done. 
+##  
+## Compiling model graph
+##    Resolving undeclared variables
+##    Allocating nodes
+## Graph information:
+##    Observed stochastic nodes: 2600
+##    Unobserved stochastic nodes: 1380
+##    Total graph size: 13125
+## 
+## Initializing model
+## 
+## Adaptive phase, 100 iterations x 3 chains 
+## If no progress bar appears JAGS has decided not to adapt 
+##  
+## 
+##  Burn-in phase, 5000 iterations x 3 chains 
+##  
+## 
+## Sampling from joint posterior, 5000 iterations x 3 chains 
+##  
+## 
+## Calculating statistics....... 
+## 
+## Done.
+

+

The function returns a plot to your screen which is a diagnostic plot of the model. When the model has converged (i.e. reached a point where the three chains agree on the answer) the lines on the plots on the left will sit on top of one another and the plots on the right will have a nice bell shape. You can turn off this plot by setting plot to FALSE. By default the method runs the chains in series. Running them in parallel makes the models run faster (about half the time) but will slow down your computer more. We can change this with the parameter parallel. The number of iterations the model runs is controlled by n.iter and defaults to 10000. If you can it is better to run it for more iterations, though this will take longer. m.scale gives the scale your data is on. It is very important that this is correct, choose from ‘loge’ (natural log, sometimes simply called ‘log’), ‘log10’ (log to the base 10), or ‘logit’ (output from models of proportions or probabilities).

+

Let’s implement a few of these changes

+
bma_indicator2 <- bma(data,
+                     parallel = TRUE,
+                     n.iter = 500,
+                     m.scale = 'log10')
+
## 
+## Processing function input....... 
+## 
+## Done. 
+##  
+## Beginning parallel processing using 3 cores. Console output will be suppressed.
+## 
+## Parallel processing completed.
+## 
+## Calculating statistics....... 
+## 
+## Done.
+

+

Because we have reduced the number of interations the model no longer has a good convergence. The lines on the graphs on the left do not overlap and the graphs on the right are no longer a smooth bell shape.

+

The object that is returned is a data.frame with years as rows and columns giving the year value, index value and confidence intervals. You can write this to a csv using the function write.csv.

+
head(bma_indicator)
+
##   Year    Index lower2.5 upper97.5
+## 1    1 100.0000 93.93207  105.9292
+## 2    2 100.3423 94.57618  105.7631
+## 3    3 100.5838 95.11727  105.9600
+## 4    4 100.9536 95.70288  106.0434
+## 5    5 101.0323 96.06185  105.9216
+## 6    6 100.9373 96.37739  105.6126
+

We can use the plotting function in BRCindicators to plot the results of this analysis, which in this case are not all that interesting!

+
plot_indicator(indicator = bma_indicator[,'Index'],
+               CIs = bma_indicator[,c(3,4)])
+

+
+
+

Multi-species Indicator

+

The multi-species indicator method was developed by Statistics Netherlands and the code is made available on their website (https://www.cbs.nl/en-gb/society/nature-and-environment/indices-and-trends--trim--/msi-tool). To find out more about the inner working of this method please read the detailed documentation on the authors website. Here is a simple example of how this method runs in BRCindicators.

+
# Create some example data in the format required
+data <- data.frame(species = rep(letters, each = 50),
+                   year = rep(1:50, length(letters)),
+                   index = runif(n = 50 * length(letters), min = 1, max = 10),
+                   se = runif(n = 50 * length(letters), min = 0.01, max = .8))
+
+# Alternativly I could read in data from a csv
+# read.csv('path/to/my/data.csv')
+
+# Run the MSI function
+msi_out <- msi(data)
+

+
# I can capture the output figures too
+# pdf('test.pdf')
+#   msi_out <- msi(data)
+# dev.off()
+

The code returns two plots to the console, the first plot shows the coefficient of variation (CV) for each of the species. Species with high values of CV may adversly effect the relaibility of hte trend estimation. Use this graph to identify the CV values of the species and use the maxCV parameter to set a threshold above which species will be excluded. The results of excluding species in this way can be tested by comparing trend plots. The second plot shows the smoothed trend and the MSI values. These two figures can be captured in the usual way in R by using pdf() for example. In the example I create a dataset from random numbers but usually you would use read.csv() to read in data from a local file.

+

Here is a second example which sets some additional parameters. The parameters for msi get passed to msi_tool so to see a list of all the parameters you can change look at the help documentation in msi_tool usign ?msi_tool at the R console. I cover most of hte important ones here.

+
msi_out <- msi(data,
+               nsim = 500, # The number of Mote Carlo simulations
+               SEbaseyear = 10, # The year to index on
+               plotbaseyear = 15, # The year to set as 100 in plots
+               index_smoot = 'INDEX', # plotbaseyear uses MSI not trend
+               span = 0.7, # 'wigglyness' of line, between 0 and 1
+               lastyears = 5, # last X years of time series for short-term trends
+               maxCV = 10, # maximum allowed Coefficient of Variation 
+               changepoint = 25, # compare trends before and after this year
+               truncfac = 8, # max year-to-year index ratio
+               TRUNC = 5, #set all indices below TRUNC to this
+               plot = TRUE # should the plots be returned?)
+               )
+

+

This set of parameters is unrealistic but shows the options available. Note that in the second graph the year 10 point now has a se = 0, year 15 MSI is set to 100, and the short term trend is reported for the last 5 years.

+

The analysis also returns data which provide more insights into the analysis and let you create your own plots if required.

+
# The returned object has 2 elements
+head(msi_out$results)
+
##   year    MSI sd_MSI lower_CL_MSI upper_CL_MSI  Trend lower_CL_trend
+## 1    1 143.77   6.59       131.43       157.28 149.93         143.78
+## 2    2 129.36   4.78       120.33       139.08 149.90         145.22
+## 3    3 181.33   7.11       167.92       195.82 149.86         145.22
+## 4    4 149.92   5.20       140.08       160.47 149.80         145.22
+## 5    5 155.17   6.42       143.08       168.29 149.74         146.68
+## 6    6 135.79   4.62       127.02       145.14 149.65         146.68
+##   upper_CL_trend trend_class
+## 1         155.75      stable
+## 2         155.75      stable
+## 3         154.20      stable
+## 4         154.20      stable
+## 5         152.67      stable
+## 6         152.67      stable
+

The first of the two elements (results) returned gives all the data, and a little more, that is presented in the second figure.

+
# The returned object has 2 elements
+msi_out$trends
+
##                             Measure   value     significance
+## 1                     overall trend  0.9994           stable
+## 2                  SE overall trend  0.0004                 
+## 3                trend last 5 years  0.9239 moderate decline
+## 4             SE trend last 5 years  0.0147                 
+## 5                  changepoint (25) 25.0000                 
+## 6     trend before changepoint (25)  1.0002           stable
+## 7  SE trend before changepoint (25)  0.0012                 
+## 8      trend after changepoint (25)  1.0021           stable
+## 9   SE trend after changepoint (25)  0.0013                 
+## 10                         % change -5.7420             n.s.
+## 11                      SE % change  3.3470                 
+## 12            % change last 5 years -2.9240             n.s.
+## 13         SE % change last 5 years  1.5170                 
+## 14                      changepoint      NA             n.s.
+
# I could write this as a csv too
+# write.csv(msi_out$trends, file = 'path/to/my/output.csv')
+

The second element (trends) returned give a summary of various trend assessments across the time series.

+

We have also added a plot method for the MSI output which provides a plot similar to that of the second figure we have seen already. Lets use this plot method to explore the effect of changing the span value in the analysis

+
for(i in c(0.2, 0.3, 0.5, 0.7)){ # use a range of values for span
+  
+  msi_out <- msi(data, span = i, # span is set to i
+                 nsim = 200, plot = FALSE)
+  print( # print makes the plot visible in the for loop
+    plot(msi_out, title = paste('MSI - span =', i)) # plot
+  )
+  
+}
+

+

As the value of span gets closer to 1 the trend line gets smoother.

+
+
+
+

Creating a custom pipeline function

+

We have demonstrated how you might run the indicator functions one at a time, however in a ‘pipeline’ we want data to flow through seamlessly. Additionally there are a number of parameters in the functions that we have not shown you that you might find useful. Here is an example of how you can create your own pipeline function. Our function will wrap around the functions described above, setting the parameters to meet our needs. Once we have done this it will allow use to execute our pipeline in one line.

+
# I call my function 'run_pipeline' and the only arguement it
+# takes is the directory of sparta's output
+run_pipeline <- function(input_dir){
+
+  require(sparta)
+  require(BRCindicators)
+  
+  # Create the trends summary
+  trends_summary <- summarise_occDet(input_dir = input_dir)
+
+  # Rescale the values and get the indicator values
+  # Here I set the index to 1 and change the value limits
+  rescaled_trends <- rescale_species(Data = trends_summary,
+                                     index = 1,
+                                     max = 100,
+                                     min = 0.001)
+  
+  # Bootstrap the indicator to get CIs
+  scaled_species <- rescaled_trends[,!colnames(rescaled_trends) %in% c('year', 'indicator')]
+  # This time I set the iterations to twice the default and 
+  # use custom confidence intervals
+  indicator_CIs <- bootstrap_indicator(Data = scaled_species,
+                                       CI_limits = c(0.25, 0.75),
+                                       iterations = 20000)
+  
+  # Get the smoothed indicator line
+  smoothed_indicator <- GAM_smoothing(rescaled_trends[,'indicator'])
+  
+  # This time I specify the years and index value
+  plot_indicator(indicator = rescaled_trends[,'indicator'],
+                 year = rescaled_trends[,'year'],
+                 index = 1,
+                 CIs = indicator_CIs,
+                 smoothed_line = smoothed_indicator)
+  
+  ## I'll return all my data  
+  return(cbind(smoothed_indicator, indicator_CIs, as.data.frame(trends_summary)))
+ }
+

Once we have created this function we can run this pipeline on a directory in one line, or put it in a loop to run across many directories.

+
# Now we can run the pipeline in one line, like a boss
+indicator_data <- run_pipeline(input_dir = '~/Testing_indicator_pipe')
+
## Loading data...done
+## Running bootstrapping for 20000 iterations...done
+

+
head(indicator_data)
+
##   smoothed_indicator  quant_25  quant_75 year         a         b
+## 1          0.9138432 1.0000000 1.0000000 1950 0.6745699 0.7173656
+## 2          0.9149464 0.9253329 1.0122398 1951 0.6675806 0.6460215
+## 3          0.9160497 0.9734023 1.0700867 1952 0.4059677 0.6024731
+## 4          0.9171529 0.8201399 0.9500472 1953 0.1990860 0.5976344
+## 5          0.9182562 0.8786610 0.9687585 1954 0.5780108 0.5145699
+## 6          0.9193594 0.9060461 1.0056549 1955 0.2000000 0.4475806
+##           c         d         e         f         g          h         i
+## 1 0.4802151 0.5568280 0.5884946 0.5057527 0.7402151 0.52607527 0.3404301
+## 2 0.5637097 0.6809677 0.4640860 0.4447312 0.7330645 0.28741935 0.6885484
+## 3 0.5306989 0.5035484 0.6218280 0.3743548 0.8120430 0.68682796 0.8302688
+## 4 0.4347312 0.5723656 0.6150538 0.6258602 0.7101075 0.05032258 0.5569892
+## 5 0.6909677 0.5766667 0.7106452 0.8705376 0.6783333 0.21607527 0.5248387
+## 6 0.5319892 0.4764516 0.8084946 0.5811290 0.8016129 0.79473118 0.1360215
+##           j         k         l         m         n         o         p
+## 1 0.7691935 0.7781720 0.7627419 0.5490860 0.9055914 0.5089247 0.5126344
+## 2 0.5700000 0.3858065 0.8473656 0.6853763 0.7496237 0.9110753 0.8044086
+## 3 0.4464516 0.9057527 0.7373118 0.7177419 0.8656452 0.8820968 0.5419355
+## 4 0.8590860 0.5333871 0.5558602 0.7541935 0.8749462 0.8529032 0.3788710
+## 5 0.6415591 0.4744624 0.7397312 0.3079032 0.5227419 0.8681183 0.5830108
+## 6 0.5844086 0.5594624 0.7416129 0.8600000 0.8168817 0.8223656 0.9411828
+##           q         r         s         t         u         v         w
+## 1 0.8418280 0.3869355 0.8094086 0.9240323 0.8798925 0.7795161 0.9172581
+## 2 0.7586022 0.5087097 0.7891935 0.5512366 0.9248925 0.4275806 0.8763441
+## 3 0.9354301 0.8917204 0.9319892 0.4596237 0.9089247 0.7392473 0.7803763
+## 4 0.9145161 0.6189247 0.7883333 0.8891398 0.6998387 0.8875269 0.7385484
+## 5 0.8787634 0.6071505 0.6269355 0.8125806 0.5819892 0.9118280 0.7530108
+## 6 0.7818280 0.5579570 0.7590860 0.7274194 0.9422581 0.8439785 0.8212903
+##           x         y         z
+## 1 0.8746774 0.8625806 0.9073118
+## 2 0.7978495 0.9104301 0.9095161
+## 3 0.6747849 0.9007527 0.7707527
+## 4 0.8668817 0.8252688 0.6954301
+## 5 0.5406452 0.9445161 0.6822581
+## 6 0.8452688 0.8245699 0.6967742
+
+ + + + +
+ + + + + + + + diff --git a/vignettes/BRCindicators.md b/vignettes/BRCindicators.md new file mode 100644 index 0000000..0fc076b --- /dev/null +++ b/vignettes/BRCindicators.md @@ -0,0 +1,919 @@ +--- +title: "BRCindicators" +author: "Tom August" +date: "04 March, 2019" +output: + html_document: + keep_md: yes + toc: yes +vignette: > + %\VignetteIndexEntry{BRCindicators} + %\usepackage[utf8]{inputenc} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +--- + + + +```r +require(snowfall) +``` + +``` +## Loading required package: snowfall +``` + +``` +## Loading required package: snow +``` + +# Introduction + +This document shows how to use the indicator pipeline to create biodiversity indicators such as those for DEFRA's [Biodiversity Indicators in Your Pocket](http://jncc.defra.gov.uk/page-1824). The pipeline is shared in the form of an R package called 'BRCindicators' making it easy to share and maintain. + +The functions in BRCindicators work with yearly estimates of species abundance or occurrence and aggregate them into an scaled indicator value with bootstrapped confidence intervals + +This package has the ability to read in the output of occupancy models created in the R package sparta, a package for estimating species trends from occurrence data. This package can be installed [from Github](https://github.com/BiologicalRecordsCentre/sparta) and details of how to use the package are given in [the package vignette](https://github.com/BiologicalRecordsCentre/sparta/raw/master/vignette/sparta_vignette.pdf). There is no need to use sparta to create your yearly species estimates as BRCindicators can also work with other data. + +To create an indicator we first need to have species trends, let's create some using the sparta R package. + +# Creating yearly estimates of occurrence in sparta + +If you already have yearly estimates of abundance or occurrence for your species you can skip this stage. Here we show how you can create these estimates from raw species observation data using sparta. + + +```r +# Install the package from CRAN +# THIS WILL WORK ONLY AFTER THE PACKAGE IS PUBLISHED +install.packages('sparta') + +# Or install the development version from GitHub +library(devtools) +install_github('biologicalrecordscentre/sparta') +``` + +Let's assume you have some raw data already, we can under take occupancy modelling like this + + +```r +# Create data +n <- 8000 # size of dataset +nyr <- 50 # number of years in data +nSamples <- 200 # set number of dates +nSites <- 100 # set number of sites +set.seed(125) # set a random seed + +# Create somes dates +first <- as.Date(strptime("1950/01/01", "%Y/%m/%d")) +last <- as.Date(strptime(paste(1950+(nyr-1),"/12/31", sep=''), "%Y/%m/%d")) +dt <- last-first +rDates <- first + (runif(nSamples)*dt) + +# taxa are set semi-randomly +taxa_probabilities <- seq(from = 0.1, to = 0.7, length.out = 26) +taxa <- sample(letters, size = n, TRUE, prob = taxa_probabilities) + +# sites are visited semi-randomly +site_probabilities <- seq(from = 0.1, to = 0.7, length.out = nSites) +site <- sample(paste('A', 1:nSites, sep=''), size = n, TRUE, prob = site_probabilities) + +# the date of visit is selected semi-randomly from those created earlier +time_probabilities <- seq(from = 0.1, to = 0.7, length.out = nSamples) +time_period <- sample(rDates, size = n, TRUE, prob = time_probabilities) + +myData <- data.frame(taxa, site, time_period) +``` + + +```r +# Load the sparta package +library(sparta) +``` + +``` +## Loading required package: lme4 +``` + +``` +## Loading required package: Matrix +``` + +For demonstration purposes I have a faked dataset of 8000 species observations. In my dataset the species are named after the letters in the alphabet. Below I show how I can use the Bayesian occupancy models in sparta to create yearly estimates of occurrence. For more information please see the [vignette for sparta](https://github.com/BiologicalRecordsCentre/sparta/raw/master/vignette/sparta_vignette.pdf) + + +```r +# Preview of my data +head(myData) +``` + +``` +## taxa site time_period +## 1 r A51 1970-01-14 +## 2 v A87 1980-09-29 +## 3 e A56 1996-04-14 +## 4 z A28 1959-01-16 +## 5 r A77 1970-09-21 +## 6 x A48 1990-02-25 +``` + +```r +# First format our data +formattedOccData <- formatOccData(taxa = myData$taxa, + site = myData$site, + survey = myData$time_period) +``` + +``` +## Warning in errorChecks(taxa = taxa, site = site, survey = survey, +## closure_period = closure_period): 94 out of 8000 observations will be +## removed as duplicates +``` + +```r +# Here we are going to use the package snowfall to parallelise +library(snowfall) + +# I have 4 cpus on my PC so I set cpus to 4 +# when I initialise the cluster +sfInit(parallel = TRUE, cpus = 4) +``` + +``` +## Warning in searchCommandline(parallel, cpus = cpus, type = type, +## socketHosts = socketHosts, : Unknown option on commandline: +## rmarkdown::render('W:/PYWELL_SHARED/Pywell Projects/BRC/Tom August/R +## Packages/BRCindicators/vignettes/BRCindicators.Rmd', encoding +``` + +``` +## R Version: R version 3.5.2 (2018-12-20) +``` + +``` +## snowfall 1.84-6.1 initialized (using snow 0.4-3): parallel execution on 4 CPUs. +``` + +```r +# Export my data to the cluster +sfExport('formattedOccData') + +# I create a function that takes a species name and runs my model +occ_mod_function <- function(taxa_name){ + + library(sparta) + + # Note that this will write you results to your computer + # the location is set to your user folder + occ_out <- occDetFunc(taxa_name = as.character(taxa_name), + n_iterations = 200, + burnin = 15, + occDetdata = formattedOccData$occDetdata, + spp_vis = formattedOccData$spp_vis, + write_results = TRUE, + output_dir = '~/Testing_indicator_pipe', + seed = 123) +} + +# I then run this in parallel +system.time({ +para_out <- sfClusterApplyLB(unique(myData$taxa), occ_mod_function) +}) +``` + +``` +## user system elapsed +## 0.04 0.05 139.66 +``` + +```r +# Stop the cluster +sfStop() +``` + +``` +## +## Stopping cluster +``` + +```r +# We can see all the files this has created +list.files('~/Testing_indicator_pipe') +``` + +``` +## [1] "a.rdata" "b.rdata" "c.rdata" "d.rdata" "e.rdata" "f.rdata" "g.rdata" +## [8] "h.rdata" "i.rdata" "j.rdata" "k.rdata" "l.rdata" "m.rdata" "n.rdata" +## [15] "o.rdata" "p.rdata" "q.rdata" "r.rdata" "s.rdata" "t.rdata" "u.rdata" +## [22] "v.rdata" "w.rdata" "x.rdata" "y.rdata" "z.rdata" +``` + +# Installing BRCindicators + +Installing the package is easy and can be done in a couple of lines + + +```r +library(devtools) +install_github(repo = 'biologicalrecordscentre/BRCindicators') +``` + +# Summarising sparta output for an indicator + +Now that we have some species trends data to work with (no doubt you already have your own) we can use the first function in BRCindicators. This function reads in all the output files from sparta (which are quite large and complex) and returns a simple summary table that we can use for calculating the indicator. If you have done your analysis without using sparta you can skip to the next step. + + +```r +library(BRCindicators) + +# All we have to supply is the directory where out data is saved +# You will note this is the 'output_dir' passed to sparta above. +trends_summary <- summarise_occDet(input_dir = '~/Testing_indicator_pipe') +``` + +``` +## Loading data...done +``` + +```r +# Lets see the summary +head(trends_summary[,1:5]) +``` + +``` +## year a b c d +## [1,] 1950 0.6936066 0.6412022 0.5689617 0.5595082 +## [2,] 1951 0.7638251 0.4861749 0.4021311 0.5510929 +## [3,] 1952 0.6248634 0.6434973 0.2830601 0.5293443 +## [4,] 1953 0.7024590 0.7724590 0.4861202 0.3753552 +## [5,] 1954 0.6174317 0.5121858 0.7556831 0.3475956 +## [6,] 1955 0.4302732 0.3584699 0.6068852 0.5328962 +``` + +Returned from this function is a summary of the data as a matrix. In each row we have the year, specified in the first column, and each subsequent column is a species. The values in the table are the mean of the posterior for the predicted proportion of sites occupied, a measure of occurrence. + +# Calculating indicator values + +Once we have species-year indicies we are in a position to proceed to calculating an indictor. To do this there are a number of mehods available, some of which are presented here in 'BRCindicators' + +## Geometric mean + +The geometric mean method is often used with data that do not have errors associated with them. + +The first step is to re-scale the data so that the value for all species in the first year is the same. Once this is done we calculate the geometric mean across species for each year creating the indicator value. This function also accounts for species that have no data at the beginning of the dataset by entering them at the geometric mean for that year, this stops them dramatically changing the indicator value in the year they join the dataset. It also accounts for species that leave the dataset before the end by holding them at their last value. Finally limits to species values can be given, preventing extremely high or low values biasing the indicator. + +### Rescaling and calculating geometric mean + +The data I have generated in 'trends_summary' is very easy to work with but to show off what this function can do I'm going to mess it up a bit. + + +```r +trends_summary[1:3, 'a'] <- NA +trends_summary[1:5, 'b'] <- NA +trends_summary[2:4, 'c'] <- 1000 +trends_summary[45:50, 'd'] <- NA + +# Let's have a look at these changes +head(trends_summary[,1:5]) +``` + +``` +## year a b c d +## [1,] 1950 NA NA 0.5689617 0.5595082 +## [2,] 1951 NA NA 1000.0000000 0.5510929 +## [3,] 1952 NA NA 1000.0000000 0.5293443 +## [4,] 1953 0.7024590 NA 1000.0000000 0.3753552 +## [5,] 1954 0.6174317 NA 0.7556831 0.3475956 +## [6,] 1955 0.4302732 0.3584699 0.6068852 0.5328962 +``` + +```r +tail(trends_summary[,1:5]) +``` + +``` +## year a b c d +## [45,] 1994 0.1308743 0.6617486 0.3050273 NA +## [46,] 1995 0.5324044 0.2455191 0.7130601 NA +## [47,] 1996 0.7407650 0.4489617 0.5519672 NA +## [48,] 1997 0.3051366 0.7181421 0.7871585 NA +## [49,] 1998 0.7282514 0.2540984 0.5586339 NA +## [50,] 1999 0.5635519 0.4161202 0.4713661 NA +``` + +Now that I have 'messed up' the data a bit we have two species with data missing at the beginning and one species with data missing at the end. We also have one species with some very high values. + +Now lets run this through the re-scaling function. + + +```r +# Let's run this data through our scaling function (all defaults used) +rescaled_trends <- rescale_species(Data = trends_summary) + +# Here's the result +head(rescaled_trends[,c('year', 'indicator', 'a', 'b', 'c', 'd')]) +``` + +``` +## year indicator a b c d +## [1,] 1950 100.00000 NA NA 100.0000 100.00000 +## [2,] 1951 119.03406 NA NA 10000.0000 98.49595 +## [3,] 1952 126.49450 NA NA 10000.0000 94.60885 +## [4,] 1953 118.53382 118.53382 NA 10000.0000 67.08663 +## [5,] 1954 94.25122 104.18620 NA 132.8179 62.12521 +## [6,] 1955 99.15398 72.60485 99.15398 106.6654 95.24368 +``` + +```r +tail(rescaled_trends[,c('year', 'indicator', 'a', 'b', 'c', 'd')]) +``` + +``` +## year indicator a b c d +## [45,] 1994 92.77695 22.08390 183.04187 53.61122 140.7169 +## [46,] 1995 102.48140 89.83859 67.91140 125.32655 140.7169 +## [47,] 1996 105.83425 124.99763 124.18431 97.01306 140.7169 +## [48,] 1997 106.16325 51.48914 198.64048 138.34998 140.7169 +## [49,] 1998 94.26861 122.88605 70.28445 98.18479 140.7169 +## [50,] 1999 95.96255 95.09446 115.10023 82.84672 140.7169 +``` + +You can see that species 'a' and 'b' enter the dataset at the geometric mean (the indicator value), all species are indexed at 100 in the first year and the very high values in 'c' are capped at 10000 at the end 'd' has been held at it's end value. + +The 'indicator' column that is returned here is our indicator, calculated as the geometric mean of all the species in the data set. + +### Confidence intervals + +We can get confidence intervals for this indicator by bootstrapping across species. We have a function for that too! + + +```r +# This function takes just the species columns +scaled_species <- rescaled_trends[,!colnames(rescaled_trends) %in% c('year', 'indicator')] +indicator_CIs <- bootstrap_indicator(Data = scaled_species) +``` + +``` +## Running bootstrapping for 10000 iterations...done +``` + +```r +# Returned are the CIs for our indicator +head(indicator_CIs) +``` + +``` +## quant_025 quant_975 +## [1,] 100.00000 100.0000 +## [2,] 87.33579 190.2349 +## [3,] 95.60814 198.0757 +## [4,] 88.79716 183.7160 +## [5,] 80.60150 110.3239 +## [6,] 86.15162 115.1020 +``` + +### Smoothing + +It is sometimes desirable to create a smoothed indicator value from the raw values. This can be achieved by fitting a GAM (general additive model) to the indicator using a spline. This spline is a smoothed curve that goes through the raw values for the indicator and is fitted using the function 'gam' in the 'mgcv' R package. + + +```r +# The smoothing function takes the indicator values +smoothed_indicator <- GAM_smoothing(rescaled_trends[,'indicator']) + +# In this example there is little support for a non-linear trend and +# so the line almost linear +plot(x = rescaled_trends[,'year'], y = rescaled_trends[,'indicator']) +lines(x = rescaled_trends[,'year'], y = smoothed_indicator, col = 'red') +``` + +![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/smoothing_indicator-1.png) + +```r +# But if our indicator did support a non-linear trend it might look +# like this +eg_indicator <- jitter(sort(rnorm(50)), amount = 0.5) +eg_smoothed <- GAM_smoothing(eg_indicator) +plot(x = 1:50, y = eg_indicator) +lines(x = 1:50, y = eg_smoothed, col = 'red') +``` + +![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/smoothing_indicator-2.png) + +Where there is little support for a non-linear trend a GAM smoothed line will tend towards linear. Where there is good support for a non-linear trend the smoothed line will become more 'bendy'. + +### Plotting + +We now have our indicator and the confidence intervals around it. The next step is to plot it. We have included a function that creates a simple plot using ggplot2, however you could easily create your own plots in R using the data. + + +```r +# Plot our indicator. +plot_indicator(indicator = rescaled_trends[,'indicator'], + smoothed_line = smoothed_indicator, + CIs = indicator_CIs) +``` + +![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/plot_indicator-1.png) + +In this plot you can see the high upper confidence interval in years 2-4, this is due to the artificially high values we gave to species 'c'. + +## Bayesian Meta-Analysis (BMA) + +The Bayesian Meta-Analysis method, or BMA, is suited to data with standard errors associated with them. As with other methods we require data from more than one species, across a number of years, with an error for each species-year estimate. + + +```r +# Here is an example dataset for the BMA method +data <- data.frame(species = rep(letters, each = 50), + year = rep(1:50, length(letters)), + index = runif(n = 50 * length(letters), min = 0, max = 1), + se = runif(n = 50 * length(letters), min = 0.01, max = .1)) +head(data) +``` + +``` +## species year index se +## 1 a 1 0.5377357 0.01769631 +## 2 a 2 0.9644076 0.07268163 +## 3 a 3 0.3653818 0.08335768 +## 4 a 4 0.7266121 0.04494251 +## 5 a 5 0.8660005 0.09993223 +## 6 a 6 0.4728771 0.09887020 +``` + +It is important that your data is in the same format and that your columns are in the same order and have the same names. Remember you can use the function `read.csv()` to read in the data from a .csv on your computer. + +BMA is run using the function `bma`, here we will use the default settings and then see what we can change. + + +```r +bma_indicator <- bma(data) +``` + +``` +## +## Processing function input....... +## +## Done. +## +## Compiling model graph +## Resolving undeclared variables +## Allocating nodes +## Graph information: +## Observed stochastic nodes: 2600 +## Unobserved stochastic nodes: 1379 +## Total graph size: 6587 +## +## Initializing model +## +## Adaptive phase..... +## Adaptive phase complete +## +## +## Burn-in phase, 5000 iterations x 3 chains +## +## +## Sampling from joint posterior, 5000 iterations x 3 chains +## +## +## Calculating statistics....... +``` + +``` +## Warning in doTryCatch(return(expr), name, parentenv, handler): At least one +## Rhat value could not be calculated. +``` + +``` +## +## Done. +``` + +![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/runBMA-1.png) + +The function returns a plot to your screen which is a diagnostic plot of the model. When the model has converged (i.e. reached a point where the three chains agree on the answer) the lines on the plots on the left will sit on top of one another and the plots on the right will have a nice bell shape. You can turn off this plot by setting `plot` to `FALSE`. By default the method runs the chains in series. Running them in parallel makes the models run faster (about half the time) but will slow down your computer more. We can change this with the parameter `parallel`. The number of iterations the model runs is controlled by `n.iter` and defaults to 10000. If you can it is better to run it for more iterations, though this will take longer. `m.scale` gives the scale your data is on. It is very important that this is correct, choose from 'loge' (natural log, sometimes simply called 'log'), 'log10' (log to the base 10), or 'logit' (output from models of proportions or probabilities). + +Let's implement a few of these changes + + +```r +bma_indicator2 <- bma(data, + parallel = TRUE, + n.iter = 500, + m.scale = 'log10') +``` + +``` +## +## Processing function input....... +## +## Done. +## +## Beginning parallel processing using 3 cores. Console output will be suppressed. +## +## Parallel processing completed. +## +## Calculating statistics....... +``` + +``` +## Warning in doTryCatch(return(expr), name, parentenv, handler): At least one +## Rhat value could not be calculated. +``` + +``` +## +## Done. +``` + +![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/runBMAparameters-1.png) + +Because we have reduced the number of interations the model no longer has a good convergence. The lines on the graphs on the left do not overlap and the graphs on the right are no longer a smooth bell shape. + +The object that is returned is a data.frame with years as rows and columns giving the year value, index value and confidence intervals. You can write this to a csv using the function `write.csv`. + + +```r +head(bma_indicator) +``` + +``` +## Year Index lower2.5 upper97.5 +## 1 1 100.0000 100.00000 100.0000 +## 2 2 100.1559 98.00631 102.3808 +## 3 3 100.3311 97.39855 103.5038 +## 4 4 100.5416 96.95652 104.3762 +## 5 5 100.5611 96.63643 104.8945 +## 6 6 100.4664 95.99929 105.0179 +``` + +We can use the plotting function in BRCindicators to plot the results of this analysis, which in this case are not all that interesting! + + +```r +plot_indicator(indicator = bma_indicator[,'Index'], + CIs = bma_indicator[,c(3,4)]) +``` + +![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/BMAplot-1.png) + + +## Multi-species Indicator + +The multi-species indicator method was developed by Statistics Netherlands and the code is made available on [their website](https://www.cbs.nl/en-gb/society/nature-and-environment/indices-and-trends--trim--/msi-tool). To find out more about the inner working of this method please read the [detailed documentation](https://www.cbs.nl/-/media/_pdf/2017/22/msi_manual.pdf) on the authors website. Here is a simple example of how this method runs in `BRCindicators`. + + +```r +# Create some example data in the format required +nyr = 20 +species = rep(letters, each = nyr) +year = rev(rep(1:nyr, length(letters))) + +# Create an index value that increases with time +index = rep(seq(50, 100, length.out = nyr), length(letters)) +# Add randomness to species +index = index * runif(n = length(index), 0.7, 1.3) +# Add correlated randomness across species, to years +index = index * rep(runif(0.8, 1.2, n = nyr), length(letters)) + +se = runif(n = nyr * length(letters), min = 10, max = 20) + +data <- data.frame(species, year, index, se) + +# Our species are decreasing +plot(data$year, data$index) +``` + +![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/msi1-1.png) + +```r +# Species index values need to be 100 in the base year. Here I use +# the first year as my base year and rescale to 100. The standard error +# in the base year should be 0. +min_year <- min(data$year) + +for(sp in unique(data$species)){ + + subset_data <- data[data$species == sp, ] + multi_factor <- 100 / subset_data$index[subset_data$year == min_year] + data$index[data$species == sp] <- data$index[data$species == sp] * multi_factor + data$se[data$species == sp] <- data$se[data$species == sp] * multi_factor + data$se[data$species == sp][1] <- 0 + +} + +# Our first year is now indexed at 100 +plot(data$year, data$index) +``` + +![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/msi1-2.png) + +```r +# Alternativly I could read in data from a csv +# data <- read.csv('path/to/my/data.csv') + +# Run the MSI function +msi_out <- msi(data) +``` + +![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/msi1-3.png)![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/msi1-4.png) + +```r +head(msi_out$CV) +``` + +``` +## species mean_CV +## 1 a 0.1927516 +## 2 b 0.1918315 +## 3 c 0.2048356 +## 4 d 0.2195197 +## 5 e 0.2048833 +## 6 f 0.2096112 +``` + +```r +# I can capture the output figures too +# pdf('test.pdf') +# msi_out <- msi(data) +# dev.off() +``` + +The code returns two plots to the console, the first plot shows the coefficient of variation (CV) for each of the species. Species with high values of CV may adversly effect the relaibility of the trend estimation. Use this graph to identify the CV values of the species and use the `maxCV` parameter to set a threshold above which species will be excluded. The results of excluding species in this way can be tested by comparing trend plots. The CV values are hard to assign to species from this plot as the species are coded to numbers. To see the raw values look at the CV component of msi_out (i.e. `msi_out$CV`). The second plot shows the smoothed trend and the MSI values. These two figures can be captured in the usual way in R by using `pdf()` for example. In the example I create a dataset from random numbers but usually you would use `read.csv()` to read in data from a local file. + +Here is a second example which sets some additional parameters. The parameters for `msi` get passed to `msi_tool` so to see a list of all the parameters you can change look at the help documentation in `msi_tool` usign `?msi_tool` at the R console. I cover most of hte important ones here. + + +```r +msi_out <- msi(data, + nsim = 500, # The number of Mote Carlo simulations + SEbaseyear = 10, # The year to index on + plotbaseyear = 15, # The year to set as 100 in plots + index_smoot = 'INDEX', # plotbaseyear uses MSI not trend + span = 0.7, # 'wigglyness' of line, between 0 and 1 + lastyears = 5, # last X years of time series for short-term trends + maxCV = 10, # maximum allowed Coefficient of Variation + changepoint = 10, # compare trends before and after this year + truncfac = 8, # max year-to-year index ratio + TRUNC = 5, #set all indices below TRUNC to this + plot = TRUE # should the plots be returned?) + ) +``` + +![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/msi2-1.png)![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/msi2-2.png) + +This set of parameters is unrealistic but shows the options available. Note that in the second graph the year 10 point now has a se = 0, year 15 MSI is set to 100, and the short term trend is reported for the last 5 years. + +The analysis also returns data which provide more insights into the analysis and let you create your own plots if required. + + +```r +# The returned object has 2 elements +head(msi_out$results) +``` + +``` +## year MSI sd_MSI lower_CL_MSI upper_CL_MSI Trend lower_CL_trend +## 1 1 164.57 7.50 150.50 179.96 162.23 151.14 +## 2 2 164.60 7.11 151.24 179.13 156.51 149.63 +## 3 3 124.16 6.20 112.59 136.94 151.27 145.21 +## 4 4 154.53 6.93 141.53 168.72 146.51 140.92 +## 5 5 170.74 7.36 156.90 185.78 142.41 136.76 +## 6 6 143.41 6.76 130.77 157.30 139.41 134.05 +## upper_CL_trend trend_class +## 1 173.85 moderate_decline +## 2 163.73 moderate_decline +## 3 157.31 moderate_decline +## 4 152.66 moderate_decline +## 5 148.15 moderate_decline +## 6 145.21 moderate_decline +``` + +The first of the two elements (`results`) returned gives all the data, and a little more, that is presented in the second figure. + + +```r +# The returned object has 2 elements +msi_out$trends +``` + +``` +## Measure value significance +## 1 overall trend 0.9691 moderate decline +## 2 SE overall trend 0.0019 +## 3 trend last 5 years 0.9711 uncertain +## 4 SE trend last 5 years 0.0154 +## 5 changepoint (10) 10.0000 +## 6 trend before changepoint (10) 0.9789 moderate decline +## 7 SE trend before changepoint (10) 0.0046 +## 8 trend after changepoint (10) 0.9487 moderate decline +## 9 SE trend after changepoint (10) 0.0040 +## 10 % change -46.1560 p<0.01 +## 11 SE % change 2.5360 +## 12 % change last 5 years -9.4120 p<0.01 +## 13 SE % change last 5 years 2.3900 +## 14 changepoint NA p<0.01 +``` + +```r +# I could write this as a csv too +# write.csv(msi_out$trends, file = 'path/to/my/output.csv') +``` + +The second element (`trends`) returned give a summary of various trend assessments across the time series. + +We have also added a plot method for the MSI output which provides a plot similar to that of the second figure we have seen already. Lets use this plot method to explore the effect of changing the span value in the analysis + + +```r +for(i in c(0.3, 0.5, 0.7)){ # use a range of values for span + + msi_out <- msi(data, span = i, # span is set to i + nsim = 200, plot = FALSE) + print( # print makes the plot visible in the for loop + plot(msi_out, title = paste('MSI - span =', i)) # plot + ) + +} +``` + +![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/msi_span-1.png)![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/msi_span-2.png)![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/msi_span-3.png) + +As the value of span gets closer to 1 the trend line gets smoother. + +## Lambda Indicator + +The lambda indicator calculates an indicator using growth rates from one year to the next. Formulating the indicator in terms of growth rates has two distinct advantages over the conventional approach to constructing indicators. First, it means that the categorisation of species as ‘increasing’ or ‘decreasing’ can be made from the same set of data (the growth rates) as the construction of the indicator. Second, it provides an elegant solution to the problem of species that join the indicator after the first year (i.e. where the first year is unreliable): other indicators typically adopt a complicated rescaling approach to ensure that species entering the indicator after the first year do not bias the overall assessment. It also makes a simple and robust, though untestable, assumption about species that drop out of the indicator prior to the final year: specifically it assumes that their fluctuations are the same, in aggregate, as those of the species that remain in the indicator. For more details see http://webarchive.nationalarchives.gov.uk/20170302170037/http://jncc.defra.gov.uk/Docs/UKBI2015_TechBG_C4b-D1c_Bayesian_Final.docx + +Very few species’ models produced reliable occupancy estimates for every year, so a majority of the time series contain missing values. This presents a problem for estimating growth rates for each species-year combination. Missing values of growth rate that would be equivalent to linear interpolation of the log odds between adjacent years with reliable estimates were therefore calculated. This indicator can therefore work with missing values. + +Input data is on the occupancy scale, and is therefore bounded between 0 and 1. + + +```r +# number of species +nsp = 50 + +# number of years +nyr = 40 + +#number of iterations +iter = 500 + +# Build a random set of data +myArray <- array(data = rnorm(n = nsp*nyr*iter, + mean = 0.5, + sd = 0.1), + dim = c(nsp, nyr, iter), + dimnames = list(paste0('SP',1:nsp), + 1:nyr, + 1:iter)) + +# Ensure values are bounded by 0 and 1 +myArray[myArray > 1] <- 1 +myArray[myArray < 0] <- 0 + +str(myArray) +``` + +``` +## num [1:50, 1:40, 1:500] 0.555 0.578 0.489 0.562 0.55 ... +## - attr(*, "dimnames")=List of 3 +## ..$ : chr [1:50] "SP1" "SP2" "SP3" "SP4" ... +## ..$ : chr [1:40] "1" "2" "3" "4" ... +## ..$ : chr [1:500] "1" "2" "3" "4" ... +``` + +`lambda_indicator` takes in an array of data, a three-dimensional matrix. The dimensions of this array represent species, years, and iterations. + + + +```r +# Run the lambda_interpolation method on this data +myIndicator <- lambda_indicator(myArray) + +# Plot the indicator +plot_indicator(myIndicator$summary[,'indicator'], + myIndicator$summary[,c('lower' ,'upper')]) +``` + +![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/lambda_2-1.png) + +There are a number of options available in the `lambda_indicator` function + + +```r +myIndicator <- lambda_indicator(myArray, + index = 1, # Set the index value to 1 not 100 + year_range = c(30,40), # Set year range + threshold_yrs = 5) # set a threshold +plot_indicator(myIndicator$summary[,'indicator'], + myIndicator$summary[,c('lower' ,'upper')]) +``` + +![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/lambda_3-1.png) + +Note that there are a range of threshold functions that allow you to adjust which data points are used in the indicator. There are options to remove species year estimates based on their standard deviaction, rHat value and based on the number of years a species is present in the dataset. Note that the rhat threshold can only be used if you are using a directory path as you input rather than an array. + +# Creating a custom pipeline function + +We have demonstrated how you might run the indicator functions one at a time, however in a 'pipeline' we want data to flow through seamlessly. Additionally there are a number of parameters in the functions that we have not shown you that you might find useful. Here is an example of how you can create your own pipeline function. Our function will wrap around the functions described above, setting the parameters to meet our needs. Once we have done this it will allow use to execute our pipeline in one line. + + +```r +# I call my function 'run_pipeline' and the only arguement it +# takes is the directory of sparta's output +run_pipeline <- function(input_dir){ + + require(sparta) + require(BRCindicators) + + # Create the trends summary + trends_summary <- summarise_occDet(input_dir = input_dir) + + # Rescale the values and get the indicator values + # Here I set the index to 1 and change the value limits + rescaled_trends <- rescale_species(Data = trends_summary, + index = 1, + max = 100, + min = 0.001) + + # Bootstrap the indicator to get CIs + scaled_species <- rescaled_trends[,!colnames(rescaled_trends) %in% c('year', 'indicator')] + # This time I set the iterations to twice the default and + # use custom confidence intervals + indicator_CIs <- bootstrap_indicator(Data = scaled_species, + CI_limits = c(0.25, 0.75), + iterations = 20000) + + # Get the smoothed indicator line + smoothed_indicator <- GAM_smoothing(rescaled_trends[,'indicator']) + + # This time I specify the years and index value + plot_indicator(indicator = rescaled_trends[,'indicator'], + year = rescaled_trends[,'year'], + index = 1, + CIs = indicator_CIs, + smoothed_line = smoothed_indicator) + + ## I'll return all my data + return(cbind(smoothed_indicator, indicator_CIs, as.data.frame(trends_summary))) + } +``` + +Once we have created this function we can run this pipeline on a directory in one line, or put it in a loop to run across many directories. + + +```r +# Now we can run the pipeline in one line, like a boss +indicator_data <- run_pipeline(input_dir = '~/Testing_indicator_pipe') +``` + +``` +## Loading data...done +## Running bootstrapping for 20000 iterations...done +``` + +![](C:/Users/tomaug/AppData/Local/Temp/RtmpwDxzMK/preview-28d07a093333.dir/BRCindicators_files/figure-html/in_one-1.png) + +```r +head(indicator_data) +``` + +``` +## smoothed_indicator quant_25 quant_75 year a b c +## 1 0.9410110 1.0000000 1.000000 1950 0.6936066 0.6412022 0.5689617 +## 2 0.9410241 0.9161441 1.013355 1951 0.7638251 0.4861749 0.4021311 +## 3 0.9410373 0.9654977 1.054758 1952 0.6248634 0.6434973 0.2830601 +## 4 0.9410504 0.9418938 1.025378 1953 0.7024590 0.7724590 0.4861202 +## 5 0.9410636 0.8829743 0.980575 1954 0.6174317 0.5121858 0.7556831 +## 6 0.9410767 0.9152991 1.016191 1955 0.4302732 0.3584699 0.6068852 +## d e f g h i j +## 1 0.5595082 0.6493443 0.6055738 0.6360656 0.6808197 0.5023497 0.7240984 +## 2 0.5510929 0.6752459 0.7466667 0.6198361 0.3294536 0.7777049 0.5445902 +## 3 0.5293443 0.7142623 0.4499454 0.8585246 0.6218033 0.7155738 0.4503825 +## 4 0.3753552 0.4351366 0.7007650 0.6812568 0.2843716 0.7617486 0.7204918 +## 5 0.3475956 0.7890710 0.9208743 0.4224590 0.3489617 0.6709290 0.6630055 +## 6 0.5328962 0.6496175 0.5071038 0.8983607 0.8183607 0.1828415 0.5910383 +## k l m n o p q +## 1 0.6745355 0.7474317 0.4280874 0.8637705 0.4754098 0.3063934 0.8561749 +## 2 0.5612568 0.8439891 0.7219672 0.7681421 0.8899454 0.8829508 0.6603279 +## 3 0.7230055 0.6057377 0.8093443 0.6449727 0.8291803 0.5590164 0.9317486 +## 4 0.6781421 0.3494536 0.6693443 0.8687432 0.7666667 0.3861202 0.8895628 +## 5 0.4267760 0.7104918 0.5197268 0.6080328 0.9333880 0.6397814 0.8000000 +## 6 0.4730601 0.7218579 0.7813115 0.8701093 0.7691803 0.9534973 0.7195628 +## r s t u v w x +## 1 0.4616940 0.8857377 0.9212568 0.9278142 0.7791803 0.8990164 0.8789617 +## 2 0.4956284 0.5505464 0.6045902 0.9059016 0.5865027 0.9578689 0.5522404 +## 3 0.8715847 0.8665574 0.7143169 0.9012022 0.7806557 0.8487978 0.6573224 +## 4 0.7289617 0.9055191 0.9448634 0.7619126 0.8996721 0.7424044 0.8715847 +## 5 0.7063934 0.4081967 0.8507650 0.6043716 0.8098361 0.8346448 0.4581421 +## 6 0.5015847 0.6086339 0.7546448 0.9357923 0.9027322 0.8691257 0.8631148 +## y z +## 1 0.8712022 0.8991257 +## 2 0.8714208 0.6720765 +## 3 0.8427869 0.7418033 +## 4 0.8560656 0.7687432 +## 5 0.9773770 0.6566667 +## 6 0.8149180 0.6977049 +``` diff --git a/vignettes/vignette.Rmd b/vignettes/vignette.Rmd index ef12169..2a6e70a 100644 --- a/vignettes/vignette.Rmd +++ b/vignettes/vignette.Rmd @@ -406,7 +406,7 @@ As the value of span gets closer to 1 the trend line gets smoother. ## Lambda Indicator -The lambda indicator calculates an indicator using growth rates from one year to the next. Formulating the indicator in terms of growth rates has two distinct advantages over the conventional approach to constructing indicators. First, it means that the categorisation of species as ‘increasing’ or ‘decreasing’ can be made from the same set of data (the growth rates) as the construction of the indicator. Second, it provides an elegant solution to the problem of species that join the indicator after the first year (i.e. where the first year is unreliable): other indicators typically adopt a complicated rescaling approach to ensure that species entering the indicator after the first year do not bias the overall assessment. It also makes a simple and robust, though untestable, assumption about species that drop out of the indicator prior to the final year: specifically it assumes that their fluctuations are the same, in aggregate, as those of the species that remain in the indicator. For more details see http://webarchive.nationalarchives.gov.uk/20170302170037/http://jncc.defra.gov.uk/Docs/UKBI2015_TechBG_C4b-D1c_Bayesian_Final.docx +The lambda indicator calculates an indicator using growth rates from one year to the next. Formulating the indicator in terms of growth rates has two distinct advantages over the conventional approach to constructing indicators. First, it means that the categorisation of species as ‘increasing’ or ‘decreasing’ can be made from the same set of data (the growth rates) as the construction of the indicator. Second, it provides an elegant solution to the problem of species that join the indicator after the first year (i.e. where the first year is unreliable): Other indicators typically adopt a complicated rescaling approach to ensure that species entering the indicator after the first year do not bias the overall assessment. It also makes a simple and robust, though untestable, assumption about species that drop out of the indicator prior to the final year: specifically it assumes that their fluctuations are the same, in aggregate, as those of the species that remain in the indicator. For more details see http://webarchive.nationalarchives.gov.uk/20170302170037/http://jncc.defra.gov.uk/Docs/UKBI2015_TechBG_C4b-D1c_Bayesian_Final.docx Very few species’ models produced reliable occupancy estimates for every year, so a majority of the time series contain missing values. This presents a problem for estimating growth rates for each species-year combination. Missing values of growth rate that would be equivalent to linear interpolation of the log odds between adjacent years with reliable estimates were therefore calculated. This indicator can therefore work with missing values.