Modeling Crown Bulk Density via Ridge Regression

Johannes Heisig

This analysis predicts Crown Bulk Density, a canopy fuel variable relevant to crown fire spread. It is difficult to sample in the field and therefore estimated using measurements of other forest structure variables and allometric equations. Predictor data for this regression analysis comes from airborne LiDAR and Sentinel-1 & -2.

NOTE: Binder is great for reproducing analysis in R. However, it has RAM restrictions, which can be a problem for large remote sensing data stacks. Below we provide switches for this program, which allow you to decide, whether you want to run all analysis steps or skip some of the computationally expensive ones (e.g. extracting raster values). Be prepared that Binder may crash when running complex tasks. If you run this script on Binder we suggest you stick with the switch settings below (skip.. = TRUE), which allows intermediate results to be loaded from existing files. Alternatively, you may choose to download this repository and run operations locally instead. In this case feel free to change switches below TRUE to FALSE.

Analysis switches:

skip_training_data_extraction = TRUE
skip_model_prediction = TRUE

suppressPackageStartupMessages({
library(glmnet)
library(caret)
library(CAST)
library(dplyr)
select = dplyr::select
library(tidyr)
library(ggplot2)
library(stars)
library(sf)
library(parallel)
library(doParallel)
library(patchwork)
})
knitr::opts_chunk$set(cache=T, warning = F, message = F)

Prepare training and predictor data

Read

if (skip_training_data_extraction){
  haard = readRDS(file.path(getwd(),"data",
                            "cbd_training_data.rds")) |> 
    st_sf() |> 
    select(-c(PlotID, FC_CBH, FC_SH, FC_CC))
} else {
  
  predictors_path = file.path(getwd(),"data","predictors_haard_10m.rds")
  if (! file.exists(predictors_path)){
    download.file("https://uni-muenster.sciebo.de/s/XPEk2uBClq2v3ob/download",
                  predictors_path)
  } else print("Predictor data already on disk.")
  
  p = readRDS(predictors_path) 
  dim(p)
  
  haard = st_read(file.path(getwd(),"data",
                            "haard_field_plot_locations.csv"), 
                  crs=4326, quiet=T,
                  options=c("X_POSSIBLE_NAMES=lon",
                            "Y_POSSIBLE_NAMES=lat")) |>
    select(plot_id, dom_spp)
  
  FC = read.csv(file.path(getwd(),"data","haard_cbd_fuelcalc.csv"))[-1,] |> 
    select(PlotID, preCBD, preCBH, preSH, preCC) |> 
    mutate(PlotID = sub(x=PlotID, "-Inventory", "") |> as.numeric(),
           across(2:5, as.numeric)) |> 
    rename_with(~sub("pre","FC_",.x)) |> 
    drop_na() |> 
    glimpse()
}

ft = read_stars(file.path(getwd(),"results","haard_surface_fuel_map.tif")) |> 
  setNames("FuelType")

Extract

if (!skip_training_data_extraction){
  # Create buffer around visited field plots according to survey protocol
  haard = haard[haard$plot_id %in% FC$PlotID,] |> 
    st_transform(st_crs(p)) |> 
    st_buffer(10)
  
  #' Convert raster data to polygon geometries to enable
  #' intersection rather than pixel-based extraction.
  #' Extract area-weighted values from predictors at plot
  #' locations. First intersect buffers with predictors,
  #' then weight each fragment of a field plot by its 
  #' area in relation to plot size (pi*(10m)² = 314 m²)
  
  haard = st_intersection(st_as_sf(p), haard)
  haard = haard |> 
    mutate(across(-c(plot_id, dom_spp, geometry), 
                  ~as.numeric(.x * st_area(haard)/314))) |> 
    group_by(plot_id) |> 
    summarise(across(where(is.numeric), 
                     function(s) round(sum(s),2)), 
              dom_spp = unique(dom_spp))
  
  # join with target variable (CBD) based on field plot ID
  haard = FC |> 
    mutate(plot_id = as.character(PlotID)) |> 
    select(plot_id, FC_CBD) |> 
    inner_join(haard, by = "plot_id") |> 
    select(-plot_id, -dom_spp) |> 
    st_sf()
}

haard = st_drop_geometry(haard)

# CBD field samples
plot(1:nrow(haard), haard$FC_CBD, col="red")

Fill NAs

any(colSums(is.na(haard)) > 0) # no NAs in training point data

## [1] FALSE

if (!skip_training_data_extraction){
  # how many NAs in which predictor?
  p.nas = colSums(is.na(as.data.frame(split(p))))
  p.nas[p.nas > 0] 
  
  # ==> NAs mainly in Z-related variables --> can be filled with zeros 
  
  # where in space are NAs?
  p.nas.st = st_apply(p, 1:2, function(x) sum(is.na(x)))
  plot(p.nas.st, breaks = "equal", 
       col = rev(terrain.colors(10)),
       main = "\nNumber of\nmissing predictor values per pixel")
  
  # ==> almost exclusively in non-forest areas --> less of a problem
  
  p = p %>% replace(is.na(.), 0)
}

Split into training and test sets

set.seed(111)
training = haard |> sample_frac(0.7)
testing = haard |> setdiff(training)

x_train = model.matrix(FC_CBD ~ ., training)[,-1]   # predictors
x_test = model.matrix(FC_CBD ~ ., testing)[,-1]

y_train = training$FC_CBD                           # target
y_test = testing$FC_CBD

Remove zero variance predictors

(zerovar = which(apply(scale(x_train), 2, FUN = function(x) {all(is.na(x))})))

##   zq15 B1_p10 B1_p50 
##     45     84     85

x_train = x_train[,-zerovar]
x_test = x_test[,-zerovar]
if (!skip_training_data_extraction) p = p[,,,-zerovar]

Ridge regression with `caret` and `glmnet`

Prepare model training components:

A simple 5-fold random CV is used for assessment of model performance
Parameter lambda needs to be tuned through cross-validation. We use a tune grid with 100 possible values between 100 and 0.1.
Alpha (a) determines whether the model performs Ridge regression (a = 0), a Lasso regression (a = 1), or an Elasticnet regression (0 < a > 1).
Predictor variables (p) are translated from a stars-object to a matrix for model prediction.

tc = trainControl("cv", 5)
grid = 10^seq(2, -1, length = 100)
a = 0
if (!skip_training_data_extraction){
  newx = p |> 
    split(3) |> 
    as.data.frame() |> 
    select(-x,-y) |> 
    as.matrix()
}

Ridge regression log(CBD)

The caret modeling framework will be used as a wrapper for glmnet.

set.seed(111)
glm_ridge = train(
  x_train,
  log(y_train),
  method = "glmnet",
  preProcess = c("center", "scale"),
  tuneGrid = data.frame(lambda=grid, alpha=0), 
  trControl = tc,
  importance = T)

#saveRDS(glm_ridge, "results/models/model_log_cbd_ridge.rds")

Validation log(CBD)

Performance metrics

best_lambda = glm_ridge$bestTune$lambda

modelpred = glm_ridge |> 
  predict(s = best_lambda) |> 
  exp()
testpred = glm_ridge |> 
  predict(s = best_lambda, newdata = x_test) |> 
  exp() 

print(paste("R2 (model prediction vs training data) =", R2(modelpred, y_train)))

## [1] "R2 (model prediction vs training data) = 0.597169910562399"

print(paste("R2 (prediction vs validation data) =", R2(testpred, y_test)))

## [1] "R2 (prediction vs validation data) = 0.734889751072626"

print(paste("RMSE (model prediction vs training data) =", RMSE(modelpred, y_train)))

## [1] "RMSE (model prediction vs training data) = 0.0538830618265586"

print(paste("RMSE (prediction vs validation data) =", RMSE(testpred, y_test)))

## [1] "RMSE (prediction vs validation data) = 0.0690322554130355"

print(paste("RMSE (intercept only model) =",RMSE(mean(y_train), y_test)))

## [1] "RMSE (intercept only model) = 0.0821053783796247"

Variable importance

plot(varImp(glm_ridge), top=20)

Residuals

shapiro.test(residuals(glm_ridge))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(glm_ridge)
## W = 0.97532, p-value = 0.8295

hist(residuals(glm_ridge))

Scatter plot

plot(log(modelpred), log(y_train), 
     xlim=c(-4,-1), ylim=c(-4,-1))
points(log(testpred),log(y_test), col="red")
abline(0,1)

Coefficients

lambda_index = which(glm_ridge$results$lambda == best_lambda)
coef(glm_ridge$finalModel)[,lambda_index] |> 
  sort(decreasing = T)

##        cov_25m       NDVI_p90         zpcum8        cov_20m         B5_p10 
##   8.589786e-03   7.765810e-03   7.551110e-03   7.065261e-03   6.598638e-03 
##         B9_p10        cov_15m        cov_30m          cover   cov_30m_plus 
##   6.027070e-03   5.650038e-03   5.615321e-03   5.407282e-03   5.405789e-03 
##         B7_p10          zkurt         B8_p10         cov_8m         cov_7m 
##   5.008443e-03   4.911439e-03   4.719062e-03   4.469104e-03   4.374511e-03 
##      zcov_tree         zpcum7         cov_9m   pzabovezmean        cov_10m 
##   4.366869e-03   3.996367e-03   3.951544e-03   3.924434e-03   3.760507e-03 
##         B6_p10    zmean_grass    zmean_shrub         VH_p50         VH_p10 
##   3.540731e-03   3.396566e-03   3.346739e-03   3.321214e-03   3.314947e-03 
##         cov_6m        B12_p50        B12_p10         B3_p10         VH_p90 
##   2.955332e-03   2.421877e-03   2.417216e-03   2.227210e-03   2.185131e-03 
##         B4_p10         cov_5m         B9_p90         zpcum6           zq25 
##   2.165452e-03   1.925761e-03   1.562989e-03   1.416798e-03   1.311937e-03 
##     zcov_shrub         cov_4m           zq30         zpcum5         B6_p50 
##   1.235259e-03   1.127762e-03   7.244414e-04   7.021985e-04   6.890991e-04 
##       cov_3.5m         B8_p90         B7_p90              D         cov_3m 
##   6.470897e-04   4.640204e-04   4.211512e-04   3.981953e-04   3.179441e-04 
##         zpcum2         B8_p50        B11_p10        B11_p50         zpcum4 
##   2.491043e-04   1.227398e-04   1.004822e-04  -1.334253e-05  -5.719775e-05 
##     zcov_grass       cov_0.4m         B4_p50       cov_0.5m         zpcum3 
##  -1.261444e-04  -1.363407e-04  -1.408502e-04  -1.481716e-04  -1.845071e-04 
##       pzabove2       NDVI_p10       cov_1.5m       cov_2.5m         VV_p10 
##  -1.854629e-04  -2.829398e-04  -2.877925e-04  -3.179764e-04  -3.332955e-04 
##         B3_p50         cov_1m         B9_p50         zpcum1         B6_p90 
##  -4.000011e-04  -4.278860e-04  -4.628147e-04  -4.651159e-04  -4.832626e-04 
##         VV_p50         cov_2m            cbh         aspect         B7_p50 
##  -6.363807e-04  -6.617142e-04  -1.026951e-03  -1.090726e-03  -1.119400e-03 
##           zq35            zcv         VV_p90         cov_0m          zskew 
##  -1.149858e-03  -1.178821e-03  -1.301041e-03  -1.365768e-03  -1.756174e-03 
##         B2_p10       NDVI_p50              N         B5_p50         B3_p90 
##  -1.791889e-03  -1.897309e-03  -2.271493e-03  -3.443341e-03  -3.521321e-03 
## VVVH_ratio_p10          slope       zentropy           zq40         B2_p90 
##  -3.845969e-03  -4.063651e-03  -4.090790e-03  -4.133946e-03  -4.598607e-03 
## VVVH_ratio_p90        pground           zq45         B5_p90        B11_p90 
##  -5.248388e-03  -5.405935e-03  -5.595980e-03  -5.691376e-03  -6.026466e-03 
##         B2_p50        B12_p90 VVVH_ratio_p50         B1_p90            dem 
##  -6.138913e-03  -6.151091e-03  -6.154383e-03  -6.223940e-03  -6.236798e-03 
##           zq20           zq50           zq55       vert_gap     zmean_tree 
##  -6.529452e-03  -6.661062e-03  -6.695292e-03  -6.752985e-03  -6.797544e-03 
##         rumple          zmean           zmax              Z           zq60 
##  -6.832287e-03  -6.847241e-03  -6.916992e-03  -6.932715e-03  -6.938348e-03 
##           zq99            zsd         B4_p90           zq65           zq95 
##  -7.028331e-03  -7.122703e-03  -7.135953e-03  -7.177020e-03  -7.191010e-03 
##           zq90           zq80           zq85           zq75           zq70 
##  -7.321122e-03  -7.342597e-03  -7.348164e-03  -7.358602e-03  -7.374379e-03 
##           ziqr    (Intercept) 
##  -7.502296e-03  -2.578066e+00

Prediction log(CBD)

Predict and back-transform to normal scale.
Set CBD of non-burnable areas to NA.
Calculate the Area of Applicability (AOA).
Finally, 117 values (0.013 %) ranging between 0.3 and 10^13 remain. They are considered to be unreasonable for CBD and are also set to NA.

if(skip_model_prediction){
  (cbd_glm_st = read_stars(file.path(getwd(),"results","haard_CBD.tif")) |> 
     setNames("CBD"))
} else {
  cbd_glm = predict(glm_ridge, newdata = newx) |> 
    exp() |> 
    setNames("CBD")
  
  cbd_glm_st = ft |> 
    mutate(CBD = cbd_glm,   
           CBD = case_when(FuelType == 4 ~ NaN,   
                           CBD > 0.3 ~ NaN,
                           TRUE ~ CBD)) |> 
    select(CBD)
}

## stars object with 2 dimensions and 1 attribute
## attribute(s):
##              Min.    1st Qu.     Median       Mean    3rd Qu.      Max.   NA's
## CBD  7.414872e-14 0.07014109 0.08498263 0.08649818 0.09911616 0.2981278 362132
## dimension(s):
##   from   to  offset delta                       refsys point values x/y
## x    1 1000  372000    10 +proj=utm +zone=32 +ellps... FALSE   NULL [x]
## y    1  900 5733000   -10 +proj=utm +zone=32 +ellps... FALSE   NULL [y]

# Distribution of predicted CBD
hist(cbd_glm_st, breaks=100)

AOA

if(skip_model_prediction){
  (AOA = read_stars(file.path(getwd(),"results","AOA_haard_CBD.tif")) |> 
     setNames("AOA"))
} else {
  cl <- makeCluster(6)
  registerDoParallel(cl)
  
  AOA = aoa(model=glm_ridge, newdata=p, cl = cl)
  
  stopCluster(cl)
}

## stars object with 2 dimensions and 1 attribute
## attribute(s):
##      Min. 1st Qu. Median      Mean 3rd Qu. Max.
## AOA     0       0      0 0.3960178       1    1
## dimension(s):
##   from   to  offset delta                       refsys point values x/y
## x    1 1000  372000    10 +proj=utm +zone=32 +ellps... FALSE   NULL [x]
## y    1  900 5733000   -10 +proj=utm +zone=32 +ellps... FALSE   NULL [y]

Plot

pcbd = ggplot() + 
  geom_stars(data = cbd_glm_st, downsample = 1) +
  coord_fixed() + 
  scale_x_discrete(expand = c(0, 0), name="") +
  scale_y_discrete(expand = c(0, 0), name="") +
  scale_fill_viridis_c(direction = -1) +
  theme(legend.position = "bottom")

AOA = c(AOA, cbd_glm_st) |> 
  mutate(AOA = ifelse(is.na(CBD), 99, AOA) |> 
           factor(levels = c(1,0,99),
                  labels = c("inside","outside","NB")))

paoa = ggplot() + 
  geom_stars(data = AOA["AOA"], downsample = 2) +
  coord_equal() + 
  scale_x_discrete(expand = c(0, 0), name="") +
  scale_y_discrete(expand = c(0, 0), name="") +
  scale_fill_manual(values=c("lightgreen","firebrick","grey80"), name = "AOA") +
  theme(legend.position = "bottom")

pcbd + paoa

This plot refers to Figure 8 in our paper.

# Which percentage of pixels fall outside the AOA?
round(sum(AOA$AOA == 'outside') / length(AOA$AOA),4)*100

## [1] 20.19

About 20.19% of cells fall outside the AOA.

From section 4.2 in Heisig et al. 2022:

Ridge regression analysis was applied to predict CBD for the Haard forest. CV selected an optimal regularization parameter lambda of 10.7. Overall, model performance was poor which was expected, considering the small number of training samples. A model-R² of 0.59 with a RMSE of 0.054 was reported. Independent validation samples produced a higher R² of 0.73, while RMSE degraded to 0.069. Although R² is acceptable, RMSEs are large, considering a CBD training sample mean of 0.095. Differences in model performance and validation scores indicate the introduction of bias by ridge regression. Variable importance scores indicated strong dependence on LiDAR-derived vertical structure metrics and optical predictor data. The most relevant predictors included C₂₅, B05_p10, Z_p20, NDV**I_p90, B04_p90, and DEM. Further, the nine next relevant variables in the ranking, Z_iqr, Z_pcu**m80, and Z_{p65, ..., 95}, all describe vegetation structure in the upper third of the tree. This coincides with relative heights at which CBD can be found at the maximum. We tested adding training samples from NFI plots (n = 15) but were not able to improve the model. On average, their derived CBD values were significantly smaller than the existing values based on field sampling. NFI surveys include records of species and CH among many other observations. However, they do not include CBH. Supplementing NFI tree lists, for example, with LiDAR-derived CBH at 10 m spatial resolution, is rather inaccurate, especially when considering plots with heterogeneous species composition, age, and vertical structure. Spatial prediction and AOA for CBD are shown in Figure 8. CBD values range from 0 to 0.3. They roughly follow the tree species classification, while higher densities can be observed for pine than for beech and red oak. Anomalies in CBD within homogeneous patches dominated by a single species are related to structural differences. Considering only forested pixels, 20% fall outside the AOA. This may again be explained by the low number of training samples. A significant portion is located in areas with steeper slopes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

04_spatial_prediction_crown_bulk_density.md

04_spatial_prediction_crown_bulk_density.md

Modeling Crown Bulk Density via Ridge Regression

Prepare training and predictor data

Read

Extract

Fill NAs

Split into training and test sets

Remove zero variance predictors

Ridge regression with `caret` and `glmnet`

Ridge regression log(CBD)

Validation log(CBD)

Prediction log(CBD)

Files

04_spatial_prediction_crown_bulk_density.md

Latest commit

History

04_spatial_prediction_crown_bulk_density.md

File metadata and controls

Modeling Crown Bulk Density via Ridge Regression

Prepare training and predictor data

Read

Extract

Fill NAs

Split into training and test sets

Remove zero variance predictors

Ridge regression with caret and glmnet

Ridge regression log(CBD)

Validation log(CBD)

Prediction log(CBD)

Ridge regression with `caret` and `glmnet`