10.1 million acres burnt to ashes in 2020 alone, across about 58,950 wildfires, a 214% increase in area burnt due to wildfires, according to the Insurance Information Institute. Although wild- fires are a natural occurrence in various ecological cycles, fire seasons are undeniably becoming more extreme and widespread, destabilising ecosystems, economies and human lives. Hotter and drier weather caused by climate change along with poor land management have been creating favorable conditions for increased, larger and stronger wildfires. While it is currently estimated that, in rural settings, fires can attain full growth (decay stage, see figure below for plot of fire stages) in little as 5 minutes, engulfing full-houses in flames, in natural environments, wildfires grow exponentially quicker, travelling at a speed of 10.8 kilometers per hour in forests and 22 kilometers per hour in grasslands.
Thus, rapid detection has emerged as a crucial solution to ensure precise and measured control of such phenomenons. Various alternative exist: tradi- tional ones such as fire watchtowers, ran by fire departments, or more current ones such as fire detection using satellite AVHRR (Advanced Very High Resolution Radiometer) images, heat detection using thermal cameras or chemical analysis powered by CO2 sensors, all varying in costs. Yet, with increasing data mining and statistical capabilities, other cheaper solutions have appeared amongst the crowd. These are solutions taking advantage of automatic tools such as local sensors (e.g. existing meteorological stations) and/or low-cost equipment (e.g. Rasberry Pi-powered circuits). Their approach rely on powerful models to classify images, meteorological data and sensor feeds in order to predict the presence of a fire and/or its potential severity. Keywords: Data Science applied to Environment, Fire Detection, Classification Models
In 2007, Paulo Cortez and Anibal Morais conducted an experiment in which they studied the extent to which local meteorological sensor data could help predict the severity of wildfires [1]. In their work, they propose a data mining approach that uses data as detected by local sensors in weather stations. Such data has the advantage of being collectable in real-time with low costs, especially when compared with satellite and infrared scanner approaches. Therefore, the objective in this lab is to understand the 2007 research and approach, and reflect on how to implement its base-approach to current projects. More precisely, I will be focusing on exploring the performance of classification models using local sensors data including information on weather conditions and related indexes (e.g. temperature, humidity, wind) to predict the potential level of severity of a forest fires (referring to burnt area). This study will develop the following classification models: Regression Trees, Random Forest and Gradient Boosting using feature se- lection techniques for improvement, powered by real-world meteorological and historical fire data from Portuguese north-eastern regions. On the business side, I will develop a rationale on why this approach is a relevant alternative, and envision how to apply it to Pyronear (pyronear.org), a french NGO I am part of, developing an affordable computer vision based solution to automatically detect wildfires and manage related alerts with an in-house platform (set to be compatible with Fire Management Systems) destined to support local fire departments (also referred to as āSDISā).
The data-set studied considers forest fire data collected in the Montesinho natural park, located in the Tras-os-Montes northeast region of Portugal. The park, well-known for its diverse flora and fauna, experiences a supra-Mediterranean climate with annual temperatures within the range of 0 to 28Ā°C. Our data-set, collected between January 2000 to December 2003, includes spatial and temporal attributes, components of the Canadian Fire Weather Index (FWI) as well as four weather condition measures on 517 fire occurrences. A data table describing each attribute can be found below:
Regarding weather data, it is currently being used in many Fire Weather Indexes (FWI), to create unified measures of various weather conditions (e.g. rain and temperature in the Drought Code index). Using such combined data can facilitate feature selection however it is important to note that certain indexes are to be used with caution when measured in different climates (e.g. the Canadian Fire Weather Index can help to measure weather conditions in Portugal but it would be more appropriate to re-calibrate the index to suit local conditions). Nonetheless, each index used here, along with Buildup Index (BUI) constitute the components of the Fire Weather Index (FWI) which is an indicator of fire intensity. By combining the Initial Spread Index (ISI) and Buildup Index (BUI) it calculates the overall rating of fire line intensity in a reference fuel type (BUI) and level terrain (ISI). Regarding data collection, data comes from two sources : the inspector responsible for the Montesinho fire events, recording various features on a daily basis, whenever a fire occurred, and the Braganc Ģ§a Polytehnic Institute recording weather observations within 30 minute intervals by the parkās station. From a general observation, we can see that the number of fires, on a monthly basis, happen mostly during the last moments of summer between August and September (see monthly distribution of fire occurrences below).
Looking at our predictors now, we notice a few elements in their distribution:
ā¢ First, most of our predictors follow somewhat of a normal distribution as visible by red dotted normalised plot (thanks to our dataās pre-processed nature).
ā¢ Secondly, we have three predictors with skewness issues: FFMC, ISI and rain. To solve such issue we will apply a square root transformation using: ā(values of predictors)
Lastly, in terms of correlations, you will find below a correlation plot of our variables.
It shows a clear strong positive relation between Drought Code and Month, and strong negative relation between Relative Humidity and temperature. Both correlation seem to be logical (temper- ature drop tend to drive humidity). For that same reason, colder months (end of the year) tend to be higher in humidity which can be seen in the relationship between Drought Code (using rain and RH to be calculated) and Month. In terms of lower correlations, we seem that most indexes have positive correlations between each other but form the data table, we understand the origin of that would be from the common variables used in each index: temperature and rain. This is insightful as we will most probably notice in our models that too many variables may lead to over-fitting and thus we might only need fewer variables in building our tree-based models.
In order to build our models we will have to solve issues in our data and transform desired columns to suit our data-set realty. Regarding skewness, the issue was resolved as explained above, using a square-root approach. In order to transform our dependent variable : acre into a categorical one, we will use the National Wildfire Coordinating Groupās Size Class of fires (see Figure 6) to create 3 categories: A, B, C. A would denote small fires (between 0 to 2.5 acres, B would be medium-sized fires (between 2.5 and 10 acres) and C would be large fires (over 10 acres). To do so, we will have to convert our area variable, initially in hectares, into acres using the following relation: acre = 2.471 ā hectares. Once done, we evaluate the distribution of our data-set: 48% of small-sized fires, 21% of medium-sized fires and 31% of large-sized fires. Regarding our categorical predictors, we find just to dummify our month and day variables. As seen in our data visualisation process, fire occurrences are not equally distributed across months, thus certain months seem to have higher importance when it comes to fires. Using the fastDummies library, we are able to split our 2 variables into 19 variables, enabling us to explore the individual importance of each split variableās levels. Having now 30 variables (including dependent variable), we should explore individual impor- tance to perform our feature selection process and retain the most important features. To do so, we will use an algorithmic approach using Random Forest Feature Selection. This is a very popular approach that enable a rapid and efficient testing of features. When we run our process, we first can highlight the following:
ā¢ First, temperature is an influential feature (2.6% decrease in accuracy and 18.1 in Gini index without the feature). This again is a logical observation as temperature is known to be an important factor for a fireās development. Related to temperature, we see that Drought Code (combining temperature and relative humidity) is comparably important in evaluating fire risk.
ā¢ Secondly, we notice that certain month are more relevant to our classification approach: the month of December and March. If we go back to our monthly distribution plot we can see that these are month are ones with low occurrences of fires yet since the plot ignores the intensity of fires, we might be missing the distribution of fires by class and not seeing that December for example has only large-sized fires recorded, giving it an influential power to predict that class of fire severity.
ā¢ Thirdly, exploring our least important variables, we notice that the month of May and June, and the days of the week Thursday, Friday and Sunday perform terribly in terms of accuracy decrease. Our analysis reveals that our model would actually be performing, in average, better without those variables by about 4% (in terms of accuracy). Thus we decide to exclude them from our modelās data-set.
We now have 27 predictors and our dependent variable : fire category (A,B and C). To perform a robust classification process, we have to choose the adequate models to perform the task. This requires us to understand tree-based models and more precisely classification trees and random forest, the two models we will be exploring here. Tree-based models are methods that involve segmentation of the predictorās space into simple regions, enabling a prediction of new data points using the mean or mode of the training data of the region in which new points belong to. Since the set of splitting rules used for the segmentation are easily summarised into a tree shape diagram, these models are referred to as decision tree methods. Therefore, tree-based models are extremely helpful for interpretation. While classification trees are unary techniques (building a single tree), random forest involves building several decision trees and combining them to result into a uni- fied prediction. In this manner, random forest is part of a category of tree-based models called āEnsemble Methodsā. The way the trees are combined will differ between members of the category.
Classification Trees. Building a classification tree comes as the most natural approach when exploring tree-based models. Itās concept is simple, it construct a tree-like structure by starting by the top (referred to as āthe rootā). It chooses the most influential predictor (the one that segments our dependent variable most). Through each segmentation of the tree, the model will determine the best branching criterion by looking at available predictors and their value to find which one reduces classification Error Rate the most.
Once the branching done, it results in a node at which a decision is made using the predictor (e.g. does the area have a relative humidity higher than 50?). To confirm the next node, it compares the improvement done by the new criterion over the classification Error Rate with a complexity parameter, also referred to as the cost of adding another variable or extending another branch, set by the user. By default around 0.1, the model would compare the performance on the Error Rate of the new criterion and the previous branching, evaluating if the improvement is greater than the complexity parameter. While the improvement is greater, the tree continues building branches, eventually stopping at nodes unsatisfying the test, also referred to as leaf nodes, resulting then in a fully grown tree (a method called recursive splitting). Classification trees are a powerful model with clear advantages:
ā¢ They are simple to interpret and easy to display graphically: a plotted classification tree is very easy to understand and analyse.
ā¢ They can handle numerical and categorical variables : which is a very strong advantage over certain non tree-based models, reducing greatly the need for data preparation.
ā¢ They are better at mirroring human-decision making processes Yet, they have equally clear limitations:
ā¢ They tend to have lower prediction accuracy as compared to linear regression models
ā¢ They are extremely sensitive to variance, meaning change in data can cause a treeās performance to dip.
By aggregating many classification trees, using ensemble techniques such as Random Forest, we reduce this sensitivity and substantially improve their individual predictive performance. Random Forests. In the Random Forest approach, we build a number of decision trees on bootstrapped training samples: samples of similar size from data-set but with replacement. As compared to simple decision trees, random forest performs its branch splitting considering a random set of m predictors (usually around m = ā(number of predictors), renewing its sample at each iteration. In this manner, trees in the forest, arenāt allowed to consider a majority of available predictors but must work with what theyāre given. The rationale behind this is simple, with strong predictors, trees tend to always place the most influential predictors at the top branches, making each tree similar to each other. Here, trees with constantly be randomly formed and improve the overallās forest performance (reducing potential correlation between trees that would be find in other ensemble techniques such as bagged trees). On average about (p ā m)/m of the splits will not consider the strong predictors, helping other less influential ones to be tested for segmentation, ultimately making the ensemble of trees less variable, thus more reliable. While random forests are more complex and harder to interpret (no clear graphical visualisations), their differences create clear advantages over classification trees:
ā¢ Lower sensitivity to variance in data thanks to collection of various decision trees
ā¢ Solves collinearity
ā¢ Reduces over-fitting
Therefore to initiate model building we will go through three steps for each tree-based approach:
ā¢ Building a simple model: running a basic model of our classification technique
ā¢ Improving that model with hyper-parameter tuning
ā¢ Measuring the improvement and building the model with the best set of parameters
Through the feature selection process, we confirmed that the best set of predictors would be: X, Y, FFMC, DMC, DC, ISI, temp, RH, wind, rain, January, February, March, June, July, August, September, November, December, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday. Therefore our model will try predicting the category of fire severity based on those 27 variables. Here are the following results obtained for each model
For the Classification tree, we found that the following parameters were the best set to use: minsplit = 5, maxdepth=3 and cp = 0.0213. Suggesting that our tree needs at least 5 observations to consider splitting a node, can go up to 3 levels of node (meaning from the root node it can descend up to ten branches to arrive to leaf node) and its minimum cost of improvement is 0.0213. The improved model gave us a 48.84% accuracy meaning it correctly classified about 1 every 2 fires it was shown. While this is a promising result for a simple model, it is good to note it was trained on a relatively small data-set (only about 380 observations in the training sub-sample, generated to compare the models on true training/test data before showing our final test data). For our Random Forest approach, while the default model produced a poorer accuracy than its classification tree counter party (41.29%), the improved model, built on the following parameters: number of tree built = 50 , number of candidate predictors randomly sampled = 9, results in a model accuracy of 47.74% on our test data. While this is a poorer score than the tuned classification tree, it is crucial to highlight the mentioned advantages of random forest over classification trees, especially regarding data variability. For example, after running the model several times, I noticed that the tuned CT model didnāt stay at 48% but actually varied around 42%, on the other hand Random Forest produced a robust average accuracy of around 46%. Therefore, the best configuration found uses Random Forest along with four meteorological inputs (temperature, relative humidity, rain and wind), four weather indexes (FFMC, DMC, DC and ISI), two dimensional features (X and Y positions in the parc) and temporal dimensions (January, February, March, Avril, July, August, September, October, November and December) and it is capable of predicting the burned area of small fires, which are more frequent. Such knowledge is particularly useful for improving firefighting resource management (e.g. prioritizing targets for air tankers and ground crews).
In 2020, fires in the US, damaged in average over 160 acres (classifying as D class fires). From 2011 to 2020, American authorities have spent over $1.3 billion in fighting wildfires resulting in an average of $32,786 per fire. Detection being one main driver in a fireās severity, solutions exploring rapid detection and/or risk evaluations have been being adopted at an accelerated pace across the world. This study has enabled me to understand more deeply the factors influencing the severity of a wildfire. Having now a clearer idea of the main indicators (cf. our feature selection), we are capable of not only predicting risk scores based on live conditions (using up-to date sensor data and updating the other variables we have), but as well support fire forces when confronted with an on-going fire, using our model to predict its potential future severity. In that manner, we can come in as support after rapid detection of a fire (determining its potential severity) as well as support detection teams (supplying live risk scores using our predictions to classify areas of the park around favorable conditions for small / medium or large fires). The proposed solution, which is based on a Random Forest algorithm, requires only four direct weather inputs ( temperature, rain, relative humidity and wind speed), four easily calculated and well-documented fire weather indexes along with two temporal data points (month and day of the week). This first approach is capable of predicting small fires with a strong accuracy (over 75% for class A). This is great knowing small-sized fires constitute the majority of fire occurrences in the park (over 45% of fires between Jan 2000 and Dec 2003). Yet an important drawback remains: the poor predictive power for medium and large sized fires. While it is true that predicting the class of a forest fire is challenging, a few paths to improvement emerge. Additional and readily available information could be added to our data-set. Type of vegetation, number of people crossing the x,y coordinates and vegetation density (harder to collect) would be great additions to our model. They would not only improve our modelās understanding of fire occurrences but as well increase its interpretation. Indeed, being able to focus on fires that occurred without any human passage, would enable us to grasp a better understanding of the link between vegetation, weather and fire risks (assuming the fires excluded could be majorly human-caused). Nonetheless, our current model is still capable to improve existing firefighting efforts, helping teams on ground evaluate the potential development of on-going fires, sending the appropriate response (avoiding to send heavy equipment to combat small fires). Such flexibility would be extremely advantageous in dramatic situations, such as fire seasons during which fires occur at distinct locations, simultaneously. To develop our approach, letās dive into an example of business application. Potential Business Case. Suppose we are part of the Montesinho fire protection team. The park has known over 500 fire occurrences in the spawn of three years, with its most damaging fire causing damage to over 2400 acres. Based on the study previously quoted, for an average wildfire severity of 74 acres, the US has seen an average cost of suppression of about $32,786 per fire. Based on our data, the park has known an average damage of 12.5 acres caused by wildfires between 2000 and 2003. Thus, suppose an average cost per fire of about (12.5/74) ā 32, 786 = $5, 538 per fire. The park has known a total fire suppression cost, over three years, of over $2,769,000. Knowing our model can accurately predict small-sized fires, it would be able to efficiently support our teams in avoiding or decreasing the impact of over 186 fires (75% of the 248 small fire occurrences), representing an average cost of just over $1 million. While we know, according to the Congressional Research Service, most wildfires are human-caused (88% on average from 2016 to 20203), being able to quickly evaluate areas at risk (with favorable conditions) leveraging weather data but as well quickly visualise areas at risk thanks to spatial data, would support teams in dispatching teams at the relevant areas and installing measures to counter fire occurrences. While this remains, in theory, a study on the potential of classification models to predict fire severity, it is a promising start toward developing more cost-efficient and reliable solutions to prevent wildfires and protect forests. Another brick to add to the Pyronear Project.
ā¢ Special thanks to Paulo Cortez and An Ģıbal Morais (University of Minho, Portugal) for pro- viding the cleaned data set as part of their research on forest fire prediction (raw data set initially provided by Manuel Rainha and the Braganca Polytechnic Institute).
[1] Cortez, P., Morais, A., 2007, A Data Mining Approach to Predict Forest Fires using Meteoro- logical Data, retrieved December 2021, from Research Gate
[2] US Congressional Research Service. (2021, October). Wildfire Statistics (No. IF10244). link