Understanding the types, quantities, and qualities of agricultural crops grown in a region is crucial for predicting yield, estimating import/export needs,
and assessing factors like soil organic carbon or greenhouse gas emissions [].
Government agencies and private companies have attempted to create models for determining crop types grown in a country.
Utilizing remote sensing of agricultural land through publicly available imagery from Earth-observing satellites proves to be a more economical approach compared to
methods like nationwide surveys or direct field visits. For instance, the U.S. Department of Agriculture (USDA) annually produces a Crop Data Layer (CDL) with a
resolution of
The CDL is typically released towards the end of January in the following crop calendar, and its production takes approximately 13 months []. However, there is a demand for a faster turnaround, and considering that not all countries produce crop maps, this remains a relevant issue.
For our educational project, we simplify the problem by focusing on mapping only three crop classes: Corn, Soybeans, and "other." Furthermore, we limit our spatial analysis to Green County, IA.
Here's a satellite base map of the county. Notice the fields, roads, and the river through the middle:
Once our model is deployed, we have the flexibility to experiment with the spatial extent. We can test it in a neighboring county where the model was not initially trained, or be more adventurous and evaluate its performance in a different state like Indiana or a southern state.
The main purpose of this project is to give an example of a pixel level supervised classification task using classical machine learning models under the Google Earth Engine (GEE) platform. We are not aiming for the state of art performance (just yet!). To summarize, given a set of input features (to be defined soon), we would like to classify each pixel as Corn, Soy or Other.
As with any supervised classification model, training labels are needed to train our models. Ideally, real ground truth data from farmers of the Greene county would be available to us, but we suffice to use CDL to extract the labels for the trainingg pixels for our task (see the Introduction section).
Here's a visual of the study area, the original CDL classification, and the reclassified CDL:
Note that the original CDL map contains many more classes than the three classes we are interested in. Those classes are are lumped into one "other" class and colored blue.
A timeseries of the Red and Near Infra Red (NIR) band of Sentinel2 over a "typical" corn/soy growing season in midwest US is used as model inputs.
- why Red and NIR?
- why use a timeseries and not just a single image
- issues that arise from mean weekly composites of the imagery
- Cloudmasking is currently not done, for implementation ease, to be revisited
This is just the Input features together with the target labels grouped into a single multiband image. However the input features are in the different coordinate system than the label image.
- were there any problems related to projection differences?
A stratified sampling function implemented in GEE is used to extract
- why was this sampling used as oppsed to say, random sample of
$n$ pixels. - why sample to being with? Why not use all (or say 80% of the county pixels).
- are there better sampling procedures for this problem
- are they easy to implement in GEE
- limitations
Here's a visual of the training sample overlayed on a satellite basemap of part of the county:
Here's a visual of the training sample overlayed on CDL:
The notebook is written in a way that the model can be configured and supplied to the train function. Any of the classical machine learning models available in GEE can be used. This makes it easy to experiment with different models and compare their prediction power on this toy problem. By default, a Random Forest using 30 trees is configured as the baseline.
Here's the original CDL map together with the predicted map using only 300 samples of each class:
- update with better result
- Add a confusion matrix for the best model