Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train a starter model for Sentinel in Mexico #34

Open
geohacker opened this issue Dec 16, 2022 · 3 comments
Open

Train a starter model for Sentinel in Mexico #34

geohacker opened this issue Dec 16, 2022 · 3 comments
Assignees

Comments

@geohacker
Copy link
Member

For our Sentinel release, we'll create a starter model based on priority AOIs for Reforestamos.

@srmsoumya
Copy link
Member

srmsoumya commented Jan 30, 2023

Model

Training Strategy

We are labeling the segmentation masks from scratch & given the complexity of differentiating between the classes of interest to RM, it is taking us quite some time to generate the chips.

In the allocated budget of ~125 hours, we can generate approximately 1000 chips of size 256x256 as ground truth for our model. This is not sufficient to build a decent segmentation model for all the eight corridors.

As a workaround for this we are trying a weakly supervised training approach, in this case:

  • We are using the LULC ground truth masks shared by the RM team
  • This is weakly supervised as the labels are generated for Landsat & don't have a perfect 1:1 match with Sentinel Imagery. The labels are also not very accurate for all the classes
  • Pretraining the model with this weakly supervised approach gives some understanding of what the landscapes look like in all the eight corridors & what features represent each of these classes.

After we have a model that is pre-trained with weakly supervised labels, we can then fine-tune the model on chips generated by our data team i.e more precise & designed for sentinel imagery.

Data Distribution

0: "other",
1: "Bosque",
2: "Selvas",
3: "Pastos",
4: "Agricultura",
5: "Urbano",
6: "Sin vegetación aparente",
7: "Agua",
8: "Matorral",
9: "Suelo desnudo",
10: "Plantaciones",
11: "Otras coberturas",
12: "Vegetación caducifolia",

Image

The numbers in the diagram represent the number of pixels for each LULC class in that particular corridor. As we can see from the figure, there is severe class imbalance across all the corridors with Bosques, Selvas & Agricultura dominating in most of the cases.

Few things to consider while training:

  • Build models specific to corridors rather than one big model for all
  • Apply heavy image augmentation & regularization techniques
  • Use loss functions that can handle class imbalance like dice loss, focal loss, etc

Initial PEARL Model for Reforestamos

PEARL models for NAIP imagery was built on top of PyTorch & used segmentation models like UNet, FCN & DeepLab.

I am building the baseline model using PyTorch & PyTorch-Lightning, this takes care of both the science & engineering side of things. We have to write less boilerplate code and things like storing model checkpoints, logs loss curves, metrics etc come for free. We can easily scale the model to run on single/multiple CPU/GPU/TPU without any additional effort.

Update as on 30 Jan, 23

We have a segmentation model that is trained on a single corridor with weakly supervised labels coming from the RM team.

Architecture - Unet
Backbone - EfficientNet-B0 pre-trained on ImageNet
Epochs - 10
Dataset - 1700 chips for training & ~400 chips for testing (with LULC labels from RM)
Loss - Dice Loss 0.47
Score - Jaccard Index 0.6

Here are some sample results

Image (Color Corrected), Ground Truth Mask, Predicted Mask, Image overlay with mask

Image

@srmsoumya
Copy link
Member

srmsoumya commented Mar 13, 2023

Model Update - 13-03-23

We have a baseline model that is a DeepLabv3+ with timm-efficientnet-b5 backbone which has an weighted F1 score of 0.78 currently deployed as Mexico LULC pre alpha in PEARL backend. This models also handles issues mentioned here #47 by using color based augmentations.

Issues with the current baseline model

  1. Clouds are creating confusion for the model. This is understandable, we filtered mosaics with no cloud cover & trained out model on that. This can be fixed by:

    • Using different search id to extract mosaic tiles that have less restrictive cloud cover
    • Use a custom augmentation that adds clouds, fog, snow flakes to the image
  2. Edge effects getting introduced as the model is looking at very small patch of the imagery. Our ground truths are representations of what an area looks like & not exact pixel match for the classes, the model learns from its surrounding pixels & infers the results. When we constrain that to just 256x256 tiles, it sometimes doesn't have enough information & thus creates the edge effects, look at the red pixels at the bottom left of model prediction mask.

Few ways to handle this:

  • Infer on larger tiles: Instead of inferring on 256x256 tiles, infer on 1024x1024 (100 km_sq) or 2048x2048 tiles. I tried doing this locally & got results with no edge effects - takes 1.8 seconds to run on a CPU with 20gb of RAM for 100 sq_km area. Check the image attached. We can also explore large image inferensing by splitting data & models into chunks using accelerate: https://huggingface.co/docs/accelerate/usage_guides/big_modeling.
  • image (2)
  • Train & infer on larger tiles i.e 512x512 (again we have a data scarcity problem here)
  • Use window based inferencing like SAHI: https://github.com/obss/sahi (I am yet to try this approach)
  1. Model retraining takes a few iteration, I tried for a few classes & this works fine - I just had to iterate twice for the model to learn. This is mainly because we have 16_000 embeddings per model & are adding just 100 new pixels for new or modified classes, we can either pass more pixels or reduce the seed data size.

Next steps in order of priority

  1. Improve model retraining workflow

    • Increase the number of points used for retraining workflow @ingalls
    • Reduce the number of embedding pixels inside seed dataset (check what works best between the two)
  2. Infer of larger tiles (should be easy to implement)

    • Pass larger tiles for inference to the model. Try tiles of size 2048x2048, 1024x1024, 512x512 - find sweet spot between speed & model accuracy to prevent edge effects @ingalls
    • SAHI approach for model inference (try later)
    • Model sharding using accelerate (try later)
  3. Retrain model to improve accuracy

    • Add artificial clouds, fog & snow flakes to the augmentation pipeline
    • Add more ground truth dataset @Rub21 Can we have the data team label more chips for Reforestamos?
    • Curate dataset to have different seasonality, cloud covers for model training

@developmentseed/pearl

@geohacker
Copy link
Member Author

@srmsoumya What are your thoughts about closing this ticket? I think we managed to achieve most of what you outlined as improvements. We can revise/reopen based on feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants