This is the official repository of TGRS 2024 paper: Contrastive Tokens and Label Activation for Remote Sensing Weakly Supervised Semantic Segmentation.
In recent years, there has been remarkable progress in Weakly Supervised Semantic Segmentation (WSSS), with Vision Transformer (ViT) architectures emerging as a natural fit for such tasks due to their inherent ability to leverage global attention for comprehensive object information perception. However, directly applying ViT to WSSS tasks can introduce challenges. The characteristics of ViT can lead to an over-smoothing problem, particularly in dense scenes of remote sensing images, significantly compromising the effectiveness of Class Activation Maps (CAM) and posing challenges for segmentation. Moreover, existing methods often adopt multi-stage strategies, adding complexity and reducing training efficiency.
To overcome these challenges, a comprehensive framework CTFA (Contrastive Token and Foreground Activation) based on the ViT architecture for WSSS of remote sensing images is presented. Our proposed method includes a Contrastive Token Learning Module (CTLM), incorporating both patch-wise and class-wise token learning to enhance model performance. In patch-wise learning, we leverage the semantic diversity preserved in intermediate layers of ViT and derive a relation matrix from these layers and employ it to supervise the final output tokens, thereby improving the quality of CAM. In class-wise learning, we ensure the consistency of representation between global and local tokens, revealing more entire object regions. Additionally, by activating foreground features in the generated pseudo label using a dual-branch decoder, we further promote the improvement of CAM generation. Our approach demonstrates outstanding results across three well-established datasets, providing a more efficient and streamlined solution for WSSS.
At present, the code repository seems to have the following issues that need to be addressed:
1) When training the Deepglobe model, the background was not used. Therefore, the following modifications need to be made:
① Modify the model section: model_seg_neg_fp. py: line 71-73: The classifier output dimension does not need to be further reduced because there is no background left. self.classifier = nn.Conv2d(in_channels=self.in_channels[-1], out_channels=self.num_classes, kernel_size=1, bias=False, ) self.aux_classifier = nn.Conv2d(in_channels=self.in_channels[-1], out_channels=self.num_classes, kernel_size=1, bias=False, ) Line 170-171: Similarly, when outputting classifications, modifications are also needed: cls_x4 = cls_x4.view(-1, self.num_classes) cls_aux = cls_aux.view(-1, self.num_classes) ② Modify the cam section: camutils'ori. py: line 13: No longer need to add+=1 to the pseudo label, otherwise it will be out of bounds. Line 377, refine_camb_with_bkg_v2(): Background related information is no longer needed.
''' b, _, h, w = images.shape _images = F.interpolate(images, size=[h // down_scale, w // down_scale], mode="bilinear", align_corners=False) refined_label = torch.ones(size=(b, h, w)) * ignore_index refined_label = refined_label.to(cams.device) refined_label_h = refined_label.clone() refined_label_l = refined_label.clone() cams_with_bkg_h = cams _cams_with_bkg_h = F.interpolate(cams_with_bkg_h, size=[h // down_scale, w // down_scale], mode="bilinear", align_corners=False) # .softmax(dim=1) cams_with_bkg_l = cams _cams_with_bkg_l = F.interpolate(cams_with_bkg_l, size=[h // down_scale, w // down_scale], mode="bilinear", align_corners=False) '''
iSAID dataset
You may download the iSAID dataset from their official webiste https://captain-whu.github.io/iSAID/dataset.html.
After downloading, you may craft your own dataset. Please refer to datasets/iSAID/make_data.py.
ISPRS Potsdam dataset
Datasets for ISPRS Potsdam are widely accessible on the Internet. You may find the original content on: https://www.isprs.org/education/benchmarks/UrbanSemLab/default.aspx.
You may refer to the datasets/potsdam/potsdam_clip_dataset.py provided by OME. Great thanks for their brilliant works.
Deepglobe Land Cover Classification Dataset
You may find the original content on:http://deepglobe.org/challenge.html.
Please refer to datasets/deepglobe/deepglobe_clip_dataset.py.
We also provide the BaiduNetDiskDownload link for processed dataset at Here. Code: CTFA
Checkpoints may try this. Code:r1k6
We provide our requirements file for building the environemnt. Note that extra packages may be downloaded.
## Download Dependencies.
pip install -r requirements.txt
To use the regularized loss, download and compile the python extension, see Here.
Please refer to the scripts folder, where all scripts are clared by their name. You can also modify them to distributed training, which cost more GPUs. A simple startup like this:
## for iSAID
python dist_train_iSAID_seg_neg_fp.py
## for potsdam
python dist_train_postdam_seg_neg_fp.py
## for deepglobe
python dist_train_deepglobe_seg_neg_fp.py
For deepglobe dataset, due to the lack of clear background partitioning in this dataset, some modifications may be necessary. Please check the annotations in camutils_ori.py.
You should remember to change the data path to your own and make sure all setting are matched.
I will try my best to reorganize the code to minimize issues. Apologize for any inconvenience caused by the code issues and thank you for your understanding.
To evaluation:
## for iSAID
python infer_seg_iSAID.py
...
Our work is built on the codebase of ToCo and Factseg. We sincerely thank for their exceptional work.