This is our official implementation of PixelCLIP!
[arXiv] [Project]
by Heeseong Shin, Chaehyun Kim, Sunghwan Hong, Seokju Cho, Anurag Arnab
(
In contrast to existing methods utilizing (a) pixel-level semantic labels or (b) image-level semantic labels, we leverage unlabeled masks as supervision, which can be freely generated from vision foundation models such as SAM and DINO.
For further details and visualization results, please check out our paper and our project page.
Please follow installation.
Please follow dataset preperation.
We provide shell scripts for training and evaluation. run.py
trains the model in default configuration and evaluates the model after training.
To train or evaluate the model in different environments, modify the given shell script and config files accordingly.
sh run.sh [CONFIG] [NUM_GPUS] [OUTPUT_DIR] [OPTS]
# With SA-1B Masks
sh run.sh configs/pixelclip_vit_base.yaml 4 output/
# With DINO Masks
sh run.sh configs/pixelclip_vit_base.yaml 4 output/ MODEL.DINO True
eval.sh
automatically evaluates the model following our evaluation protocol, with weights in the output directory if not specified.
To individually run the model in different datasets, please refer to the commands in eval.sh
.
sh run.sh [CONFIG] [NUM_GPUS] [OUTPUT_DIR] [OPTS]
sh eval.sh configs/pixelclip_vit_base.yaml 4 output/ MODEL.WEIGHTS path/to/weights.pth
We provide pretrained weights for our models reported in the paper. All of the models were trained and evaluated with 4 NVIDIA A6000 GPUs, and can be reproduced with the evaluation script above.
Backbone | Masks | COCO-Stuff | ADE-150 | Pascal-Context | CityScapes | Pascal-VOC | Download |
---|---|---|---|---|---|---|---|
CLIP ViT-B/16 | DINO | 22.2 | 17.4 | 34.3 | 22.9 | 83.8 | ckpt |
CLIP ViT-B/16 | SA-1B | 23.6 | 18.7 | 37.9 | 27.2 | 85.9 | ckpt |
OpenCLIP ConvNeXt-B | DINO | 20.2 | 19.4 | 32.7 | 30.0 | 62.9 | ckpt |
OpenCLIP ConvNeXt-B | SA-1B | 21.4 | 20.3 | 35.4 | 34.8 | 67.2 | ckpt |
@misc{shin2024openvocabularysemanticsegmentationsemantic,
title={Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels},
author={Heeseong Shin and Chaehyun Kim and Sunghwan Hong and Seokju Cho and Anurag Arnab and Paul Hongsuck Seo and Seungryong Kim},
year={2024},
eprint={2409.19846},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2409.19846},
}