Towards Open-Vocabulary Semantic Segmentation without Semantic Labels [NeurIPS 2024]

This is our official implementation of PixelCLIP!

[arXiv] [Project]
by Heeseong Shin, Chaehyun Kim, Sunghwan Hong, Seokju Cho, Anurag Arnab$$^\dagger$$, Paul Hongsuck Seo$$^\dagger$$, Seungryong Kim$$^\dagger$$
($$^\dagger$$: Corresponding authors)

Introduction

In contrast to existing methods utilizing (a) pixel-level semantic labels or (b) image-level semantic labels, we leverage unlabeled masks as supervision, which can be freely generated from vision foundation models such as SAM and DINO.

For further details and visualization results, please check out our paper and our project page.

Installation

Please follow installation.

Data Preparation

Please follow dataset preperation.

Training

We provide shell scripts for training and evaluation. run.py trains the model in default configuration and evaluates the model after training.

To train or evaluate the model in different environments, modify the given shell script and config files accordingly.

Training script

sh run.sh [CONFIG] [NUM_GPUS] [OUTPUT_DIR] [OPTS]

# With SA-1B Masks
sh run.sh configs/pixelclip_vit_base.yaml 4 output/
# With DINO Masks
sh run.sh configs/pixelclip_vit_base.yaml 4 output/ MODEL.DINO True

Evaluation

eval.sh automatically evaluates the model following our evaluation protocol, with weights in the output directory if not specified. To individually run the model in different datasets, please refer to the commands in eval.sh.

Evaluation script

sh run.sh [CONFIG] [NUM_GPUS] [OUTPUT_DIR] [OPTS]

sh eval.sh configs/pixelclip_vit_base.yaml 4 output/ MODEL.WEIGHTS path/to/weights.pth

Pretrained Models

We provide pretrained weights for our models reported in the paper. All of the models were trained and evaluated with 4 NVIDIA A6000 GPUs, and can be reproduced with the evaluation script above.

Backbone	Masks	COCO-Stuff	ADE-150	Pascal-Context	CityScapes	Pascal-VOC	Download
CLIP ViT-B/16	DINO	22.2	17.4	34.3	22.9	83.8	ckpt
CLIP ViT-B/16	SA-1B	23.6	18.7	37.9	27.2	85.9	ckpt
OpenCLIP ConvNeXt-B	DINO	20.2	19.4	32.7	30.0	62.9	ckpt
OpenCLIP ConvNeXt-B	SA-1B	21.4	20.3	35.4	34.8	67.2	ckpt

Citing PixelCLIP

@misc{shin2024openvocabularysemanticsegmentationsemantic,
      title={Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels}, 
      author={Heeseong Shin and Chaehyun Kim and Sunghwan Hong and Seokju Cho and Anurag Arnab and Paul Hongsuck Seo and Seungryong Kim},
      year={2024},
      eprint={2409.19846},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.19846}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
configs		configs
datasets		datasets
demo		demo
pixelclip		pixelclip
.gitignore		.gitignore
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
eval.sh		eval.sh
plain_train_net.py		plain_train_net.py
requirements.txt		requirements.txt
run.sh		run.sh
train_net.py		train_net.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Open-Vocabulary Semantic Segmentation without Semantic Labels [NeurIPS 2024]

Introduction

Installation

Data Preparation

Training

Training script

Evaluation

Evaluation script

Pretrained Models

Citing PixelCLIP

About

Releases

Packages

Contributors 3

Languages

License

cvlab-kaist/PixelCLIP

Folders and files

Latest commit

History

Repository files navigation

Towards Open-Vocabulary Semantic Segmentation without Semantic Labels [NeurIPS 2024]

Introduction

Installation

Data Preparation

Training

Training script

Evaluation

Evaluation script

Pretrained Models

Citing PixelCLIP

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages