Skip to content

[NeurIPS 2024] XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

License

Notifications You must be signed in to change notification settings

wangzy22/XMask3D

Repository files navigation

XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

Created by Ziyi Wang*, Yanbo Wang*, Xumin Yu, Jie Zhou, Jiwen Lu.

This repository is a pyTorch implementation of our NeurIPS 2024 paper XMask3D.

XMask3D is a framework for open vocabulary 3D semantic segmentation that improves fine-grained boundary delineation by aligning 3D features with a 2D-text embedding space at the mask level. Using a mask generator based on a pre-trained diffusion model, it enables precise textual control over dense pixel representations, enhancing the versatility of generated masks. By integrating 3D global features into a 2D denoising UNet, XMask3D adds 3D geometry awareness to mask generation. The resulting 2D masks align 3D representations with vision-language features, yielding competitive segmentation performance across benchmarks.

[arXiv] intro

Installation

  • Follow the installation.md to install all required packages so you can do the training & evaluation afterwards.

Data Preparation

  • For convenience, the download link for the processed dataset is provided here. You can download the dataset by executing the command below.
sh scripts/download_datasets.sh

Pre-trained Model Preparation

  • For this project, you will need the pre-trained CLIP model and the Stable Diffusion model. Due to the instability of official network links, we provide alternative download options below:
# CLIP ViT-Large Patch14
cd /path/to/your/workspace
wget -O openai.tar.gz https://cloud.tsinghua.edu.cn/f/3890f1df1c5248a7a6e8/?dl=1
tar -xzvf openai.tar.gz
# Stable Diffusion v1.3 Checkpoint
wget -O sd_model.tar.gz https://cloud.tsinghua.edu.cn/f/8dce9b137f574e6eb57c/?dl=1
tar -xzvf sd_model.tar.gz

Usage

Training

sh run/train.sh --exp_dir=<EXPERIMENT_DIRECTORY> --config=<CONFIG_FILE>
  • For example, to train on the ScanNet B15N4 benchmark, run:
sh run/train.sh --exp_dir=out/exp_b15n4 --config=config/scannet/xmask3d_scannet_B15N4.yaml

Resume

sh run/resume.sh --exp_dir=<EXPERIMENT_DIRECTORY> --config=<CONFIG_FILE>
  • For example, to resume the last ckpt on the ScanNet B15N4 benchmark, run:
sh run/resume.sh --exp_dir=out/exp_b15n4 --config=config/scannet/xmask3d_scannet_B15N4.yaml

Inference

sh run/infer.sh --exp_dir=<EXPERIMENT_DIRECTORY> --config=<CONFIG_FILE> --ckpt_name=<CKPT_NAME>
  • For example, to run inference using the checkpoint b15n4.pth.tar on the ScanNet B15N4 benchmark, execute the following command:
sh run/infer.sh --exp_dir=out/exp_b15n4 --config=config/scannet/xmask3d_scannet_B15N4.yaml --ckpt_name=b15n4.pth.tar

Checkpoint

Benchmark hIoU / mIoUb / mIoUn Download Link
Scannet B15N4 70.0 / 69.8 / 70.2 [Tsinghua Cloud] [Google]
Scannet B12N7 61.7 / 70.2 / 55.1 [Tsinghua Cloud] [Google]
Scannet B10N9 55.7 / 76.5 / 43.8 [Tsinghua Cloud] [Google]
Scannet B170N30 18.0 / 27.8 / 13.3 [Tsinghua Cloud] [Google]
Scannet B150N50 15.5 / 24.4 / 11.4 [Tsinghua Cloud] [Google]

Citation

If you find our work useful in your research, please consider citing:

@article{wang2024xmask3d,
  title={XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation},
  author={Wang, Ziyi and Wang, Yanbo and Yu, Xumin and Zhou, Jie and Lu, Jiwen},
  journal={arXiv preprint arXiv:2411.13243},
  year={2024}
}

About

[NeurIPS 2024] XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published