GitHub - scene-verse/SceneVerse: Official implementation of ECCV24 paper "SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding"

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Baoxiong Jia^✶, Yixin Chen^✶, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang

We propose SceneVerse, the first million-scale 3D vision-language dataset with 68K 3D indoor scenes and 2.5M vision-language pairs. We demonstrate the scaling effect by (i) achieving state-of-the-art on all existing 3D visual grounding benchmarks and (ii) showcasing zero-shot transfer capabilities with our GPS (Grounded Pre-training for Scenes) model.

News

[2024-09] The scripts for scene graph generation are released.
[2024-07] Training & Inference code as well as preprocessing code is released and checkpoints & logs are on the way!
[2024-07] Preprocessing codes for scenes used in SceneVerse are released.
[2024-07] SceneVerse is accepted by ECCV 2024! Training and inference codes/checkpoints will come shortly, stay tuned!
[2024-03] We release the data used in SceneVerse. Fill out the form for the download link!
[2024-01] We release SceneVerse on ArXiv. Checkout our paper and website.

Getting Started

For data browsing, we experimented with NVIDIA CUDA 12.1 on Ubuntu 22.04 and require the following steps:

$ conda create -n sceneverse python=3.9
$ pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121
$ pip install numpy open3d

For training and inference, we provide instructions for using our code base here. It also contains codes for creating the virtual environment, so if you have already created one for visualization, you can directly install all requirements via requirements.txt.

Data

Note: As some of our users requested the mapping between HM3D object id in SceneVerse to HM3D-semantics, we have added an additional file (HM3D_tgtID2objID.zip) to obtain this mapping. The json file for each scene contains a dictionary of {<sceneverse_objid>:[hm3d_objid, hm3d_label]}.

Data Processing

We have released the data preprocessing scripts for 3RScan, MultiScan and ARKitScenes, with more details here.

Scene Graph Generation

We have released the scripts to generate 3D scene graphs for the datasets released in SceneVerse, with more details here.

Data Download

We currently host our data on G-drive and request all applicants to fill out the form from here.

You should see one or multiple zip file segments for each dataset we provided. For datasets with multiple segments (e.g., ARKitScenes), you can unzip the files with:

# Directories with multiple zip segments
$ ls ARKitScenes/
  -> ARKitScenes.zip  ARKitScenes.z01

# Unzip from all zip segments
$ cd ARKitScenes/
$ zip -F ARKitScenes.zip --out combined.zip
$ unzip combined.zip

After unzipping, the files are organized as:

ARKitScenes/
|-- scan_data                   # Point cloud data
  |-- instance_id_to_label      # Reorganized instance id to label mapping
  |-- pcd_with_global_alignment # Aligned scene point clouds
|-- annotations                 # Language annotations
  |-- splits
    |-- train_split.txt         # For all datasets, we provide training split
    |-- val_split.txt           # For datasets with evaluation sets
  |-- <language_type>.json      # For datasets except for ScanNet, language for ScanNet is located at annotations/refer

Data Visualization

We also provide a short script for visualizing scene and language data, you can use it with:

# Visualize scene and instance data
$ python visualize_data.py --root <PATH_TO_DOWNLOAD> --dataset <DATASET>
# Visualize language data
$ python visualize_data.py --root <PATH_TO_DOWNLOAD> --dataset <DATASET> --vis_refer

As our data contains scenes from existing datasets, please read carefully about the term of use for each dataset we provided in the form.

Provided Language Types

We list the available data in the current version of SceneVerse in the table below:

Dataset	Object Caption	Scene Caption	Ref-Annotation	Ref-Pairwise `rel2`	Ref-MultiObject `relm`	Ref-Star `star`	Ref-Chain (Optional) `chain`
ScanNet	✅	✅	ScanRefer Nr3D	✅	✅	✅	✅
MultiScan	✅	✅	✅	✅	✅	✅	✅
ARKitScenes	✅	✅	✅	✅	✅	✅	✅
HM3D	`template`	✅	✅	✅	✅	✅	✅
3RScan	✅	✅	❌	✅	✅	✅	✅
Structured3D	`template`	✅	❌	✅	✅	✅	❌
ProcTHOR	`template`	❌	❌	`template`	`template`	`template`	❌

For the generated object referrals, we provide both the direct template-based generations template and the LLM-refined versions gpt. Please refer to our supplementary for the description of selected pair-wise / multi-object / star types. We also provide the chain type which contains language using obejct A to refer B and then B to refer the target object C. As we found the chain type could sometimes lead to unnatural descriptions, we did not discuss it in the main paper. Feel free to inspect and use it in your projects.

For the remaining data, we hope to further refine and update our data in the following weeks, stay tuned!

BibTex

@article{jia2024sceneverse,
  title={SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding},
  author={Jia, Baoxiong and Chen, Yixin and Yu, Huangyue and Wang, Yan and Niu, Xuesong and Liu, Tengyu and Li, Qing and Huang, Siyuan},
  journal={arXiv preprint arXiv:2401.09340},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
assets		assets
common		common
configs/final		configs/final
data		data
evaluator		evaluator
model		model
modules		modules
optim		optim
preprocess		preprocess
trainer		trainer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TRAIN.md		TRAIN.md
launch.py		launch.py
requirements.txt		requirements.txt
run.py		run.py
visualize_data.py		visualize_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

News

Getting Started

Data

Data Processing

Scene Graph Generation

Data Download

Data Visualization

Provided Language Types

BibTex

About

Releases

Packages

Contributors 2

Languages

License

scene-verse/SceneVerse

Folders and files

Latest commit

History

Repository files navigation

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

News

Getting Started

Data

Data Processing

Scene Graph Generation

Data Download

Data Visualization

Provided Language Types

BibTex

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages