Skip to content

Latest commit

 

History

History
215 lines (180 loc) · 7.33 KB

README.md

File metadata and controls

215 lines (180 loc) · 7.33 KB

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

Zeqi XiaoTai WangJingbo WangJinkun CaoWenwei ZhangBo DaiDahua LinJiangmiao Pang*
Shanghai AI Laboratory Nanyang Technological University Carnegie Mellon University

🏠 About

Dialogue_Teaser
This paper presents a UNIfied HSI framework, UniHSI, which supports unified control of diverse interactions through language commands. This framework is built upon the definition of interaction as Chain of Contacts (CoC): steps of human joint-object part pairs, which is inspired by the strong correlation between interaction types and human-object contact regions. Based on the definition, UniHSI constitutes a Large Language Model (LLM) Planner to translate language prompts into task plans in the form of CoC, and a Unified Controller that turns CoC into uniform task execution. To facilitate training and evaluation, we collect a new dataset named ScenePlan that encompasses thousands of task plans generated by LLMs based on diverse scenarios. Comprehensive experiments demonstrate the effectiveness of our framework in versatile task execution and generalizability to real scanned scenes.

🔥 News

  • [2024-04] The data is released.
  • [2024-03] The code is released.
  • [2024-01] UniHSI is accepted as ICLR 2024 spotlight. Thanks for the recognition!
  • [2023-09] We release the paper of UniHSI. Please check the 👉 webpage 👈 and view our demos! 🎇;

🔍 Overview

The whole pipeline consists of two major components: the LLM Planner and the Unified Controller. The LLM planner takes language inputs and background scenario information as inputs and outputs multi-step plan in the form of a Chain of Contacts. The Unified Controller then executes task plans step-by-step and output interaction movements.

Installation

Download Isaac Gym from the website, then follow the installation instructions.

Once Isaac Gym is installed, install the external dependencies for this repo:

pip install -r requirements.txt

Data Preparation

PartNet

  1. Download PartNet and ShapeNet V2.

  2. Save them in the following formation

data/
├── partnet_origin
│   ├── obj_id1
│   ├── obj_id2
│   ├── ...
├── shapenet_origin
│   ├── class_id1
│   │    ├── obj_id1
│   │    ├── ...
│   ├── class_id2
│   │    ├── obj_id1
│   │    ├── ...
│   ├── ...
  1. Extract the objects used in sceneplan by
python cp_partnet_train.py
python cp_partnet_test.py

ScanNet

  1. Download ScanNet.

  2. Save them in the following formation

data/
├── scan_origin
│   ├── scans
│   │   ├── scans_1
│   │   ├── scans_2
│   │   ├── ...
  1. Extract the objects used in sceneplan by
python cp_scannet_test.py

Motio Clips

We select and process motion clips from SAMP and CIRCLE.

Training

We adopt step-by-step training.

sh train_partnet_simple.sh
sh train_partnet_mid.sh
sh train_partnet_hard.sh

Demo

sh demo_scannet.sh

Evaluation

sh test_partnet_simple.sh
sh test_partnet_mid.sh
sh test_partnet_hard.sh
sh test_scannet_simple.sh
sh test_scannet_mid.sh
sh test_scannet_hard.sh
Source Success Rate (%) Contact Error Success Steps
Simple Mid Hard Simple Mid Hard Simple Mid Hard
PartNet 85.5 67.9 40.5 0.035 0.037 0.040 2.13 4.11 4.84
ScanNet 73.2 43.1 22.3 0.061 0.072 0.062 2.21 3.47 4.78

The results will be saved in the "output" folder.

  • There will be ~10% variance due to randomness in sampling.

🔗 Citation

If you find our work helpful, please cite:

@inproceedings{
  xiao2024unified,
  title={Unified Human-Scene Interaction via Prompted Chain-of-Contacts},
  author={Zeqi Xiao and Tai Wang and Jingbo Wang and Jinkun Cao and Wenwei Zhang and Bo Dai and Dahua Lin and Jiangmiao Pang},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=1vCnDyQkjg}
}

📄 License

Creative Commons License
This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

👏 Acknowledgements

  • ASE: Our codebase is built upon the AMP implementation in ASE.
  • PartNet and ShapeNet.: We use objects from PartNet for training and evaluation.
  • ScanNet: We use scenarios from ScanNet for evaluation.
  • SAMP: We use motion clips from SAMP for training.
  • CIRCLE: We use motion clips from CIRCLE for training.