T3VIP

T3VIP: Transformation-based 3D Video Prediction

Iman Nematollahi, Erick Rosete Beas, Seyed Mahdi B. Azad, Raghu Rajan, Frank Hutter, Wolfram Burgard

We present T3VIP, a transformation-based 3D video prediction approach that explicitly models the 3D motion by decomposing a scene into its object parts and predicting their corresponding rigid transformations. Our model is fully unsupervised, captures the stochastic nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals. To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera. Given the ability to learn long-range RGB-D predictions of the future from unlabeled experience, we hope that this repo paves the way of employing 3D world models for robots in the real world.

Installation

To begin, clone this repository locally

git clone https://github.com/nematoli/t3vip.git
export T3VIP_ROOT=$(pwd)/t3vip

Install requirements:

cd T3VIP_ROOT
conda create -n t3vip_venv python=3.7
conda activate t3vip_venv
sh install.sh

Download

Datasets

To download datasets used in this work, please follow this guide.

Pre-trained Models

To download our final T3VIP models on different datasets, please follow this guide.

Training

To train T3VIP models on each dataset you need to specify the corresponding datamodule and model configs.

python train.py datamodule=dexhand model=T3VIP_dex 
python train.py datamodule=omnipush model=T3VIP_omni 
python train.py datamodule=calvin datamodule.dataset.env=env_c model=T3VIP_calvin

If you want to skip frames during data loading (e.g. 2 frames), you need to set datamodule.skip_frames=2.

If you have access to slurm cluster, please follow this guide.

Ablations

we compare our stochastic RGB-D video prediction model to the following well-established RGB video prediction baselines:

CDNA: Unsupervised Learning for Physical Interaction through Video Prediction, (Finn et al., 2016):

python train.py datamodule=dexhand model=CDNA

SNA: Self-supervised Visual Planning with Temporal Skip Connections, (Ebert et al., 2017):

python train.py datamodule=dexhand model=SNA

SV2P: Stochastic Variational Video Prediction, (Babaeizadeh et al., 2018):

python train.py datamodule=dexhand model=SV2P

You can change the datamodule variable to be either dexhand, calvin or omnipush.

Evaluation

You can evaluate your final trained model on test sets. The evaluate.py script will fetch the latest checkpoint and run the evaluation. You need to specify the corresponding datamodule and model for your run. For example, to evaluate T3VIP on omnipush test set, we run:

python evaluate.py datamodule=omnipush model=T3VIP_omni

If you want to change the length of test videos sequences, you can modify eval_seq_len.

Automated Hyperparameter Optimization

In order to automatically find the best set of hyperparameters of T3VIP for your own dataset, you need to set ray scheduler parameters such that it suits your resources and dataset. We used the following configs for finding hyperparameters with our datasets having access to 8 parallel gpus:

For DexHand dataset:

python train_hpo.py datamodule=dexhand model=T3VIP ray.reports_per_epoch=1 ray.scheduler.max_t=63 ray.scheduler.grace_period=1 ray.num_samples=200 trainer.limit_val_batches=0.5

For Calvin dataset:

python train_hpo.py datamodule=calvin datamodule.dataset.env=env_c model=T3VIP ray.reports_per_epoch=27 ray.scheduler.max_t=27 ray.scheduler.grace_period=1 ray.num_samples=200 trainer.limit_val_batches=0.1

For Omnipush dataset:

python train_hpo.py datamodule=omnipush model=T3VIP ray.reports_per_epoch=1 ray.scheduler.max_t=252 ray.scheduler.grace_period=1 ray.num_samples=200 trainer.limit_val_batches=1.0

Citation

If you find the code useful, please cite:

T3VIP

@inproceedings{nematollahi22iros,
    author  = {Iman Nematollahi and Erick Rosete-Beas and Seyed Mahdi B. Azad and Raghu Rajan and Frank Hutter and Wolfram Burgard}
    title   = {T3VIP: Transformation-based 3D Video Prediction},
    booktitle={Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    year = 2022,
    url={http://ais.informatik.uni-freiburg.de/publications/papers/nematollahi22iros.pdf}
    address = {Kyoto, Japan}
}

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

T3VIP

Installation

Download

Datasets

Pre-trained Models

Training

Ablations

Evaluation

Automated Hyperparameter Optimization

Citation

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

T3VIP

Installation

Download

Datasets

Pre-trained Models

Training

Ablations

Evaluation

Automated Hyperparameter Optimization

Citation

License