This repository contains Pytorch code for the paper Zero-Shot Category-Level Object Pose Estimation (Goodwin et al., ECCV 2022) [arxiv].
-
Make environment:
conda env create -f environment.yml
-
Install Pillow < 7.0 with
pip
to overcome atorchvision
bug:pip install 'pillow<7'
-
Install Pytorch3D from Github:
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"
Install the zsp
python package implemented in this repo with pip install -e .
⚠️ This repo uses CO3D Version 1: Use the correctv1
branch of the CO3D repo, available here! Meta has since released a v2 of the CO3D dataset, which is not currently supported but likely could be if somebody wanted to put in the effort!
This work uses the Common Objects in 3D (CO3D) dataset. The repo for this dataset, with download instructions, is here.
This dataset contains 18,619 multi-frame sequences capturing different instances of 50 object categories. For full dataset is around 1.4TB. For evaluation in this work, we manually annotated 10 sequences from each of 20 categories with ground-truth poses (these annotations are found under data/class_labels
). The relevant subset of the dataset is thus smaller at around ~15GB. If you are struggling to download the entire CO3D dataset, please contact me and I will try to share this subset with you.
This code uses DINO ViTs for feature extraction. Links to pre-trained weights can be found in this file. However, to just download the main model considered in this work:
wget https://dl.fbaipublicfiles.com/dino/dino_deitsmall8_pretrain/dino_deitsmall8_pretrain.pth
The directory to which you save this model can be passed as an argument to the main script.
cd zsp
python method/evaluate_ref_to_target_pose.py \
--co3d_root /path/to/co3d/dataset \
--hub_dir /path/to/saved/dino/weights/ \
--kmeans
By default, this will loop over the 20 categories in the labelled subset developed in this work, and draw 100 reference-target pairings from the 10 labelled sequences in each of these categories. To vary the number of target frames used (default = 5), change the --n_target
argument.
To plot results (correspondences, the closest matching frame, and renders of the aligned point clouds), pass --plot_results
.
If you use this code in your research, please consider citing our paper:
@InProceedings{goodwin2022,
author = {Walter Goodwin and Sagar Vaze and Ioannis Havoutis and Ingmar Posner},
title = {Zero-Shot Category-Level Object Pose Estimation},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022},
}