This is the official implementation of the paper "Monocular depth estimation using cues inspired by biological vision systems", accepted for publication at the International Conference on Pattern Recognition (ICPR) 2022.
It is a fork of the original AdaBins repository. The original readme for that repository is included at the end of this ReadMe, for completeness.
This repo is comprised of several parts:
- The instance segmentation module,
- The semantic segmentation module,
- The depth estimation module.
The instance and semantic segmentation modules are off-the-shelf models. In some cases, these have been modified to work with the datasets required, and so the versions of them used in this work have been included here for reference.
Instance and semantic segmentation happen offline (although with some modification an end-to-end inference pipeline may be set up, we have not done this due to computational constraints).
The best model in our work uses semantic segmentation from HRNetV2 pretrained on ADE20K, instance segmentation from Cascade Mask-RCNN with Swin-B backbone, pretrained on ADE20K, and the AdaBins pipeline with an EfficientNet-B1 instead of EfficientNet-B5 as the backbone.
The basic steps to get the best model running are:
- Get NYUD2 and ADE20K datasets,
- Train the Cascade Mask-RCNN w/ Swin-B backbone on ADE20K,
- Run instance segmentation inference with this on NYUD2,
- Acquire HRNetV2 ADE20K checkpoint,
- Run semantic segmentation inference on NYUD2 using the HRNetV2 checkpoint.
- Train the depth estimation module using the parameters detailed.
For other models that use different instance or semantic segmentation pipelines, a similar offline training/inference process must be used.
Running training or evaluation requires the NYU Depth V2 Dataset by Silberman et al. (Indoor Segmentation and Support Inference from RGBD Images, ECCV 2012). The specific version used for this work was acquired using the instructions from the official implementation of BTS (Lee et al. 2019). This version of the dataset has been converted to numpy format, and is in the folder structure that this repository requires.
The downloaded dataset should consist of a folder named nyu
, containing two folders: official_splits
and sync
.
This model requires the ADE20K Places Challenge subset, in order to perform training of some of the instance and semantic segmentation models used.
- Download the images
- Download the annotations
- Uncompress
images.tar
andannotations_instance.tar
into a common folder, giving:/path/to/ADE20K +-- annotations_instance \-- images
- Clone the Places Challenge Toolkit Repo
- Run
instancesegmentation/evaluation/convert_anns_to_json_dataset.py
on thetraining
andvalidation
folders inside theannotations_instance
folder, changing the output name to match. Place the resultinginstance_training_<splitname>_gts.json
files inside theannotations_instance
folder.
We provide a modified version of the official Swin repository, in the folder Swin-Transformer-Object-Detection
. The key difference is the inclusion of new config files to allow training on ADE20K.
To train the Swin-B backbone Cascade Mask-RCNN on the ADE20K Places Challenge subset:
- Fetch the required pretraining checkpoint,
swin_base_patch4_window12_384_22kto1k.pth
, from this URL. - Configure the Swin environment (we used Docker) accordinging to the instructions in the README in that folder.
- Inside the Swin folder, open
./configs/_base_/datasets/ade20k_instance.py
in a text editor. - Modify:
- Line 2, change
data_root
on to point to the ADE20K folder, - Lines 38, 44, and 50 to point to the training, validation, and validation (again) ground-truth JSON files. If you placed them according to step 5 in the ADE20K section above, these shouldn't need modification.
- Line 2, change
- Run:
Note that something causes RAM usage to consistently increase every time evaluation is run. In the interests of transparency, we include the aptly named
tools/dist_train.sh configs/swin/cascade_mask_rcnn_swin_base_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_3x_ade20k.py 1 --cfg-options model.pretrained=checkpoints/swin_base_patch4_window12_384_22kto1k.pth
keep_resuming_until_success.sh
that restarts training if something causes it to fail.
This is an instance segmentation model, outputting semantic labels and per-pixel areas-of-instance-of-this-pixel. Running this section generates the instance_areas_ade20k_swin_<number>.npz
and instance_labels_ade20k_swin_<number>.npz
files corresponding to the rgb_<number>.jpg
and sync_depth_<number>.png
files already present in the dataset.
cd
to the Swin folder in this repository- Run the
tools/nyud2_inference.py
script, using the following command:Modify the path underpython tools/nyud2_inference.py --config configs/swin/cascade_mask_rcnn_swin_base_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_3x_ade20k.py --checkpoint work_dirs/cascade_mask_rcnn_swin_base_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_3x_ade20k/epoch_36.pth --images data/nyu
--images
to point to your NYUD2 folder.
This is a semantic segmentation only model. Running this section generates the semantic_seg_<number>.npy
files corresponding to the rgb_<number>.jpg
and sync_depth_<number>.png
files already present in the dataset.
cd
into thesemantic-segmentation-pytorch
folder, which is a fork of the Official ADE20K semantic segmentation toolkit- Download the HRNetV2 checkpoints from this link, and place them in the folder
ckpt/ade20k-hrnetv2-c1
. - Follow any setup/installation instructions in the README in the
semantic-segmentation-pytorch
folder. - Modify
demo_test.sh
on line 5 to point to your NYUD2 folder. This script will search the directory for images, run inference on them, and save the results back to the same directory with different filenames.
Evaluation is performed using evaluate.py /path/to/params/file.txt
. To do so, datasets must be set up as outline in the prerequisites section.
A base parameter file for evaluation, params/args_test_nyu_BASE.txt
, is included. Checkpoints are not provided with supplementary material due to file size limit.
First, modify the following arguments to point to the nyu/sync/
and nyu/official_splits/test/
directories respectively:
--data_path
and--gt_path
should both have the absolute path ofnyu/sync/
--data_path_eval
andgt_path_eval
should both have the absolute path ofnyu/official_splits/test/
.
To run evaluation on a checkpoint, ensure that the following arguments match in both the evaluation parameter file, and in the training parameter file for the run that produced the checkpoint to be used:
--use_semantics
--use_instance_segmentation
--encoder_name
--insertion_point
If one of these is not present in the training params file, it should not be present in the evaluation file either.
You must also ensure that --checkpoint_path
points to the checkpoint you wish to run evaluation on (and the checkpoint must have been produced with training parameters matching the above list).
All other parameters should match those in any of the provided evaluation parameter files.
Training is performed using train.py /path/to/training/params.txt
. Training parameters as used for our experiments are provided in the params
folder. We note that the system used for this work comprised 2x NVIDIA GeForce GTX 1080 graphics cards, with 8GB VRAM each.
To reproduce an experiment, modify the following arguments to point to the nyu/sync/
and nyu/official_splits/test/
directories respectively:
--data_path
and--gt_path
should both have the absolute path ofnyu/sync/
--data_path_eval
andgt_path_eval
should both have the absolute path ofnyu/official_splits/test/
.
Also, modify the --root
argument to point to the folder in which the experiment folder will be created. The experiment folder that is created will also contain the tensorboard run files, so setting your tensorboard logdir
to match --root
works well.
All other arguments should be left the same, and training should be performed on a system with similar hardware capabilities.
If running a new experiment that has not been run before, ensure that you modify the --name
attribute, or you may end up with a previous experiment being overwritten.
All experiment names (parameter file names, the --name
parameter in training parameters, and therefore the generated checkpoint names) detail the values for the following train/eval parameters:
--dataset
: Alwaysnyu
--encoder_name
: Alwaysefficientnet-b1
, but can beefficientnet-b5
if sufficient compute is available--use_semantics
(sometimes prefaced withsem_
in param filenames):glove-25d
uses results from running inference on NYUD2 with HRNetV2 trained on ADE20K, embedded with 25 dimensional GloVe embeddings.glove-25d-ade20k-places
uses only the semantic labels generated by the Cascade Mask-RCNN with Swin-B backbone trained on the ADE20K Places Challenge subset. Encoded asglove-25d
.glove-25d-ade20k-places-human-sizes
is as the previous, except that per-class absolute dimensions in metres are provided to the network. These are embedded as described in the paper.
--use_instance_segmentation
(sometimes prefaced withinst_
in param file names):coco
uses instance segmentation and mask areas from using Mask-RCNN trained on MSCOCO to run inference on NYUD2. Class label names embedded as inglove-25d
semantics.ade20k_swin
uses labels and mask areas from the Cascade Mask-RCNN w/ Swin-B backbone trained on the ADE20K Places Challenge subset, run on NYUD2.ade20k_swin_human_sizes
is as the previous, except include absolute per-label dimensions in metres. Full details in the paper.ade20k_swin_bbox
Is the same asade20k_swin
but uses bounding box areas instead of mask areas for the instance areas.ade20k_swin_bbox_human_sizes
is the same asade20k_swin_human_sizes
but uses bounding box instead of mask instance areas.
--insertion_point
- Always
input
(can also bebefore-attn
to attach all added information after the encoder/decoder but before the AdaBins module, but this performs worse in all cases.)
- Always
Our parameter file names match our checkpoint names.
Official implementation of Adabins: Depth Estimation using adaptive bins
- You can download the pretrained models "AdaBins_nyu.pt" and "AdaBins_kitti.pt" from here
- You can download the predicted depths in 16-bit format for NYU-Depth-v2 official test set and KITTI Eigen split test set here
Move the downloaded weights to a directory of your choice (we will use "./pretrained/" here). You can then use the pretrained models like so:
from models import UnetAdaptiveBins
import model_io
from PIL import Image
MIN_DEPTH = 1e-3
MAX_DEPTH_NYU = 10
MAX_DEPTH_KITTI = 80
N_BINS = 256
# NYU
model = UnetAdaptiveBins.build(n_bins=N_BINS, min_val=MIN_DEPTH, max_val=MAX_DEPTH_NYU)
pretrained_path = "./pretrained/AdaBins_nyu.pt"
model, _, _ = model_io.load_checkpoint(pretrained_path, model)
bin_edges, predicted_depth = model(example_rgb_batch)
# KITTI
model = UnetAdaptiveBins.build(n_bins=N_BINS, min_val=MIN_DEPTH, max_val=MAX_DEPTH_KITTI)
pretrained_path = "./pretrained/AdaBins_kitti.pt"
model, _, _ = model_io.load_checkpoint(pretrained_path, model)
bin_edges, predicted_depth = model(example_rgb_batch)
Note that the model returns bin-edges (instead of bin-centers).
Recommended way: InferenceHelper
class in infer.py
provides an easy interface for inference and handles various types of inputs (with any prepocessing required). It uses Test-Time-Augmentation (H-Flips) and also calculates bin-centers for you:
from infer import InferenceHelper
infer_helper = InferenceHelper(dataset='nyu')
# predict depth of a batched rgb tensor
example_rgb_batch = ...
bin_centers, predicted_depth = infer_helper.predict(example_rgb_batch)
# predict depth of a single pillow image
img = Image.open("test_imgs/classroom__rgb_00283.jpg") # any rgb pillow image
bin_centers, predicted_depth = infer_helper.predict_pil(img)
# predict depths of images stored in a directory and store the predictions in 16-bit format in a given separate dir
infer_helper.predict_dir("/path/to/input/dir/containing_only_images/", "path/to/output/dir/")
- Add instructions for Evaluation and Training.
- Add Colab demo
- Add UI demo
- Remove unnecessary dependencies