Skip to content

Latest commit

 

History

History
 
 

metric_depth

Depth Anything V2 for Metric Depth Estimation

teaser

We here provide a simple codebase to fine-tune our Depth Anything V2 pre-trained encoder for metric depth estimation. Built on our powerful encoder, we use a simple DPT head to regress the depth. We fine-tune our pre-trained encoder on synthetic Hypersim / Virtual KITTI datasets for indoor / outdoor metric depth estimation, respectively.

Pre-trained Models

We provide six metric depth models of three scales for indoor and outdoor scenes, respectively.

Base Model Params Indoor (Hypersim) Outdoor (Virtual KITTI 2)
Depth-Anything-V2-Small 24.8M Download Download
Depth-Anything-V2-Base 97.5M Download Download
Depth-Anything-V2-Large 335.3M Download Download

We recommend to first try our larger models (if computational cost is affordable) and the indoor version.

Usage

Prepraration

git clone https://github.com/DepthAnything/Depth-Anything-V2
cd Depth-Anything-V2/metric_depth
pip install -r requirements.txt

Download the checkpoints listed here and put them under the checkpoints directory.

Use our models

import cv2
import torch

from depth_anything_v2.dpt import DepthAnythingV2

model_configs = {
    'vits': {'encoder': 'vits', 'features': 64, 'out_channels': [48, 96, 192, 384]},
    'vitb': {'encoder': 'vitb', 'features': 128, 'out_channels': [96, 192, 384, 768]},
    'vitl': {'encoder': 'vitl', 'features': 256, 'out_channels': [256, 512, 1024, 1024]}
}

encoder = 'vitl' # or 'vits', 'vitb'
dataset = 'hypersim' # 'hypersim' for indoor model, 'vkitti' for outdoor model
max_depth = 20 # 20 for indoor model, 80 for outdoor model

model = DepthAnythingV2(**{**model_configs[encoder], 'max_depth': max_depth})
model.load_state_dict(torch.load(f'checkpoints/depth_anything_v2_metric_{dataset}_{encoder}.pth', map_location='cpu'))
model.eval()

raw_img = cv2.imread('your/image/path')
depth = model.infer_image(raw_img) # HxW depth map in meters in numpy

Running script on images

Here, we take the vitl encoder as an example. You can also use vitb or vits encoders.

# indoor scenes
python run.py \
  --encoder vitl \
  --load-from checkpoints/depth_anything_v2_metric_hypersim_vitl.pth \
  --max-depth 20 \
  --img-path <path> --outdir <outdir> [--input-size <size>] [--save-numpy]

# outdoor scenes
python run.py \
  --encoder vitl \
  --load-from checkpoints/depth_anything_v2_metric_vkitti_vitl.pth \
  --max-depth 80 \
  --img-path <path> --outdir <outdir> [--input-size <size>] [--save-numpy]

Project 2D images to point clouds:

python depth_to_pointcloud.py \
  --encoder vitl \
  --load-from checkpoints/depth_anything_v2_metric_hypersim_vitl.pth \
  --max-depth 20 \
  --img-path <path> --outdir <outdir>

Reproduce training

Please first prepare the Hypersim and Virtual KITTI 2 datasets. Then:

bash dist_train.sh

Citation

If you find this project useful, please consider citing:

@article{depth_anything_v2,
  title={Depth Anything V2},
  author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Zhao, Zhen and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
  journal={arXiv:2406.09414},
  year={2024}
}

@inproceedings{depth_anything_v1,
  title={Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data}, 
  author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
  booktitle={CVPR},
  year={2024}
}