Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ZoeDepth #30136

Merged
merged 108 commits into from
Jul 8, 2024
Merged
Show file tree
Hide file tree
Changes from 100 commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
c074ce8
First draft
NielsRogge Jan 29, 2024
ca4d141
Fix merge
NielsRogge Mar 28, 2024
420397f
Add docs
NielsRogge Mar 28, 2024
02d775a
Merge remote-tracking branch 'upstream/main' into add_zoedepth
NielsRogge Mar 28, 2024
de9d51e
Clean up code
NielsRogge Apr 1, 2024
705f4c6
Convert model
NielsRogge Apr 1, 2024
8080b35
Add image processor
NielsRogge Apr 1, 2024
7e511c2
Convert Zoe_K
NielsRogge Apr 1, 2024
ceb079b
More improvements
NielsRogge Apr 1, 2024
331b48d
Improve variable names and docstrings
NielsRogge Apr 3, 2024
be57cc6
Improve variable names
NielsRogge Apr 3, 2024
27c013c
Improve variable names
NielsRogge Apr 3, 2024
090bb82
Replace nn.sequential
NielsRogge Apr 4, 2024
74088b3
Merge remote-tracking branch 'upstream/main' into add_zoedepth
NielsRogge Apr 8, 2024
712b483
More improvements
NielsRogge Apr 8, 2024
a1f9520
Convert ZoeD_NK
NielsRogge Apr 8, 2024
04cd658
Fix most tests
NielsRogge Apr 8, 2024
73bd15e
Verify pixel values
NielsRogge Apr 8, 2024
2198070
Verify pixel values
NielsRogge Apr 8, 2024
470856b
Add squeeze
NielsRogge Apr 8, 2024
ad188e5
Update beit to support arbitrary window sizes
NielsRogge Apr 8, 2024
f422f24
Improve image processor
NielsRogge Apr 8, 2024
8c611c3
Improve docstring
NielsRogge Apr 8, 2024
69b3593
Improve beit
NielsRogge Apr 8, 2024
35f86df
Improve model outputs
NielsRogge Apr 8, 2024
0146011
Add figure
NielsRogge Apr 9, 2024
a8d7739
Fix beit
NielsRogge Apr 9, 2024
46a6479
Update checkpoint
NielsRogge Apr 11, 2024
b693dd4
Fix repo id
NielsRogge Apr 11, 2024
bb16289
Add _keys_to_ignore_on_load_unexpected
NielsRogge Apr 11, 2024
a9b3070
More improvements
NielsRogge Apr 11, 2024
47ac850
Address comments
NielsRogge Apr 20, 2024
ba496e5
Merge remote-tracking branch 'upstream/main' into add_zoedepth
NielsRogge Apr 20, 2024
ef8412e
Address comments
NielsRogge Apr 20, 2024
40c8f3d
Address comments
NielsRogge Apr 20, 2024
d52b425
Address comments
NielsRogge Apr 20, 2024
ff70bff
Rename variable name
NielsRogge Apr 20, 2024
75a8ccb
Add backbone_hidden_size
NielsRogge Apr 20, 2024
94afdbd
Vectorize
NielsRogge Apr 20, 2024
66f3b21
Vectorize more
NielsRogge Apr 20, 2024
da53475
Address comments
NielsRogge Apr 21, 2024
76cc537
Clarify docstring
NielsRogge Apr 21, 2024
b9f59e2
Remove backbone_hidden_size
NielsRogge Apr 21, 2024
c7020b1
Fix image processor
NielsRogge Apr 22, 2024
313e5f1
Remove print statements
NielsRogge Apr 22, 2024
2ec7f1a
Remove print statement
NielsRogge Apr 22, 2024
070aed2
Fix merge
NielsRogge Apr 24, 2024
c1d2954
Add integration test
NielsRogge Apr 24, 2024
3d1cc8f
Address comments
NielsRogge Apr 27, 2024
973b538
Address comments
NielsRogge Apr 27, 2024
dcdcc7f
Address comments
NielsRogge Apr 27, 2024
5f3cdf8
Address comments
NielsRogge Apr 27, 2024
0f5b499
Add requires_backends
NielsRogge Apr 27, 2024
f9bd852
Clean up
NielsRogge Apr 27, 2024
5febc7f
Simplify conversion script
NielsRogge Apr 27, 2024
fd97f14
Simplify more
NielsRogge Apr 27, 2024
e69fbb8
Simplify more
NielsRogge Apr 27, 2024
e097e1f
Simplify more
NielsRogge Apr 27, 2024
53a1992
Clean up
NielsRogge Apr 27, 2024
f06d878
Make sure beit is loaded correctly
NielsRogge Apr 27, 2024
d044b2b
Address comment
NielsRogge Apr 29, 2024
373ff0f
Address bin_configurations
NielsRogge Apr 30, 2024
ccc8d46
Use bin_configurations
NielsRogge Apr 30, 2024
46bf21c
Convert models, add integration tests
NielsRogge Apr 30, 2024
2331935
Fix doc test
NielsRogge May 1, 2024
57af8d7
Address comments
NielsRogge May 1, 2024
bb8e7b2
Unify regressor classes
NielsRogge May 2, 2024
96777f8
Clarify arguments
NielsRogge May 2, 2024
30a3ac9
Improve resize_image
NielsRogge May 2, 2024
fb727f1
Add num_relative_features
NielsRogge May 2, 2024
7fc2b3b
Merge remote-tracking branch 'upstream/main' into add_zoedepth
NielsRogge May 2, 2024
a558e02
Address comment
NielsRogge May 2, 2024
25e4319
[run-slow]beit,data2vec,zoedepth
NielsRogge May 3, 2024
a592c8d
Merge remote-tracking branch 'upstream/main' into add_zoedepth
NielsRogge May 4, 2024
1397644
[run-slow]beit,data2vec,zoedepth
NielsRogge May 5, 2024
d77929f
Address comments
NielsRogge May 15, 2024
db1e4ad
Address comment
NielsRogge May 15, 2024
5c80182
Fix merge
NielsRogge May 17, 2024
979a4bb
Address comment
NielsRogge May 19, 2024
234cbf2
Replace nn.TransformerEncoderLayer and nn.TransformerEncoder
NielsRogge May 20, 2024
1244d59
Replace nn.MultiheadAttention
NielsRogge May 20, 2024
7d0e82b
Add attributes for patch transformer to config
NielsRogge May 20, 2024
d194245
Add tests for ensure_multiple_of
NielsRogge May 20, 2024
7415bd4
Update organization
NielsRogge May 20, 2024
e19921b
Add tests
NielsRogge May 24, 2024
fbcffcc
[run-slow] beit data2vec
NielsRogge May 24, 2024
bcc3108
Merge remote-tracking branch 'upstream/main' into add_zoedepth
NielsRogge May 24, 2024
3ece046
Update ruff
NielsRogge May 24, 2024
4893ec9
Merge remote-tracking branch 'upstream/main' into add_zoedepth
NielsRogge May 27, 2024
9d82897
[run-slow] beit data2vec
NielsRogge May 27, 2024
dcdf9b6
Add comment
NielsRogge May 27, 2024
344a565
Merge remote-tracking branch 'upstream/main' into add_zoedepth
NielsRogge May 30, 2024
525869e
Improve docstrings, add test
NielsRogge Jun 6, 2024
5aad6c7
Fix merge
NielsRogge Jun 7, 2024
bcd7ae1
Fix interpolate_pos_encoding
NielsRogge Jun 7, 2024
39e2ca3
Fix slow tests
NielsRogge Jun 7, 2024
bde8dda
Add docstring
NielsRogge Jun 8, 2024
80bb3ed
Update src/transformers/models/zoedepth/image_processing_zoedepth.py
NielsRogge Jun 16, 2024
6137387
Update src/transformers/models/zoedepth/image_processing_zoedepth.py
NielsRogge Jun 16, 2024
e6d8aac
Improve tests and docstrings
NielsRogge Jun 16, 2024
dffbfea
Fix merge
NielsRogge Jun 28, 2024
34a1abb
Use run_common_tests
NielsRogge Jun 28, 2024
1bcd19f
Improve docstrings
NielsRogge Jun 28, 2024
c6e5d6f
Improve docstrings
NielsRogge Jun 28, 2024
8ac163f
Improve tests
NielsRogge Jun 28, 2024
6cb3c56
Improve tests
NielsRogge Jun 28, 2024
b2534b2
Remove print statements
NielsRogge Jun 28, 2024
617487d
Merge remote-tracking branch 'upstream/main' into add_zoedepth
NielsRogge Jul 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -661,6 +661,8 @@
title: ViTMSN
- local: model_doc/yolos
title: YOLOS
- local: model_doc/zoedepth
title: ZoeDepth
title: Vision models
- isExpanded: false
sections:
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -338,5 +338,6 @@ Flax), PyTorch, and/or TensorFlow.
| [XLSR-Wav2Vec2](model_doc/xlsr_wav2vec2) | ✅ | ✅ | ✅ |
| [YOLOS](model_doc/yolos) | ✅ | ❌ | ❌ |
| [YOSO](model_doc/yoso) | ✅ | ❌ | ❌ |
| [ZoeDepth](model_doc/zoedepth) | ✅ | ❌ | ❌ |

<!-- End table-->
108 changes: 108 additions & 0 deletions docs/source/en/model_doc/zoedepth.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# ZoeDepth

## Overview

The ZoeDepth model was proposed in [ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth](https://arxiv.org/abs/2302.12288) by Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, Matthias Müller. ZoeDepth extends the [DPT](dpt) framework for metric (also called absolute) depth estimation. ZoeDepth is pre-trained on 12 datasets using relative depth and fine-tuned on two domains (NYU and KITTI) using metric depth. A lightweight head is used with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier.

The abstract from the paper is the following:

*This paper tackles the problem of depth estimation from a single image. Existing work either focuses on generalization performance disregarding metric scale, i.e. relative depth estimation, or state-of-the-art results on specific datasets, i.e. metric depth estimation. We propose the first approach that combines both worlds, leading to a model with excellent generalization performance while maintaining metric scale. Our flagship model, ZoeD-M12-NK, is pre-trained on 12 datasets using relative depth and fine-tuned on two datasets using metric depth. We use a lightweight head with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier. Our framework admits multiple configurations depending on the datasets used for relative depth pre-training and metric fine-tuning. Without pre-training, we can already significantly improve the state of the art (SOTA) on the NYU Depth v2 indoor dataset. Pre-training on twelve datasets and fine-tuning on the NYU Depth v2 indoor dataset, we can further improve SOTA for a total of 21% in terms of relative absolute error (REL). Finally, ZoeD-M12-NK is the first model that can jointly train on multiple datasets (NYU Depth v2 and KITTI) without a significant drop in performance and achieve unprecedented zero-shot generalization performance to eight unseen datasets from both indoor and outdoor domains.*

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/zoedepth_architecture_bis.png"
alt="drawing" width="600"/>

<small> ZoeDepth architecture. Taken from the <a href="https://arxiv.org/abs/2302.12288">original paper.</a> </small>

This model was contributed by [nielsr](https://huggingface.co/nielsr).
The original code can be found [here](https://github.com/isl-org/ZoeDepth).

## Usage tips

- ZoeDepth is an absolute (also called metric) depth estimation model, unlike DPT which is a relative depth estimation model. This means that ZoeDepth is able to estimate depth in metric units like meters.

The easiest to perform inference with ZoeDepth is by leveraging the [pipeline API](../main_classes/pipelines.md):

```python
from transformers import pipeline
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

pipe = pipeline(task="depth-estimation", model="Intel/zoedepth-nyu-kitti")
result = pipe(image)
depth = result["depth"]
```

Alternatively, one can also perform inference using the classes:

```python
from transformers import AutoImageProcessor, ZoeDepthForDepthEstimation
import torch
import numpy as np
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("Intel/zoedepth-nyu-kitti")
model = ZoeDepthForDepthEstimation.from_pretrained("Intel/zoedepth-nyu-kitti")

# prepare image for the model
inputs = image_processor(images=image, return_tensors="pt")

with torch.no_grad():
outputs = model(**inputs)
predicted_depth = outputs.predicted_depth

# interpolate to original size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be part of a post processing method in the image processor

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is to be done in a follow-up, an issue should be made to make sure it's actually done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue has been opened: #30917

prediction = torch.nn.functional.interpolate(
predicted_depth.unsqueeze(1),
size=image.size[::-1],
mode="bicubic",
align_corners=False,
)

# visualize the prediction
output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)
```

## Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ZoeDepth.

- A demo notebook regarding inference with ZoeDepth models can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ZoeDepth). 🌎

## ZoeDepthConfig

[[autodoc]] ZoeDepthConfig

## ZoeDepthImageProcessor

[[autodoc]] ZoeDepthImageProcessor
- preprocess

## ZoeDepthForDepthEstimation

[[autodoc]] ZoeDepthForDepthEstimation
- forward
14 changes: 14 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -795,6 +795,7 @@
"models.xmod": ["XmodConfig"],
"models.yolos": ["YolosConfig"],
"models.yoso": ["YosoConfig"],
"models.zoedepth": ["ZoeDepthConfig"],
"onnx": [],
"pipelines": [
"AudioClassificationPipeline",
Expand Down Expand Up @@ -1168,6 +1169,7 @@
_import_structure["models.vitmatte"].append("VitMatteImageProcessor")
_import_structure["models.vivit"].append("VivitImageProcessor")
_import_structure["models.yolos"].extend(["YolosFeatureExtractor", "YolosImageProcessor"])
_import_structure["models.zoedepth"].append("ZoeDepthImageProcessor")


# PyTorch-backed objects
Expand Down Expand Up @@ -3527,6 +3529,12 @@
"YosoPreTrainedModel",
]
)
_import_structure["models.zoedepth"].extend(
[
"ZoeDepthForDepthEstimation",
"ZoeDepthPreTrainedModel",
]
)
_import_structure["optimization"] = [
"Adafactor",
"AdamW",
Expand Down Expand Up @@ -5423,6 +5431,7 @@
from .models.xmod import XmodConfig
from .models.yolos import YolosConfig
from .models.yoso import YosoConfig
from .models.zoedepth import ZoeDepthConfig

# Pipelines
from .pipelines import (
Expand Down Expand Up @@ -5796,6 +5805,7 @@
from .models.vitmatte import VitMatteImageProcessor
from .models.vivit import VivitImageProcessor
from .models.yolos import YolosFeatureExtractor, YolosImageProcessor
from .models.zoedepth import ZoeDepthImageProcessor

# Modeling
try:
Expand Down Expand Up @@ -7688,6 +7698,10 @@
YosoModel,
YosoPreTrainedModel,
)
from .models.zoedepth import (
ZoeDepthForDepthEstimation,
ZoeDepthPreTrainedModel,
)

# Optimization
from .optimization import (
Expand Down
10 changes: 5 additions & 5 deletions src/transformers/image_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -363,22 +363,22 @@ def validate_preprocess_arguments(

"""
if do_rescale and rescale_factor is None:
raise ValueError("rescale_factor must be specified if do_rescale is True.")
raise ValueError("`rescale_factor` must be specified if `do_rescale` is `True`.")

if do_pad and size_divisibility is None:
# Here, size_divisor might be passed as the value of size
raise ValueError(
"Depending on moel, size_divisibility, size_divisor, pad_size or size must be specified if do_pad is True."
"Depending on the model, `size_divisibility`, `size_divisor`, `pad_size` or `size` must be specified if `do_pad` is `True`."
)

if do_normalize and (image_mean is None or image_std is None):
raise ValueError("image_mean and image_std must both be specified if do_normalize is True.")
raise ValueError("`image_mean` and `image_std` must both be specified if `do_normalize` is `True`.")

if do_center_crop and crop_size is None:
raise ValueError("crop_size must be specified if do_center_crop is True.")
raise ValueError("`crop_size` must be specified if `do_center_crop` is `True`.")

if do_resize and (size is None or resample is None):
raise ValueError("size and resample must be specified if do_resize is True.")
raise ValueError("`size` and `resample` must be specified if `do_resize` is `True`.")


# In the future we can add a TF implementation here when we have TF models.
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -259,4 +259,5 @@
xmod,
yolos,
yoso,
zoedepth,
)
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,7 @@
("xmod", "XmodConfig"),
("yolos", "YolosConfig"),
("yoso", "YosoConfig"),
("zoedepth", "ZoeDepthConfig"),
]
)

Expand Down Expand Up @@ -578,6 +579,7 @@
("xmod", "X-MOD"),
("yolos", "YOLOS"),
("yoso", "YOSO"),
("zoedepth", "ZoeDepth"),
]
)

Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@
("vitmatte", "VitMatteImageProcessor"),
("xclip", "CLIPImageProcessor"),
("yolos", "YolosImageProcessor"),
("zoedepth", "ZoeDepthImageProcessor"),
]
)

Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -785,6 +785,7 @@
("depth_anything", "DepthAnythingForDepthEstimation"),
("dpt", "DPTForDepthEstimation"),
("glpn", "GLPNForDepthEstimation"),
("zoedepth", "ZoeDepthForDepthEstimation"),
]
)
MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
Expand Down
Loading