Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(DPT,Depth-Anything) torch.export #34103

Merged
merged 7 commits into from
Nov 20, 2024

Conversation

philkuz
Copy link
Contributor

@philkuz philkuz commented Oct 12, 2024

What does this PR do?

Small modification of the DPT modeling code to remove a new object creation in a forward() method of a Module. This object creation makes the model incompatible with torch.export, which is a key part of preparing a model to run on a variety of hardware backends through projects such as ExecuTorch (related issue: #32253)

Motivation

torch.export allows you to export PyTorch models into standardized model representations, intended to be optimized and run efficiently using frameworks such as TensorRT or ExecuTorch.

The Bug

They key issue was the slice on self.layers:

for hidden_state, layer in zip(hidden_states[1:], self.layers[1:]):

self.layers[1:] creates a new ModuleList() each time this line is executed.

https://github.com/pytorch/pytorch/blob/69bcf1035e7f06f2eefd8986d000cc980e9ebd37/torch/nn/modules/container.py#L330

The model tracer in torch.export monkey-patches nn.Module constructors during evaluation of the forward() pass, so the original DPT modeling code raises the following error:

  File "/home/philkuz/.pyenv/versions/gml311/lib/python3.11/site-packages/torch/nn/modules/container.py", line 293, in __getitem__
      return self.__class__(list(self._modules.values())[idx])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                 TypeError: _ModuleStackTracer.__init__.<locals>.AttrProxy.__init__() missing 1 required positional argument: 'path'

The Solution

Pytorch recommends users update the modeling code. My team and I figured this could be helpful to the broader community, especially a future where Export to Executorch becomes more widely available: #32253

This also removes an unnecessary creation of a new module list as a bonus.

Tests

I ensured that tests/models/dpt/test_modeling_dpt.py passes, which appears to test a portion of the outputs. I also verified that the entire output of the model
before and after my changes matched with the following script:

import os
import sys

import numpy as np
import requests
import torch
from PIL import Image
from transformers import pipeline

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)


model = pipeline("depth-estimation", "facebook/dpt-dinov2-base-kitti")
result = model(image)


output_file = "depth_estimation_output.npy"

if not os.path.exists(output_file):
    # Save the current output
    np.save(output_file, result["predicted_depth"])
    print(f"Depth estimation output saved to {output_file}")
    print("Rerun the script to compare the output")
    sys.exit(0)
# Load existing output and compare
expected_output = np.load(output_file)
np.testing.assert_allclose(
    result["predicted_depth"],
    expected_output,
    rtol=1e-5,
    atol=1e-5,
    err_msg="Depth estimation output has changed",
)
print("Depth estimation output matches the saved version.")

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts, @qubvel

@philkuz philkuz force-pushed the torch_export_dpt_based_models branch from aa7d562 to 2e15b57 Compare October 12, 2024 00:36
@philkuz philkuz changed the title Fix torch.export issue in dpt based models Fix torch.export issue in DPT based models Oct 14, 2024
@philkuz philkuz changed the title Fix torch.export issue in DPT based models Support torch.export/ExecuTorch for DPT-based models Oct 14, 2024
@philkuz philkuz changed the title Support torch.export/ExecuTorch for DPT-based models Make DPT-based models compatible with torch.export Oct 14, 2024
Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks for unlocking more models for torch export, this is very valuable!

The same comment as for Mask2Former PR, would be great to have this PR tested, and please push run-slow commit to trigger all tests at the end!

src/transformers/models/dpt/modeling_dpt.py Outdated Show resolved Hide resolved
@philkuz philkuz changed the title Make DPT-based models compatible with torch.export fix(DPT,Depth-Anything) torch.export Oct 29, 2024
@philkuz philkuz force-pushed the torch_export_dpt_based_models branch from 2e15b57 to 4c65b67 Compare October 29, 2024 23:09
@philkuz
Copy link
Contributor Author

philkuz commented Oct 29, 2024

Very nice, thanks for unlocking more models for torch export, this is very valuable!

The same comment as for Mask2Former PR, would be great to have this PR tested, and please push run-slow commit to trigger all tests at the end!

I've added Depth-anything to this PR, I'm not entirely sure if I've triggered the run_slow test for it and DPT correctly. Happy to split it off into a separate PR.

Also ran into an issue with zoedepth not working because of the beit backend. I suspect that will take more time to properly addres, I added a skipped test to zoedepth, but I can also remove that test entirely and add it in a WIP PR.

@qubvel qubvel self-requested a review October 30, 2024 07:50
Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! Regarding ZoeDepth I suggest either fixing the model export or excluding this model from the PR. A skipped test is not the best solution, cause it might stuck in this state for a very long time 😄

To trigger multiple models' slow tests you can list them as follows [run_slow] depth_anything, dpt, zoedepth

@philkuz
Copy link
Contributor Author

philkuz commented Oct 30, 2024

Thanks for the update! Regarding ZoeDepth I suggest either fixing the model export or excluding this model from the PR. A skipped test is not the best solution, cause it might stuck in this state for a very long time 😄

To trigger multiple models' slow tests you can list them as follows [run_slow] depth_anything, dpt, zoedepth

I have to add some of the model changes because of the copy-consistency check, but I'll remove the Relu change and the torch.export test!

Thanks for the heads up on slow tests.

@philkuz philkuz force-pushed the torch_export_dpt_based_models branch from 4c65b67 to bc0633a Compare October 30, 2024 16:33
@philkuz
Copy link
Contributor Author

philkuz commented Oct 30, 2024

@qubvel could you approve the slow workflow?

@qubvel
Copy link
Member

qubvel commented Oct 30, 2024

I have to add some of the model changes because of the copy-consistency check
Can you provide a bit more details on this? Can I somehow help to enable torch export for ZoeDepth?

@philkuz
Copy link
Contributor Author

philkuz commented Oct 30, 2024

I have to add some of the model changes because of the copy-consistency check
Can you provide a bit more details on this? Can I somehow help to enable torch export for ZoeDepth?

I'm not 100% sure that this is part of the CI, but the contributing guide asks you to run repo-consistency

make repo-consistency

which throws an error in python utils/check_copies.py if you don't update ZoeDepth to match DPT. (ZoeDepth copied many layers from DPT
https://github.com/philkuz/transformers/blob/bc0633a82cbfe8d828fa2d3b432dfde4fbd2f0e5/src/transformers/models/zoedepth/modeling_zoedepth.py#L175
)

So basically I have to include those shared changes: https://github.com/huggingface/transformers/pull/34103/files#diff-02337c86e3fba49173cf2cb6fa1595ed168db19726938aec925b8b010a3b6a8c

The current crux of ZoeDepth is that the BEIT model, the backbone of all the HF hub models for ZoeDepth, isn't compatible. So you have to address that issue, which I have not had time to address yet.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@philkuz
Copy link
Contributor Author

philkuz commented Oct 30, 2024

The slow_tests are failing, but I think they're broken on main as well. Here's a repro:

git checkout main
# DPT Failures
CUDA_VISIBLE_DEVICES="" RUN_SLOW=true pytest tests/models/dpt/test_modeling_dpt_auto_backbone.py -v  -k=test_inference_depth_estimation_dinov2
# Depth-anything failures
CUDA_VISIBLE_DEVICES="" RUN_SLOW=true pytest tests/models/depth_anything/test_modeling_depth_anything.py -v  -k test_inference

The following also has checks on the output of slices and those checks seem to work.

CUDA_VISIBLE_DEVICES="" RUN_SLOW=true pytest tests/models/dpt/test_modeling_dpt.py -v  -k=test_inference

I went ahead and made a PR to try and address this issue: #34518

@qubvel
Copy link
Member

qubvel commented Oct 30, 2024

I'm not 100% sure that this is part of the CI, but the contributing guide asks you to run repo-consistency

Ahh, ok, it's because model modules contain "Copied from" statements and these parts are synced across models. No worries then!

@philkuz philkuz force-pushed the torch_export_dpt_based_models branch from fd1b352 to c7ddd0f Compare October 31, 2024 16:49
@qubvel qubvel added the Vision label Oct 31, 2024
Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating it!

cc @guangy10

Comment on lines +188 to +194
fused_hidden_state = None
for hidden_state, layer in zip(hidden_states, self.layers):
if fused_hidden_state is None:
# first layer only uses the last hidden_state
fused_hidden_state = layer(hidden_state)
else:
fused_hidden_state = layer(fused_hidden_state, hidden_state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment for final review:

This change included in the ZoeDepth model because of the "Copied from" statement, it doesn't unlock torch export for the model, however will be useful if we decide to enable it

Copy link
Contributor

@guangy10 guangy10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the contribution.

@guangy10
Copy link
Contributor

guangy10 commented Nov 1, 2024

Extend the script in pytorch/executorch#6509 to support lowering (with the simplest recipe) the DepthEstimation and SemanticSegmentation models enabled in this PR.

The dpt model works as expected.
However, the depth-anything model fails due to an unsupported dim order. ExecuTorch supports dim_order: (0, 1, 2, 3) but got dim_order: (0, 2, 3, 1) for a placeholder node aten_clone_default. There seems to be a way to insert a compiler pass to fix it w/o requiring changing the source code. I will give a try.

@philkuz
Copy link
Contributor Author

philkuz commented Nov 4, 2024

Extend the script in pytorch/executorch#6509 to support lowering (with the simplest recipe) the DepthEstimation and SemanticSegmentation models enabled in this PR.

The dpt model works as expected. However, the depth-anything model fails due to an unsupported dim order. ExecuTorch supports dim_order: (0, 1, 2, 3) but got dim_order: (0, 2, 3, 1) for a placeholder node aten_clone_default. There seems to be a way to insert a compiler pass to fix it w/o requiring changing the source code. I will give a try.

Any luck on the compiler pass?

Also do you think that this gates the support for torch.export? Seems like this is Executorch specific. Maybe we can scope this PR down to be more for torch.export generally and focus on adding support for Executorch in another PR? Happy to help with that

@philkuz philkuz closed this Nov 4, 2024
@philkuz philkuz reopened this Nov 4, 2024
@guangy10
Copy link
Contributor

guangy10 commented Nov 4, 2024

Extend the script in pytorch/executorch#6509 to support lowering (with the simplest recipe) the DepthEstimation and SemanticSegmentation models enabled in this PR.
The dpt model works as expected. However, the depth-anything model fails due to an unsupported dim order. ExecuTorch supports dim_order: (0, 1, 2, 3) but got dim_order: (0, 2, 3, 1) for a placeholder node aten_clone_default. There seems to be a way to insert a compiler pass to fix it w/o requiring changing the source code. I will give a try.

Any luck on the compiler pass?

@philkuz Sorry, haven't got a chance to write the pass yet.

Also do you think that this gates the support for torch.export? Seems like this is Executorch specific. Maybe we can scope this PR down to be more for torch.export generally and focus on adding support for Executorch in another PR? Happy to help with that

Right, it's ExecuTorch specific, i.e. all tensors need to be contiguous. BTW, do you happen to know where the channel_last tensor may come from the eager, we can fix it here, otherwise, having a separate PR for ExecuTorch is fine. Please note that unlike compiled artifact the exported program is just an intermediate representation, typically should only being used as the entry for further optimizations, i.e. ExecuTorch.

@philkuz
Copy link
Contributor Author

philkuz commented Nov 6, 2024

Extend the script in pytorch/executorch#6509 to support lowering (with the simplest recipe) the DepthEstimation and SemanticSegmentation models enabled in this PR.
The dpt model works as expected. However, the depth-anything model fails due to an unsupported dim order. ExecuTorch supports dim_order: (0, 1, 2, 3) but got dim_order: (0, 2, 3, 1) for a placeholder node aten_clone_default. There seems to be a way to insert a compiler pass to fix it w/o requiring changing the source code. I will give a try.

Any luck on the compiler pass?

@philkuz Sorry, haven't got a chance to write the pass yet.

Also do you think that this gates the support for torch.export? Seems like this is Executorch specific. Maybe we can scope this PR down to be more for torch.export generally and focus on adding support for Executorch in another PR? Happy to help with that

Right, it's ExecuTorch specific, i.e. all tensors need to be contiguous. BTW, do you happen to know where the channel_last tensor may come from the eager, we can fix it here, otherwise, having a separate PR for ExecuTorch is fine. Please note that unlike compiled artifact the exported program is just an intermediate representation, typically should only being used as the entry for further optimizations, i.e. ExecuTorch.

I did a very quick scan for the channel_last tensor, and I believe it's in DINOv2 (the backbone) which is not part of this particular modeling code. I think we should move it to another PR IMO.

@guangy10
Copy link
Contributor

Any other block for merging this PR?

@qubvel
Copy link
Member

qubvel commented Nov 19, 2024

@guangy10 no blocks IMO, waiting for @ArthurZucker's review, he has quite a few in line

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great contribution! Thanks all for iterating 🤗
Super good in general as slicing does not go well with compile either!

@ArthurZucker ArthurZucker merged commit 8cadf76 into huggingface:main Nov 20, 2024
23 checks passed
@ArthurZucker
Copy link
Collaborator

Sorry for the delay @guangy10 we were on a company wide offsite! 🌴

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants