Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix failure due to invalid elementCount when unpacking tensors in DmlExecutionProvider #19221

Closed
wants to merge 1 commit into from

Conversation

mtavenrath
Copy link
Contributor

Description

I've replaced *_data_size function calls used to determine the element count by computing the element_count based on the dimensions.

Motivation and Context

To benchmark the performance difference between external tensors and protobuf tensors I've exported a medium sized model (1.5GB) with the script below from a protobuf only onnx file to a onnx file with external tensors. As a result I was not able to load the model anymore into the dml EP due to *_data_size being 0.

import sys

# onnx_model is an in-memory ModelProto
onnx_model = onnx.load(sys.argv[1])

# Save the ONNX model
onnx.save(onnx_model, sys.argv[2], save_as_external_data=True, all_tensors_to_one_file=False, size_threshold=0, convert_attribute=False)

@mtavenrath
Copy link
Contributor Author

@jeffbloo @PatriceVignola Hi, I am not sure if this a bug in the input network after the conversion or within the code uploading the tensor.

@PatriceVignola
Copy link
Contributor

@jeffbloo @PatriceVignola Hi, I am not sure if this a bug in the input network after the conversion or within the code uploading the tensor.

Would you be able to use this existing helper internally?

Status UnpackTensor(const ONNX_NAMESPACE::TensorProto& tensor, const void* raw_data, size_t raw_data_len, \

If the function from tensorprotoutils works, then we should just start using it to avoid code duplication and make sure we benefits from improvement to this logic in the future. If the function from tensorprotoutils fails, then we should probably fail here as well since it means it's most likely a model or converter issue.

Either way, we should probably use those functions.

@mtavenrath
Copy link
Contributor Author

This function is using the same attributes and the code I had trouble with. I'll give it a try anyways and if it fails as well I'll try to find a sufficiently large public model which will reproduce the same problem and file an issue.

@PatriceVignola
Copy link
Contributor

It's also failing on some Stable Diffusion models converted through Olive, so this is a pretty high priority issue. It affects the 1.17 release so we'll probably need to release a patch.

I'm fine with merging this change in (I confirmed that it fixes the SD crash locally), but it's pretty weird that those functions started returning 0 all of a sudden. I'm going to try to find the root cause in the meantime.

@mtavenrath
Copy link
Contributor Author

I guess this means

Status UnpackTensor(const ONNX_NAMESPACE::TensorProto& tensor, const void* raw_data, size_t raw_data_len, \
is broken as well since it's using the same fields?

@PatriceVignola
Copy link
Contributor

It's not. The way we are unpacking the tensor in the DML EP has been regressed and we are trying to unpack external data as if it was internal data. I'll have a fix soon.

@PatriceVignola
Copy link
Contributor

@mtavenrath This PR should address the issue #19415. It works for the SDXL crash, but let me know if it also works for your internal model.

@mtavenrath
Copy link
Contributor Author

@PatriceVignola #19415 fixes the issue for my internal model as well. Closing this PR.

@mtavenrath mtavenrath closed this Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants