-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only output bo needs to be synced from device after result is available #849
base: main
Are you sure you want to change the base?
Conversation
bool not_ofm = (bindings.values[j].buffer->memory_type & IREE_HAL_MEMORY_TYPE_HOST_VISIBLE) && | ||
(bindings.values[j].buffer->allowed_usage & IREE_HAL_MEMORY_TYPE_HOST_VISIBLE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not how input/output is determined - https://github.com/nod-ai/iree-amd-aie/blob/makslevental/xrt-lite/runtime/src/iree-amd-aie/driver/xrt/cts/executable_cache_test.mlir#L21-L23
%0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<32x32xf32>>
%1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<32x32xf32>>
%2 = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<32x32xf32>>
note !flow.dispatch.tensor<readonly:tensor<32x32xf32>
vs !flow.dispatch.tensor<writeonly:tensor<32x32xf32>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, as it clearly states, IREE_HAL_MEMORY_TYPE_HOST_VISIBLE
is a iree_hal_memory_type_bits_t
flag, not an iree_hal_buffer_usage_bits_t
flag. See https://github.com/iree-org/iree/blob/fbf677d2096aea4c544cd8d0558b49bfc729fd91/runtime/src/iree/hal/buffer.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I see is all input bo are created from here
mappable_params.type |= IREE_HAL_MEMORY_TYPE_HOST_VISIBLE;
mappable_params.usage |= IREE_HAL_BUFFER_USAGE_MAPPING;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeating again: that is not the only place in HAL where bufferes are created.
Specifically those are the buffer_view
APIs, which we do not use. We use the pure allocate_buffer
APIs:
static iree_status_t iree_hal_xrt_allocator_allocate_buffer( |
For example, xrt_lite_dispatch_test.log
For example, xrt_dispatch_test.log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this change affect xrt-lite? I thought this change only affect xrt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function iree_hal_buffer_view_generate_buffer_in_situ is at iree level. It will call xrt level function as you pointed out. I do not see a contradiction here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function iree_hal_buffer_view_generate_buffer_in_situ is at iree level. It will call xrt level function as you pointed out.
You're completely wrong. Feel free to run https://github.com/nod-ai/iree-amd-aie/blob/main/runtime/src/iree-amd-aie/driver/xrt/cts/matmul_dispatch_test.cc and see for yourself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I see is that the file you referenced has this line
#include "iree/hal/buffer_view_util.h"
That is aligned with my understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're understanding is wrong: #852
feel free to consult any reference on how C++ headers work.
@@ -360,7 +362,7 @@ static iree_status_t iree_hal_xrt_direct_command_buffer_dispatch( | |||
return iree_make_status(IREE_STATUS_UNKNOWN, e.what()); | |||
} | |||
|
|||
for (xrt::bo& bo : bos) bo.sync(XCL_BO_SYNC_BO_FROM_DEVICE); | |||
ofm_bo.sync(XCL_BO_SYNC_BO_FROM_DEVICE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because today we only have one output, doesn't mean tomorrow we will only have one output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far all NPU products only use one output buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has absolutely nothing to do with NPU and everything to do with the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has absolutely nothing to do with NPU and everything to do with the model.
Do we have a model running on NPU with multiple output bo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indeed doesn't have anything to do with NPU and whether we currently have a model like this or not. We shouldn't make the assumption that there is a single output buffer.
This is not correct, they are created in many places and the flags are set in many places, such as in runtime/src/iree-amd-aie/driver/xrt/direct_allocator.cc |
Please reread my comment again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the correct way to do this; see my comments.
run.set_arg(arg_index + j, arg_buffer); | ||
bool not_ofm = (bindings.values[j].buffer->memory_type & IREE_HAL_MEMORY_TYPE_HOST_VISIBLE) && | ||
(bindings.values[j].buffer->allowed_usage & IREE_HAL_BUFFER_USAGE_MAPPING); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IREE_HAL_BUFFER_USAGE_MAPPING
is also not the correct flag:
// WARNING: mapping can be extremely expensive, use limited hardware
// resources, introduce data hazards, and synchronize host and device
// execution. Unless an application knows that such issues will not arise
// (as in tests where there's never concurrent usage) mapping should be used
// judiciously: do not assume mapping is a high-performance technique!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to this comment:
#849 (comment)
@dezhiAmd please do not resolve conversations without making requested changes. That is our policy. |
@dezhiAmd please also do not delete comments since it completely distorts the discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have made no changes.
Issue description:
In function iree_hal_xrt_direct_command_buffer_dispatch, all buffers are trying to sync from the device regardless of it is buffer for holding input or output. This may impact performance.
Solution:
Use member allowed_usage and memory_type in struct iree_hal_buffer_t to decide whether the buffer is for input or output.
Currently all input buffer are created here
Assumption:
For the foreseeable future, NPU compute pipeline only use one output buffer. This is a significant difference from GPU compute kernel, which could use multiple output buffers
output buffer will never use IREE_HAL_BUFFER_USAGE_MAPPING flag since mapping can be extremely expensive, use limited hardware resources, introduce data hazards, and synchronize host and device execution. But input buffer cannot avoid using this flag