Only output bo needs to be synced from device after result is available #849

dezhiAmd · 2024-10-18T23:20:40Z

Issue description:
In function iree_hal_xrt_direct_command_buffer_dispatch, all buffers are trying to sync from the device regardless of it is buffer for holding input or output. This may impact performance.

Solution:
Use member allowed_usage and memory_type in struct iree_hal_buffer_t to decide whether the buffer is for input or output.
Currently all input buffer are created here

Assumption:

For the foreseeable future, NPU compute pipeline only use one output buffer. This is a significant difference from GPU compute kernel, which could use multiple output buffers
output buffer will never use IREE_HAL_BUFFER_USAGE_MAPPING flag since mapping can be extremely expensive, use limited hardware resources, introduce data hazards, and synchronize host and device execution. But input buffer cannot avoid using this flag

…nstall_test

makslevental · 2024-10-18T23:25:13Z

runtime/src/iree-amd-aie/driver/xrt/direct_command_buffer.cc

+    bool not_ofm = (bindings.values[j].buffer->memory_type & IREE_HAL_MEMORY_TYPE_HOST_VISIBLE) &&
+                   (bindings.values[j].buffer->allowed_usage & IREE_HAL_MEMORY_TYPE_HOST_VISIBLE);


this is not how input/output is determined - https://github.com/nod-ai/iree-amd-aie/blob/makslevental/xrt-lite/runtime/src/iree-amd-aie/driver/xrt/cts/executable_cache_test.mlir#L21-L23

%0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<32x32xf32>> %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<32x32xf32>> %2 = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<32x32xf32>>

note !flow.dispatch.tensor<readonly:tensor<32x32xf32> vs !flow.dispatch.tensor<writeonly:tensor<32x32xf32>

Also, as it clearly states, IREE_HAL_MEMORY_TYPE_HOST_VISIBLE is a iree_hal_memory_type_bits_t flag, not an iree_hal_buffer_usage_bits_t flag. See https://github.com/iree-org/iree/blob/fbf677d2096aea4c544cd8d0558b49bfc729fd91/runtime/src/iree/hal/buffer.h

What I see is all input bo are created from here

mappable_params.type |= IREE_HAL_MEMORY_TYPE_HOST_VISIBLE;
mappable_params.usage |= IREE_HAL_BUFFER_USAGE_MAPPING;

Repeating again: that is not the only place in HAL where bufferes are created.

Specifically those are the buffer_view APIs, which we do not use. We use the pure allocate_buffer APIs:

iree-amd-aie/runtime/src/iree-amd-aie/driver/xrt/direct_allocator.cc

Line 143 in fad9629

static iree_status_t iree_hal_xrt_allocator_allocate_buffer(

For example, xrt_lite_dispatch_test.log
For example, xrt_dispatch_test.log

Will this change affect xrt-lite? I thought this change only affect xrt

Function iree_hal_buffer_view_generate_buffer_in_situ is at iree level. It will call xrt level function as you pointed out. I do not see a contradiction here.

Function iree_hal_buffer_view_generate_buffer_in_situ is at iree level. It will call xrt level function as you pointed out.

You're completely wrong. Feel free to run https://github.com/nod-ai/iree-amd-aie/blob/main/runtime/src/iree-amd-aie/driver/xrt/cts/matmul_dispatch_test.cc and see for yourself.

What I see is that the file you referenced has this line
#include "iree/hal/buffer_view_util.h"
That is aligned with my understanding.

you're understanding is wrong: #852

feel free to consult any reference on how C++ headers work.

makslevental · 2024-10-18T23:33:28Z

runtime/src/iree-amd-aie/driver/xrt/direct_command_buffer.cc

@@ -360,7 +362,7 @@ static iree_status_t iree_hal_xrt_direct_command_buffer_dispatch(
    return iree_make_status(IREE_STATUS_UNKNOWN, e.what());
  }

-  for (xrt::bo& bo : bos) bo.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
+  ofm_bo.sync(XCL_BO_SYNC_BO_FROM_DEVICE);


Just because today we only have one output, doesn't mean tomorrow we will only have one output.

So far all NPU products only use one output buffer.

This has absolutely nothing to do with NPU and everything to do with the model.

This has absolutely nothing to do with NPU and everything to do with the model.

Do we have a model running on NPU with multiple output bo?

This indeed doesn't have anything to do with NPU and whether we currently have a model like this or not. We shouldn't make the assumption that there is a single output buffer.

makslevental · 2024-10-18T23:42:27Z

Currently all input buffer are created here

This is not correct, they are created in many places and the flags are set in many places, such as in runtime/src/iree-amd-aie/driver/xrt/direct_allocator.cc

makslevental · 2024-10-19T00:25:48Z

There is no buffer created in function iree_hal_xrt_allocator_query_buffer_compatibility which is the link point to.

Please reread my comment again.

makslevental

This is not the correct way to do this; see my comments.

makslevental · 2024-10-19T01:31:19Z

runtime/src/iree-amd-aie/driver/xrt/direct_command_buffer.cc

    run.set_arg(arg_index + j, arg_buffer);
+    bool not_ofm = (bindings.values[j].buffer->memory_type & IREE_HAL_MEMORY_TYPE_HOST_VISIBLE) &&
+                   (bindings.values[j].buffer->allowed_usage & IREE_HAL_BUFFER_USAGE_MAPPING);


IREE_HAL_BUFFER_USAGE_MAPPING is also not the correct flag:

// WARNING: mapping can be extremely expensive, use limited hardware // resources, introduce data hazards, and synchronize host and device // execution. Unless an application knows that such issues will not arise // (as in tests where there's never concurrent usage) mapping should be used // judiciously: do not assume mapping is a high-performance technique!

https://github.com/iree-org/iree/blob/66342abbfaaee707e27ecc7d8151ad9e357ca0da/runtime/src/iree/hal/buffer.h#L389

Please refer to this comment:
#849 (comment)

makslevental · 2024-10-19T01:51:57Z

@dezhiAmd please do not resolve conversations without making requested changes. That is our policy.

makslevental · 2024-10-19T02:11:54Z

@dezhiAmd please also do not delete comments since it completely distorts the discussion.

makslevental

You have made no changes.

dezhiAmd and others added 5 commits September 19, 2024 18:29

Split the big building script into multiple stages: configure/build/i…

1228cf7

…nstall_test

small change

9003f1f

Merge branch 'main' into buildDebug

ae3024c

Sync only ofm bo

c47b1c6

Remove unrelated build scripts

1fd9113

dezhiAmd changed the title ~~Sync output bo~~ Only output bo needs to be synced from device after result is available Oct 18, 2024

makslevental reviewed Oct 18, 2024

View reviewed changes

dezhiAmd marked this pull request as ready for review October 19, 2024 00:27

makslevental requested changes Oct 19, 2024

View reviewed changes

Fix a typo (copy & paste error)

336c485

dezhiAmd requested a review from makslevental October 19, 2024 01:30

makslevental reviewed Oct 19, 2024

View reviewed changes

dezhiAmd requested a review from makslevental October 20, 2024 15:58

makslevental requested changes Oct 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only output bo needs to be synced from device after result is available #849

Only output bo needs to be synced from device after result is available #849

dezhiAmd commented Oct 18, 2024 •

edited

Loading

makslevental Oct 18, 2024

makslevental Oct 18, 2024

dezhiAmd Oct 19, 2024 •

edited

Loading

makslevental Oct 19, 2024 •

edited

Loading

dezhiAmd Oct 19, 2024

makslevental Oct 19, 2024 •

edited

Loading

dezhiAmd Oct 19, 2024

makslevental Oct 19, 2024 •

edited

Loading

dezhiAmd Oct 20, 2024

makslevental Oct 20, 2024 •

edited

Loading

makslevental Oct 18, 2024

dezhiAmd Oct 19, 2024

makslevental Oct 19, 2024

dezhiAmd Oct 19, 2024

jtuyls Oct 20, 2024

makslevental commented Oct 18, 2024

makslevental commented Oct 19, 2024

makslevental left a comment

makslevental Oct 19, 2024

dezhiAmd Oct 19, 2024

makslevental commented Oct 19, 2024 •

edited

Loading

makslevental commented Oct 19, 2024

makslevental left a comment •

edited

Loading

		bool not_ofm = (bindings.values[j].buffer->memory_type & IREE_HAL_MEMORY_TYPE_HOST_VISIBLE) &&
		(bindings.values[j].buffer->allowed_usage & IREE_HAL_MEMORY_TYPE_HOST_VISIBLE);

Only output bo needs to be synced from device after result is available #849

Are you sure you want to change the base?

Only output bo needs to be synced from device after result is available #849

Conversation

dezhiAmd commented Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dezhiAmd Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

makslevental Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makslevental Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makslevental Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makslevental Oct 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makslevental commented Oct 18, 2024

makslevental commented Oct 19, 2024

makslevental left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makslevental commented Oct 19, 2024 • edited Loading

makslevental commented Oct 19, 2024

makslevental left a comment • edited Loading

Choose a reason for hiding this comment

dezhiAmd commented Oct 18, 2024 •

edited

Loading

dezhiAmd Oct 19, 2024 •

edited

Loading

makslevental Oct 19, 2024 •

edited

Loading

makslevental Oct 19, 2024 •

edited

Loading

makslevental Oct 19, 2024 •

edited

Loading

makslevental Oct 20, 2024 •

edited

Loading

makslevental commented Oct 19, 2024 •

edited

Loading

makslevental left a comment •

edited

Loading