Add basic command buffer support to level zero adapter v2 #2532

Xewar313 · 2025-01-08T11:51:20Z

This pull request implements basic calls to command buffer in level zero v2 adapter. These calls are required by sycl graph functionality implemented inside llvm, such as record and replay.

…o buffer

…ied-runtime into add-command-buffer-support

source/adapters/level_zero/v2/command_buffer.cpp

igchor · 2025-01-09T16:43:15Z

source/adapters/level_zero/v2/queue_api.hpp

@@ -14,6 +14,16 @@

 #include <ur_api.h>

+#include "../common.hpp"


Those includes should not be needed here. Also, this file (and queue_api.cpp) is auto-generate so you can't modify this manually. You need to update https://github.com/oneapi-src/unified-runtime/blob/main/scripts/templates/queue_api.hpp.mako and run make generate

Actually, could you also add a comment to the queue_api.hpp.mako saying that queue_api.hpp is being auto-generated? We should have already added that.

I have a small problem with removing all includes. As far as I understand, the queue_api.hpp.mako is being generated based on some file containing declarations (ur_api.hpp?), but all of these declarations only use structures that were defined inside UR (all starting with ur_*). However, I had to add function that uses ze_command_list_handle_t, because there is no respective class with ur_ prefix. And the problem is, that the ze_command_list_handle_t must be included, so I have to at least include "common.hpp" or "../common.hpp" - without that this simply won't work. Other option would be creating something like ur_command_list_handle_t, but I believe that should be a separate PR, because of scope of the change.

If you need to use ze_command_list_handle_t then it should be enough to just include ze_api.h, just add it (and the enqueueCommandBuffer declaration) to the https://github.com/oneapi-src/unified-runtime/blob/main/scripts/templates/queue_api.hpp.mako and call make generate

igchor · 2025-01-09T17:02:04Z

source/adapters/level_zero/v2/command_buffer.hpp

+#include "queue_api.hpp"
+
+struct command_buffer_profiling_t {
+  ur_exp_command_buffer_sync_point_t NumEvents;


nit: we are naming variables/params using with all lower-case in v2

In that case, I believe that command_list_cache should also be changed, right? (Its fields are named starting with upper-case)

Yes, good point, we never got to fixing it

igchor · 2025-01-09T17:03:25Z

source/adapters/level_zero/v2/command_buffer.hpp

+  ur_context_handle_t Context;
+  // Device associated with this command buffer
+  ur_device_handle_t Device;
+  ze_command_list_handle_t ZeCommandList;


please use v2::raii::ze_command_list_handle_t from v2/common.hpp

igchor · 2025-01-09T17:04:55Z

source/adapters/level_zero/v2/command_buffer.cpp

+ * @param[out] CommandList The L0 command-list created by this function.
+ * @return UR_RESULT_SUCCESS or an error code on failure
+ */
+ur_result_t createMainCommandList(ur_context_handle_t Context,


Would it make sense to cache command lists like we do for non-command-buffer path?

I don't know what the non-command-buffer path does, but for reference creating a pool of command-lists to use has been an idea we've had for the v1 adapter but never got around to. TODO comment that has since been removed

In v2, we have https://github.com/oneapi-src/unified-runtime/blob/main/source/adapters/level_zero/v2/command_list_cache.hpp which could be used here.

igchor · 2025-01-09T17:07:32Z

source/adapters/level_zero/v2/command_buffer.cpp

+
+  ze_command_list_handle_t ZeCommandList = nullptr;
+  UR_CALL(createMainCommandList(Context, Device, IsUpdatable, ZeCommandList));
+  try {


please move the try/catch block to cover the entire function and add try/catch to other functions as well, i.e. urCommandBufferCreateExp(...) try { ... } catch (...) { return exceptionToResult(std::current_exception()); } In v2, we have helpers functions that might throw so it's best to wrap every function with try/catch.

igchor · 2025-01-09T17:11:52Z

source/adapters/level_zero/v2/command_buffer.cpp

+  UR_CALL_THROWS(ur::level_zero::urDeviceRetain(Device));
+}
+
+void ur_exp_command_buffer_handle_t_::cleanupCommandBufferResources() {


I don't think urContextRelease/urDeviceRelease can actually fail.

I would suggest just removing this entire function and moving the logic to the destructor.

UR_CALL_THROWS can throw an exception, which is bad practice to do in C++ destructors. So if we want to do this move I think removing that thow behaviour the main thing.

What about urKernelRelease? I also need to release it when releasing buffer, so I would also need to know if it can fail, and I am not sure if it is a case

Well, technically, urKernelRelease can fail if zeKernelDestroy fails but if that happens and we throw from this function we can have a memory leak (if there are more unreleased kernels in the vector).

Also, is it really that useful to return an error from urCommandBufferReleaseExp?

Perhaps we could just log all the failures from urKernelRelease, urContextRelease, etc and always return UR_RESULT_SUCCESS in urCommandBufferReleaseExp This would ensure that all resources are (attempted to be) freed, even if there is some error during release of one of them.

igchor · 2025-01-09T17:16:27Z

source/adapters/level_zero/v2/command_buffer.hpp

+  ur_exp_command_buffer_command_handle_t_(ur_exp_command_buffer_handle_t,
+                                          uint64_t);
+
+  virtual ~ur_exp_command_buffer_command_handle_t_();


Doesn;t need to be virtual I think

pbalcer · 2025-01-14T10:00:02Z

scripts/templates/queue_api.cpp.mako

@@ -19,6 +19,8 @@ from templates import helper as th
 *
 */

+// This file was generated basing on scripts/templates/queue_api.cpp.mako


nit:

Do not edit. This file is auto generated from a template: scripts/templates/queue_api.cpp.mako

pbalcer · 2025-01-14T10:04:09Z

source/adapters/level_zero/v2/command_buffer.cpp

+} // namespace
+
+std::pair<ze_event_handle_t *, uint32_t>
+ur_exp_command_buffer_handle_t_::getWaitListView(


This is identical to the implementation in queue. Please create simple WaitListView abstraction usable in both.

pbalcer · 2025-01-14T10:08:41Z

source/adapters/level_zero/v2/command_buffer.cpp

+                                  ze_command_list_handle_t &commandList) {
+
+  using queue_group_type = ur_device_handle_t_::queue_group_info_t::type;
+  // that should be call to queue getZeOrdinal,


This, together with the fact that we have no way to allocate events from the correct queue, makes me think we either need to defer creation of these objects until the first enqueue of the command buffer or urCommandBufferCreateExp should take a queue.
@EwanC thoughts?

I'm also thinking whether it wouldn't make sense just to make the CommandBuffer allocate a whole pool of events from the context. That way, when we need an event, we don't have to acquire context locks.

This, together with the fact that we have no way to allocate events from the correct queue, makes me think we either need to defer creation of these objects until the first enqueue of the command buffer or urCommandBufferCreateExp should take a queue.
@EwanC thoughts?

I'm not that keen on urCommandBufferCreateExp taking a queue. 1) it doesn't match the SYCL API, where we don't have a queue object when creating the UR command-buffer. 2) It is the opposite direction to where the OpenCL WG is going with command-buffers in KhronosGroup/OpenCL-Docs#1292 to separate the queues used on command-buffer creation and enqueue.

Are there only 2 types of queue ordinal relevant here, compute and copy? If so I would suggest creating command-lists for both. This PR doesn't have the v1 functionality of splitting commands from the UR command-buffers into compute and copy command-lists. But we saw this having good perf benefits on V1, so would be imagine creating a copy engine command-list is something we'll need to do at some point anyway.

Are there only 2 types of queue ordinal relevant here, compute and copy?

In v2 by default we are only going to use the compute ordinal, letting the UMD decide whether to offload the copy to a separate engine. So the answer here is: "just use compute". I was more thinking about the general direction.
What you say makes sense.

cool, letting the UMD handle this consideration definitely sounds like it will simplify the adapter code

pbalcer · 2025-01-14T10:13:48Z

source/adapters/level_zero/v2/command_buffer.cpp

+    checkImmediateAppendSupport(context);
+
+    if (isUpdatable) {
+      UR_ASSERT(context->getPlatform()->ZeMutableCmdListExt.Supported,


Please don't use these "asserts". Do a normal if and return UR_RESULT_ERROR_UNSUPPORTED_FEATURE.
I just dislike overloading the term "assert" to mean fail with error.

pbalcer · 2025-01-14T10:15:54Z

source/adapters/level_zero/v2/command_buffer.cpp

+ * @param[out] commandList The L0 command-list created by this function.
+ * @return UR_RESULT_SUCCESS or an error code on failure
+ */
+ur_result_t createMainCommandList(ur_context_handle_t context,


this isn't used anywhere.

pbalcer · 2025-01-14T10:26:27Z

source/adapters/level_zero/v2/command_buffer.cpp

+  std::ignore = kernelAlternatives;
+  std::ignore = command;
+  try {
+    UR_ASSERT(hKernel, UR_RESULT_ERROR_INVALID_NULL_HANDLE);


This is near identical to the queue enqueue implementation. I imagine a lot of other similar enqueue methods will be likewise similar.

We need to come up with an abstraction that lets us share this implementation between queue and command buffers.

While reviewing this, I had a thought that a command buffer is a subset of the queue functionality with some extra bits. This sounds like this should be solvable with composition:

class cmds { // enqueuable? command_list? WaitViewList waitlist; CmdList cmdlist; event_pool events; enqueue_kernel enqueue_kernel(...); ... } class queue { cmds cmd; } class command_buffer { cmds cmd; }

@igchor thoughts?

I agree, also, we can simplify the enqueue operations by combining getSignalEvent and getWaitListView functions. I would move this functionality to the ur_command_list_handler_t (we might want to rename it) and implement it like this:

struct ur_command_list_handler_t { ur_command_list_handler_t(ur_context_handle_t hContext, ur_device_handle_t hDevice, const ur_queue_properties_t *pProps); ur_command_list_handler_t(ze_command_list_handle_t hZeCommandList, bool ownZeHandle); std::tuple<ze_event_handle_t, uint32_t, ze_event_handle_t *> getSignalEventAndWaitList(ur_event_handle_t *hUserEvent, ur_command_t commandType, const ur_event_handle_t *phWaitEvents, uint32_t numWaitEvents); raii::command_list_unique_handle commandList; std::vector<ze_event_handle_t> waitList; event_pool events; };

then, in enqueue* we can do:

auto [signalEvent, numWaitEvents, waitEvent] = getSignalEventAndWaitList(...); ... ZE2UR_CALL(zeCommandListAppenSomething(..., signalEvent, numWaitEvents, waitEvent));

Also, right now, getSIgnalEvent returns ur_event_handle_t (not ze_event_handle_t) because in enqueueTimestampRecordingExp, we need to call timestamp-related functions on it. But actually, we already have a pointer to the ur_event - it's the same one that we pass as a first argument to the getSignalEvent.

It's better to make getSignalEventAndWaitList return ze_event_handle_t so that we can avoid checking for nullptr.

I changed getSignalEvent to use ze_event_handle_t instead of ur_event_handle_t, but I don't think that merging getSignalEvent with getWaitListView makes sense - due to some C++ weirdness, referencing variables bound using [] inside lambda is impossible. So following code is not compiling:

auto [signalEvent, numWaitEvents, waitEvent] = getSignalEventAndWaitList(...); hMem->unmapHostPtr(pMappedPtr, [&](void *src, void *dst, size_t size) { ZE2UR_CALL_THROWS(zeCommandListAppendMemoryCopy, (commandListManager.getZeCommandList(), dst, src, size, nullptr, numWaitEvents, waitEvent)); memoryMigrated = true; }); }

This pattern is used across whole queue implementation, so I would have to call std::get on tuple inside function, what would be even less readable.

Right, I forgot about this. In that case, instead of the tuple we can just use a custom structure:

struct events { ze_event_handle_t signalEvent; size_t numWaitEvents; ze_event_handle_t *waitEvents; };

and then, the code just becomes:

auto e = getSignalEventAndWaitList(...); hMem->unmapHostPtr(pMappedPtr, [&](void *src, void *dst, size_t size) { ZE2UR_CALL_THROWS(zeCommandListAppendMemoryCopy, (commandListManager.getZeCommandList(), dst, src, size, nullptr, e.numWaitEvents, e.waitEvent)); memoryMigrated = true; }); }

Another problem that I have noticed with that solution is inside enqueueTimestampRecordingExp - We call there only getWaitListView without getSignalEvent, because ze_event_handle_t is obtained from ur_event_handle_t method. So we will get signal event without needing it there (so the most optimal solution would be again splitting this merged method into two)

Let's skip this change for now. Returning 24 byte-sized structure might prevent register use.

@Xewar313 Yeah, I don't think your current implementation is correct. You should still call getSignalEvent() as you need to somehow allocate the event (the passed handle is out handle only, it doesn't point to a valid event).

So, with getSignalEventAndWaitList you would do:

auto e = getSignalEventAndWaitList(phEvent, ...); (*phEvent)->recordStartTimestamp(); assert(phEvent->getZeHandle() == e.signalEvent); ...

And that's the point, I don't think there is any use case where we need to only call getWaitList so it would be safer to have it combined.

@pbalcer I don't think there should be any difference with current usage where we call those function one after another and only then use the resulting values.

If that's really something that you'd like to optimize we can just move the implementation to the header and make it inline.

@Xewar313 you can leave the implementation as is for now (just fix enqueueTimestampRecordingExp) and we can implement getSignalEventAndWaitList in another PR.

pbalcer · 2025-01-14T10:26:58Z

source/adapters/level_zero/v2/command_buffer.hpp

+  std::pair<ze_event_handle_t *, uint32_t>
+  getWaitListView(const ur_event_handle_t *phWaitEvents,
+                  uint32_t numWaitEvents);
+


pbalcer · 2025-01-14T10:27:04Z

source/adapters/level_zero/v2/command_buffer.hpp

+                                          uint64_t);
+
+  ~ur_exp_command_buffer_command_handle_t_();
+


source/adapters/level_zero/v2/command_buffer.cpp

pbalcer · 2025-01-21T11:50:39Z

source/loader/layers/sanitizer/asan/asan_report.hpp

- * See LICENSE.TXT
- * SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+ * Part of the Unified-Runtime Project, under the Apache License v2.0 with LLVM
+ * Exceptions. See LICENSE.TXT SPDX-License-Identifier: Apache-2.0 WITH


invalid rebase?

This is a strange one - for some reason code formatting formats this header, but it shouldn't be formatted

pbalcer · 2025-01-21T11:55:56Z

source/adapters/level_zero/v2/command_list_manager.cpp

+  return UR_RESULT_SUCCESS;
+}
+
+ur_result_t ur_command_list_manager::closeCommandList() {


this should be inside of command buffer. Immediate command lists never need to be closed.

pbalcer · 2025-01-21T11:56:34Z

source/adapters/level_zero/v2/command_list_manager.cpp

+
+  auto zeSignalEvent = signalEvent ? signalEvent->getZeEvent() : nullptr;
+
+  ZE2UR_CALL(zeCommandListImmediateAppendCommandListsExp,


this is command buffer specific.

Actually it is queue specific (since we need to enqueue command buffer, by appending command list to the queue execution list), however since it is modifying the command list, I feel like it makes more sense to add it here, instead of queue implementation. What do you think?

Ah, right. I'd move it to queue then. The "command list manager" is not meant to be a clean command list abstraction, but merely a place where we store and retrieve command list and associated state. Otherwise we risk pushing everything into this class.

pbalcer · 2025-01-21T11:58:57Z

source/adapters/level_zero/v2/command_buffer.cpp

+} // namespace
+
+std::pair<ze_event_handle_t *, uint32_t>
+ur_exp_command_buffer_handle_t_::getWaitListView(


this can be removed, right?

This reverts commit 545e577.

pbalcer

directionally lgtm, but we can't do this refactoring piecemeal with cmdlist-related state duplicated between queue class and the new manager. That might lead to inconsistency issues/duplicating event pools etc.

igchor · 2025-01-21T17:07:42Z

source/adapters/level_zero/v2/command_buffer.cpp

+      context, device, std::move(zeCommandList), commandBufferDesc);
+  return UR_RESULT_SUCCESS;
+
+} catch (const std::bad_alloc &) {


I would change this to catch ... - the cache can throw other things as well

igchor · 2025-01-21T17:08:10Z

source/adapters/level_zero/v2/command_buffer.cpp

+urCommandBufferFinalizeExp(ur_exp_command_buffer_handle_t hCommandBuffer) try {
+  UR_ASSERT(hCommandBuffer, UR_RESULT_ERROR_INVALID_NULL_POINTER);
+  UR_ASSERT(!hCommandBuffer->isFinalized, UR_RESULT_ERROR_INVALID_OPERATION);
+  hCommandBuffer->closeCommandList();


igchor · 2025-01-21T17:08:54Z

source/adapters/level_zero/v2/command_buffer.cpp

+ur_result_t
+urCommandBufferFinalizeExp(ur_exp_command_buffer_handle_t hCommandBuffer) try {
+  UR_ASSERT(hCommandBuffer, UR_RESULT_ERROR_INVALID_NULL_POINTER);
+  UR_ASSERT(!hCommandBuffer->isFinalized, UR_RESULT_ERROR_INVALID_OPERATION);


this is not thread safe, I think you should move this (and setting isFinalized to closeCommandList).

igchor · 2025-01-21T17:14:35Z

source/adapters/level_zero/v2/command_list_manager.cpp

+    : context(context), device(device),
+      eventPool(context->eventPoolCache.borrow(device->Id.value(), flags)),
+      zeCommandList(
+          std::forward<v2::raii::command_list_unique_handle>(commandList)),


std::move()

igchor · 2025-01-21T17:16:52Z

source/adapters/level_zero/v2/queue_immediate_in_order.cpp

+              reinterpret_cast<ze_command_list_handle_t>(hNativeHandle),
+              [ownZeQueue](ze_command_list_handle_t hZeCommandList) {
+                if (ownZeQueue) {
+                  zeCommandListDestroy(hZeCommandList);


ZE_CALL_NOCHECK, this is needed to have proper logs

Xewar313 and others added 8 commits December 18, 2024 13:06

Prepare ground for command_buffer in v2

519c9c3

Enforce in order list usage, and add initialization and destruction t…

f87741e

…o buffer

Merge branch 'oneapi-src:main' into add-command-buffer-support

94ce521

Add initial support of command buffers to adapter v2

159ebc8

Merge branch 'add-command-buffer-support' of github.com:Xewar313/unif…

6d0d8b3

…ied-runtime into add-command-buffer-support

Update UR calls handling

bb90ee5

Remove unnecessary comment

84ef0df

Move not implemented command buffer commands to previous position

1716db3

Xewar313 requested review from a team as code owners January 8, 2025 11:51

Xewar313 requested a review from reble January 8, 2025 11:51

github-actions bot added level-zero L0 adapter specific issues command-buffer Command Buffer feature addition/changes/specification labels Jan 8, 2025

EwanC reviewed Jan 8, 2025

View reviewed changes

source/adapters/level_zero/v2/command_buffer.cpp Outdated Show resolved Hide resolved

igchor reviewed Jan 9, 2025

View reviewed changes

Fix most issues with code

7da53d8

Xewar313 requested a review from a team as a code owner January 10, 2025 14:19

Xewar313 added 3 commits January 13, 2025 10:56

Fix formatting and modify queue_api template

895f5c6

Move command buffer cleanup to destructor

384326c

Use cached command lists instead of created ones

a1dd428

pbalcer reviewed Jan 14, 2025

View reviewed changes

igchor reviewed Jan 15, 2025

View reviewed changes

source/adapters/level_zero/v2/command_buffer.cpp Outdated Show resolved Hide resolved

Xewar313 added 7 commits January 20, 2025 09:27

Remove not needed function and change phrasing

4e3072a

Add initial implementation of command list manager

cbfba58

Use list manager instead of custom implementation in queue

1de57ef

Optimalize imports

de2f273

Remove not needed destructor

d979f6a

Merge branch 'main' into add-command-buffer-support

021c0e4

Fix formatting

545e577

github-actions bot added the loader Loader related feature/bug label Jan 21, 2025

pbalcer reviewed Jan 21, 2025

View reviewed changes

Xewar313 added 2 commits January 21, 2025 12:20

Revert "Fix formatting"

ea643b3

This reverts commit 545e577.

Move command list close to the command buffer

8b7b269

pbalcer reviewed Jan 21, 2025

View reviewed changes

Xewar313 added 5 commits January 21, 2025 13:16

Moved try outside function block

95f978c

Move enqueue generic command list back to queue

30f2f91

Share events and lists between queue and command list manager

c00d960

Use ze events instead of ur in getSignalEvent

06e7807

Remove not needed structs and reformat code

9f53547

igchor reviewed Jan 21, 2025

View reviewed changes

		@@ -14,6 +14,16 @@

		#include <ur_api.h>

		#include "../common.hpp"


		auto zeSignalEvent = signalEvent ? signalEvent->getZeEvent() : nullptr;

		ZE2UR_CALL(zeCommandListImmediateAppendCommandListsExp,

Add basic command buffer support to level zero adapter v2 #2532

Are you sure you want to change the base?

Add basic command buffer support to level zero adapter v2 #2532

Conversation

Xewar313 commented Jan 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

igchor Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EwanC Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

Xewar313 Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

igchor Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

igchor Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

igchor Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xewar313 Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xewar313 Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pbalcer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

igchor Jan 10, 2025 •

edited

Loading

EwanC Jan 10, 2025 •

edited

Loading

Xewar313 Jan 10, 2025 •

edited

Loading

igchor Jan 15, 2025 •

edited

Loading

igchor Jan 21, 2025 •

edited

Loading

igchor Jan 21, 2025 •

edited

Loading

Xewar313 Jan 21, 2025 •

edited

Loading

Xewar313 Jan 21, 2025 •

edited

Loading