Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BLAS] Simplify CublasScopedContextHandler #609

Merged

Conversation

konradkusiak97
Copy link
Contributor

@konradkusiak97 konradkusiak97 commented Oct 30, 2024

Nowadays we only use the cuda primary context in DPC++, hence we can refactor CublasScopedContextHandler.

  • Since now we only use the primary context which is unique to a device, we can change the
    unordered_map<context, handle> to be unordered_map<device, handle>,
  • I believe that the call to sycl::detail::contextSetExtendedDeleter is not necessary. It provides additional feature tying the lifetime of cublasHandle to the corresponding sycl queue but it makes the code very complicated (with atomics etc.) and uses the detail namespace which is not ideal.
  • cublas_handle.hpp was modified such that it is not templated anymore. Both DPC++ and AdaptiveCpp versions use the same mapping so that could be simplified. Also, we need to set the correct context in the custom destructor in order to destroy cublas handles so the native type was necessary there.

Side note:
We can definitely remove the dependency on UR headers just by changing the templated types to the native types.

test_main_blas_rt.txt
test_main_blas_ct.txt

@konradkusiak97 konradkusiak97 requested a review from a team as a code owner October 30, 2024 18:07
@konradkusiak97
Copy link
Contributor Author

@oneapi-src/onemkl-blas-write any thoughts on this?

@andrewtbarker
Copy link
Contributor

I very much like the simplification here. If a user has two cuda devices available, how do they specify which one to use?

@konradkusiak97
Copy link
Contributor Author

I very much like the simplification here. If a user has two cuda devices available, how do they specify which one to use?

They would specify that at the creation of a sycl::queue. Each queue can be mapped only to a single device and CublasScopedContextHandler constructor takes a queue parameter. My understanding is that we want to use one cublasHandle_t per one cuda device.

@konradkusiak97
Copy link
Contributor Author

Thanks for the review @andrewtbarker! In that case I will apply similar changes to other backends as well.

Copy link
Contributor

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. I agree that the CublasScopedContextHandler can be improved but I have 2 big concerns with the suggested changes.
Ideally we should have a closer look at the impact of this change on an application using oneMKL+cuBLAS or create an example that calls a few oneMKL functions.

src/blas/backends/cublas/cublas_scope_handle.cpp Outdated Show resolved Hide resolved
…xtendedDeleter

It seems the static thread_local unordered map needs to stay because of
all the thread shenanigans. But we're removing the use of detail
namespace in sycl since it's not necessary for correctness.
@konradkusiak97
Copy link
Contributor Author

konradkusiak97 commented Nov 8, 2024

I modified this PR such that the cublasHandle(s) are destroyed only in one place: at the end of the program. Because of all the threading shenanigans I believe the static thread_local unordered_map needs to stay.

By removing the use of sycl::detail::contextSetExtendedDeleter we aren't allowing the possibility of cublasHandle being destroyed when sycl::queue is destroyed. This is useful feature but it makes this code very complicated and I'm not sure exactly how much it is actually used in practice. If it is really necessary I think we should provide a public API for it in DPC++ and not use the detail namespace.

The good news is that those changes make AdaptiveCpp and DPC++ implementations identical (if that's of any benefit).

@Rbiessy
Copy link
Contributor

Rbiessy commented Nov 15, 2024

Alright, that looks fine to me then.

The good news is that those changes make AdaptiveCpp and DPC++ implementations identical (if that's of any benefit).

Do you think it would be possible to remove cublas_scope_handle_hipsycl.hpp and cublas_scope_handle_hipsycl.cpp as part of this PR then? I'm not sure if the blas domain still compile with AdaptiveCpp so it may be too difficult to test right now.

@konradkusiak97
Copy link
Contributor Author

Good point, I think those files can be removed but I don't have much experience with AdaptiveCpp so I wouldn't be able to test the build (at least right now).

@Rbiessy
Copy link
Contributor

Rbiessy commented Nov 15, 2024

Looking at the internal CI, it does not seem that the blas domain compiles with AdaptiveCpp anyway. One of the issue is #567. I see that there are still some differences like ih.get_native_device<sycl::backend::ext_oneapi_cuda>() vs interop_h.get_native_device<sycl::backend::cuda>() so let's not bother trying to fix that today. I think we can merge this next week, thanks for the work!

@Rbiessy Rbiessy merged commit c0cef0c into uxlfoundation:develop Nov 18, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants