Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement some Dark API functions #41

Merged
merged 2 commits into from
Aug 29, 2024
Merged

Conversation

SEt-t
Copy link

@SEt-t SEt-t commented Aug 27, 2024

The functions that are required for CUDA runtime 6.5 and 7.0. Now CUDA-Z is able to run memory bench.

Compute bench fails on

Unrecognized statement mul24.lo.s32 %r8, %r7, %r7; found at 641406:641433

and I don't understand how to add a new command to ptx translator.

Thank you for continuing ROCm5 support, as version 6 dropped support of my GPU.

@lshqqytiger lshqqytiger added the implementation Unimplemented feature(s) label Aug 27, 2024
@lshqqytiger lshqqytiger changed the base branch from master to dev August 29, 2024 06:44
@lshqqytiger lshqqytiger self-requested a review August 29, 2024 08:08
Copy link
Owner

@lshqqytiger lshqqytiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contribution. Minor changes are requested.

zluda/src/impl/dark_api.rs Outdated Show resolved Hide resolved
zluda/src/impl/dark_api.rs Outdated Show resolved Hide resolved
zluda/src/impl/dark_api.rs Outdated Show resolved Hide resolved
@SEt-t
Copy link
Author

SEt-t commented Aug 29, 2024

Updated the code.

@lshqqytiger
Copy link
Owner

Can I merge this PR?

@SEt-t
Copy link
Author

SEt-t commented Aug 29, 2024

Yes, I think it's a reasonable implementation.

Can you comment about what is needed to be done to add mul24 instruction to ptx translator?

@lshqqytiger
Copy link
Owner

lshqqytiger commented Aug 29, 2024

It is working in progress and I expect it will be ready in few days.
Thank you for contribution.

@lshqqytiger lshqqytiger merged commit 5fa0aee into lshqqytiger:dev Aug 29, 2024
@lshqqytiger
Copy link
Owner

lshqqytiger commented Aug 30, 2024

mul24.lo is implemented. However, as CUDA-Z uses legacy mov behavior, comgr fails to build mov.u16 instructions with special registers. Could you have a look on it?

You can debug it using offline compiler. (zoc.exe)

mul24.lo introduction: f342702..230e5dd
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-ntid
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-ctaid

@SEt-t
Copy link
Author

SEt-t commented Aug 30, 2024

I see that tests are not very thorough: should test something like 0x9F12345*511. Do you translate mul24 to native mul24 or emulate it with regular mul?

Sorry, I really don't understand that ptx translation code. To me it looks like panic in Rust code that has nothing to do with comgr.

thread 'main' panicked at ptx\src\translate.rs:56:1:
MismatchedType
stack backtrace:
0: std::panicking::begin_panic_handler
at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library\std\src\panicking.rs:652
1: core::panicking::panic_fmt
at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library\core\src\panicking.rs:72
2: enum2$ptx::translate::TranslateError::mismatched_type
at C:\Lang\zluda\ptx\src\translate.rs:39
3: ptx::translate::default_implicit_conversion_type
at C:\Lang\zluda\ptx\src\translate.rs:9048
4: ptx::translate::default_implicit_conversion
at C:\Lang\zluda\ptx\src\translate.rs:8972
5: ptx::translate::implicit_conversion_mov
at C:\Lang\zluda\ptx\src\translate.rs:9115
6: ptx::translate::insert_implicit_conversions_impl::closure$0<enum2$<ptx::ast::Instruction<enum2$ptx::translate::ExpandedArgParams > > >
at C:\Lang\zluda\ptx\src\translate.rs:4903

@lshqqytiger
Copy link
Owner

lshqqytiger commented Aug 31, 2024

mul24.lo will be translated into a call of a function __ockl_mul24_<scalar_type>. (otherwise, zoc translates it into like %"13" = mul i32 %"14", 9815513)
But I could not find corresponding ockl function for hi instruction.
(see ptx/lib/zluda_ptx_impl.cpp)

Sorry for the confusion. I meant zoc, not comgr.
MismatchedType error occurs because zoc does not support legacy behavior of 16 bits mov.
When I tried to apply ConversionKind::Default for such situation, comgr panicked with thread 'main' overflowed its stack error.

@lshqqytiger lshqqytiger mentioned this pull request Sep 11, 2024
lshqqytiger added a commit that referenced this pull request Sep 11, 2024
* Restore cublas argument. (injector)

* Implement some Dark API functions (#41)

* Implement some Dark API functions

* Better error handling

* Implement mul24.lo.

* Implement mul24.hi.

* Fix mul24.lo implementation.

* Make mul24 tests more thorough.

* Add ZLUDA_COMGR_LOG_LEVEL.

* Bring back the minimal implementations of runtime API. (#45)

* [Fix] Handle stream correctly.

* WIP

* Fix fatbin.

* Revert.

* wip

* Remove redundant functions.

* Bump version.

---------

Co-authored-by: SEt <[email protected]>
@SEt-t
Copy link
Author

SEt-t commented Sep 13, 2024

Use these instead of ockl:

__device__  int32_t __attribute__((const))  mul24( int32_t a,  int32_t b) __asm("llvm.amdgcn.mul.i24");
__device__ uint32_t __attribute__((const)) umul24(uint32_t a, uint32_t b) __asm("llvm.amdgcn.mul.u24");
__device__  int32_t __attribute__((const))  mul24_hi( int32_t a,  int32_t b) __asm("llvm.amdgcn.mulhi.i24");
__device__ uint32_t __attribute__((const)) umul24_hi(uint32_t a, uint32_t b) __asm("llvm.amdgcn.mulhi.u24");
__device__  int32_t __attribute__((const))  mad24( int32_t a,  int32_t b,  int32_t c) {return  mul24(a, b) + c;}
__device__ uint32_t __attribute__((const)) umad24(uint32_t a, uint32_t b, uint32_t c) {return umul24(a, b) + c;}

That should give you direct mapping to AMD instructions, even mad are exactly 1 instruction (well, at least when compiling from OpenCL).

@lshqqytiger
Copy link
Owner

Would you open new pull request? Or, I can add you as a co-author.

@SEt-t
Copy link
Author

SEt-t commented Sep 13, 2024

Just add me as co-author – you understand that translation code far better.

@lshqqytiger
Copy link
Owner

llvm.amdgcn.mul works fine, but mulhi seems to return high 32 bits of 64 bits result while CUDA returns high 32 bits of 48 bits result.

Test Case
x: EA1 29B
y: 95C 5D9
result: 0000 88F1 BAF2 0C63

Expected (CUDA): 88F1 BAF2
Actual: 0000 88F1

@SEt-t
Copy link
Author

SEt-t commented Sep 13, 2024

I've missed that PTX has non-standard definition of hi part. (Current ZLUDA is also wrong there, btw: try the same test 0x9F12345*511)
I guess the best we can do is:

__device__  int32_t __attribute__((const))  mul24h( int32_t a,  int32_t b) {return __builtin_amdgcn_alignbit( mul24_hi(a, b),  mul24(a, b), 16);}
__device__ uint32_t __attribute__((const)) umul24h(uint32_t a, uint32_t b) {return __builtin_amdgcn_alignbit(umul24_hi(a, b), umul24(a, b), 16);}

It's 3x slower, but if someone wants 'true' hi part and does mul24h() >> 16 – compiler is smart enough to reduce that to single hi instruction.

@SEt-t
Copy link
Author

SEt-t commented Sep 18, 2024

Good news: I'm not sure what's changed, but now it's possible to compile that kernel and compute benchmark in CUDA-Z works.

Bad news: while compiling amd_comgr.dll exhausts the default stack by recursion and crashes. Editing the process limit allows it to finish successfully, but it's quite slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
implementation Unimplemented feature(s)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants