Implement some Dark API functions #41

SEt-t · 2024-08-27T15:16:16Z

The functions that are required for CUDA runtime 6.5 and 7.0. Now CUDA-Z is able to run memory bench.

Compute bench fails on

Unrecognized statement mul24.lo.s32 %r8, %r7, %r7; found at 641406:641433

and I don't understand how to add a new command to ptx translator.

Thank you for continuing ROCm5 support, as version 6 dropped support of my GPU.

lshqqytiger

Thank you for contribution. Minor changes are requested.

zluda/src/impl/dark_api.rs

SEt-t · 2024-08-29T16:32:14Z

Updated the code.

lshqqytiger · 2024-08-29T16:53:45Z

Can I merge this PR?

SEt-t · 2024-08-29T17:24:35Z

Yes, I think it's a reasonable implementation.

Can you comment about what is needed to be done to add mul24 instruction to ptx translator?

lshqqytiger · 2024-08-29T17:49:13Z

It is working in progress and I expect it will be ready in few days.
Thank you for contribution.

lshqqytiger · 2024-08-30T15:38:05Z

mul24.lo is implemented. However, as CUDA-Z uses legacy mov behavior, comgr fails to build mov.u16 instructions with special registers. Could you have a look on it?

You can debug it using offline compiler. (zoc.exe)

mul24.lo introduction: f342702..230e5dd
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-ntid
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-ctaid

SEt-t · 2024-08-30T21:25:26Z

I see that tests are not very thorough: should test something like 0x9F12345*511. Do you translate mul24 to native mul24 or emulate it with regular mul?

Sorry, I really don't understand that ptx translation code. To me it looks like panic in Rust code that has nothing to do with comgr.

thread 'main' panicked at ptx\src\translate.rs:56:1:
MismatchedType
stack backtrace:
0: std::panicking::begin_panic_handler
at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library\std\src\panicking.rs:652
1: core::panicking::panic_fmt
at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library\core\src\panicking.rs:72
2: enum2$ptx::translate::TranslateError::mismatched_type
at C:\Lang\zluda\ptx\src\translate.rs:39
3: ptx::translate::default_implicit_conversion_type
at C:\Lang\zluda\ptx\src\translate.rs:9048
4: ptx::translate::default_implicit_conversion
at C:\Lang\zluda\ptx\src\translate.rs:8972
5: ptx::translate::implicit_conversion_mov
at C:\Lang\zluda\ptx\src\translate.rs:9115
6: ptx::translate::insert_implicit_conversions_impl::closure$0<enum2$<ptx::ast::Instruction<enum2$ptx::translate::ExpandedArgParams > > >
at C:\Lang\zluda\ptx\src\translate.rs:4903

lshqqytiger · 2024-08-31T03:44:48Z

mul24.lo will be translated into a call of a function __ockl_mul24_<scalar_type>. (otherwise, zoc translates it into like %"13" = mul i32 %"14", 9815513)
But I could not find corresponding ockl function for hi instruction.
(see ptx/lib/zluda_ptx_impl.cpp)

Sorry for the confusion. I meant zoc, not comgr.
MismatchedType error occurs because zoc does not support legacy behavior of 16 bits mov.
When I tried to apply ConversionKind::Default for such situation, comgr panicked with thread 'main' overflowed its stack error.

* Restore cublas argument. (injector) * Implement some Dark API functions (#41) * Implement some Dark API functions * Better error handling * Implement mul24.lo. * Implement mul24.hi. * Fix mul24.lo implementation. * Make mul24 tests more thorough. * Add ZLUDA_COMGR_LOG_LEVEL. * Bring back the minimal implementations of runtime API. (#45) * [Fix] Handle stream correctly. * WIP * Fix fatbin. * Revert. * wip * Remove redundant functions. * Bump version. --------- Co-authored-by: SEt <[email protected]>

SEt-t · 2024-09-13T02:25:28Z

Use these instead of ockl:

__device__  int32_t __attribute__((const))  mul24( int32_t a,  int32_t b) __asm("llvm.amdgcn.mul.i24");
__device__ uint32_t __attribute__((const)) umul24(uint32_t a, uint32_t b) __asm("llvm.amdgcn.mul.u24");
__device__  int32_t __attribute__((const))  mul24_hi( int32_t a,  int32_t b) __asm("llvm.amdgcn.mulhi.i24");
__device__ uint32_t __attribute__((const)) umul24_hi(uint32_t a, uint32_t b) __asm("llvm.amdgcn.mulhi.u24");
__device__  int32_t __attribute__((const))  mad24( int32_t a,  int32_t b,  int32_t c) {return  mul24(a, b) + c;}
__device__ uint32_t __attribute__((const)) umad24(uint32_t a, uint32_t b, uint32_t c) {return umul24(a, b) + c;}

That should give you direct mapping to AMD instructions, even mad are exactly 1 instruction (well, at least when compiling from OpenCL).

lshqqytiger · 2024-09-13T02:54:18Z

Would you open new pull request? Or, I can add you as a co-author.

SEt-t · 2024-09-13T03:10:44Z

Just add me as co-author – you understand that translation code far better.

lshqqytiger · 2024-09-13T04:39:08Z

llvm.amdgcn.mul works fine, but mulhi seems to return high 32 bits of 64 bits result while CUDA returns high 32 bits of 48 bits result.

Test Case
x: EA1 29B
y: 95C 5D9
result: 0000 88F1 BAF2 0C63

Expected (CUDA): 88F1 BAF2
Actual: 0000 88F1

SEt-t · 2024-09-13T15:10:03Z

I've missed that PTX has non-standard definition of hi part. (Current ZLUDA is also wrong there, btw: try the same test 0x9F12345*511)
I guess the best we can do is:

__device__  int32_t __attribute__((const))  mul24h( int32_t a,  int32_t b) {return __builtin_amdgcn_alignbit( mul24_hi(a, b),  mul24(a, b), 16);}
__device__ uint32_t __attribute__((const)) umul24h(uint32_t a, uint32_t b) {return __builtin_amdgcn_alignbit(umul24_hi(a, b), umul24(a, b), 16);}

It's 3x slower, but if someone wants 'true' hi part and does mul24h() >> 16 – compiler is smart enough to reduce that to single hi instruction.

SEt-t · 2024-09-18T16:45:38Z

Good news: I'm not sure what's changed, but now it's possible to compile that kernel and compute benchmark in CUDA-Z works.

Bad news: while compiling amd_comgr.dll exhausts the default stack by recursion and crashes. Editing the process limit allows it to finish successfully, but it's quite slow.

Implement some Dark API functions

ed8568b

lshqqytiger added the implementation Unimplemented feature(s) label Aug 27, 2024

lshqqytiger changed the base branch from master to dev August 29, 2024 06:44

lshqqytiger force-pushed the dev branch 2 times, most recently from e84105e to 4ee8315 Compare August 29, 2024 08:02

lshqqytiger self-requested a review August 29, 2024 08:08

lshqqytiger approved these changes Aug 29, 2024

View reviewed changes

zluda/src/impl/dark_api.rs Outdated Show resolved Hide resolved

zluda/src/impl/dark_api.rs Outdated Show resolved Hide resolved

zluda/src/impl/dark_api.rs Outdated Show resolved Hide resolved

Better error handling

613e151

lshqqytiger approved these changes Aug 29, 2024

View reviewed changes

lshqqytiger merged commit 5fa0aee into lshqqytiger:dev Aug 29, 2024

lshqqytiger mentioned this pull request Sep 11, 2024

ZLUDA 3.8.4 #46

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement some Dark API functions #41

Implement some Dark API functions #41

SEt-t commented Aug 27, 2024

lshqqytiger left a comment

SEt-t commented Aug 29, 2024

lshqqytiger commented Aug 29, 2024

SEt-t commented Aug 29, 2024

lshqqytiger commented Aug 29, 2024 •

edited

Loading

lshqqytiger commented Aug 30, 2024 •

edited

Loading

SEt-t commented Aug 30, 2024

lshqqytiger commented Aug 31, 2024 •

edited

Loading

SEt-t commented Sep 13, 2024

lshqqytiger commented Sep 13, 2024

SEt-t commented Sep 13, 2024

lshqqytiger commented Sep 13, 2024

SEt-t commented Sep 13, 2024 •

edited

Loading

SEt-t commented Sep 18, 2024

Implement some Dark API functions #41

Implement some Dark API functions #41

Conversation

SEt-t commented Aug 27, 2024

lshqqytiger left a comment

Choose a reason for hiding this comment

SEt-t commented Aug 29, 2024

lshqqytiger commented Aug 29, 2024

SEt-t commented Aug 29, 2024

lshqqytiger commented Aug 29, 2024 • edited Loading

lshqqytiger commented Aug 30, 2024 • edited Loading

SEt-t commented Aug 30, 2024

lshqqytiger commented Aug 31, 2024 • edited Loading

SEt-t commented Sep 13, 2024

lshqqytiger commented Sep 13, 2024

SEt-t commented Sep 13, 2024

lshqqytiger commented Sep 13, 2024

SEt-t commented Sep 13, 2024 • edited Loading

SEt-t commented Sep 18, 2024

lshqqytiger commented Aug 29, 2024 •

edited

Loading

lshqqytiger commented Aug 30, 2024 •

edited

Loading

lshqqytiger commented Aug 31, 2024 •

edited

Loading

SEt-t commented Sep 13, 2024 •

edited

Loading