-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement some Dark API functions #41
Conversation
e84105e
to
4ee8315
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for contribution. Minor changes are requested.
Updated the code. |
Can I merge this PR? |
Yes, I think it's a reasonable implementation. Can you comment about what is needed to be done to add |
It is working in progress and I expect it will be ready in few days. |
You can debug it using offline compiler. (zoc.exe)
|
I see that tests are not very thorough: should test something like 0x9F12345*511. Do you translate mul24 to native mul24 or emulate it with regular mul? Sorry, I really don't understand that ptx translation code. To me it looks like panic in Rust code that has nothing to do with comgr.
|
Sorry for the confusion. I meant zoc, not comgr. |
* Restore cublas argument. (injector) * Implement some Dark API functions (#41) * Implement some Dark API functions * Better error handling * Implement mul24.lo. * Implement mul24.hi. * Fix mul24.lo implementation. * Make mul24 tests more thorough. * Add ZLUDA_COMGR_LOG_LEVEL. * Bring back the minimal implementations of runtime API. (#45) * [Fix] Handle stream correctly. * WIP * Fix fatbin. * Revert. * wip * Remove redundant functions. * Bump version. --------- Co-authored-by: SEt <[email protected]>
Use these instead of ockl: __device__ int32_t __attribute__((const)) mul24( int32_t a, int32_t b) __asm("llvm.amdgcn.mul.i24");
__device__ uint32_t __attribute__((const)) umul24(uint32_t a, uint32_t b) __asm("llvm.amdgcn.mul.u24");
__device__ int32_t __attribute__((const)) mul24_hi( int32_t a, int32_t b) __asm("llvm.amdgcn.mulhi.i24");
__device__ uint32_t __attribute__((const)) umul24_hi(uint32_t a, uint32_t b) __asm("llvm.amdgcn.mulhi.u24");
__device__ int32_t __attribute__((const)) mad24( int32_t a, int32_t b, int32_t c) {return mul24(a, b) + c;}
__device__ uint32_t __attribute__((const)) umad24(uint32_t a, uint32_t b, uint32_t c) {return umul24(a, b) + c;} That should give you direct mapping to AMD instructions, even |
Would you open new pull request? Or, I can add you as a co-author. |
Just add me as co-author – you understand that translation code far better. |
|
I've missed that PTX has non-standard definition of hi part. (Current ZLUDA is also wrong there, btw: try the same test 0x9F12345*511) __device__ int32_t __attribute__((const)) mul24h( int32_t a, int32_t b) {return __builtin_amdgcn_alignbit( mul24_hi(a, b), mul24(a, b), 16);}
__device__ uint32_t __attribute__((const)) umul24h(uint32_t a, uint32_t b) {return __builtin_amdgcn_alignbit(umul24_hi(a, b), umul24(a, b), 16);} It's 3x slower, but if someone wants 'true' hi part and does |
Good news: I'm not sure what's changed, but now it's possible to compile that kernel and compute benchmark in CUDA-Z works. Bad news: while compiling amd_comgr.dll exhausts the default stack by recursion and crashes. Editing the process limit allows it to finish successfully, but it's quite slow. |
The functions that are required for CUDA runtime 6.5 and 7.0. Now CUDA-Z is able to run memory bench.
Compute bench fails on
and I don't understand how to add a new command to ptx translator.
Thank you for continuing ROCm5 support, as version 6 dropped support of my GPU.