[WIP] Playing around a bit with outlining GPU kernels. #76

fodinabor · 2022-09-05T12:32:54Z

I did a thing - even though the changes are neither especially significant nor polished and thus require much manual pasting, I just wanted to have the PR to enable discussions on how GPU support in Thorin could look like.

What this does: I added an axiom %core.gpu that wraps a function (obviously, this should not stay in core).
This is never lowered away or so, thus in the backend, we can take the lam wrapped by the axiom, copy it (recursively) into a new world, and mark it as external in that world.
This new world is then emitted by the backend. It uses a super simple target notion for this, that allows overwriting some default imports (for thread id, ..) and so on for each target.

The ll code is currently emitted to the same stream as the CPU code, thus one has to manually copy the code to some kernel file, where it can be compiled by e.g. ROCm clang to a HSA code object file hsaco.

This hsaco file can then be loaded using the HIP driver API and executed by a runtime (see example in launch_gcn).
The biggest todos to make this actually useful, is probably splitting the CPU and GPU outputs and to integrate some sort of runtime (possibly an adaption from AnyDSL/runtime?)

One would probably want a GPU dialect that exhibits an assortment of axioms representing OpenCL kernel functions, like the exemplary present index accesses.

In the long run, one might want to add ways to use Thorin as a driver, so that compilation of the (various) kernels is automatically performed and maybe even the clang-offload-bundler can be used to get a single binary, where the kernels are embedded.

Note, the compilation model proposed here is following more the design of the NVC++ compiler, where the frontend parses the code just once and just the backends (and maybe parts of the middle end) run for each targeted GPU architecture. This is in contrast to NVCC's and Clang's default CUDA / HIP compilation flows, which parse the C(++) code twice: once for the CPU and once for the devices. The latter comes with a bunch of complications, e.g. consistent lambda numbering, ...
Ideally, this approach reduces the number of these oddities.

leissa · 2022-09-05T23:53:40Z

Just some random thoughts:

We need to eliminate the free vars within the kernel in order to make it invokable:

x = ...
for i in gpu(...) {
    ... x ...
}

here we need to equip x with the signature of the lambda.

We also need to transitively import everything that is used with the kernel:

for i in gpu(...) {
    foo()
    bar()
}

Here, we need to import foo, and bar into the new gpu module - possibly more stuff that is transitively used within foo or bar.

Another design decision is which axioms should "just work" within the kernel and which one are "cpu-only" so to say and vice versa:

x = sin(...)
for i in gpu(...) {
    ... sin(...) ...
}

In Thorin1 we had a set of builtin operations (+, -, *, ...) and everything else like sin, ... were special functions that had to be explicitly accessed like so:

for i, intrin in nvvm(...) { // can't remember the exact method - but you get the idea ...
    ... intrin.sin(...) ...
}

I never really liked this approach. I think we should simply add sin, cos, ... to core and deal with that on our side.

However, there do exist functions that only makes sense on the gpu. In this case, we can either use the intrin-approach as we originally had in Thorin1, or we enforce proper use of device function with the type system (see below).

Now, that we have this fancy type system in Thorin we can also track different devices more explicit with the type system. Maybe sth like this.

fn foo(d: Dev)(m: Mem d, p: ptr m) ...
    bar(d)(p, ...)

And then, instead of automagically importing, we just partially apply all functions in question to the specific Dev. Then, adding gpu-only functions is pretty forward:

fn gpu_only(m: Mem GPU, p: ptr m) ...

(Note that I use Mem: Dev -> * instead of Mem: * here)

Playing around a bit with outlining AMDGPU kernels.

dd8b779

Merge remote-tracking branch 'origin/master' into feature/gpu-outlining

875b469

fodinabor force-pushed the feature/gpu-outlining branch from 53f7edc to 875b469 Compare November 2, 2022 11:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Playing around a bit with outlining GPU kernels. #76

[WIP] Playing around a bit with outlining GPU kernels. #76

fodinabor commented Sep 5, 2022

leissa commented Sep 5, 2022

[WIP] Playing around a bit with outlining GPU kernels. #76

Are you sure you want to change the base?

[WIP] Playing around a bit with outlining GPU kernels. #76

Conversation

fodinabor commented Sep 5, 2022

leissa commented Sep 5, 2022