Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Playing around a bit with outlining GPU kernels. #76

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

fodinabor
Copy link
Collaborator

I did a thing - even though the changes are neither especially significant nor polished and thus require much manual pasting, I just wanted to have the PR to enable discussions on how GPU support in Thorin could look like.

What this does: I added an axiom %core.gpu that wraps a function (obviously, this should not stay in core).
This is never lowered away or so, thus in the backend, we can take the lam wrapped by the axiom, copy it (recursively) into a new world, and mark it as external in that world.
This new world is then emitted by the backend. It uses a super simple target notion for this, that allows overwriting some default imports (for thread id, ..) and so on for each target.

The ll code is currently emitted to the same stream as the CPU code, thus one has to manually copy the code to some kernel file, where it can be compiled by e.g. ROCm clang to a HSA code object file hsaco.

This hsaco file can then be loaded using the HIP driver API and executed by a runtime (see example in launch_gcn).
The biggest todos to make this actually useful, is probably splitting the CPU and GPU outputs and to integrate some sort of runtime (possibly an adaption from AnyDSL/runtime?)

One would probably want a GPU dialect that exhibits an assortment of axioms representing OpenCL kernel functions, like the exemplary present index accesses.

In the long run, one might want to add ways to use Thorin as a driver, so that compilation of the (various) kernels is automatically performed and maybe even the clang-offload-bundler can be used to get a single binary, where the kernels are embedded.

Note, the compilation model proposed here is following more the design of the NVC++ compiler, where the frontend parses the code just once and just the backends (and maybe parts of the middle end) run for each targeted GPU architecture. This is in contrast to NVCC's and Clang's default CUDA / HIP compilation flows, which parse the C(++) code twice: once for the CPU and once for the devices. The latter comes with a bunch of complications, e.g. consistent lambda numbering, ...
Ideally, this approach reduces the number of these oddities.

@leissa
Copy link
Member

leissa commented Sep 5, 2022

Just some random thoughts:

We need to eliminate the free vars within the kernel in order to make it invokable:

x = ...
for i in gpu(...) {
    ... x ...
}

here we need to equip x with the signature of the lambda.

We also need to transitively import everything that is used with the kernel:

for i in gpu(...) {
    foo()
    bar()
}

Here, we need to import foo, and bar into the new gpu module - possibly more stuff that is transitively used within foo or bar.

Another design decision is which axioms should "just work" within the kernel and which one are "cpu-only" so to say and vice versa:

x = sin(...)
for i in gpu(...) {
    ... sin(...) ...
}

In Thorin1 we had a set of builtin operations (+, -, *, ...) and everything else like sin, ... were special functions that had to be explicitly accessed like so:

for i, intrin in nvvm(...) { // can't remember the exact method - but you get the idea ...
    ... intrin.sin(...) ...
}

I never really liked this approach. I think we should simply add sin, cos, ... to core and deal with that on our side.

However, there do exist functions that only makes sense on the gpu. In this case, we can either use the intrin-approach as we originally had in Thorin1, or we enforce proper use of device function with the type system (see below).

Now, that we have this fancy type system in Thorin we can also track different devices more explicit with the type system. Maybe sth like this.

fn foo(d: Dev)(m: Mem d, p: ptr m) ...
    bar(d)(p, ...)

And then, instead of automagically importing, we just partially apply all functions in question to the specific Dev. Then, adding gpu-only functions is pretty forward:

fn gpu_only(m: Mem GPU, p: ptr m) ...

(Note that I use Mem: Dev -> * instead of Mem: * here)

@fodinabor fodinabor force-pushed the feature/gpu-outlining branch from 53f7edc to 875b469 Compare November 2, 2022 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants