[WIP] Playing around a bit with outlining GPU kernels. #76
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I did a thing - even though the changes are neither especially significant nor polished and thus require much manual pasting, I just wanted to have the PR to enable discussions on how GPU support in Thorin could look like.
What this does: I added an axiom
%core.gpu
that wraps a function (obviously, this should not stay in core).This is never lowered away or so, thus in the backend, we can take the lam wrapped by the axiom, copy it (recursively) into a new world, and mark it as external in that world.
This new world is then emitted by the backend. It uses a super simple target notion for this, that allows overwriting some default imports (for thread id, ..) and so on for each target.
The ll code is currently emitted to the same stream as the CPU code, thus one has to manually copy the code to some kernel file, where it can be compiled by e.g. ROCm clang to a HSA code object file
hsaco
.This hsaco file can then be loaded using the HIP driver API and executed by a runtime (see example in
launch_gcn
).The biggest todos to make this actually useful, is probably splitting the CPU and GPU outputs and to integrate some sort of runtime (possibly an adaption from AnyDSL/runtime?)
One would probably want a GPU dialect that exhibits an assortment of axioms representing OpenCL kernel functions, like the exemplary present index accesses.
In the long run, one might want to add ways to use Thorin as a driver, so that compilation of the (various) kernels is automatically performed and maybe even the clang-offload-bundler can be used to get a single binary, where the kernels are embedded.
Note, the compilation model proposed here is following more the design of the NVC++ compiler, where the frontend parses the code just once and just the backends (and maybe parts of the middle end) run for each targeted GPU architecture. This is in contrast to NVCC's and Clang's default CUDA / HIP compilation flows, which parse the C(++) code twice: once for the CPU and once for the devices. The latter comes with a bunch of complications, e.g. consistent lambda numbering, ...
Ideally, this approach reduces the number of these oddities.