A CUDA extension for PyTorch to compute the linear recurrence
import torch, linrec
kwargs = dict(size=(512, 2**16), device='cuda')
inputs, coeffs = torch.randn(**kwargs), torch.rand(**kwargs)
outputs = linrec.linrec(inputs, coeffs, reverse=False)
├── eval
│ ├── add_linrec_to_path.py # add root to path if not installed
│ ├── bench.py # benchmark implementations
│ ├── debug.py # call/debug kernels
│ ├── test.py # test implementations
│ ├── tune.py # tune configurations
│ └── utils.py # eval utilities
├── linrec
│ ├── impl
│ │ ├── cuda
│ │ │ ├── build.py # build system
│ │ │ ├── cuhelpers.cuh # various cuda helpers
│ │ │ ├── dispatch.h # dispatch features
│ │ │ ├── executable.cpp # standalone executable
│ │ │ ├── extension.cpp # pytorch extension
│ │ │ ├── linrec.h # header files
│ │ │ ├── linrec_pipe.cu # pipe host-side function
│ │ │ ├── linrec_pipe.cuh # pipe kernel implementation
│ │ │ ├── linrec_ref.cu # reference host-side function
│ │ │ ├── linrec_ref.cuh # reference kernel implementation
│ │ │ ├── linrec_tile.cu # tile host-side function
│ │ │ ├── linrec_tile.cuh # tile kernel implementation
│ │ │ ├── memio.cuh # memory loading
│ │ │ └── ops.py # cuda ops interface
│ │ └── python
│ │ └── ops.py # python ops interface
├── pyproject.toml
├── README.md
├── requirements.txt
└── setup.py
Installing the CUDA Toolkit can be a bit tricky and is explained in the NVIDIA CUDA Installation Guide. For this specific project we have the following requirements:
- CUDA requires
gxx<=13.2
, as explained here and here. - C++20 with
static constexpr
requiresgxx>=12.1
. - triton/torch.compile requires
python>=3.9,<3.13
. - the latest PyTorch is built with
cuda-runtime==12.4
(can be ignored)
We follow the Conda installation instructions to setup a build environment:
conda create -n CUDA12.4 -c conda-forge gxx==13.2 python==3.12 nvidia::cuda==12.4
If you see the error InvalidSpec: The package "cuda==XX" is not available for the specified platform
, install gxx
first and then cuda
. The exact compiler requirements for a given CUDA installation are in .../include/crt/host_config.h
. Note that nvidia::cuda
includes packages from nvidia::cuda-runtime
, which are additionally installed through pip
as dependencies of torch
(see Section on Pip Wheels in the Installation Guide).
Once the build environment is ready, you can simply build and install linrec
with
pip install git+https://github.com/safelix/linrec.git
For evals and lightweight development, a quick editable install is available (clone then pip install -e linrec[eval]
). For C++/CUDA development, accessing the code directly without installation provides more finegrained control over the compilation process (make sure to pip uninstall linrec
). Let's get started and run the eval suite:
git clone [email protected]:safelix/linrec.git
pip install -r linrec/requirements.txt
python linrec/eval/test.py
python linrec/eval/bench.py
This automatically compiles the extension and stores all the build files in the .build
directory. The C++/CUDA build process is controled from linrec/impl/cuda/build.py
. For example, calling _C = linrec.impl.cuda.build.extension()
dynamically loads the extension into the Python runtime and triggers recompilation if the C++/CUDA source files changed. Its commandline interface python -m linrec.impl.cuda.build
allows to force clean re-builds, show compilation outputs and compile a light-weight standalone executable for C++/CUDA debugging or profiling.
Tipps: Link the .build/compile_commands.json
to your IDE for C++/CUDA linting and code integrations. Set the environment variable MAX_JOBS
to enable parallel compliation. Demangle function names in the compilation outputs with [build command] | cu++filt
.
To use NSight Compute, install it on your local machine and use its Remote Launch
feature to start a profiling activity via SSH
. You can also install it on a server using conda install nvidia::nsight-compute
, but this requires a workaround. For this, open obtain the paths which ncu
and which ncu-ui
, open the bash scripts and insert in line 47:
# WORKAROUND
# If installed with cuda, nsight-compute tools will be under nsight-compute/<version> folder. e.g nsight-compute/2019.4.0
for nsight_compute_tool_dir_path in "$CUDA_TOOLKIT_BIN_DIR"/../nsight-compute/*; do
if [ ! -e "$nsight_compute_tool_dir_path" ]; then
# Glob didn't match anything. Let's skip this single iteration.
continue
fi
setLatestNsightComputeToolDir "$(basename "$nsight_compute_tool_dir_path")" "$nsight_compute_tool_dir_path"
done
- Associative Scan Interfaces in CUB: Device-Level Scan, Block-Level Scan, Warp-Level Scan
- Warp-level associative scan implementation in triton: github.com/triton-lang/triton
- Device and warp-level associative scan operations in pytorch: github.com/pytorch/pytorch
- Warp-level associative scan implementation by Volodymyr Kyrylov: github.com/proger/accelerated-scan
- Device-level associative scan implementation by Alexandre TL: github.com/alxndrTL/mamba.py
- Associative scan implementations by John Ryan: github.com/johnryan465/pscan