XLA backend for Lc0 #1949

mooskagh · 2024-02-11T20:29:16Z

This PR contains a few fairly independent modules which can be reviewed separately, here is a short description.

pjrt.cc, pjrt.h

(a C++ wrapper around PJRT C API)

PjRt is an API that takes a graph representation in the HLO format, compiles and executes it. In theory, different backends are possible, but really there's only XLA (ok, there's also IREE).
HLO is a format to define computation graph. There are many slightly different formats actually, most notably "StableHLO", and "XLA HLO". The formats have different serialization formats, most notably "text", "HloModuleProto" and MLIR (binary format used in LLVM).
For now the code uses "XLA HLO" + "HloModuleProto", which was just the easiest to do.

To make it possible to use PjRt without pulling gazillion of header dependencies, there is a PJRT C API. It works like this: a .so file, named PjRt Plugin, has a function GetPjrtAPI() (or sometime multiple, like GetGPUPjrtApi()) which returns a pointer to the PJRT_Api structure, defined in pjrt_c_api.h. By calling functions returned in this structure, you can do stuff.

See the comment below for the instruction how to build a pjrt_c_api_gpu_plugin.so XLA PJRT plugin for GPU. For TPU it's supposed to work without any changes: just download
https://storage.googleapis.com/cloud-tpu-tpuvm-artifacts/libtpu/2023-09-12/libtpu.so (or newer file from here), and it should just work. Noone ever tested that though.

This API is not extremely convenient to use directly, so pjrt.cc and pjrt.h are simple wrappers of functions we use for now.

hlo_builder.cc, hlo_builder.h

(a helper class to build HloModuleProto)

HLO is an input format for XLA (which we use through PjRt API layer). hlo_builder builds HLO module in the HloModuleProto format.

HLO module consists of HLO computations (in other programming languages, they are called "functions"), one of which is "entry computation" which is usually named "main". In the current version of the PR which only supports 11248.pb.gz, module consists of just one computation (the entry one), but there are scenarios where more computations are necessary (e.g. reduction op gets reduction step computation as an attribute, or condition or loop get computations to execute on true or false condition, or as a loop body; it's also possible to call a computation just like you normally do with a function, but in automatically generated code, there's little reason for that).

HLO computation is a DAG that consists of HLO instructions, in Single Static Assignment form. Every instruction has zero or more operands (HLO instruction dependencies), some "hardcoded" attributes, and exactly one output. Shapes of operands and outputs must be defined. The last instruction of the computation is called a "root instruction", and it's the output. If we want to return multiple tensors from a computation, the root instruction is usually a tuple.

Here is an example of HLO computation in text form, just to get the feeling of it (batch size 56):

ENTRY main {
    %i0 = f32[56,112,8,8] parameter(0), metadata={op_type="input", op_name="/input/planes"}
    %i1 = f32[256,112,3,3] parameter(1), metadata={op_type="initializer", op_name="/inputconv/w/kernel"}
    %i2 = f32[256] constant(-1.40709,-0.332952,...), metadata={op_type="initializer", op_name="/inputconv/w/bias"}
    %i3 = f32[56,256,8,8] convolution(%i0, %i1), window={size=3x3 pads=1_1x1_1}, dim_labels=bf01_oi01->bf01, metadata={op_type="Conv", op_name="/inputconv"}
    %i4 = f32[56,256,8,8] broadcast(%i2), dimensions={1}, metadata={op_type="Conv", op_name="/inputconv"}
    %i5 = f32[56,256,8,8] add(%i3, %i4), metadata={op_type="Conv", op_name="/inputconv"}
    %i6 = f32[] constant(0), metadata={op_type="Relu", op_name="/inputconv/relu"}
    ...
    %i384 = f32[56,1] add(%i381, %i383), metadata={op_type="Add", op_name="/value/dense2/add"}
    %i385 = f32[56,1] tanh(%i384), metadata={op_type="Tanh", op_name="/output/value"}
    ROOT %i386 = (f32[56,1858], f32[56,1]) tuple(%i362, %i385)
}

onnx2hlo.h, onnx2hlo.cc

(Converter from ONNX graph to HLO module)

It goes node by node through the ONNX graph, and uses hlo_builder to emit the HLO graph.

xla_runner.h, xla_runner.cc

This is a module that compiles HLO for different batch sizes, keeps the compiled executable loaded, owns shared input buffers (through which large constants are passed), rounds up the sizes of input tensors to match the minibatch size, and decides which executable to run (based on the batch size).

It does most of the stuff through pjrt.h, it doesn't depend on converters etc (and has no idea what ONNX is; it operates on HLO, although HLO is also just passed through to PjRt and treated as a black box).

network_xla.cc

A Lc0 backend code. Converts the network to ONNX first, then calls onnx2hlo to convert ONNX to hlo, then calles xla_runner to compile and execute the net.

print_hlo.cc, print_hlo.h

Pretty-prints the HLO module (possibly still even compatible with the "official" format, so the output can be passed to XLA tools). Currently is not plugged anywhere (but I insert it into various places of the code when debugging the conversion), but the plan is to use it in leela2onnx for (optional) output as text HLO, and maybe dump of the optimized HLO from backend.

mooskagh · 2024-02-11T20:40:47Z

This is WIP (but close to being functional).

The backend requires PjRt C API plugin. To compile XLA PjRt C API plugin, do the following:

Clone the openxla/xla repo:

$ git clone https://github.com/openxla/xla.git

Install Bazel and Bazelisk!.
Inside the repo, call ./configure and answer questions. Sample session for CUDA support is below:

$ ./configure
You have bazel 6.5.0 installed.
Please specify the location of python. [Default is /usr/bin/python3]:


Found possible Python library paths:
  /usr/lib/python3.11/site-packages
Please input the desired Python library path to use.  Default is [/usr/lib/python3.11/site-packages]

Do you wish to build XLA with ROCm support? [y/N]:
No ROCm support will be enabled for XLA.

Do you wish to build XLA with CUDA support? [y/N]: y
CUDA support will be enabled for XLA.

Could not find any cuda.h matching version '' in any subdirectory:
        ''
        'include'
        'include/cuda'
        'include/*-linux-gnu'
        'extras/CUPTI/include'
        'include/cuda/CUPTI'
        'local/cuda/extras/CUPTI/include'
        'targets/x86_64-linux/include'
of:
        '/opt/cuda/extras/CUPTI/lib64'
        '/opt/cuda/lib64'
        '/opt/cuda/nvvm/lib64'
        '/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin'
        '/opt/intel/oneapi/compiler/latest/linux/lib'
        '/opt/intel/oneapi/compiler/latest/linux/lib/x64'
        '/opt/intel/oneapi/lib/intel64'
        '/opt/intel/oneapi/mkl/latest/lib/intel64'
        '/opt/intel/oneapi/tbb/latest/lib/intel64/gcc4.8'
        '/usr'
        '/usr/lib'
        '/usr/lib/R/lib'
        '/usr/lib/libfakeroot'
        '/usr/lib/libfakeroot/fakechroot'
        '/usr/lib32'

Asking for detailed CUDA configuration...

Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 11]: 12


Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 2]: 8


Please specify the locally installed NCCL version you want to use. [Leave empty to use http://github.com/nvidia/nccl]:


Please specify the comma-separated list of base paths to look for CUDA libraries and headers. [Leave empty to use the default]: /opt/cuda,/usr


Found CUDA 12.3 in:
    /opt/cuda/targets/x86_64-linux/lib
    /opt/cuda/targets/x86_64-linux/include
Found cuDNN 8 in:
    /usr/lib
    /usr/include


Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus. Each capability can be specified as "x.y" or "compute_xy" to include both virtual and binary GPU code, or as "sm_xy" to only include the binary code.
Please note that each additional compute capability significantly increases your build time and binary size, and that XLA only supports compute capabilities >= 5.2 [Default is: 5.2]:


Do you want to use clang as CUDA compiler? [y/N]:
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc-12]:


Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -Wno-sign-compare]:


Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
	--config=mkl         	# Build with MKL support.
	--config=mkl_aarch64 	# Build with oneDNN and Compute Library for the Arm Architecture (ACL).
	--config=monolithic  	# Config for mostly static monolithic build.
	--config=numa        	# Build with NUMA support.
	--config=dynamic_kernels	# (Experimental) Build kernels into separate shared objects.
	--config=v1          	# Build with TensorFlow 1 API instead of TF 2 API.
Preconfigured Bazel build configs to DISABLE default on features:
	--config=nogcp       	# Disable GCP support.
	--config=nonccl      	# Disable NVIDIA NCCL support.
Configuration finished

Build the plugin (takes some hours):

$  bazel build -c opt //xla/pjrt/c:pjrt_c_api_gpu_plugin.so

After that, the resulting plugin will be locates at <xla-repo>/bazel-bin/xla/pjrt/c/pjrt_c_api_gpu_plugin.so.

The elixir-nx repository has pre-compiled XLA repo, and it's possible that the .so files that they provide, also includes PjrtApi (I didn't check). It's also possible that it's there, but requires different function call (GetGpuPjrtApi() rather than GetPjrtApi(), which would be a simple change in the Lc0 code).

borg323

Can't test but it looks good and is well separated. Just one build system suggestion.

meson.build

(cherry picked from commit 04f73fc)

mooskagh added 24 commits February 4, 2024 11:34

Add hlo.proto

f3758c5

Skeleton.

d3eb196

Compiles checkpoint.

9db9779

Compiles. Checkpoint.

11a48ac

Compiles.

61f0944

Checkpoint.

8fbd6f2

List devices

2ecbd68

Device list works, reached the compilation stage.

9eb56e7

Attempts to compile!

824b9a9

Minimal HLO is built.

181c872

Pretty printer

8023aa1

The HLO compiles! \o/

42e0dae

Constants are supported.

4da62a1

Parameter index fix.

9f03b7d

Compiler is happy about my constant!

da11fd8

Correct context restoration

242b363

Convolution params pretty printing.

eae5722

Convolution works including bias

d344233

Relu implemented.

4fc2836

Add (with implicit broadcast!) implemented!

fdc858c

Reshape works but doesn't reset the layout

47c3ad9

MatMul! \o/

79b1e8d

OMG the entre 11248.pb.gz graph is converted.

eaae09d

outputs are also supported

9c81247

mooskagh added 5 commits February 12, 2024 22:17

Buffers are half-way to GPU

3f77a2b

A comment.

c998407

Removed abstract classes.

37ade1e

Data goes to GPU

17d836c

Initialized backend! \o/

b00e990

mooskagh added 18 commits February 15, 2024 12:18

Filled InputPlanes

6c43647

It executes something!

0b67f81

Buffers are on the host, but not yet at MCTS

bace9a8

Done but crashes. :-P

f25e56d

Runs end to end!

ffcaeca

Cleanup memory before filling the position.

fe3edee

Works. 🎉 \o/

90ea194

Configurable PjRt plugin path.

abc9596

Parellel NN outputs transfer. And non-hardcoded PjRt plugin path

dd34c7e

Configurable batch size.

ad6063a

Cleanup

c7218e6

Hlo Builder cleanup.

a339452

onnx2hlo comments

12c2074

Allow GPU selection

27fafca

xla_runner cleanup

15cb362

More cleanup

6ab1cf6

netword_xla cleanup

f66234c

Commment the print_hlo

4ded400

mooskagh marked this pull request as ready for review February 23, 2024 22:25

mooskagh requested a review from borg323 February 23, 2024 22:25

borg323 approved these changes Feb 24, 2024

View reviewed changes

meson.build Outdated Show resolved Hide resolved

mooskagh added 2 commits February 24, 2024 21:13

Adding libdl as a dep when building with XLA backend.

2c08ec7

Merge remote-tracking branch 'upstream/master' into xla

25f3544

mooskagh merged commit 04f73fc into LeelaChessZero:master Feb 24, 2024
3 checks passed

PikaCat-OuO pushed a commit to official-pikafish/px0 that referenced this pull request Mar 20, 2024

XLA backend for Lc0 (LeelaChessZero#1949)

b9f5c67

(cherry picked from commit 04f73fc)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XLA backend for Lc0 #1949

XLA backend for Lc0 #1949

mooskagh commented Feb 11, 2024 •

edited

Loading

mooskagh commented Feb 11, 2024

borg323 left a comment

XLA backend for Lc0 #1949

XLA backend for Lc0 #1949

Conversation

mooskagh commented Feb 11, 2024 • edited Loading

pjrt.cc, pjrt.h

hlo_builder.cc, hlo_builder.h

onnx2hlo.h, onnx2hlo.cc

xla_runner.h, xla_runner.cc

network_xla.cc

print_hlo.cc, print_hlo.h

mooskagh commented Feb 11, 2024

borg323 left a comment

Choose a reason for hiding this comment

mooskagh commented Feb 11, 2024 •

edited

Loading