- Installation
- Examples and Documentation
- Going through basics: by example
- Layers that can't be expressed right now
- Note about performance / tuning
- Communication
A blogpost on Tensor Comprehensions can be read here.
We provide integration of Tensor Comprehensions (TC) with PyTorch for both training and inference purposes. Using TC, you can express an operator using Einstein notation and get the fast CUDA code for that layer with a few lines of code. By providing TC integration with PyTorch, we hope to make it further easy to write new operations with TC.
Here is what the PyTorch-TC package provides:
- inputs and outputs to functions are are
torch.*Tensor
s - Integration with PyTorch
autograd
: if you specify forward and backward functions, you get an autograd function that takesVariable
as input and returnsVariable
as output. Here's an example. - autotuner results can be cached to a file (for reuse)
To make it easy to use TC, we provide conda packages for it. Follow the instructions below on how to install the conda package. Building from source is not easy, because of large dependencies like llvm, so using the conda package is ideal.
You will need anaconda to install conda packages of TC. If you don't have it, follow the next step, otherwise verify conda is in your $PATH and proceed to Step 2.
Install anaconda3 by following the instructions below:
cd $HOME && wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh -O anaconda3.sh
chmod +x anaconda3.sh && ./anaconda3.sh -b -p $HOME/anaconda3 && rm anaconda3.sh
Now add anaconda3 to your PATH so that you can use it. For that run the following command:
export PATH=$HOME/anaconda3/bin:$PATH
Now, verify your conda installation and check the version:
which conda
This command should print the path of your conda bin. If it doesn't, make sure conda is in your $PATH.
Now, go ahead and install Tensor Comprehensions by running following commands.
conda install -y -c pytorch -c https://conda.anaconda.org/t/oJuz1IosRLQ5/prigoyal tensor_comprehensions
In order to explore Tensor Comprehensions (TC), there are few helpful resources to get started:
- We provide examples of TC definitions covering wide range of Deep Learning layers.
The list of examples we provide are: avgpool, maxpool, matmul, matmul - give output buffers and batch-matmul, convolution, strided-convolution, batchnorm, copy, cosine similarity, Fully-connected, fused FC + ReLU, group-convolutions, strided group-convolutions, indexing, Embedding (lookup table), small-mobilenet, softmax, tensordot, transpose
-
Tensor Comprehensions are based on Einstein notation which is very well explained here. This notation is also widely used in Numpy. If you don't know the notation, we recommend doing a 5 minute read of the above link.
-
TC Documentation is a very helpful resource to understand how Tensor Comprehensions are expressed. The sections on introduction, range inference, semantics are particularly helpful to get insights into writing Tensor Comprehensions.
-
Autotuner: TC provides an evolutionary search based algorithm to automatically tune the kernel. You can read briefly about autotuner here and look at various examples of autotuning.
-
To construct a TC autograd function, here is one self-descriptive example.
Let's see few examples of what features Tensor Comprehensions has and what you can do as a starter. I'll pick a simple layer matmul
for the purpose of examples and start with describing reduction operator we will use.
Reduction Operator:
+=!
operator with !
means that the output tensor will be initialized to reduction identity i.e. 0
for +
.
Now, let's cover the basics:
- New Tensor Comprehension:
lang = """
def matmul(float(M,N) A, float(N,K) B) -> (output) {
output(i, j) +=! A(i, kk) * B(kk, j)
}
"""
matmul = tc.define(lang, name="matmul")
mat1, mat2 = torch.randn(3, 4).cuda(), torch.randn(4, 5).cuda()
out = matmul(mat1, mat2)
- New Tensor Comprehension, Autotune it, run it:
lang = """
def matmul(float(M,N) A, float(N,K) B) -> (output) {
output(i, j) +=! A(i, kk) * B(kk, j)
}
"""
matmul = tc.define(lang, name="matmul")
mat1, mat2 = torch.randn(100, 400).cuda(), torch.randn(400, 500).cuda()
best_options = matmul.autotune(mat1, mat2, **tc.autotuner_default_options)
out = matmul(mat1, mat2, options=best_options)
- New Tensor Comprehension, Autotune it, save cache:
lang = """
def matmul(float(M,N) A, float(N,K) B) -> (output) {
output(i, j) +=! A(i, kk) * B(kk, j)
}
"""
matmul = tc.define(lang, name="matmul")
matmul.autotune((3, 4), (4, 5), cache="matmul_345.tc", **tc.small_size_autotuner_options)
matmul.autotune((100, 400), (400, 500), cache="matmul_100400500.tc", **tc.autotuner_default_options)
The big advantage of specifying cache
is that the next time you run the program, the cached autotuned values are used.
Beware that if you move to a significantly different type of GPU, then you might want to tune again for maximum performance.
- Train layer with TC, Autotune it and run it:
lang = """
def KRU(float(D2, N2) W2, float(M, N0, N1, N2) X) -> (XW2) {
XW2(m, n0, n1, d2) +=! X(m, n0, n1, n2_red) * W2(d2, n2_red)
}
def KRU_grad(float(D2, N2) W2, float(M, N0, N1, N2) X, float(M, N0, N1, D2) XW2_grad) -> (W2_grad, X_grad)
{
W2_grad(d2, n2) +=! XW2_grad(m_red, n0_red, n1_red, d2) * X(m_red, n0_red, n1_red, n2)
X_grad(m, n0, n1, n2) +=! XW2_grad(m, n0, n1, d2_red) * W2(d2_red, n2)
}
"""
KRU = tc.define(lang, training=True, name="KRU", backward="KRU_grad")
X = Variable(torch.randn(256, 16, 16, 16).cuda(), requires_grad=True)
W2 = Parameter(torch.randn(32, 16)).cuda()
options = KRU.autotune(W2, X, **tc.autotuner_default_options)
out = KRU(W2, X, options=options)
out[0].sum().backward()
- Dump out generated CUDA code (for fun?):
import tensor_comprehensions as tc
tc.GlobalDebugInit(["tc", "--dump_cuda=true"])
lang = """
def matmul(float(M,N) A, float(N,K) B) -> (output) {
output(i, j) +=! A(i, kk) * B(kk, j)
}
"""
matmul = tc.define(lang, name="matmul")
mat1, mat2 = torch.randn(3, 4).cuda(), torch.randn(4, 5).cuda()
out = matmul(mat1, mat2)
- Inject your own CUDA code and run it (because you might have faster code):
lang = """
def add(float(N) A, float(N) B) -> (output) {
output(i) = A(i) + B(i) + 1
}
"""
cuda_code = """
extern "C"{
__global__ void my_add(float* __restrict__ output, const float* __restrict__ A, const float* __restrict B) {
int t = threadIdx.x;
output[t] = A[t] + B[t];
}
}
"""
add = tc.define(lang, name="add", inject_kernel="my_add", cuda_code=cuda_code)
a, b = torch.randn(100).cuda(), torch.randn(100).cuda()
out = add(a, b, grid=[1, 1, 1], block=[100, 1, 1]) # change grid/block for adjusting kernel performance
- Reshaping Tensors inside the language
- Dropout : RNGs are not suppported inside TC language, because TC doesn't do internal allocations
- Strided "tensors" : input Tensors have to be contiguous. If they are not contiguous, they are made contiguous before passing to the TC backend.
- RNNs : TC language doesn't have loops yet. You can write them unrolled if you want :)
We are actively working on these and many more features. If there is some feature that can be very helpful to you, please send your request our way.
Tensor Comprehensions have an autotuner that uses evolutionary search to find faster kernels. Here is what you should know about the polyhederal exploration / evolutionary search:
- The autotuner needs static input sizes (for now). You can not tune a kernel, for say: batchsize between
16 and 32
- you can autotune
avgpool2x2
for input shape(16, 32, 24, 23)
:avgpool.autotune((16, 32, 24, 23), **tc.small_size_autotuner_options, cache="16x32x24x23.tc")
- if you want to target multiple input shapes, run multiple autotune calls:
avgpool.autotune((16, 32, 24, 23), **tc.small_size_autotuner_options, cache="mysize1.tc") avgpool.autotune((32, 1, 128, 128), **tc.small_size_autotuner_options, cache="mysize2.tc")
- The more static we make the sizes, the better and faster the search procedure. Hence, we made this trade-off of only supporting static sizes in the initial release.
- you can autotune
By default, tc.autotuner_default_options
is:
options = {
"threads": 32, "generations": 2, "pop_size": 10, "number_elites": 1
}
Good for quick autotuning (2 generations finish quickly)
good default that runs for a bit longer (maybe in exchange for better performance)
options = {
"threads": 32, "generations": 5, "pop_size": 10, "number_elites": 1
}
good default that runs for a LOT longer
options = {
"threads": 32, "generations": 25, "pop_size": 100, "number_elites": 10
}
brief explanation
threads
- set this to number of CPU cores available.generations
- 5 to 10 generations is a good number.pop_size
- 10 is usually reasonable. You can try 10 to 20.number_elites
- number of candidates preserved intact between generations.1
is usually sufficient.min_launch_total_threads
- If you have really input small sizes, set this to1
.gpus
: Number of gpus to use for autotuning. Default value is "0". Set this to "0,1" if you wish to use two gpus.
Look at docs for more details
- Email: [email protected]
- GitHub issues: bug reports, feature requests, install issues, RFCs, thoughts, etc.
- Slack: For discussion around framework integration, build support, collaboration, etc. join our slack channel https://tensorcomprehensions.herokuapp.com/.