This is a proposal for a new code generation facility for writing
kernels in PyTorch, where we will automatically generate
easy-to-get-wrong boilerplate for functional (add
), inplace (add_
)
and out (add_out
) variants of functions, as well as common code
(device guards). The net result is you only need to write a shape
checking function and an out-kernel when writing a function.
-
Be an opt-in mechanism: always OK to continue to write code as it is today. This ensures that third-party backend extenders also can make things work without making use of codegen.
-
Reduce shape checking boilerplate: make it easier to share common shape checking code between CPU and CUDA implementations, as well as with out-of-tree backend implementations.
-
Reduce functional/inplace/out boilerplate: avoid having to write
foo
,foo_
,foo_out
variants for every function; make it harder to forget to addfoo_out
variant when it is appropriate. -
Be able to run forward shape computation without kernel: so compilers and static runtimes can easily compute sizes of all elements in the framework without having to actually run the computation.
-
Provide entry point for static runtime: provide public API for accessing operators directly, bypassing output allocation, shape checking, device guards.
-
Bypass dispatcher overhead: in high performance cases, e.g., core framework operators, reduce the number of redispatches to improve performance.
-
(Under question) Unify with version counter bumps and view metadata tracking: these are currently done in autograd generated code but must be performed unconditionally even when autograd is disabled. This logic to be incorporated with logic here.
-
Work with mobile selective build: mobile selective build is implemented in codegen, and implementation strategy must be compatible with design constraints in mobile.
-
Fix some operator UX paper cuts: have C++ compiler tell you when you've specified the signature of a native function incorrectly, rather than get a linker error.
-
Handle TensorIterator operators without major refactoring: TensorIterator operators cover a third of all operators in PyTorch, and structured kernels must be able to account for them, without requiring major changes to how TensorIterator is implemented.
-
Be zero runtime cost all the time: We are willing to give up some runtime efficiency for cleaner code. We are willing to give up efficiency by default for out of tree implementers, unless they opt in to higher performance. There is always an escape hatch to be high performance if absolutely necessary.
-
No codegen: As long as it is possible to implement things out of tree (at the cost of human understandable boilerplate), codegen is a reasonable strategy for implementing features we need.
-
Major format change to native_functions.yaml: We could do this. We are choosing not to in order to reduce the degrees of freedom in what changes we make to
native_functions.yaml
.
-
Core
-
Define a new API for shape checking functions. Shape checking functions are called from generated code that implements functional/inplace/out variants of functions.
-
Define a new API for kernels. Static kernel API does not do shape checking, output is always preallocated at correct size. This API is public and a suitable entrypoint for static runtime. Static kernels are called from generated code like above.
-
Code generate meta functions from shape checking functions to provide public API for running shape computations without any kernel.
-
Generated code is augmented with extra boilerplate like device guards that all kernels typically need to do.
-
-
Extensions
- Add a new dispatch key
Common
which contains shape checking, device guard. External backend implementations of PyTorch operators which do not explicitly opt out of this dispatch key will get this logic applied to them. This key is skipped for core operators, as this checking logic is fused directly into the backend kernel. External backends can also opt to fuse this logic in.
- Add a new dispatch key
Let’s suppose you want to write a new operator from scratch. The first
thing you will do is add a native_functions.yaml
declaration for it.
Let’s take upsample_nearest1d
as the example for this post.
In the classic system, you will write an entry like this, describing the functional version of this operator:
- func: upsample_nearest1d(Tensor self, int[1] output_size, float? scales=None) -> Tensor
dispatch:
CPU: upsample_nearest1d_cpu
CUDA: upsample_nearest1d_cuda
Because you are a conscientious implementor, you also want to provide an out= variant of this function. This version gets a separate entry:
- func: upsample_nearest1d.out(Tensor self, int[1] output_size, float? scales=None, *, Tensor(a!) out) -> Tensor(a!)
dispatch:
CPU: upsample_nearest1d_out_cpu
CUDA: upsample_nearest1d_out_cuda
Ordinarily, these two declarations would cause function signatures for
upsample_nearest1d_cpu
, etc., to be generated into NativeFunctions.h,
and then you would go ahead and implement them.
We propose a new structured format for writing kernels. In this proposal variant, we’ll do this by marking the out version of this operator as structured and deleting dispatch entries from the functional version, instead delegating its implementation to the out version:
- func: upsample_nearest1d(Tensor self, int[1] output_size, float? scales=None) -> Tensor
structured_delegate: upsample_nearest1d.out # [NEW], replacing dispatch
- func: upsample_nearest1d.out(Tensor self, int[1] output_size, float? scales=None, *, Tensor(a!) out) -> Tensor(a!)
structured: True # [NEW]
dispatch:
CPU: upsample_nearest1d_structured_cpu
CUDA: upsample_nearest1d_structured_cuda
Structured definitions require a different set of functions to be written to implement the operator. In this particular case, the functions you have to implement are:
namespace meta {
/* macro expands to: upsample_nearest1d::upsample_nearest1d( */
TORCH_META_FUNC(upsample_nearest1d) (
const Tensor& self, IntArrayRef output_size, optional<double> scales
) {
... compute sizes and options, check shapes ...
set_output(sizes, options);
}
}
namespace native {
// Precondition: out is an allocated and appropriately sized tensor;
// all shape checks have passed, device guards have been set,
// version counter bumps are all handled, etc...
/* macro expands to: void upsample_nearest1d_structured_cpu::impl( */
TORCH_IMPL_FUNC(upsample_nearest1d_structured_cpu) (
const Tensor& self, IntArrayRef output_size, optional<double> scales, const Tensor& out
);
/* macro expands to: void upsample_nearest1d_structured_cuda::impl( */
TORCH_IMPL_FUNC(upsample_nearest1d_structured_cuda) (
const Tensor& self, IntArrayRef output_size, optional<double> scales, const Tensor& out
);
}
The code generator then generates the boilerplate code to put these functions together. The boilerplate is somewhat involved, and we will explain it below.
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ //
// Common code
// Abridged from aten/src/ATen/TensorMeta.h
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ //
#define TORCH_META_FUNC(name) void structured_##name::meta
#define TORCH_META_FUNC2(name, overload) void structured_##name##_##overload::meta
#define TORCH_IMPL_FUNC(name) void structured_##name::impl
// Parent class for all code-generated meta:: classes
struct MetaBase {
// TODO: Maybe some of these should be optional, but in many cases they
// be implicitly made optional by passing an empty list
virtual void set_output(int64_t output_idx, IntArrayRef size, IntArrayRef strides, TensorOptions options, DimnameList names) = 0;
// Returns a reference to an undefined tensor if no output is
// available
virtual const Tensor& maybe_get_output(int64_t output_idx) = 0;
// Convenience helpers
void set_output(IntArrayRef size, TensorOptions options) { set_output(0, size, {}, options, {}); }
const Tensor& maybe_get_output() { return maybe_get_output(0); }
};
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ //
// Code generated per operator
// Generated to build/aten/src/ATen/MetaFunctions.h
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ //
namespace meta {
struct structured_upsample_nearest1d : public MetaBase {
void meta(const Tensor& self, IntArrayRef output_size, optional<double> scales); // user defined
};
} // namespace meta
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ //
// Code generated per dispatch table entry for operator
// Generated to, e.g., build/aten/src/ATen/RegisterCUDA.cpp
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ //
namespace native {
struct structured_upsample_nearest1d_cuda : public meta::upsample_nearest1d {
void impl(const Tensor& self, IntArrayRef output_size, optional<double> scales); // user defined
};
// functional implementation
// NB: set_output could be devirtualized with CRTP, but for now we don't do this
struct structured_upsample_nearest1d_cuda_functional final : public structured_upsample_nearest1d_cuda {
void set_output(int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimNameList names) override {
outputs_[output_idx] = at::native::empty_strided(sizes, strides, options);
if (!names.empty()) namedinference::propagate_names(outputs_[output_idx], names);
}
const Tensor& maybe_get_output(int64_t output_idx) override {
return outputs_[output_idx];
}
std::array<Tensor, 1> outputs_;
};
Tensor structured_upsample_nearest1d_cuda(const Tensor& self, IntArrayRef output_size, optional<double> scales) {
CUDADeviceGuard g(self.device());
upsample_nearest1d_cuda_functional op;
op.meta(self, output_size, scales);
op.impl(self, output_size, scales, op.outputs_[0]);
return std::move(op.outputs_[0]);
}
// out-place implementation
struct structured_upsample_nearest1d_cuda_out final : public structured_upsample_nearest1d_cuda {
upsample_nearest1d_cuda_out(const Tensor& out) : outputs_{std::ref(out)} {}
void set_output(int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options) override {
at::native::resize_output(outputs_[output_idx], sizes);
if (!strides.empty()) {
TORCH_INTERNAL_ASSERT(!options.memory_format_opt().has_value());
outputs_[output_idx].get().as_strided_(sizes, strides);
} else if (options.memory_format_opt().has_value()) {
outputs_[output_idx].get().unsafeGetTensorImpl()->empty_tensor_restride(*options.memory_format_opt());
}
if (!names.empty()) namedinference::propagate_names(outputs_[output_idx], names);
}
std::array<std::reference_wrapper<Tensor>, 1> outputs_;
};
Tensor& structured_upsample_nearest1d_out_cuda(Tensor& out, const Tensor& self, IntArrayRef output_size, optional<double> scales) {
// In event of multiple tensor arguments, code generation should
// be responsible for making sure all devices are consistent
CUDADeviceGuard g(self.device());
upsample_nearest1d_cuda_out op(out);
op.meta(self, output_size, scales);
op.impl(out, self, output_size, scales);
return out;
}
// CPU follows similarly
} // namespace native
The key idea is we use object oriented programming to factor the
boilerplate into several parts (the user-provided meta
and impl
definitions,
as well as the framework-provided set_output
helper) which we then specialize
for each variation (functional/out/inplace) of the kernel that
we need to generate. Here is the step-by-step:
-
At the top of the inheritance hierarchy is
MetaBase
, which definesset_output
virtual method that will be varied depending on which variant of the operator we're defining. In a functional kernel, this method is overridden to actually allocate the output tensor; in an out-of-place kernel, this kernel only resizes the pre-existing output. -
meta::upsample_nearest1d
inherits fromMetaBase
; there is one per structured function group. The user defines themeta
method on this class. This method does general shape checking work and eventually makes a call to the virtualset_output
which specifies what the output shape should be (still unspecified!) -
For each device type to be implemented, we extend the meta class into a class with a user-defined
impl
method that says how to actually do the kernel computation for that device. This method is assumed to take anout
tensor that has been appropriately sized and placed on the correct device. Becauseimpl
is a method, users of the scaffolding get nice error messages when they write a method implementation whose C++ type doesn't match the generated header (this is in contrast to the previous approach of generating function prototypes, which result in a linker error if you get the signature wrong). -
Finally, for each variant of the function we need (functional, out-of-place, inplace), we extend one last time to provide the correct override implementation of
set_output
. -
In the final kernel function we register for operators, we construct one of these classes, call its meta and impl methods, and then finally return the output tensor in an appropriate way. These functions also take care of other boilerplate operation, such as setting up device guards, version counter bumps, etc.
The boilerplate here is written very carefully for performance:
-
Because we generate code separately for CPU/CUDA, we can bypass the dispatcher entirely. In the current implementation, we don't do this, but the optimization opportunity is available.
-
The use of
set_output
as a method means we can avoid allocating an owning vector to store sizes; instead, initializer lists can be used whenever the size is statically known. -
set_output
is virtual. This is a tradeoff between code size and devirtualization: by makingset_output
virtual, we can reuse the generated code for a single meta function across all variations of a function; it is not too difficult to devirtualize with CRTP, but we don't expect there to be many benefits to inliningset_output
, asat::native::empty
won't inline. -
The
outputs_
field is a statically-sized array, rather than a vector, because we generate the code per operator and thus know at compile time how many outputs there are. -
set_output
doesn't return a reference to theTensor
that was freshly allocated. This is to avoid users from doing an antipattern where they first allocate a tensor, and then restride it in a subsequent call. The intention is that (eventually) these functions can do all of the allocation and striding in one go, without redundancy. (This also applies to setting names!) This means that we may provide multipleset_output
overloads for handling various situations.
In absolute terms, the boilerplate code saved here is not all that large; only several lines. However, when multiplied over the number of kernels in PyTorch, and the subtlety of remembering to handle all of these issues, and we think the use of code generation to automatically generate this boilerplate is worth it.
The functionality code generated here is not made available to other backends. This is problematic, so we also generate a separate dispatch key handler which will automatically handle shape checking and other concerns for backends that are not concerned with performance:
// Name TBD; for this example we will call it DispatchKey::Common, as the
// functionality here is common to all backends. This is an alias key
// that resolves CommonXLA/CommonMSNPU/... in the same way as Autograd.
struct upsample_nearest1d_common final : public meta::upsample_nearest1d {
void set_output(int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names) override {
outputs_[output_idx] = at::empty_strided(sizes, strides, options, names); // go via dispatcher
if (!names.empty()) namedinference::propagate_names(outputs_[output_idx], names);
}
std::array<Tensor, 1> outputs_;
};
Tensor upsample_nearest1d_common(const Tensor& self, IntArrayRef output_size, optional<double> scales) {
// TODO: RecordFunction could be added here, if desired
DeviceGuard g(self.device());
upsample_nearest1d_common op;
op.meta(self, output_size, scales);
ExcludeDispatchKeyGuard g2(DispatchKey::Common);
// Notice that upsample_nearest1d is ignored here. It may be a good idea
// to skip this implementation, or use a slightly different variant, if
// a backend explicitly registered upsample_nearest1d
at::upsample_nearest1d_out(op.outputs_[0], self, output_size, scales);
return std::move(op.tensor_);
}
TORCH_LIBRARY_IMPL(aten, Common, m) {
m.impl("upsample_nearest1d", upsample_nearest1d_common);
}
// out variant proceeds similarly; with a dispatched resize_ call in set_output
A backend that would like to fuse this boilerplate into their kernel for performance reasons can simply override the Common entry with fallthrough:
TORCH_LIBRARY_IMPL(aten, CommonXLA, m) {
m.impl("upsample_nearest1d", CppFunction::makeFallthrough);
}
Finally, structured kernels also implicitly register an implementation for the Meta key, which is the API for dry running shape/dtype calculations without any kernels involved:
struct upsample_nearest1d_meta final : public meta::upsample_nearest1d {
void set_output(int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names) override {
outputs_[output_idx] = at::native::empty_strided_meta(sizes, strides, options);
if (!names.empty()) namedinference::propagate_names(outputs_[output_idx], names);
}
std:array<Tensor, 1> outputs_;
};
Tensor upsample_nearest1d_meta(const Tensor& self, IntArrayRef output_size, optional<double> scales) {
upsample_nearest1d_meta op;
op.meta(self, output_size, scales);
return std::move(op.outputs_[0]);
}
TORCH_LIBRARY_IMPL(aten, Meta, m) {
m.impl("upsample_nearest1d", upsample_nearest1d_meta);
}
// out variant proceeds similarly; with a meta::resize_ call in set_output
Discussion:
-
Why not just use the Common dispatch key for everything?
- Performance reasons. Introducing the common key would induce an extra redispatch, which at time of writing would give up quite a bit of performance due to dispatch overhead, for no particularly good reason.
-
I don’t like codegen.
- An earlier version of this proposal had the boilerplate generated using C++ templates rather than codegen. However, we think the formulation in this proposal is superior under the constraint that mobile selective build must keep working, as we cannot directly write registrations in source files, and so we must intermediate between the structured and non-structured variants.
TensorIterator accounts for a third of operators in PyTorch, and is characterized by a class which computes a lot of metadata and carries out allocation. Previous iterations of the structured kernel design struggled to account for this style of kernel writing in a clean way, without requiring major rewrite TensorIterator.
This class based design permits TensorIterator to work, by making TensorIterator itself a subclass of MetaBase. The modified class hierarchy now looks like this:
struct TensorIteratorBase : public MetaBase;
struct TensorIterator : public TensorIteratorBase;
namespace meta {
struct add : public TensorIteratorBase;
}
TensorIterator itself remains an implementation of the old-style API for
kernels that are not yet ported to structured kernels.
TensorIteratorBase contains the bulk of the implementation, but all
places that previously allocated tensors now call set_output
:
// allocate memory for output, memory format depends on setup_type
switch (setup_type) {
case FastSetupType::CONTIGUOUS:
{
for (int i = 0; i < num_outputs_; i++){
auto& op = operands_[i];
/* BEFORE:
if (!op.tensor.defined()) {
TORCH_INTERNAL_ASSERT(op.is_type_defined(), "no type for operand", i);
op.tensor = at::empty(shape_, op.options(), MemoryFormat::Contiguous);
op.current_dtype = op.target_dtype;
} else if (op.will_resize) {
at::native::resize_output(op.tensor, shape_);
}
*/
// AFTER:
set_output(i, shape_, {}, op.options(), names_);
}
break;
}
TensorIterator defines an override of set_output
that recovers the old
behavior, while structured kernel subclasses override set_output
in
the same way as before.
The code generation only requires a very modest extension: an structured_inherits
field that lets you replace MetaBase
with your own custom base
implementation class:
- func: add.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
structured: True
structured_inherits: TensorIteratorBase # [NEW]
dispatch:
CPU: upsample_nearest1d_structured_cpu
CUDA: upsample_nearest1d_structured_cuda
Now you can simply construct it appropriately in your function definitions:
TORCH_META_FUNC2(add, Tensor) (
const Tensor& self, const Tensor& other, Scalar alpha
) {
build_binary_op(maybe_get_output(), self, other);
native::alpha_check(dtype(), alpha);
}
TORCH_IMPL_FUNC(add_out) (
Tensor& result, const Tensor& self, const Tensor& other, Scalar alpha
) {
add_stub(device_type(), *this, alpha);
TORCH_INTERNAL_ASSERT(result.scalar_type() == output().dtype());
}
In the case of TensorIterator, all of the arguments are stored in the
struct at construction time, so the impl func doesn't need to make
use of any of the tensors. Additionally, when set_output
is invoked
in the meta function, after doing all appropriate allocations, it will
delegate to the underlying set_output
in TensorIterator, letting it
query with maybe_get_output()
to register any outputs necessary.
Some operators do not have out variants, and thus do not make sense as structured kernels per se (i.e., cannot do boilerplate reduction between out and functional implementations). However, these operators still require meta implementations, and can still usefully have error checking code that is shared.
TODO WRITE MORE
Ports to structured kernels must be done individually by hand. Because the porting process is already labor intensive, we should do other improvements "while the patient is open". Here are some candidate improvements which should consider applying
Historically, out and inplace variants of functions take a Tensor&
rather than a const Tensor&
. This convention has lead to no end
of confusion for kernel writers, who incorrectly surmise that given
a mutable reference Tensor& out
, one may assign the output by
writing out = ... some expression ...
(this doesn't work).
The absence of the const modifier is currently relied upon by
template metaprogramming machinery (to detect if arguments are out
tensors or not); however, because the implementations of structured
kernels are a layer below the operator registration layer, the
const modifier can be eliminated from the TORCH_IMPL_FUNC
API
without requiring the rest of the system to be updated.
One implication of this change is that the out parameter cannot be
easily passed to existing public API that requires a mutable reference.
This can be easily remedied by updating the existing APIs to accept
const references and not only mutable references, or, if truly
necessary, using a const_cast
to get out of jail free.
Mutable reference removal has landed.
Once we have dispatched to a CPU kernel, we know that the tensor in
question is in fact a CPU tensor, and not (for example) a CUDA tensor.
However, this information is not retained inside of the body of the
kernel, and so if a user makes a method call or regular at::
namespace
function call, the dispatcher still must inspect the type tag to
rediscover, yes, indeed, we still have a CPU tensor.
One promising approach to solving this problem is to refine the type of
a tensor from Tensor
to CPUTensor
, where a CPUTensor
represents a
tensor that is statically known to be a CPU Tensor. Operations
(functions and methods) on CPUTensor
bypass dispatching and go
directly to the CPU implementations in question. const CPUTensor&
can
be defined to implicitly convert into const Tensor&
, which means
existing APIs that don't know how to short circuit can continue to do
pre-existing behavior.
The primary consequence of making this change immediately is we must
immediately create a CPUTensor class with enough methods to cover the
usual surface area (even if those methods don't apply any performance
optimization). With code generation this should not be too much code.
This would also require the creation of a CPUTensorRef class to ensure
that CPUTensors can be created from const Tensor&
without incurring
a reference count bump).
One question is whether or not the existence of CPUTensor means we should
eliminate the at::cpu::
namespace (as they serve near equivalent purposes;
if you have functions which support CPUTensor, simply (unsafely) cast
your Tensor to a CPUTensor and then utilize the regular API). One
possible argument for retaining the at::cpu::
namespace is that these
functions are guaranteed to bypass dispatching, whereas other functions
may implicitly downcast to Tensor
and do an optimized call.
Type refinement HAS NOT landed.
Structured operators are currently strictly defined in terms of an out operation. However, there are some operators in PyTorch which do not have out variants, because they typically don't make sense. Some of the most notable operator types of this form:
- View operations
- Factory functions
copy_
Since non-structured kernels simply operate at a lower level of abstraction, in principle, it is not a big deal if some operators never become structured; to make an analogy, sometimes you have to write assembly, and as long as it is not too frequent, there is not too much to be gained from trying to extend the functionality of your system to expunge these entirely.
However, there is one practical problem with doing this: we continue
to have separate code generation paths for structured and unstructured
kernels, and in some cases, there are improvements that could be
profitably applied to both structured and unstructured kernels (for
example, elimination of mutable Tensor&
references). For some
improvements, the correct answer to "I want my operator to have this
improvement" is "port it to structured kernels". However, in the cases
where this is not possible, there must be some alternate recourse.
Here are the list of planned improvements to structured kernels which also should be equivalently applied to unstructured kernels:
-
Generation of
at::cpu::
stubs for static runtime. Suggested resolution: implement for unstructured as well. This has LANDED. -
Removal of mutable references. Suggested resolution: don't bother fixing in the unstructured case (until someone decides to purge mutable references from the public API. Which, let's be honest, probably isn't going to happen).
There's still a lot of work to be done in structured kernels! Here are some of the things to be done.
Folks who have expressed interest in helping: @ailzhang, @bdhirsh, @hameerabbasi
Kernels vary wildly in difficulty with regards to how difficult they are to port, but at least a substantial chunk of operators should be possible to port to structured without too much difficulty.
Ailing has graciously attempted to port a kernel, and made some observations about things you might have to do when porting:
-
Every kernel in the operator must be c10 full. If something is using hacky wrapper, port it to stop using hacky wrapper (usually by reordering arguments) first.
-
Don't accidentally remove the dispatch table for the out kernel, you still need that one!
-
Change all
Tensor&
arguments toconst Tensor&
(as structured kernels do not take mutable references). If possible, change all the helper functions the kernel uses to also useconst Tensor&
(most commonly, you will need to changeDispatchStub
signatures). If you can't conveniently change an API because it would have large knock on effects, use aconst_cast<Tensor&>
and mark it with a TODO.
There are some kernels which should be easier to port to structured. This post taxonomizes operators; we have working examples of non-reduction TensorIterator kernels and Fixed kernels, and those should work out without too many hiccups. For example, try finishing the rest of the upsample kernels.
Things that are known not to work:
-
Operators that call a lot of other operators (even if they are not technically composite). In these situations, there may not be any place where shape computation is actually done in the old kernel, so you would have to reconstruct this logic from scratch. Many linear algebra kernels fall in this bucket.
-
Reductions. These use the TensorIterator API in a way that we have trouble supporting today (they allocate output outside of TensorIterator and then "force" TensorIterator to not resize in shape computation).
-
Kernels that directly overwrite
Tensor&
argument (reductions are known to do this!) Then again, these kernels are just WRONG and should be fixed.
What operators to prioritize? Consider picking some important model (James Reed and Yinghai Lu may have some suggestions) and getting to 100% coverage there.
Right now, torch.jit.trace
requires you to provide real tensors,
because it's the only way to get accurate shape tracking. Meta tensors,
which take advantage of structured kernels, allow for JAX style tracing
where you can feed in tracers that have shape but no data, and do fast
tracing on the fly.
Most of the pieces to make this work should exist, it's just a matter of putting it all together.