-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][WIP] Tensor Expression level automatic differentiation #1996
Comments
This looks great. I would also encourage us to think a bit more on how this can play together with the current system. Specifically, while being able to run AD in tensor expression level end to end is fun, we may not necessarily use this method, as AD on high-level IR still offers advantages of things like general graph optimizations and pick of algorithms(winograd vs direct), so technically we would still encourage the use of high-level IR AD. On the other hand, this would become practical if we could have a proposal on how to write quick expression op, that can be integrated together with other high-level IR and its AD system, this would enable a powerful to write custom operators and automatically generate gradient for it |
@MarisaKirisame after PLDI could you provide more feedback. |
@jroesch yeah |
@tqchen A couple of months ago I tried to integrate this autodiff into NNVM, the attempt (now abandoned) is still preserved in some old commits: Being able to easily define new high-level ops just from tensor expressions sounds like a good idea. I definitely missed something like this in NNVM. Also I think it would be nice to be able to override operations' attributes, like gradients, from python code. At least, this would be great for comparing automatic gradients with manually written gradients. |
FYI, we are moving toward NNVMv2(relay) and this is a good time to think about related questions as scaffolding are being done. Relay already support override op's attributes from python |
Indeed, I think that higher level IRs will always be able to decide which method to use. As far as Relay uses TVM under the hood, it should be in good position to make a choice: either substitute operations during building the backward path (e.g. Winograd -> matmul), switch to hand-written gradients or use a semantically-correct default implementation which is provided by the current approach. |
Continuing discussion about training, started in #2122 @junrushao1994 @were @sgrechanik-h @szha I suggest we continue discussion related to training here.
We plan to do the following regarding scheduling:
|
For 1., 2. diff = tvm.diff([op1, op2, op3])
sch = tvm.create_schedule(diff.op) The differentiation is also some kind of op node so that we can unify the API to TVM general flow. |
I've updated our automatic differentiation branch. Now the result of differentiating flatten is acceptable, and operations like max pool work better as well. We have also improved the API, the simple use-case look pretty much the same up to function renaming: [dw1, dw2, dw3] = tvm.differentiate(loss, [w1, w2, w3]) (The function x = tvm.placeholder((32, 3, 28, 28), name='x')
w1 = tvm.placeholder((10, 3, 3, 3), name='w1')
t1 = topi.nn.conv2d(x, w1, 1, 0, 1)
t2 = topi.nn.flatten(t1)
t3 = topi.sum(t2)
res = tvm.differentiate(t3)
res.adjoints[t1] (It may be useful for manually scheduling intermediate tensors). def mydiff(out, inp, head):
return tvm.compute(inp.shape, lambda ax0, ax1, ax2, ax3: head[ax0, ax3 + ax2*26 + ax1*676])
res = tvm.differentiate(t3, [x, w1], manual={(t2, t1): mydiff}) (Which may be useful when autodiff does poor job). |
@sgrechanik-h Thanks for the hard work on proposing and bringing in the implementation, this seems to be an RFC that many folks are interested in. I have no doubt on the possibility to make a correct and efficient tensor expression auto diff. The key issue we need to address first is how can we make a clear set of APIs and namespaces so that:
These discussions and decision are important because they will affect how users will use the API, as per https://docs.tvm.ai/contribute/code_review.html#deliberate-on-api-and-data-structures I suggest we discuss them, reach consensus before we move on cc @merrymercy @jroesch @yzhliu @junrushao1994 @ZihengJiang @grwlf @dmlc/tvm-committer |
@sgrechanik-h would be great if you can edit the first post and list the choices or options clearly so that everyone can discuss in a coordinated way, see #2535 |
Here are the things to discuss I came up with. NamingConcerning names, here is a list of things in the tensor expression AD that we might want to rename:
APIThe main thing to discuss here is the python function def differentiate(output, inputs=None, head=None, override=None, fdiff=None):
"""Perform reverse-mode automatic differentiation. Parameters:
The returned value of the function is an instance of DifferentiationResult. It mimics a list of gradients (corresponding to inputs) but also contains two other fields: Several issues to discuss:
Another important function is the C++ counterpart, tvm::ir::Differentiate This function is for developers though. Currently my only suggestion is that we might want to support multiple outputs and heads. Relay integration APIThe user mostly needs two functions: (code here):
The operation representing the generated gradients currently takes a single tuple as input and returns a single tuple as output (here). The attributes of the operation are as follows:
However, I think this operation should be considered an implementation detail, and the stability of its attributes, inputs and outputs shouldn't be relied upon. |
|
Do you mean renaming
Wouldn't it make more sense to put it in |
Hello everyone. I want to tell you about the current status of tensor expression automatic differentiation. The latest version can be found here. The main improvements are as follows:
However there are several problems which are TVM-related and should be addressed before creating pull-requests:
|
I see fundamental problem in this PR. Jacobian(Y, W): This is a really, really big tensor, and the approach this PR take has a "cliff of death" performance chart. This PR then rely on simplification to eliminate all those tensor. If any tensor is not eliminated(which seems to be the case for more complex tensor) the performance will be very bad. Reverse mode automatic differentiation should only calculate vector jacobian product. The jacobian of Y, W, should be dW times jacobian Y W. the Jacobian should simply never be manifested. can this be fixed so the algorithm will not be algorithmically slower without optimization? |
@MarisaKirisame could you elaborate "The jacobian of Y, W, should be dW times jacobian Y W" ? not sure I correctly understand the symbol you use. |
@yzhliu let me explain it another way. In this PR, the gradient of f might be MANY times more expensive. |
@MarisaKirisame One approach that I know of is to differentiate by literally reversing the computation, replacing loads with stores (with |
@sergei-grechanik that is what ppl do. the problem of parallel ad is shared read become shared write, but you can pull optimization tricks to turn shared write into a single write (for example, https://people.csail.mit.edu/tzumao/gradient_halide/). |
@MarisaKirisame Yes, I think you are right. I guess implementing this approach requires extending tensor expressions with an additional construct or working on a different level. |
It probably require working on the tensor ir, as the tvm compute level is pure. |
@MarisaKirisame What do you think about adding a scatter-like reduction to TVM's tensor expressions? (I mean |
@sergei-grechanik look great! another way to do it is to add a pass that turn scatter into gather, effectively doing halide's autodiff. so, the ir is not pure, but it is after some more processing. |
The te AD has been landed in the mainline |
I'm working on automatic differentiation at the level of compute expressions, and I would like to share some progress and hear any comments. Currently the automatic differentiation works well enough for some operations, so that it is possible to train a simple model, here is a tutorial on how to do this. Yet, for many operations the performance is unacceptable, but I'm working on it.
My implementation mostly follows this paper. In this notebook I describe how my implementation works internally and give a list of operations which are known to work or not to work. Basically, the AD consists of two parts:
All transformations work on the level of compute expressions (before scheduling). Their general goal is to eliminate summation over zeros by moving up conditional expressions of the form
cond ? val : 0
and then using them to simplify iteration domains of reductions. Hopefully, these transformations may be useful for some other tasks besides AD when they are powerful enough. Currently the main problem is that they don't understand modular arithmetic (which is needed for differentiating dilated and strided convolutions and for the flattening operation).The git branch
The squashed commit
The tutorial on training a simple model
The notebook describing some internals
The text was updated successfully, but these errors were encountered: