-
-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Op
s for solve_discrete_lyapunov
and solve_continuous_lyapunov
#1020
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I have only a couple of minor questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the implementation here.
@aesara-devs/aesara, as always, we should take a quick look at the underlying NumPy/SciPy implementations to see whether or not something can be implemented using existing Aesara Op
s.
While the direct solver can be written elegantly using only aesara Ops, I still think there is value-add to handing the forward part to scipy. As noted in the scipy documentation, the direct solver scales very poorly with the size of the matrices. Even at N=50, the bilinear solver is two orders of magnitude faster than the direct solver (180 ms vs <1ms). The memory footprint is also much lower, because a Kronecker product is not required. Here is a graph comparing performance of the Op from this PR, using the bilinear solver, to the pure aesara direct solver. Sizes larger than N=250 could not be allocated in memory for the direct solution. In the application I am thinking about (kalman filtering), this routine would need to be called once per logp evaluation, so speed is important. I realize this is an unfair comparison, and that the real question is whether the bilinear solver could be directly implemented in Aesara. It would first require an implementation of the Schur decomposition (imaginary part is discarded, so this is not an issue), then an implementation of a Sylvester equation solver, which would in turn require an efficient solution routine like trsyl, otherwise we run into the direct solver problem again. Or, decide that having access to trsyl is good, so wrap the scipy sylvester equation solver, then call that from a semi-pure aesara implementation of solve_lyapunov, but this strikes me as arbitrary, especially if the schur decomposition ended up as a scipy wrapper as well. |
There's no need to justify the inclusion of a relevant SciPy feature, and, yes, the question is whether or not the other estimation approach that is supported by SciPy can be implemented with existing Aesara If need be, a custom |
For reference, the other discrete approach (i.e. "bilinear") appears to be here, and, from a quick glance, it looks to be well covered by existing |
The relevant lines I see as being not covered are here, inside the # Compute the Schur decomposition form of a
r, u = schur(a, output='real')
# Construct f = u'*q*u
f = u.conj().T.dot(q.dot(u))
# Call the Sylvester equation solver
trsyl = get_lapack_funcs('trsyl', (r, f)) |
Yes, a Shur decomposition and |
"Arbitrary" here refers to the criterion for writing an Op wrapper around a given scipy function, but the word was chosen more out of my ignorance of the criterion than any capriciousness (or lack thereof) in the criterion itself. What I am gleaning is that the ideal criteria for a custom scipy Op is, "as close to LAPACK as possible"? One could get more atomic than that, but I doubt it would be productive. I am also struggling with the merits and demerits of having analytic gradient expressions. My naive perspective is that, even if we had implementations of schur and trsyl, it would still be better to manually implement the lyapunov ops, because we have an analytic expression of the reverse-mode gradients. We relieve the system of the need to compute these, and side-step any potential issues arising from approximation. |
Ah, yeah, the "depth" at which something should be implemented is always measured relative to its complexity and ancillary benefits. As you've noticed, this library's current "depth" is somewhere around the BLAS/LAPACK level. As a result, the complexity of implementing something at that level is relatively low, and there are multiple examples illustrating how to implement and test such Those new Work at this "depth" provides two important ancillary benefits that are exclusive to this approach:
Our ability to reason about expressions and manipulate them in an automated fashion is what allows many of the performance and stability optimizations that set this project apart from plain NumPy and SciPy–as well as other tensor libraries. Redundant, high-level Consider an One could construct new rewrites explicitly for the redundant In this situation, redundancy begets redundancy, so, unless the benefits of introducing redundancy in a particular instance are very clear, we must avoid it altogether.
First, we always have "analytic" expression for the gradients. Numerical approximations are not produced or used by Aesara. Second, the The |
Thanks for laying it out clearly. I'm clearly not getting it through my head that aesara straddles the line between sympy and numpy. I appreciate your patience as I learn all this. I'm happy to work on I spent this morning doing some research regarding the schur decomposition, and it seems it would be non-trivial to implement. I couldn't find an implementation in TF, Torch, or mxnet. I dug around on google scholar and found implementations for gradients of SVD, QR, and LU decompositions (here and here), but not Schur or QZ. If anyone has a reference, I'm happy to work on figuring an implementation out. But, as I'm sure you can tell from repeated interactions with me, I'm already extremely out of my depth here. Trying to work out the gradients myself from first principals is not likely to happen. |
If we can't get a gradient for the Schur decomposition in place, then a separate PR for
Same with those other decompositions (e.g. QR, LU) with known gradients; having In the end, we really only need to do this sort of due diligence to determine exactly why something can/can't reasonably be done at a lower
It wouldn't surprise me if someone hasn't published a solution to this one, but my first thought is that the approach would be very similar to the one for QR, since the two decompositions are related. If I recall, a common QR algorithm essentially computes the Schur decomposition, but there might be some subtleties involving real/imaginary results. |
@brandonwillard But if we can't easily get Schur, we can't get lyapunov that way, no? If that's the case it sounds like this PR would be our next best bet. |
Yes, that's part of what I was saying. There are other options we could consider, though. For instance, it might be straightforward to identify |
Op
s for solve_discrete_lyapunov
and solve_continuous_lyapunov
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1020 +/- ##
==========================================
+ Coverage 74.10% 74.16% +0.06%
==========================================
Files 174 174
Lines 48673 48774 +101
Branches 10373 10376 +3
==========================================
+ Hits 36067 36175 +108
+ Misses 10315 10311 -4
+ Partials 2291 2288 -3
|
I hope I don't seem like a broken record when saying this: But implementing the backward mode gradient operations through eg the schur decomposition isn't ideal. Let's say we want to compute For a given $\bar{X} $ the values for $\bar{Q} $ and $\bar{A} $ are also nicely defined, and the formulas from the paper work just fine. But if we compute the gradients through the schur decomposition we run into trouble: First we can notice that because $A $ is symmetric, the schur decomposition coincides with the eigenvalue decomposition. But because the eigenvalues of $A $ are not unique, the function that maps $A $ to its eigenvalue decomposition isn't a well defined function in the mathematical sense, because any orthogonal matrix $Q $ can be used as eigenvector matrix. In the forward code this isn't an issue, because we don't care which $Q $ we get as long as it is orthogonal. However, the backward operation of the eigenvalue decomposition isn't well defined because of this, which shows up in the formulas for $\bar{A} $ as a division by zero: This makes me think that the Op as it is implemented in this PR is actually better than a rewrite that uses the schur decomposition and its gradient directly. It is reasonably stable, tested and produces values and gradients if they are well defined (ie But if we want to have an even better solution, maybe we could get the best of both worlds: If we have forward ops for schur, trsyl, qz etc, we could write the lyaponov et al Ops as |
I've still yet to see how "expanding" the Lyapunov-solving function/operator into its constituent parts wouldn't be ideal. In other words, if it could be done, why wouldn't it be better than the un-expanded version? So far, in this discussion, the only concrete issue I've noticed is whether or not one component of the expanded version could be implemented without more effort than it's theoretically worth, but that's distinct from the question(s) "Is it possible or ideal?". Aside from that, I've already given a directly relevant example demonstrating how such expansions can be at least as performant as their un-expanded counterparts (i.e. the discrete "direct" case). Just so we're clear, the take-away from that example is that we should dispatch based on the method, so that the expanded version is used when possible, and, if need be, we can have an un-expanded
It sounds like you're restating one of the potential challenges behind implementing a complete gradient for the constituent Schur step, and not necessarily making a statement about the performance or quality of an expanded approach–both of which are important aspects of an ideal/not ideal approach. Before going further, it should be clear that we're talking about one very specific "degenerate" set of inputs for which we have not even performed any sort of analysis for the un-expanded case (i.e. current implementation)–let alone a relevant comparative analysis with an expanded implementation of any form. That said, this is really a constrained issue, and one that might be surmountable in any number of ways–some of which could have everything/nothing to do with the specifics of an This is especially relevant given that the provided example equates eigen-decompositions with Schur decompositions, which may be true in some cases at a high-level, but not true in terms of the solution-generating processes and their choices regarding mathematically ambiguous mappings, subgradients, removable singularities, etc. If we want consistent gradient implementations, then we need to be consistent
The actual function mapping More importantly, consider a simpler function with a similar issue: the absolute value. import numpy as np
import aesara
import aesara.tensor as at
x = at.scalar("x")
y = at.abs(x)
y_fn = aesara.function([x], [y, aesara.grad(y, x)])
y_fn(0.0)
# [array(0.), array(0.)] As with the absolute value, it's possible to choose a specific subgradient value. I don't see why the same isn't possible/reasonable here.
I keep noticing mention of "forward" and "backward", but I don't yet see why these distinctions are relevant, so I can't address those aspects at the moment. Regardless, let's see if we can quickly do something along the lines of the subgradient approach mentioned above, but for this exact example. Currently, our For comparison, we'll simply add the implementation from Giles, Mike. 2008. “An Extended Collection of Matrix Derivative Results for Forward and Reverse Mode Automatic Differentiation.” to a custom import numpy as np
import aesara
import aesara.tensor as at
from aesara.tensor.nlinalg import Eig, _zero_disconnected
A = at.matrix("A")
d, U = at.linalg.eigh(A)
eigh_fn = aesara.function([A], [d, U, at.grad(U.sum(), A)])
class MyEig(Eig):
def L_op(
self,
inputs,
outputs,
g_outputs,
):
(A,) = inputs
d, U = outputs
dd, dU = _zero_disconnected([d, U], g_outputs)
dD = at.diag(dd)
# Compute all differences of the elements in `d`, i.e. `d[j] - d[i]`
E = at.outer(at.ones(d.shape[0]), d) - d[..., None]
# This is what the current version of `Eigh.grad` effectively does:
# from aesara.tensor.subtensor import set_subtensor
# non_diag_mask = tm.invert(at.eye(E.shape[0], dtype="bool"))
# F = at.zeros_like(E)
# F = set_subtensor(F[non_diag_mask], tm.reciprocal(E[non_diag_mask]))
# This replaces all `d[j] == d[i]` with 0, instead of just the diagonal
# of E
F = at.switch(at.neq(E, 0.0), at.reciprocal(E), 0.0)
# TODO: This inverse probably isn't good.
dA = at.linalg.inv(U).T @ (dD + F * U.T @ dU) @ U.T
return [dA]
myeig = MyEig()
d_2, U_2 = myeig(A)
myeig_fn = aesara.function([A], [d_2, U_2, at.grad(U_2.sum(), A)])
rng = np.random.default_rng(2039)
A_val = rng.normal(size=(3, 3))
A_val = A_val.T @ A_val
A_val = A_val + A_val.T
d_val, U_val, dU_sum_val = eigh_fn(A_val)
d_val, U_val, dU_sum_val
# [array([ 1.60014544, 6.97086699, 10.3810344 ]),
# array([[ 0.60544104, -0.06363686, 0.79334198],
# [-0.65820594, -0.6004278 , 0.4541491 ],
# [-0.44744396, 0.7971429 , 0.40540979]]),
# array([[-0.0907336 , 0. , 0. ],
# [ 0.32372021, 0.14822482, 0. ],
# [-0.30375936, 0.0925893 , -0.05749122]])]
# Rough check of the eigen-decomposition
np.allclose(U_val * d_val @ np.linalg.inv(U_val), A_val)
# True
%timeit eigh_fn(A_val)
# 148 µs ± 4.18 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
d_2_val, U_2_val, dU_sum_2_val = myeig_fn(A_val)
d_2_val, U_2_val, dU_sum_2_val
# [array([ 1.60014544, 10.3810344 , 6.97086699]),
# array([[-0.60544104, 0.79334198, -0.06363686],
# [ 0.65820594, 0.4541491 , -0.6004278 ],
# [ 0.44744396, 0.40540979, 0.7971429 ]]),
# array([[-0.03123565, -0.12868062, -0.41475149],
# [ 0.01339011, 0.05516287, 0.17779589],
# [-0.01800771, -0.07418586, -0.23910902]])]
np.allclose(U_2_val * d_2_val @ np.linalg.inv(U_2_val), A_val)
# True
%timeit myeig_fn(A_val)
# 97.6 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# Perform some rough checks to make sure that the new gradient implementation
# for our new `Op` is working:
aesara.gradient.verify_grad(lambda A: myeig(A)[0].sum(), [A_val], rng=rng)
# How about at a (previously) "undefined" input?
aesara.gradient.verify_grad(lambda A: myeig(A)[0].sum(), [np.eye(3)], rng=rng)
# What about `Eigh`?
aesara.gradient.verify_grad(lambda A: at.linalg.eigh(A)[0].sum(), [A_val], rng=rng)
aesara.gradient.verify_grad(lambda A: at.linalg.eigh(A)[0].sum(), [np.eye(3)], rng=rng)
# ValueError: ('abs_err not finite', 'array([[nan, 0., 0.],\n [nan, nan, 0.],\n [nan, nan, nan]])')
# Now, try the degenerate case:
eigh_fn(np.eye(3))
# [array([1., 1., 1.]),
# array([[1., 0., 0.],
# [0., 1., 0.],
# [0., 0., 1.]]),
# array([[nan, 0., 0.],
# [nan, nan, 0.],
# [nan, nan, nan]])]
myeig_fn(np.eye(3))
# [array([1., 1., 1.]),
# array([[1., 0., 0.],
# [0., 1., 0.],
# [0., 0., 1.]]),
# array([[0., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.]])] Like the expanded discrete Lyapunov example, this expanded gradient implementation is at least as performant as a comparable non-expanded implementation. While the comparison isn't exact, because
To reiterate, you've only stated one of the implementation challenges/choices, not a point regarding the relative quality of different implementations; otherwise, what you're saying appears to amount to "We have an implementation in this branch and I want to use it, so it's better than the unimplemented approaches we're discussing". It's fine to say you want this/a implementation, but it doesn't make sense to use that as a means of comparing their relevant qualities (aside from being present or not). Also, functions are only as well-defined as they're implemented/specified: i.e. even if a common mathematical definition of a function isn't well-defined at certain points doesn't mean that a viable, refined definition of it isn't possible; however, you seem to be implying the latter. |
@aseyboldt neatly summarized the possible approaches and their pros/cons in a private conversation, and one of the points he made was that what we're calling the "subgradient approach" (i.e. making "usable" I absolutely agree that the |
My summary from that chat (slightly edited, I hope I didn't break anything @brandonwillard ): We know that if we split the solve op and then chain the backward ops of the computation graph of solve_lyapunov (ie schur decomposition and trsyl) we get additional singularities that we don't want: The lyapunov equation always has a solution and gradient, unless We know of those 5 options:
And my current personal feelings about those would be something like this:
This turned out more tricky than I thought, I hope we didn't scare you away @jessegrabowski :-) |
Not scared away, just staying quiet to avoid making myself look like a fool (a common occurrence). Some questions:
Plus a comment:
|
You can define a function for the gradient of an |
Yes, I believe so.
Also, if NumPy/SciPy expose their own C-level means of accessing BLAS/LAPACK functions, then one can always use those. |
I updated this PR with the following changes:
My idea with this setup is to direct users to the Aesara implementation, while leaving the "by hand" version until we can cover all the corner cases: complex gradients, native aesara implementation of the continuous case. If the consensus is to remove it all together, though, I can do that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look great. We just need to resolve the conflict(s) and we can get to merging this.
I think I messed this all up trying to update the branch to match the current aesara |
One minute; I'll take a (local) look. |
3ff4822
to
84029ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed some changes to this branch. @jessegrabowski, take a look and make sure all your changes are in there.
If you want to make more changes, you'll need to fetch the upstream changes and do something like git rebase --skip <origin-name>/solve_lyapunov
to pull in these remote changes and overwrite your local ones.
Before you do that, make sure your <origin-name>
(e.g. origin
) remote is set to pull from your fork (i.e. jessegrabowski/aesara.git
) where these changes reside. You can use git remotes
to do that. (N.B. There are a lot of ways to do this, but the important point is that you want to overwrite your local changes with whatever is here in jessegrabowski:solve_lyapunov
.)
You should probably check out your upstream remote, too, and just make sure it points to aesara-devs/aesara.git
.
84029ae
to
4fc0168
Compare
Thanks for taking the time to clean up my mess. I'll take some time to get my git situation squared away before I move forward with any other contributions. |
If it helps anyone, I am also posting here. As a reference I have background in applied linear algebra. |
4fc0168
to
48965b2
Compare
Sorry for the delay in reviewing this. It looks like Brandon approved this PR, so unless the tests fail after rebasing on |
Following our discussions in #1015 and #1011, this PR adds Ops to
aesara.tensor.slinalg
that wrapscipy.linalg.solve_discrete_lyapunov
andscipy.linalg.solve_continuous_lyapunov
, as well as compute the reverse-mode gradients for each.One note, I had an error from mypy when running the pre-commit hooks:
aesara\link\c\cmodule.py:2439: error: Unused "type: ignore" comment
But this is unrelated to any changes i made, so I didn't know what to do with it.