-
-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent Behaviour in Broadcasting #1036
Comments
I guess the culprit is the discrepancy between the version of the rule DiffRules and ChainRules
In the latter we adopt a different convention from the former. |
Hmm, using DiffRules seems to be more "correct". Thoughts? |
Unclear to me -- the "subgradient" convention adopted in ChainRules works perfectly in the KernelFunctions use-case. Also, it's what we do for lots of other things (ReLUs, abs, etc). My guess would be it's a useful thing to do? |
Hmm, I interpreted this as meaning that you believe the convention adopted in DiffRules is more appropriate than the one in ChainRules. Was I correct in this interpretation? |
Returning Considering The deriviative wrt This also has the attractive property that if you enumerate all points that the derivative is zero, Which is where we get the subgradient convention (which I will write up one day, it comes from a discussion I had at a conference.) |
I think this thread urgently needs a graph: As x -> 0 from above, the gradient does go to positive real infinity. Whether to return A few more cases to check, while I was at it. We should be sure to get all of these right. |
How did you obtain these plots? x^y for y non-integer and x negative is impossible to define in a consistent way |
(eg see your 3.14 plot, which does not look like x^3) |
They allow complex numbers, the dashed lines are the imaginary part when this is nonzero. To get one answer you have to pick a branch, and Julia does. |
It does not pick a branch? (nor should it)
|
Ah, you gave it complex numbers, I see. This is different then. In that case the choice is inferred from the fact that zeros are signed:
|
Yes, there's a cut on the negative real axis in Julia's conventions. The 3rd choice is:
|
Here's my opinion of what should happen here
The crucial difference between case 2 and case 3 is that it is permissible to use one-sided derivatives when at the boundary of a definition where the function is undefined at the other side, but it's not when the function is discontinuous. Does that make sense? |
Except at y=1, where it's 1 -- the 2nd panel of my graph. And at y=0 it's zero. And negative y is allowed too. But what I'm not sure about is whether returning |
Oooh that's evil. Then yeah I agree with you: when y is integer (positive or negative), the rule should be that pow(x::Real, ::Real) follows pow(x::Real, y::Int). The derivative wrt y when y is integer and x <= 0 should be NaN I don't see the point of regularizing: the one-sided derivative is infinite. |
Yes, I think I agree. When at a boundary, we implicitly assume that this is the boundary of the domain of your parameter too, so of course you care about the one-sided derivative. When at a singularity in the interior there's no obvious answer mathematically. Although sometimes, like
The point would be to try to save people from NaNs downstream. What you are ultimately doing (in my head at least) is gradient descent, and if your parameter is bounded to [0,1] sometimes you need an arrow telling you to move away from the wall. A strictly infinite arrow is much more likely to break things than a large finite arrow -- and since you're doing floating point, you do accept some fuzziness, That said I'm far from 100% sure that this wouldn't have weird consequences elsewhere. It just seems worth some thought, before jumping from an infinite limit in the pure mathematics to the concrete |
With what you suggest the arrow would be 1/eps(): if you're doing things like that, it will break: I would imagine you want to make sure it breaks fast, it's easier to debug. |
The regulated arrow would depend on the function. So it goes smoothly from "infinite" to 0 for powers near 1:
That was my thinking in writing
How obvious is this? If I have some parameter x in [0,1], and choose to re-parameterise to work with |
Behaviour with ChainRules v1.11.4 now matches ForwardDiff, and the graph, for the the gradient with respect to
ForwardDiff still has NaN for the |
Hi, I discovered inconsistent behavior when using using Zygote
x = [1.0, -1.0]
w = [1.0, 0.0]
g1 = gradient(x) do x
sum(abs.(x .- w))
end # ([1.0, -1.0],)
g2 = gradient(x) do x
abs(x[1] - w[1]) + abs(x[2] - w[2])
end # ([0.0, -1.0],) Basically, both are correct in the sense that both provide a valid subgradient for Based on the first post in this thread, the issue might come (but I don't know enough of Zygote internals) from different definitions for the subgradient: DiffRules vs ChainRules. Let me know if I can provide more info ! |
As of version 0.6.15, the following:
The same code in version 0.6.14:
the crucial difference being in the last case for each.
If you do this with ForwardDiff, you get
so this smells to me like ForwardDiff is being invoked for
f
, whereas individual chain rules are used forsum(x .^ y)
because Zygote un-fuses broadcasting where it can (it can't un-fusef
).@oxinabox @mcabbott @DhairyaLGandhi I'm assuming that this is related to our recent efficiency upgrades?
This is breaking the build in KernelFunctions.jl. We can of course allow the tests to fail for the time being, but it essentially means that we can't use Zygote with one of our kernels.
The text was updated successfully, but these errors were encountered: