-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consider DifferentiationInterface.jl #219
Comments
Hi, thank you very much for the information. I'll take a look. Do you think using Enzyme could improve performance compared to ForwardDiff? |
I tried the following code, but it seems that ForwardDiff.jl is much faster than Enzyme.jl julia> using DifferentiationInterface, StaticArrays, BenchmarkTools
julia> import ForwardDiff, Enzyme
julia> f(x) = sum(abs2, x)
f (generic function with 1 method)
julia> x = SVector(1.0, 2.0)
2-element SVector{2, Float64} with indices SOneTo(2):
1.0
2.0
julia> @benchmark value_and_gradient(f, AutoForwardDiff(), $x)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 1.791 ns … 17.750 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.916 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.952 ns ± 0.564 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▇ █ ▂ ▂ ▁
▃▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█ █
1.79 ns Histogram: log(frequency) by time 2.08 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark value_and_gradient(f, AutoEnzyme(), $x)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 5.708 ns … 5.727 μs ┊ GC (min … max): 0.00% … 99.60%
Time (median): 6.292 ns ┊ GC (median): 0.00%
Time (mean ± σ): 8.022 ns ± 70.177 ns ┊ GC (mean ± σ): 14.08% ± 1.72%
▇█▇█▇▄ ▁ ▄▆▅▅▃▂ ▂
█████████▆▂▆███████▅▅▅▄▆▆▅▄▃▆▇▆▇▇█▇▇▆▆▆▅▆▇▇▇█▇██▆▆▂▅▄▅▅▆▇▆ █
5.71 ns Histogram: log(frequency) by time 14.4 ns <
Memory estimate: 32 bytes, allocs estimate: 1. |
Yeah, the front-end for Enzyme has had a lot of issues (especially with StaticArrays). You should mostly find that the low level Anyway, I think it's safe to say there is also a broader advantage from the flexibility of using DI as opposed to merely a direct comparison of Enzyme and ForwardDiff. I'll try to remember to post back here when I believe the most egregious Enzyme StaticArrays cases are solved. |
Just popping by to say that maximum performance with DI can only be attained with preparation, so the right way to benchmark would look more like this: backend1 = AutoForwardDiff();
prep1 = prepare_gradient(f, AutoForwardDiff(), x);
@btime value_and_gradient(f, $prep1, $backend1, $x)
backend2 = AutoEnzyme(; mode=Enzyme.Forward);
prep2 = prepare_gradient(f, backend2, x);
@btime value_and_gradient(f, $prep2, $backend2, $x)
backend3 = AutoEnzyme(; mode=Enzyme.Reverse);
prep3 = prepare_gradient(f, backend3, x);
@btime value_and_gradient(f, $prep3, $backend3, $x) You're right that ForwardDiff is still faster for the gradient but not by much, and ideally Enzyme in reverse mode should be on par with it (but it's hard to optimize). Related: |
With the latest version of DI (v0.6.14) on Julia 1.10.5, I get benchmarks like these: For julia> @btime value_and_gradient(f, $prep_forwarddiff, $forwarddiff, $x);
2.856 ns (0 allocations: 0 bytes)
julia> @btime value_and_gradient(f, $prep_enzyme_forward, $enzyme_forward, $x);
4.314 ns (0 allocations: 0 bytes)
julia> @btime value_and_gradient(f, $prep_enzyme_reverse, $enzyme_reverse, $x);
4.122 ns (0 allocations: 0 bytes) For julia> @btime value_and_gradient(f, $prep_forwarddiff, $forwarddiff, $x);
16.767 ns (0 allocations: 0 bytes)
julia> @btime value_and_gradient(f, $prep_enzyme_forward, $enzyme_forward, $x);
33.211 ns (0 allocations: 0 bytes)
julia> @btime value_and_gradient(f, $prep_enzyme_reverse, $enzyme_reverse, $x);
8.411 ns (0 allocations: 0 bytes) For julia> @btime value_and_gradient(f, $prep_forwarddiff, $forwarddiff, $x);
4.293 μs (19 allocations: 12.05 KiB)
julia> @btime value_and_gradient(f, $prep_enzyme_forward, $enzyme_forward, $x);
16.703 μs (330 allocations: 13.83 KiB)
julia> @btime value_and_gradient(f, $prep_enzyme_reverse, $enzyme_reverse, $x);
721.333 ns (2 allocations: 960 bytes) |
Thank you for the detailed benchmarks. I have one question regarding your comment:
I understand that the preparation step should be skipped when benchmarking DI, but in practice, can the preparation process actually be avoided? If not, it seems to have a significant impact on performance. |
The preparation step can be avoided in practice but in general it will make your code much slower. The "unprepared" version essentially falls back on calling preparation then running the "prepared" version. And in many cases, preparation involves pre allocating a cache, recording a tape, making some type-unstable choices like batch size, all of which are slow but can be reused whenever you compute the same operator for similar inputs. The mantra here is "prepare once, differentiate many times", which is why it usually doesn't make sense to benchmark preparation itself (a bit like JIT compilation). |
Hello! I've been lurking around this package for a while, thinking of maybe using it instead of
SaticArrays.jl
, or maybe what I want is a little different and I should try to do something similar but distinct, or whatever.I've been looking at some of the AD stuff in here, and I just thought I should point out the existence of DifferentiationInterface.jl. While certainly ForwardDiff seems particularly relevant for this package, the advantage of using the interface is that it can potentially be used to make the code generic for arbitrary AD back-ends. Another one that is potentially quite relevant here is e.g. Enzyme. It also might allow you to ditch a lot of the ForwardDiff internal stuff that is currently necessary (see for example
pushforward
which is a slightly lower level function provided by DI).Anyway, cool package, keep up the good work!
The text was updated successfully, but these errors were encountered: