Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider DifferentiationInterface.jl #219

Open
ExpandingMan opened this issue Oct 11, 2024 · 7 comments
Open

consider DifferentiationInterface.jl #219

ExpandingMan opened this issue Oct 11, 2024 · 7 comments

Comments

@ExpandingMan
Copy link

Hello! I've been lurking around this package for a while, thinking of maybe using it instead of SaticArrays.jl, or maybe what I want is a little different and I should try to do something similar but distinct, or whatever.

I've been looking at some of the AD stuff in here, and I just thought I should point out the existence of DifferentiationInterface.jl. While certainly ForwardDiff seems particularly relevant for this package, the advantage of using the interface is that it can potentially be used to make the code generic for arbitrary AD back-ends. Another one that is potentially quite relevant here is e.g. Enzyme. It also might allow you to ditch a lot of the ForwardDiff internal stuff that is currently necessary (see for example pushforward which is a slightly lower level function provided by DI).

Anyway, cool package, keep up the good work!

@KeitaNakamura
Copy link
Owner

Hi, thank you very much for the information. I'll take a look. Do you think using Enzyme could improve performance compared to ForwardDiff?

@KeitaNakamura
Copy link
Owner

I tried the following code, but it seems that ForwardDiff.jl is much faster than Enzyme.jl

julia> using DifferentiationInterface, StaticArrays, BenchmarkTools

julia> import ForwardDiff, Enzyme

julia> f(x) = sum(abs2, x)
f (generic function with 1 method)

julia> x = SVector(1.0, 2.0)
2-element SVector{2, Float64} with indices SOneTo(2):
 1.0
 2.0

julia> @benchmark value_and_gradient(f, AutoForwardDiff(), $x)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  1.791 ns  17.750 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.916 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.952 ns ±  0.564 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▇       █        ▂                       ▂ ▁
  ▃▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█ █
  1.79 ns      Histogram: log(frequency) by time     2.08 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value_and_gradient(f, AutoEnzyme(), $x)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  5.708 ns   5.727 μs  ┊ GC (min  max):  0.00%  99.60%
 Time  (median):     6.292 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   8.022 ns ± 70.177 ns  ┊ GC (mean ± σ):  14.08% ±  1.72%

  ▇█▇█▇▄  ▁   ▄▆▅▅▃▂                                         ▂
  █████████▆▂▆███████▅▅▅▄▆▆▅▄▃▆▇▆▇▇█▇▇▆▆▆▅▆▇▇▇█▇██▆▆▂▅▄▅▅▆▇▆ █
  5.71 ns      Histogram: log(frequency) by time     14.4 ns <

 Memory estimate: 32 bytes, allocs estimate: 1.

@ExpandingMan
Copy link
Author

ExpandingMan commented Oct 15, 2024

Yeah, the front-end for Enzyme has had a lot of issues (especially with StaticArrays). You should mostly find that the low level autodiff from Enzyme is blazing fast in most cases. @gdalle and I have been working on getting StaticArrays to work efficiently with DI as-is. It would be interesting to see how it would work for tensorial, it might suggest optimizations for DI or the core packages.

Anyway, I think it's safe to say there is also a broader advantage from the flexibility of using DI as opposed to merely a direct comparison of Enzyme and ForwardDiff. I'll try to remember to post back here when I believe the most egregious Enzyme StaticArrays cases are solved.

@gdalle
Copy link

gdalle commented Oct 15, 2024

Just popping by to say that maximum performance with DI can only be attained with preparation, so the right way to benchmark would look more like this:

backend1 = AutoForwardDiff();
prep1 = prepare_gradient(f, AutoForwardDiff(), x);
@btime value_and_gradient(f, $prep1, $backend1, $x)

backend2 = AutoEnzyme(; mode=Enzyme.Forward);
prep2 = prepare_gradient(f, backend2, x);
@btime value_and_gradient(f, $prep2, $backend2, $x)

backend3 = AutoEnzyme(; mode=Enzyme.Reverse);
prep3 = prepare_gradient(f, backend3, x);
@btime value_and_gradient(f, $prep3, $backend3, $x)

You're right that ForwardDiff is still faster for the gradient but not by much, and ideally Enzyme in reverse mode should be on par with it (but it's hard to optimize).

Related:

@gdalle
Copy link

gdalle commented Oct 16, 2024

With the latest version of DI (v0.6.14) on Julia 1.10.5, I get benchmarks like these:

For x = @SVector(rand(2)):

julia> @btime value_and_gradient(f, $prep_forwarddiff, $forwarddiff, $x);
  2.856 ns (0 allocations: 0 bytes)

julia> @btime value_and_gradient(f, $prep_enzyme_forward, $enzyme_forward, $x);
  4.314 ns (0 allocations: 0 bytes)

julia> @btime value_and_gradient(f, $prep_enzyme_reverse, $enzyme_reverse, $x);
  4.122 ns (0 allocations: 0 bytes)

For x = @SVector(rand(10)):

julia> @btime value_and_gradient(f, $prep_forwarddiff, $forwarddiff, $x);
  16.767 ns (0 allocations: 0 bytes)

julia> @btime value_and_gradient(f, $prep_enzyme_forward, $enzyme_forward, $x);
  33.211 ns (0 allocations: 0 bytes)

julia> @btime value_and_gradient(f, $prep_enzyme_reverse, $enzyme_reverse, $x);
  8.411 ns (0 allocations: 0 bytes)

For x = rand(100):

julia> @btime value_and_gradient(f, $prep_forwarddiff, $forwarddiff, $x);
  4.293 μs (19 allocations: 12.05 KiB)

julia> @btime value_and_gradient(f, $prep_enzyme_forward, $enzyme_forward, $x);
  16.703 μs (330 allocations: 13.83 KiB)

julia> @btime value_and_gradient(f, $prep_enzyme_reverse, $enzyme_reverse, $x);
  721.333 ns (2 allocations: 960 bytes)

@KeitaNakamura
Copy link
Owner

Thank you for the detailed benchmarks. I have one question regarding your comment:

Just popping by to say that maximum performance with DI can only be attained with preparation, so the right way to benchmark would look more like this:

I understand that the preparation step should be skipped when benchmarking DI, but in practice, can the preparation process actually be avoided? If not, it seems to have a significant impact on performance.

@gdalle
Copy link

gdalle commented Oct 17, 2024

The preparation step can be avoided in practice but in general it will make your code much slower. The "unprepared" version essentially falls back on calling preparation then running the "prepared" version. And in many cases, preparation involves pre allocating a cache, recording a tape, making some type-unstable choices like batch size, all of which are slow but can be reused whenever you compute the same operator for similar inputs. The mantra here is "prepare once, differentiate many times", which is why it usually doesn't make sense to benchmark preparation itself (a bit like JIT compilation).
Check out this tutorial for more details: https://gdalle.github.io/DifferentiationInterface.jl/DifferentiationInterface/dev/explanation/operators/#Preparation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants