consider DifferentiationInterface.jl #219

ExpandingMan · 2024-10-11T22:35:34Z

Hello! I've been lurking around this package for a while, thinking of maybe using it instead of SaticArrays.jl, or maybe what I want is a little different and I should try to do something similar but distinct, or whatever.

I've been looking at some of the AD stuff in here, and I just thought I should point out the existence of DifferentiationInterface.jl. While certainly ForwardDiff seems particularly relevant for this package, the advantage of using the interface is that it can potentially be used to make the code generic for arbitrary AD back-ends. Another one that is potentially quite relevant here is e.g. Enzyme. It also might allow you to ditch a lot of the ForwardDiff internal stuff that is currently necessary (see for example pushforward which is a slightly lower level function provided by DI).

Anyway, cool package, keep up the good work!

The text was updated successfully, but these errors were encountered:

KeitaNakamura · 2024-10-15T03:09:45Z

Hi, thank you very much for the information. I'll take a look. Do you think using Enzyme could improve performance compared to ForwardDiff?

KeitaNakamura · 2024-10-15T04:41:21Z

I tried the following code, but it seems that ForwardDiff.jl is much faster than Enzyme.jl

julia> using DifferentiationInterface, StaticArrays, BenchmarkTools

julia> import ForwardDiff, Enzyme

julia> f(x) = sum(abs2, x)
f (generic function with 1 method)

julia> x = SVector(1.0, 2.0)
2-element SVector{2, Float64} with indices SOneTo(2):
 1.0
 2.0

julia> @benchmark value_and_gradient(f, AutoForwardDiff(), $x)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.791 ns … 17.750 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.916 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.952 ns ±  0.564 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▇       █        ▂                       ▂ ▁
  ▃▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█ █
  1.79 ns      Histogram: log(frequency) by time     2.08 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value_and_gradient(f, AutoEnzyme(), $x)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  5.708 ns …  5.727 μs  ┊ GC (min … max):  0.00% … 99.60%
 Time  (median):     6.292 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   8.022 ns ± 70.177 ns  ┊ GC (mean ± σ):  14.08% ±  1.72%

  ▇█▇█▇▄  ▁   ▄▆▅▅▃▂                                         ▂
  █████████▆▂▆███████▅▅▅▄▆▆▅▄▃▆▇▆▇▇█▇▇▆▆▆▅▆▇▇▇█▇██▆▆▂▅▄▅▅▆▇▆ █
  5.71 ns      Histogram: log(frequency) by time     14.4 ns <

 Memory estimate: 32 bytes, allocs estimate: 1.

ExpandingMan · 2024-10-15T15:40:12Z

Yeah, the front-end for Enzyme has had a lot of issues (especially with StaticArrays). You should mostly find that the low level autodiff from Enzyme is blazing fast in most cases. @gdalle and I have been working on getting StaticArrays to work efficiently with DI as-is. It would be interesting to see how it would work for tensorial, it might suggest optimizations for DI or the core packages.

Anyway, I think it's safe to say there is also a broader advantage from the flexibility of using DI as opposed to merely a direct comparison of Enzyme and ForwardDiff. I'll try to remember to post back here when I believe the most egregious Enzyme StaticArrays cases are solved.

gdalle · 2024-10-15T16:09:37Z

Just popping by to say that maximum performance with DI can only be attained with preparation, so the right way to benchmark would look more like this:

backend1 = AutoForwardDiff();
prep1 = prepare_gradient(f, AutoForwardDiff(), x);
@btime value_and_gradient(f, $prep1, $backend1, $x)

backend2 = AutoEnzyme(; mode=Enzyme.Forward);
prep2 = prepare_gradient(f, backend2, x);
@btime value_and_gradient(f, $prep2, $backend2, $x)

backend3 = AutoEnzyme(; mode=Enzyme.Reverse);
prep3 = prepare_gradient(f, backend3, x);
@btime value_and_gradient(f, $prep3, $backend3, $x)

You're right that ForwardDiff is still faster for the gradient but not by much, and ideally Enzyme in reverse mode should be on par with it (but it's hard to optimize).

Test Enzyme with StaticArrays JuliaDiff/DifferentiationInterface.jl#558

gdalle · 2024-10-16T11:56:19Z

With the latest version of DI (v0.6.14) on Julia 1.10.5, I get benchmarks like these:

For x = @SVector(rand(2)):

julia> @btime value_and_gradient(f, $prep_forwarddiff, $forwarddiff, $x);
  2.856 ns (0 allocations: 0 bytes)

julia> @btime value_and_gradient(f, $prep_enzyme_forward, $enzyme_forward, $x);
  4.314 ns (0 allocations: 0 bytes)

julia> @btime value_and_gradient(f, $prep_enzyme_reverse, $enzyme_reverse, $x);
  4.122 ns (0 allocations: 0 bytes)

For x = @SVector(rand(10)):

julia> @btime value_and_gradient(f, $prep_forwarddiff, $forwarddiff, $x);
  16.767 ns (0 allocations: 0 bytes)

julia> @btime value_and_gradient(f, $prep_enzyme_forward, $enzyme_forward, $x);
  33.211 ns (0 allocations: 0 bytes)

julia> @btime value_and_gradient(f, $prep_enzyme_reverse, $enzyme_reverse, $x);
  8.411 ns (0 allocations: 0 bytes)

For x = rand(100):

julia> @btime value_and_gradient(f, $prep_forwarddiff, $forwarddiff, $x);
  4.293 μs (19 allocations: 12.05 KiB)

julia> @btime value_and_gradient(f, $prep_enzyme_forward, $enzyme_forward, $x);
  16.703 μs (330 allocations: 13.83 KiB)

julia> @btime value_and_gradient(f, $prep_enzyme_reverse, $enzyme_reverse, $x);
  721.333 ns (2 allocations: 960 bytes)

KeitaNakamura · 2024-10-17T02:58:21Z

Thank you for the detailed benchmarks. I have one question regarding your comment:

Just popping by to say that maximum performance with DI can only be attained with preparation, so the right way to benchmark would look more like this:

I understand that the preparation step should be skipped when benchmarking DI, but in practice, can the preparation process actually be avoided? If not, it seems to have a significant impact on performance.

gdalle · 2024-10-17T05:01:17Z

The preparation step can be avoided in practice but in general it will make your code much slower. The "unprepared" version essentially falls back on calling preparation then running the "prepared" version. And in many cases, preparation involves pre allocating a cache, recording a tape, making some type-unstable choices like batch size, all of which are slow but can be reused whenever you compute the same operator for similar inputs. The mantra here is "prepare once, differentiate many times", which is why it usually doesn't make sense to benchmark preparation itself (a bit like JIT compilation).
Check out this tutorial for more details: https://gdalle.github.io/DifferentiationInterface.jl/DifferentiationInterface/dev/explanation/operators/#Preparation

KeitaNakamura mentioned this issue Oct 15, 2024

Make it more like StaticArrays.jl #222

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consider DifferentiationInterface.jl #219

consider DifferentiationInterface.jl #219

ExpandingMan commented Oct 11, 2024

KeitaNakamura commented Oct 15, 2024

KeitaNakamura commented Oct 15, 2024

ExpandingMan commented Oct 15, 2024 •

edited

Loading

gdalle commented Oct 15, 2024

gdalle commented Oct 16, 2024

KeitaNakamura commented Oct 17, 2024

gdalle commented Oct 17, 2024

consider DifferentiationInterface.jl #219

consider DifferentiationInterface.jl #219

Comments

ExpandingMan commented Oct 11, 2024

KeitaNakamura commented Oct 15, 2024

KeitaNakamura commented Oct 15, 2024

ExpandingMan commented Oct 15, 2024 • edited Loading

gdalle commented Oct 15, 2024

gdalle commented Oct 16, 2024

KeitaNakamura commented Oct 17, 2024

gdalle commented Oct 17, 2024

ExpandingMan commented Oct 15, 2024 •

edited

Loading