-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A combined CostGrad structure #139
Comments
There is already a Julia package that has combined function and gradient calculation figured out: https://github.com/JuliaNLSolvers/NLSolversBase.jl . |
Thanks for the link, maybe it can also be done easier then (with their last |
The interesting part I have not yet written above that they cover is to only evaluate one even when having |
When you calculate gradient using AD then you get cost "for free" as a part of the output of AD -- so it's nice to be able to use that. |
sure I also know a few cases where I can use that, I am planning to do this issue somewhen, I just do not see the point of using |
I think I might have an idea based on a discussion I had with a student today, where we actually needed something like this. We introduce a struct ProblemCache{P;G,C} <: AbstractCache
p::P # current point
X::T # current gradient (at p)
value::F # current cost (at p)
# maybe further fields.
end that might have more fields depending on what information one wants to share. Now the cost and the gradient are functors (see #172 since they are also indicators whether they are in-place or allocating) so for example struct MyGradient{PC <: ProblemCache} <: AbstractGradient
pc::PC
end ( and struct MyCost <: AbstractCost
pc::PC
end as a functor for the cost. The overhead would be that one has to check in for a function update_cache!(::Untion{<:MyCost, <:MyGradient}, p)
...
end so that comon operations / updates would even only be implemented once. On the other hand this approach still keeps two separate functions of grad and cost and enable the caching we talked about. Sure one can also do simple helpers like a struct CostCache{P,C,F} <: AbstractCache
p::P
cost::C
f::F
end Which could wrap any function |
I think combined cost and gradient calculation and caching are two separate issues? By that I mean that there should be a way to have a conbined calculation without caching. Probably the simplest interface for that would be something like abstract type AbstractObjective end
struct SeparateGradCostObjective{TF,TG} <: AbstractObjective
f::TF
g::TG
end
struct CombinedGradCostObjective{TFG} <: AbstractObjective
fg::TFG
end
get_cost(::AbstractObjective, p) = ...
get_gradient(::AbstractObjective, p) = ...
get_cost_and_gradient(::AbstractObjective, p) = ... # (either calls `fg` for `CombinedGradCostObjective` or `f` and `g` for `SeparateGradCostObjective`)
# and so on with mutating variants, Hessians, multivalued objectives for nonlinear least squares, Jacobians... |
Well, I would say to some extend those have similar components. This would have the advantage that the two fields edit: or to phrase differently – if the function I think this might even be a little more flexible than your approach, but sure if storing a gradient is a problem, then one would need a |
The current two fields seem to be incompatible with cache-less combined cost-gradient computation which I'd like to have available so some rework would have to happen there. Either a combined thing or a third field for the combined computation.
In my approach there could also be something like struct CombinedAndSeparateGradCostObjective{TF,TG,TFG} <: AbstractObjective
f::TF
g::TG
fg::TFG
end
get_cost(obj::CombinedAndSeparateGradCostObjective, p) = obj.f(p)
get_gradient(obj::CombinedAndSeparateGradCostObjective, p) = obj.g(p)
get_cost_and_gradient(obj::CombinedAndSeparateGradCostObjective, p) = obj.fg(p) and then we could have a mutable struct CachedObjective{TObj<:AbstractObjective,TP} <: AbstractObjective
obj::TObj
cached_at::TP
cached_values::Dict{Symbol,Any}
end
get_cost(obj::CachedObjective, p) # check if p is obj.cached_at and either returns value from obj.cached_values or computes a new one, deleting contents of cached_values. which would wrap an arbitrary objective and cache values exactly as you wrote. The advantage is that we could then use combined evaluation without caching. Good caching is not easy. For example, you need an efficient cache invalidation mechanism. When you have stored objective and gradient, but you need to compute just the objective at a new point, there may be no need to compute the gradient but the old one needs to be marked as invalid (if we have a cache for a single point). Or we could have semi-separate caches for cost and gradient computation. I'm not sure if a combined objective would be better or worse than three separate functions, my main point is that caching should be optional. |
Hm, that sounds valid, but makes the approach I had in mind not doable. The other disadvantage of your approach is, that it would require a lot – really a lot a lot – of rework, since the
with all their accessor functions which then need to be rewritten. I am not saying, we should not do that, but I will maybe remove the “good first issue”, because this large change, would more like put it on the other side of the spectrum of issues. And yes sure, partially invalidated data was not yet in my mind, but should be doable with the dictionary you proposed (good idea!) if one stores maybe valid flags always tuples |
Yes, right, a unified cost/grad/etc. field would be a significant rework. So maybe a better (or at least easier to implement) idea would be to introduce additional fields in specific problems for combined computations? It would be non-breaking at least so it could be introduced gradually. |
Hm, if that is an urgent issue then sure we could do that, otherwise that might be my new-years project maybe – them, well so eh, my last years new-years project took until April, but well will see ;) |
No, it's not urgent, actually the opposite. I doubt it will ever make anything more than 2x faster which is neat but rarely makes a difference 😉 . I think there are more important things that can be improved in Manopt. |
Oh, then feel free to open more issues – there are several parts that I wrote to learn Julia around 2015/2016, so sure that might not all be optimal performance wise, but I hope I at least gave most interfaces some thoughts. For the gf/cache (if so this would model both), sure, that is not urgent but with our discussion I think I have a good idea, how one can do this. |
Ah and I think there might be complicated cost/gards where at least caching might be useful, but again, I think since that would be some rework, one might approach the combination of functions in the same PR. |
Actually I think performance is mostly fine now, at least in the few solvers that I've used so far and for not too cheap to evaluate cost functions. A benchmark that could attract more users by showing advantages over other libraries would be great. I think we could just take any Manopt.jl example, rewrite it in Optim.jl for example as a constrained optimization problem and show how much faster Manopt.jl converges. |
Ah, that might be a good think as a tutorial / comparison, but if so maybe also as an honest one:
Sounds like a good plan for a tutorial. |
Depending on some feedback from the talk yesterday:
It would be nice to have a function that can reuse parts from the computation of the cost for the gradient and vice versa.
In manopt there is such a method available ( https://www.manopt.org/tutorial.html#costdescription ) in pymanopt there is an issue open ( pymanopt/pymanopt#65 )
A first idea would be to introduce a struct
and a combined function
which would compute
c
and rhegrad
ifp
is different from thegc.p
and storep, grad, c
in the struct. The actual cost and gradient then depend on an initialisedgc
asand one could write a constructor for all abstract gradient problems (or all that have a gradient) that if they are called with a
CombinedCostAndGrad
structures, these two functions would be generated automatically within the generated problem.Then the user really only has to implement the above new struct type thing as a combined function.
The text was updated successfully, but these errors were encountered: