Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A combined CostGrad structure #139

Closed
kellertuer opened this issue Jul 27, 2022 · 17 comments
Closed

A combined CostGrad structure #139

kellertuer opened this issue Jul 27, 2022 · 17 comments
Milestone

Comments

@kellertuer
Copy link
Member

Depending on some feedback from the talk yesterday:

It would be nice to have a function that can reuse parts from the computation of the cost for the gradient and vice versa.

In manopt there is such a method available ( https://www.manopt.org/tutorial.html#costdescription ) in pymanopt there is an issue open ( pymanopt/pymanopt#65 )

A first idea would be to introduce a struct

abstract type AbstractCombinedCostAndGrad{P,T} where {P,T} end

struct CombinedCostAndGradient{P,T,F} <: AbstractCombinedCostAndGradient{P,T} where {F}
    p::P
   grad::T
   cost::F
end

and a combined function

function (gc:: CombinedCostAndGradient )(M,p)
...
end

which would compute c and rhe grad if p is different from the gc.p and store p, grad, c in the struct. The actual cost and gradient then depend on an initialised gc as

function cost(M,p)
    gc(M,p)
return gc.cost
end
function grad(M,p)
    gc(M,p)
return gc.grad
end

and one could write a constructor for all abstract gradient problems (or all that have a gradient) that if they are called with a CombinedCostAndGrad structures, these two functions would be generated automatically within the generated problem.

Then the user really only has to implement the above new struct type thing as a combined function.

@mateuszbaran
Copy link
Member

There is already a Julia package that has combined function and gradient calculation figured out: https://github.com/JuliaNLSolvers/NLSolversBase.jl .

@kellertuer
Copy link
Member Author

Thanks for the link, maybe it can also be done easier then (with their last fg! idea, though I would also have fg there I think – and I want it to easily be compatible with the current state we have here (so just with a preprocessing use the existing gradient problems here).

@kellertuer
Copy link
Member Author

The interesting part I have not yet written above that they cover is to only evaluate one even when having costgrad(or their costgrad!, for the non mutating one I might need an extra argument and not just a Nothing like them.

@mateuszbaran
Copy link
Member

When you calculate gradient using AD then you get cost "for free" as a part of the output of AD -- so it's nice to be able to use that.

@kellertuer
Copy link
Member Author

sure I also know a few cases where I can use that, I am planning to do this issue somewhen, I just do not see the point of using Nothing to deactivate that.

@kellertuer
Copy link
Member Author

kellertuer commented Nov 29, 2022

I think I might have an idea based on a discussion I had with a student today, where we actually needed something like this.
The idea is as follows; it would also be an approach to solve #171, since it also covers the cache case.

We introduce a

struct ProblemCache{P;G,C} <: AbstractCache
    p::P # current point
    X::T # current gradient (at p)
    value::F # current cost (at p)
    # maybe further fields.
end

that might have more fields depending on what information one wants to share. Now the cost and the gradient are functors (see #172 since they are also indicators whether they are in-place or allocating) so for example

struct MyGradient{PC <: ProblemCache} <: AbstractGradient
    pc::PC
end 

(AbstractGradient would assume the functor has both an inlace and a mutating function implemented, AbstractMutatingGradient just the mutating one and so on)

and

struct MyCost <: AbstractCost
   pc::PC
end

as a functor for the cost.

The overhead would be that one has to check in for a g = MyGradient() type with g(M, p) whether isapprox(M, p, g.pc.p) that is whether the cache is up to date or needs to be updated, but here common operations could be implemented in an

function update_cache!(::Untion{<:MyCost, <:MyGradient}, p)
...
end

so that comon operations / updates would even only be implemented once.

On the other hand this approach still keeps two separate functions of grad and cost and enable the caching we talked about.

Sure one can also do simple helpers like a

struct CostCache{P,C,F} <: AbstractCache
    p::P
    cost::C
   f::F
end

Which could wrap any function f and provide a cache for one function evaluation (or if one wants to be fancy for a certain number of the last n function evaluations. This of course would directly be a “Cost functor”.

@mateuszbaran
Copy link
Member

I think combined cost and gradient calculation and caching are two separate issues? By that I mean that there should be a way to have a conbined calculation without caching. Probably the simplest interface for that would be something like

abstract type AbstractObjective end

struct SeparateGradCostObjective{TF,TG} <: AbstractObjective
    f::TF
    g::TG
end

struct CombinedGradCostObjective{TFG} <: AbstractObjective
    fg::TFG
end

get_cost(::AbstractObjective, p) = ...
get_gradient(::AbstractObjective, p) = ...
get_cost_and_gradient(::AbstractObjective, p) = ... # (either calls `fg` for `CombinedGradCostObjective` or `f` and `g` for `SeparateGradCostObjective`)

# and so on with mutating variants, Hessians, multivalued objectives for nonlinear least squares, Jacobians...

@kellertuer
Copy link
Member Author

kellertuer commented Nov 29, 2022

Well, I would say to some extend those have similar components.
It might be that both the cost and the gradient might be based on a common (pre-)computation of some h(x) so if at p the cost comes first, computes h(p) then the gradient can reuse that.

This would have the advantage that the two fields cost and grad that are currently within the problem would stay,
which is not possible in your sketch, since that would require a common Objective structure?

edit: or to phrase differently – if the function fg would be the total common function h above that would always “return both”, then the gradient and cost would be basically accessors to the cache, which would update the cache (so all of p, X and value) if we are at the new point.

I think this might even be a little more flexible than your approach, but sure if storing a gradient is a problem, then one would need a get_cost that computes both grad and value but throws grad away instead of storing that (that is how I understand your get_cost for the case of the combined variant). So what I miss in your sketch is, that if I call get_cost and after that get_gradient for your last struct, it would compute cost & grad twice (first throw away the gradient, second run throw away the cost).

@mateuszbaran
Copy link
Member

This would have the advantage that the two fields cost and grad that are currently within the problem would stay,
which is not possible in your sketch, since that would require a common Objective structure?

The current two fields seem to be incompatible with cache-less combined cost-gradient computation which I'd like to have available so some rework would have to happen there. Either a combined thing or a third field for the combined computation.

I think this might even be a little more flexible than your approach, but sure if storing a gradient is a problem, then one would need a get_cost that computes both grad and value but throws grad away instead of storing that (that is how I understand your get_cost for the case of the combined variant). So what I miss in your sketch is, that if I call get_cost and after that get_gradient for your last struct, it would compute cost & grad twice (first throw away the gradient, second run throw away the cost).

In my approach there could also be something like

struct CombinedAndSeparateGradCostObjective{TF,TG,TFG} <: AbstractObjective
    f::TF
    g::TG
    fg::TFG
end

get_cost(obj::CombinedAndSeparateGradCostObjective, p) = obj.f(p)
get_gradient(obj::CombinedAndSeparateGradCostObjective, p) = obj.g(p)
get_cost_and_gradient(obj::CombinedAndSeparateGradCostObjective, p) = obj.fg(p)

and then we could have a

mutable struct CachedObjective{TObj<:AbstractObjective,TP} <: AbstractObjective
    obj::TObj
    cached_at::TP
    cached_values::Dict{Symbol,Any}
end

get_cost(obj::CachedObjective, p) # check if p is obj.cached_at and either returns value from obj.cached_values or computes a new one, deleting contents of cached_values.

which would wrap an arbitrary objective and cache values exactly as you wrote. The advantage is that we could then use combined evaluation without caching. Good caching is not easy. For example, you need an efficient cache invalidation mechanism. When you have stored objective and gradient, but you need to compute just the objective at a new point, there may be no need to compute the gradient but the old one needs to be marked as invalid (if we have a cache for a single point). Or we could have semi-separate caches for cost and gradient computation.

I'm not sure if a combined objective would be better or worse than three separate functions, my main point is that caching should be optional.

@kellertuer
Copy link
Member Author

Hm, that sounds valid, but makes the approach I had in mind not doable.

The other disadvantage of your approach is, that it would require a lot – really a lot a lot – of rework, since the Problems currently always have different fields for cost/grad/... and that would be joined into one, which would also change all accessors.
This is a lot of work, because we to not just have cost/grad but also

  • prox
  • a set of proxes
  • a primal prox
  • a dual prox
  • a prox and a relfection
  • a Hessian
  • an approximate Hessian

with all their accessor functions which then need to be rewritten. I am not saying, we should not do that, but I will maybe remove the “good first issue”, because this large change, would more like put it on the other side of the spectrum of issues.

And yes sure, partially invalidated data was not yet in my mind, but should be doable with the dictionary you proposed (good idea!) if one stores maybe valid flags always tuples (p, grad_p) instead of just one coached_at value.

@mateuszbaran
Copy link
Member

Yes, right, a unified cost/grad/etc. field would be a significant rework. So maybe a better (or at least easier to implement) idea would be to introduce additional fields in specific problems for combined computations? It would be non-breaking at least so it could be introduced gradually.

@kellertuer
Copy link
Member Author

Hm, if that is an urgent issue then sure we could do that, otherwise that might be my new-years project maybe – them, well so eh, my last years new-years project took until April, but well will see ;)

@mateuszbaran
Copy link
Member

No, it's not urgent, actually the opposite. I doubt it will ever make anything more than 2x faster which is neat but rarely makes a difference 😉 . I think there are more important things that can be improved in Manopt.

@kellertuer
Copy link
Member Author

Oh, then feel free to open more issues – there are several parts that I wrote to learn Julia around 2015/2016, so sure that might not all be optimal performance wise, but I hope I at least gave most interfaces some thoughts.

For the gf/cache (if so this would model both), sure, that is not urgent but with our discussion I think I have a good idea, how one can do this.

@kellertuer
Copy link
Member Author

Ah and I think there might be complicated cost/gards where at least caching might be useful, but again, I think since that would be some rework, one might approach the combination of functions in the same PR.

@mateuszbaran
Copy link
Member

Oh, then feel free to open more issues – there are several parts that I wrote to learn Julia around 2015/2016, so sure that might not all be optimal performance wise, but I hope I at least gave most interfaces some thoughts.

Actually I think performance is mostly fine now, at least in the few solvers that I've used so far and for not too cheap to evaluate cost functions. A benchmark that could attract more users by showing advantages over other libraries would be great. I think we could just take any Manopt.jl example, rewrite it in Optim.jl for example as a constrained optimization problem and show how much faster Manopt.jl converges.

@kellertuer
Copy link
Member Author

Ah, that might be a good think as a tutorial / comparison, but if so maybe also as an honest one:

  1. Euclidean we are similar / maybe a little slower (hopefully not 😎 ) than Optim.jl
  2. Look for a god manifold example that is more efficient than the corresponding constraint optimisation.

Sounds like a good plan for a tutorial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants