Skip to content

Commit

Permalink
more deprecations
Browse files Browse the repository at this point in the history
  • Loading branch information
CarloLucibello committed Apr 4, 2024
1 parent 8d35864 commit 63b7613
Show file tree
Hide file tree
Showing 8 changed files with 59 additions and 172 deletions.
19 changes: 13 additions & 6 deletions docs/src/destructure.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,20 +49,27 @@ julia> Flux.destructure(grad) # acts on non-models, too
(Float32[10.339018, 11.379145, 22.845667, -29.565302, -37.644184], Restructure(Tuple, ..., 5))
```

!!! compat "Flux ≤ 0.12"
Old versions of Flux had an entirely different implementation of `destructure`, which
had many bugs (and almost no tests). Many comments online still refer to that now-deleted
function, or to memories of it.
In order to collect all parameters of a model into a list instead, you can use the `trainables` function:

```julia
julia> Flux.trainables(model)
5-element Vector{AbstractArray}:
[0.863101 1.2454957]
[0.0]
[1.290355429422727;;]
[0.0]
```
Any mutation of the elements of the resulting list will affect the model's parameters.

### All Parameters

The function `destructure` now lives in [`Optimisers.jl`](https://github.com/FluxML/Optimisers.jl).
(Be warned this package is unrelated to the `Flux.Optimisers` sub-module! The confusion is temporary.)
The functions `destructure` and `trainables` live in [`Optimisers.jl`](https://github.com/FluxML/Optimisers.jl).


```@docs
Optimisers.destructure
Optimisers.trainable
Optimisers.trainables
Optimisers.isnumeric
```

Expand Down
41 changes: 0 additions & 41 deletions docs/src/models/advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,47 +80,6 @@ Flux.@layer Affine trainable=(W,)

There is a second, more severe, kind of restriction possible. This is not recommended, but is included here for completeness. Calling `Functors.@functor Affine (W,)` means that all no exploration of the model will ever visit the other fields: They will not be moved to the GPU by [`gpu`](@ref), and their precision will not be changed by `f32`. This requires the `struct` to have a corresponding constructor that accepts only `W` as an argument.


## Freezing Layer Parameters

When it is desired to not include all the model parameters (for e.g. transfer learning), we can simply not pass in those layers into our call to `params`.

!!! compat "Flux ≤ 0.14"
The mechanism described here is for Flux's old "implicit" training style.
When upgrading for Flux 0.15, it should be replaced by [`freeze!`](@ref Flux.freeze!) and `thaw!`.

Consider a simple multi-layer perceptron model where we want to avoid optimising the first two `Dense` layers. We can obtain
this using the slicing features `Chain` provides:

```julia
m = Chain(
Dense(784 => 64, relu),
Dense(64 => 64, relu),
Dense(32 => 10)
);

ps = Flux.params(m[3:end])
```

The `Zygote.Params` object `ps` now holds a reference to only the parameters of the layers passed to it.

During training, the gradients will only be computed for (and applied to) the last `Dense` layer, therefore only that would have its parameters changed.

`Flux.params` also takes multiple inputs to make it easy to collect parameters from heterogenous models with a single call. A simple demonstration would be if we wanted to omit optimising the second `Dense` layer in the previous example. It would look something like this:

```julia
Flux.params(m[1], m[3:end])
```

Sometimes, a more fine-tuned control is needed.
We can freeze a specific parameter of a specific layer which already entered a `Params` object `ps`,
by simply deleting it from `ps`:

```julia
ps = Flux.params(m)
delete!(ps, m[2].bias)
```

## Custom multiple input or output layer

Sometimes a model needs to receive several separate inputs at once or produce several separate outputs at once. In other words, there multiple paths within this high-level layer, each processing a different input or producing a different output. A simple example of this in machine learning literature is the [inception module](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.pdf).
Expand Down
76 changes: 24 additions & 52 deletions docs/src/models/basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,68 +74,40 @@ julia> Flux.withgradient(g, nt)
(val = 1, grad = ((a = [0.0, 2.0], b = [-0.0, -2.0], c = nothing),))
```

!!! note "Implicit gradients"
Flux used to handle many parameters in a different way, using the [`params`](@ref Flux.params) function.
This uses a method of `gradient` which takes a zero-argument function, and returns a dictionary
through which the resulting gradients can be looked up:

```jldoctest basics
julia> x = [2, 1];

julia> y = [2, 0];

julia> gs = gradient(Flux.params(x, y)) do
f(x, y)
end
Grads(...)

julia> gs[x]
2-element Vector{Float64}:
0.0
2.0

julia> gs[y]
2-element Vector{Float64}:
-0.0
-2.0
```


## Building Simple Models

Consider a simple linear regression, which tries to predict an output array `y` from an input `x`.

```julia
W = rand(2, 5)
b = rand(2)

predict(x) = W*x .+ b
predict(W, b, x) = W*x .+ b

function loss(x, y)
ŷ = predict(x)
function loss(W, b, x, y)
ŷ = predict(W, b, x)
sum((y .- ŷ).^2)
end

x, y = rand(5), rand(2) # Dummy data
loss(x, y) # ~ 3
W = rand(2, 5)
b = rand(2)

loss(W, b, x, y) # ~ 3
```

To improve the prediction we can take the gradients of the loss with respect to `W` and `b` and perform gradient descent.

```julia
using Flux

gs = gradient(() -> loss(x, y), Flux.params(W, b))
dW, db = gradient((W, b) -> loss(W, b, x, y), W, b)
```

Now that we have gradients, we can pull them out and update `W` to train the model.

```julia
= gs[W]
W .-= 0.1 .* dW

W .-= 0.1 .*

loss(x, y) # ~ 2.5
loss(W, b, x, y) # ~ 2.5
```

The loss has decreased a little, meaning that our prediction `x` is closer to the target `y`. If we have some data we can already try [training the model](../training/training.md).
Expand All @@ -144,7 +116,7 @@ All deep learning in Flux, however complex, is a simple generalisation of this e

## Building Layers

It's common to create more complex models than the linear regression above. For example, we might want to have two linear layers with a nonlinearity like [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function) (`σ`) in between them. In the above style we could write this as:
It's common to create more complex models than the linear regression above. For example, we might want to have two linear layers with a nonlinearity like [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function) in between them. We could write this as:

```julia
using Flux
Expand All @@ -157,7 +129,7 @@ W2 = rand(2, 3)
b2 = rand(2)
layer2(x) = W2 * x .+ b2

model(x) = layer2(σ.(layer1(x)))
model(x) = layer2(sigmoid.(layer1(x)))

model(rand(5)) # => 2-element vector
```
Expand All @@ -174,7 +146,7 @@ end
linear1 = linear(5, 3) # we can access linear1.W etc
linear2 = linear(3, 2)

model(x) = linear2(σ.(linear1(x)))
model(x) = linear2(sigmoid.(linear1(x)))

model(rand(5)) # => 2-element vector
```
Expand All @@ -188,7 +160,7 @@ struct Affine
end

Affine(in::Integer, out::Integer) =
Affine(randn(out, in), randn(out))
Affine(randn(out, in), zeros(out))

# Overload call, so the object can be used as a function
(m::Affine)(x) = m.W * x .+ m.b
Expand All @@ -198,16 +170,16 @@ a = Affine(10, 5)
a(rand(10)) # => 5-element vector
```

Congratulations! You just built the `Dense` layer that comes with Flux. Flux has many interesting layers available, but they're all things you could have built yourself very easily.
Congratulations! You just built the [`Dense`](@ref) layer that comes with Flux. Flux has many interesting layers available, but they're all things you could have built yourself very easily.

(There is one small difference with `Dense` – for convenience it also takes an activation function, like `Dense(10 => 5, σ)`.)
(There is one small difference with `Dense` – for convenience it also takes an activation function, like `Dense(10 => 5, sigmoid)`.)

## Stacking It Up

It's pretty common to write models that look something like:

```julia
layer1 = Dense(10 => 5, σ)
layer1 = Dense(10 => 5, relu)
# ...
model(x) = layer3(layer2(layer1(x)))
```
Expand All @@ -217,7 +189,7 @@ For long chains, it might be a bit more intuitive to have a list of layers, like
```julia
using Flux

layers = [Dense(10 => 5, σ), Dense(5 => 2), softmax]
layers = [Dense(10 => 5, relu), Dense(5 => 2), softmax]

model(x) = foldl((x, m) -> m(x), layers, init = x)

Expand All @@ -228,7 +200,7 @@ Handily, this is also provided for in Flux:

```julia
model2 = Chain(
Dense(10 => 5, σ),
Dense(10 => 5, relu),
Dense(5 => 2),
softmax)

Expand All @@ -255,22 +227,22 @@ m(5) # => 26

## Layer Helpers

There is still one problem with this `Affine` layer, that Flux does not know to look inside it. This means that [`Flux.train!`](@ref) won't see its parameters, nor will [`gpu`](@ref) be able to move them to your GPU. These features are enabled by the [`@layer`](@ref Flux.@layer) macro:
There is still one problem with this `Affine` layer, that Flux does not know to look inside it. This means that [`Flux.train!`](@ref Flux.train!) won't see its parameters, nor will [`gpu`](@ref) be able to move them to your GPU. These features are enabled by the [`@layer`](@ref Flux.@layer) macro:

```julia
Flux.@layer Affine
```

Finally, most Flux layers make bias optional, and allow you to supply the function used for generating random weights. We can easily add these refinements to the `Affine` layer as follows, using the helper function [`create_bias`](@ref Flux.create_bias):

```
function Affine((in, out)::Pair; bias=true, init=Flux.randn32)
```julia
function Affine((in, out)::Pair; bias=true, init=glorot_uniform)
W = init(out, in)
b = Flux.create_bias(W, bias, out)
Affine(W, b)
return Affine(W, b)
end

Affine(3 => 1, bias=false, init=ones) |> gpu
Affine(3 => 1, bias=false) |> gpu
```

```@docs
Expand Down
18 changes: 9 additions & 9 deletions docs/src/models/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@ truth = [xor(col[1]>0.5, col[2]>0.5) for col in eachcol(noisy)] # 1000-element
model = Chain(
Dense(2 => 3, tanh), # activation function inside layer
BatchNorm(3),
Dense(3 => 2),
softmax) |> gpu # move model to GPU, if available
Dense(3 => 2)) |> gpu # move model to GPU, if available

# The model encapsulates parameters, randomly initialised. Its initial output is:
out1 = model(noisy |> gpu) |> cpu # 2×1000 Matrix{Float32}
probs1 = softmax(out1) # normalise to get probabilities

# To train the model, we use batches of 64 samples, and one-hot encoding:
target = Flux.onehotbatch(truth, [true, false]) # 2×1000 OneHotMatrix
Expand All @@ -36,7 +36,7 @@ losses = []
loss, grads = Flux.withgradient(model) do m
# Evaluate model and loss inside gradient context:
y_hat = m(x)
Flux.crossentropy(y_hat, y)
Flux.logitcrossentropy(y_hat, y)
end
Flux.update!(optim, model, grads[1])
push!(losses, loss) # logging, outside gradient context
Expand All @@ -45,8 +45,8 @@ end

optim # parameters, momenta and output have all changed
out2 = model(noisy |> gpu) |> cpu # first row is prob. of true, second row p(false)

mean((out2[1,:] .> 0.5) .== truth) # accuracy 94% so far!
probs2 = softmax(out2) # normalise to get probabilities
mean((probs2[1,:] .> 0.5) .== truth) # accuracy 94% so far!
```

![](../assets/quickstart/oneminute.png)
Expand All @@ -55,8 +55,8 @@ mean((out2[1,:] .> 0.5) .== truth) # accuracy 94% so far!
using Plots # to draw the above figure

p_true = scatter(noisy[1,:], noisy[2,:], zcolor=truth, title="True classification", legend=false)
p_raw = scatter(noisy[1,:], noisy[2,:], zcolor=out1[1,:], title="Untrained network", label="", clims=(0,1))
p_done = scatter(noisy[1,:], noisy[2,:], zcolor=out2[1,:], title="Trained network", legend=false)
p_raw = scatter(noisy[1,:], noisy[2,:], zcolor=probs1[1,:], title="Untrained network", label="", clims=(0,1))
p_done = scatter(noisy[1,:], noisy[2,:], zcolor=probs2[1,:], title="Trained network", legend=false)

plot(p_true, p_raw, p_done, layout=(1,3), size=(1000,330))
```
Expand Down Expand Up @@ -87,7 +87,7 @@ Some things to notice in this example are:

* The `model` can be called like a function, `y = model(x)`. Each layer like [`Dense`](@ref Flux.Dense) is an ordinary `struct`, which encapsulates some arrays of parameters (and possibly other state, as for [`BatchNorm`](@ref Flux.BatchNorm)).

* But the model does not contain the loss function, nor the optimisation rule. The momenta needed by [`Adam`](@ref Flux.Adam) are stored in the object returned by [setup](@ref Flux.Train.setup). And [`Flux.crossentropy`](@ref Flux.Losses.crossentropy) is an ordinary function.
* But the model does not contain the loss function, nor the optimisation rule. The momenta needed by [`Adam`](@ref Flux.Adam) are stored in the object returned by [setup](@ref Flux.Train.setup). And [`Flux.logitcrossentropy`](@ref Flux.Losses.logitcrossentropy) is an ordinary function that combines the [`softmax`](@ref Flux.softmax) and [`crossentropy`](@ref Flux.crossentropy) functions.

* The `do` block creates an anonymous function, as the first argument of `gradient`. Anything executed within this is differentiated.

Expand All @@ -97,7 +97,7 @@ Instead of calling [`gradient`](@ref Zygote.gradient) and [`update!`](@ref Flux.
for epoch in 1:1_000
Flux.train!(model, loader, optim) do m, x, y
y_hat = m(x)
Flux.crossentropy(y_hat, y)
Flux.logitcrossentropy(y_hat, y)
end
end
```
Expand Down
2 changes: 1 addition & 1 deletion docs/src/models/recurrence.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ Flux.reset!(m)
[m(x) for x in seq_init]

ps = Flux.params(m)
opt= Adam(1e-3)
opt = Adam(1e-3)
Flux.train!(loss, ps, data, opt)
```

Expand Down
Loading

0 comments on commit 63b7613

Please sign in to comment.