more deprecations

FluxML · Apr 4, 2024 · 63b7613 · 63b7613
1 parent 8d35864
commit 63b7613
Show file tree

Hide file tree

Showing 8 changed files with 59 additions and 172 deletions.
diff --git a/docs/src/destructure.md b/docs/src/destructure.md
@@ -49,20 +49,27 @@ julia> Flux.destructure(grad)  # acts on non-models, too
 (Float32[10.339018, 11.379145, 22.845667, -29.565302, -37.644184], Restructure(Tuple, ..., 5))
 ```
 
-!!! compat "Flux ≤ 0.12"
-    Old versions of Flux had an entirely different implementation of `destructure`, which
-    had many bugs (and almost no tests). Many comments online still refer to that now-deleted
-    function, or to memories of it.
+In order to collect all parameters of a model into a list instead, you can use the `trainables` function:
 
+```julia
+julia> Flux.trainables(model)
+5-element Vector{AbstractArray}:
+  [0.863101 1.2454957]
+  [0.0]
+  [1.290355429422727;;]
+  [0.0]
+```
+Any mutation of the elements of the resulting list will affect the model's parameters.
 
 ### All Parameters
 
-The function `destructure` now lives in [`Optimisers.jl`](https://github.com/FluxML/Optimisers.jl).
-(Be warned this package is unrelated to the `Flux.Optimisers` sub-module! The confusion is temporary.)
+The functions `destructure` and `trainables` live in [`Optimisers.jl`](https://github.com/FluxML/Optimisers.jl).
+
 
 ```@docs
 Optimisers.destructure
 Optimisers.trainable
+Optimisers.trainables
 Optimisers.isnumeric
 ```
 

diff --git a/docs/src/models/advanced.md b/docs/src/models/advanced.md
@@ -80,47 +80,6 @@ Flux.@layer Affine trainable=(W,)
 
 There is a second, more severe, kind of restriction possible. This is not recommended, but is included here for completeness. Calling `Functors.@functor Affine (W,)` means that all no exploration of the model will ever visit the other fields: They will not be moved to the GPU by [`gpu`](@ref), and their precision will not be changed by `f32`. This requires the `struct` to have a corresponding constructor that accepts only `W` as an argument.
 
-
-## Freezing Layer Parameters
-
-When it is desired to not include all the model parameters (for e.g. transfer learning), we can simply not pass in those layers into our call to `params`.
-
-!!! compat "Flux ≤ 0.14"
-    The mechanism described here is for Flux's old "implicit" training style.
-    When upgrading for Flux 0.15, it should be replaced by [`freeze!`](@ref Flux.freeze!) and `thaw!`.
-
-Consider a simple multi-layer perceptron model where we want to avoid optimising the first two `Dense` layers. We can obtain
-this using the slicing features `Chain` provides:
-
-```julia
-m = Chain(
-      Dense(784 => 64, relu),
-      Dense(64 => 64, relu),
-      Dense(32 => 10)
-    );
-
-ps = Flux.params(m[3:end])
-```
-
-The `Zygote.Params` object `ps` now holds a reference to only the parameters of the layers passed to it.
-
-During training, the gradients will only be computed for (and applied to) the last `Dense` layer, therefore only that would have its parameters changed.
-
-`Flux.params` also takes multiple inputs to make it easy to collect parameters from heterogenous models with a single call. A simple demonstration would be if we wanted to omit optimising the second `Dense` layer in the previous example. It would look something like this:
-
-```julia
-Flux.params(m[1], m[3:end])
-```
-
-Sometimes, a more fine-tuned control is needed.
-We can freeze a specific parameter of a specific layer which already entered a `Params` object `ps`,
-by simply deleting it from `ps`:
-
-```julia
-ps = Flux.params(m)
-delete!(ps, m[2].bias) 
-```
-
 ## Custom multiple input or output layer
 
 Sometimes a model needs to receive several separate inputs at once or produce several separate outputs at once. In other words, there multiple paths within this high-level layer, each processing a different input or producing a different output. A simple example of this in machine learning literature is the [inception module](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.pdf).

diff --git a/docs/src/models/basics.md b/docs/src/models/basics.md
@@ -74,68 +74,40 @@ julia> Flux.withgradient(g, nt)
 (val = 1, grad = ((a = [0.0, 2.0], b = [-0.0, -2.0], c = nothing),))
 ```
 
-!!! note "Implicit gradients"
-    Flux used to handle many parameters in a different way, using the [`params`](@ref Flux.params) function.
-    This uses a method of `gradient` which takes a zero-argument function, and returns a dictionary
-    through which the resulting gradients can be looked up:
-
-    ```jldoctest basics
-    julia> x = [2, 1];
-
-    julia> y = [2, 0];
-
-    julia> gs = gradient(Flux.params(x, y)) do
-             f(x, y)
-           end
-    Grads(...)
-
-    julia> gs[x]
-    2-element Vector{Float64}:
-     0.0
-     2.0
-
-    julia> gs[y]
-    2-element Vector{Float64}:
-     -0.0
-     -2.0
-    ```
-
-
 ## Building Simple Models
 
 Consider a simple linear regression, which tries to predict an output array `y` from an input `x`.
 
 ```julia
-W = rand(2, 5)
-b = rand(2)
 
-predict(x) = W*x .+ b
+predict(W, b, x) = W*x .+ b
 
-function loss(x, y)
-  ŷ = predict(x)
+function loss(W, b, x, y)
+  ŷ = predict(W, b, x)
   sum((y .- ŷ).^2)
 end
 
 x, y = rand(5), rand(2) # Dummy data
-loss(x, y) # ~ 3
+W = rand(2, 5)
+b = rand(2)
+
+loss(W, b, x, y) # ~ 3
 ```
 
 To improve the prediction we can take the gradients of the loss with respect to `W` and `b` and perform gradient descent.
 
 ```julia
 using Flux
 
-gs = gradient(() -> loss(x, y), Flux.params(W, b))
+dW, db = gradient((W, b) -> loss(W, b, x, y), W, b)
 ```
 
 Now that we have gradients, we can pull them out and update `W` to train the model.
 
 ```julia
-W̄ = gs[W]
+W .-= 0.1 .* dW
 
-W .-= 0.1 .* W̄
-
-loss(x, y) # ~ 2.5
+loss(W, b, x, y) # ~ 2.5
 ```
 
 The loss has decreased a little, meaning that our prediction `x` is closer to the target `y`. If we have some data we can already try [training the model](../training/training.md).
@@ -144,7 +116,7 @@ All deep learning in Flux, however complex, is a simple generalisation of this e
 
 ## Building Layers
 
-It's common to create more complex models than the linear regression above. For example, we might want to have two linear layers with a nonlinearity like [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function) (`σ`) in between them. In the above style we could write this as:
+It's common to create more complex models than the linear regression above. For example, we might want to have two linear layers with a nonlinearity like [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function) in between them. We could write this as:
 
 ```julia
 using Flux
@@ -157,7 +129,7 @@ W2 = rand(2, 3)
 b2 = rand(2)
 layer2(x) = W2 * x .+ b2
 
-model(x) = layer2(σ.(layer1(x)))
+model(x) = layer2(sigmoid.(layer1(x)))
 
 model(rand(5)) # => 2-element vector
 ```
@@ -174,7 +146,7 @@ end
 linear1 = linear(5, 3) # we can access linear1.W etc
 linear2 = linear(3, 2)
 
-model(x) = linear2(σ.(linear1(x)))
+model(x) = linear2(sigmoid.(linear1(x)))
 
 model(rand(5)) # => 2-element vector
 ```
@@ -188,7 +160,7 @@ struct Affine
 end
 
 Affine(in::Integer, out::Integer) =
-  Affine(randn(out, in), randn(out))
+  Affine(randn(out, in), zeros(out))
 
 # Overload call, so the object can be used as a function
 (m::Affine)(x) = m.W * x .+ m.b
@@ -198,16 +170,16 @@ a = Affine(10, 5)
 a(rand(10)) # => 5-element vector
 ```
 
-Congratulations! You just built the `Dense` layer that comes with Flux. Flux has many interesting layers available, but they're all things you could have built yourself very easily.
+Congratulations! You just built the [`Dense`](@ref) layer that comes with Flux. Flux has many interesting layers available, but they're all things you could have built yourself very easily.
 
-(There is one small difference with `Dense` – for convenience it also takes an activation function, like `Dense(10 => 5, σ)`.)
+(There is one small difference with `Dense` – for convenience it also takes an activation function, like `Dense(10 => 5, sigmoid)`.)
 
 ## Stacking It Up
 
 It's pretty common to write models that look something like:
 
 ```julia
-layer1 = Dense(10 => 5, σ)
+layer1 = Dense(10 => 5, relu)
 # ...
 model(x) = layer3(layer2(layer1(x)))
 ```
@@ -217,7 +189,7 @@ For long chains, it might be a bit more intuitive to have a list of layers, like
 ```julia
 using Flux
 
-layers = [Dense(10 => 5, σ), Dense(5 => 2), softmax]
+layers = [Dense(10 => 5, relu), Dense(5 => 2), softmax]
 
 model(x) = foldl((x, m) -> m(x), layers, init = x)
 
@@ -228,7 +200,7 @@ Handily, this is also provided for in Flux:
 
 ```julia
 model2 = Chain(
-  Dense(10 => 5, σ),
+  Dense(10 => 5, relu),
   Dense(5 => 2),
   softmax)
 
@@ -255,22 +227,22 @@ m(5) # => 26
 
 ## Layer Helpers
 
-There is still one problem with this `Affine` layer, that Flux does not know to look inside it. This means that [`Flux.train!`](@ref) won't see its parameters, nor will [`gpu`](@ref) be able to move them to your GPU. These features are enabled by the [`@layer`](@ref Flux.@layer) macro:
+There is still one problem with this `Affine` layer, that Flux does not know to look inside it. This means that [`Flux.train!`](@ref Flux.train!) won't see its parameters, nor will [`gpu`](@ref) be able to move them to your GPU. These features are enabled by the [`@layer`](@ref Flux.@layer) macro:
 
 ```julia
 Flux.@layer Affine
 ```
 
 Finally, most Flux layers make bias optional, and allow you to supply the function used for generating random weights. We can easily add these refinements to the `Affine` layer as follows, using the helper function [`create_bias`](@ref Flux.create_bias):
 
-```
-function Affine((in, out)::Pair; bias=true, init=Flux.randn32)
+```julia
+function Affine((in, out)::Pair; bias=true, init=glorot_uniform)
   W = init(out, in)
   b = Flux.create_bias(W, bias, out)
-  Affine(W, b)
+  return Affine(W, b)
 end
 
-Affine(3 => 1, bias=false, init=ones) |> gpu
+Affine(3 => 1, bias=false) |> gpu
 ```
 
 ```@docs

diff --git a/docs/src/models/quickstart.md b/docs/src/models/quickstart.md
@@ -16,11 +16,11 @@ truth = [xor(col[1]>0.5, col[2]>0.5) for col in eachcol(noisy)]   # 1000-element
 model = Chain(
     Dense(2 => 3, tanh),   # activation function inside layer
     BatchNorm(3),
-    Dense(3 => 2),
-    softmax) |> gpu        # move model to GPU, if available
+    Dense(3 => 2)) |> gpu        # move model to GPU, if available
 
 # The model encapsulates parameters, randomly initialised. Its initial output is:
 out1 = model(noisy |> gpu) |> cpu                                 # 2×1000 Matrix{Float32}
+probs1 = softmax(out1)      # normalise to get probabilities
 
 # To train the model, we use batches of 64 samples, and one-hot encoding:
 target = Flux.onehotbatch(truth, [true, false])                   # 2×1000 OneHotMatrix
@@ -36,7 +36,7 @@ losses = []
         loss, grads = Flux.withgradient(model) do m
             # Evaluate model and loss inside gradient context:
             y_hat = m(x)
-            Flux.crossentropy(y_hat, y)
+            Flux.logitcrossentropy(y_hat, y)
         end
         Flux.update!(optim, model, grads[1])
         push!(losses, loss)  # logging, outside gradient context
@@ -45,8 +45,8 @@ end
 
 optim # parameters, momenta and output have all changed
 out2 = model(noisy |> gpu) |> cpu  # first row is prob. of true, second row p(false)
-
-mean((out2[1,:] .> 0.5) .== truth)  # accuracy 94% so far!
+probs2 = softmax(out2)      # normalise to get probabilities
+mean((probs2[1,:] .> 0.5) .== truth)  # accuracy 94% so far!
 ```
 
 ![](../assets/quickstart/oneminute.png)
@@ -55,8 +55,8 @@ mean((out2[1,:] .> 0.5) .== truth)  # accuracy 94% so far!
 using Plots  # to draw the above figure
 
 p_true = scatter(noisy[1,:], noisy[2,:], zcolor=truth, title="True classification", legend=false)
-p_raw =  scatter(noisy[1,:], noisy[2,:], zcolor=out1[1,:], title="Untrained network", label="", clims=(0,1))
-p_done = scatter(noisy[1,:], noisy[2,:], zcolor=out2[1,:], title="Trained network", legend=false)
+p_raw =  scatter(noisy[1,:], noisy[2,:], zcolor=probs1[1,:], title="Untrained network", label="", clims=(0,1))
+p_done = scatter(noisy[1,:], noisy[2,:], zcolor=probs2[1,:], title="Trained network", legend=false)
 
 plot(p_true, p_raw, p_done, layout=(1,3), size=(1000,330))
 ```
@@ -87,7 +87,7 @@ Some things to notice in this example are:
 
 * The `model` can be called like a function, `y = model(x)`. Each layer like [`Dense`](@ref Flux.Dense) is an ordinary `struct`, which encapsulates some arrays of parameters (and possibly other state, as for [`BatchNorm`](@ref Flux.BatchNorm)).
 
-* But the model does not contain the loss function, nor the optimisation rule. The momenta needed by [`Adam`](@ref Flux.Adam) are stored in the object returned by [setup](@ref Flux.Train.setup). And [`Flux.crossentropy`](@ref Flux.Losses.crossentropy) is an ordinary function.
+* But the model does not contain the loss function, nor the optimisation rule. The momenta needed by [`Adam`](@ref Flux.Adam) are stored in the object returned by [setup](@ref Flux.Train.setup). And [`Flux.logitcrossentropy`](@ref Flux.Losses.logitcrossentropy) is an ordinary function that combines the [`softmax`](@ref Flux.softmax) and [`crossentropy`](@ref Flux.crossentropy) functions.
 
 * The `do` block creates an anonymous function, as the first argument of `gradient`. Anything executed within this is differentiated.
 
@@ -97,7 +97,7 @@ Instead of calling [`gradient`](@ref Zygote.gradient) and [`update!`](@ref Flux.
 for epoch in 1:1_000
     Flux.train!(model, loader, optim) do m, x, y
         y_hat = m(x)
-        Flux.crossentropy(y_hat, y)
+        Flux.logitcrossentropy(y_hat, y)
     end
 end
 ```

diff --git a/docs/src/models/recurrence.md b/docs/src/models/recurrence.md
@@ -173,7 +173,7 @@ Flux.reset!(m)
 [m(x) for x in seq_init]
 
 ps = Flux.params(m)
-opt= Adam(1e-3)
+opt = Adam(1e-3)
 Flux.train!(loss, ps, data, opt)
 ```