-
Notifications
You must be signed in to change notification settings - Fork 97
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
86affc4
commit 05a6c23
Showing
7 changed files
with
3,325 additions
and
14 deletions.
There are no files selected for viewing
Binary file not shown.
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file not shown.
Binary file not shown.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
--- | ||
title: "Why stochastic gradient descent does not converge?" | ||
--- | ||
|
||
Many people are used to apply SGD(stochastic gradient descent), but not everyone knows that it is guaranteed(!) to be non convergent in the constant stepsize (learning rate) case even for the world's nicest function - a strongly convex quadratic (even on average). | ||
|
||
Why so? The point is that SGD actually solves a different problem built on the selected data at each iteration. And this problem on the batch may be radically different from the full problem (however, the careful reader may note that this does not guarantee a very bad step). That is, at each iteration we actually converge, but to the minimum of a different problem, and each iteration we change the rules of the game for the method, preventing it from taking more than one step. | ||
|
||
At the end of the attached video, you can see that using selected points from the linear regression problem, we can construct an optimal solution to the batched problem and a gradient to it - this will be called the stochastic gradient for the original problem. Most often the stochasticity of SGD is analyzed using noise in the gradient and less often we consider noise due to randomness/incompleteness of the choice of the problem to be solved (interestingly, these are not exactly the same thing). | ||
|
||
This is certainly not a reason not to use the method, because convergence to an approximate solution is still guaranteed. For convex problems it is possible to deal with nonconvergence by | ||
* gradually decreasing the step (slow convergence) | ||
* increasing the patch size (expensive) | ||
* using variance reduction methods (more on this later) | ||
|
||
:::{.video} | ||
sgd_divergence.mp4 | ||
::: | ||
|
||
[Code](https://colab.research.google.com/github/MerkulovDaniil/optim/blob/master/assets/Notebooks/SGD_2d_visualization.ipynb) |
Binary file not shown.