Skip to content

Commit

Permalink
add sgd divergence visualization
Browse files Browse the repository at this point in the history
  • Loading branch information
MerkulovDaniil committed Nov 28, 2024
1 parent 86affc4 commit 05a6c23
Show file tree
Hide file tree
Showing 7 changed files with 3,325 additions and 14 deletions.
Binary file added assets/Notebooks/SGD_1_pic.pdf
Binary file not shown.
25 changes: 15 additions & 10 deletions assets/Notebooks/SGD_1d_visualization.ipynb

Large diffs are not rendered by default.

Binary file added assets/Notebooks/gd_scalar_convergence.pdf
Binary file not shown.
Binary file not shown.
3,294 changes: 3,290 additions & 4 deletions assets/graphon.drawio

Large diffs are not rendered by default.

20 changes: 20 additions & 0 deletions docs/visualizations/sgd_divergence.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
title: "Why stochastic gradient descent does not converge?"
---

Many people are used to apply SGD(stochastic gradient descent), but not everyone knows that it is guaranteed(!) to be non convergent in the constant stepsize (learning rate) case even for the world's nicest function - a strongly convex quadratic (even on average).

Why so? The point is that SGD actually solves a different problem built on the selected data at each iteration. And this problem on the batch may be radically different from the full problem (however, the careful reader may note that this does not guarantee a very bad step). That is, at each iteration we actually converge, but to the minimum of a different problem, and each iteration we change the rules of the game for the method, preventing it from taking more than one step.

At the end of the attached video, you can see that using selected points from the linear regression problem, we can construct an optimal solution to the batched problem and a gradient to it - this will be called the stochastic gradient for the original problem. Most often the stochasticity of SGD is analyzed using noise in the gradient and less often we consider noise due to randomness/incompleteness of the choice of the problem to be solved (interestingly, these are not exactly the same thing).

This is certainly not a reason not to use the method, because convergence to an approximate solution is still guaranteed. For convex problems it is possible to deal with nonconvergence by
* gradually decreasing the step (slow convergence)
* increasing the patch size (expensive)
* using variance reduction methods (more on this later)

:::{.video}
sgd_divergence.mp4
:::

[Code](https://colab.research.google.com/github/MerkulovDaniil/optim/blob/master/assets/Notebooks/SGD_2d_visualization.ipynb)
Binary file added docs/visualizations/sgd_divergence.mp4
Binary file not shown.

0 comments on commit 05a6c23

Please sign in to comment.