Skip to content

Commit

Permalink
fix headers
Browse files Browse the repository at this point in the history
  • Loading branch information
MerkulovDaniil committed Nov 15, 2023
1 parent b4abb67 commit d1750b4
Show file tree
Hide file tree
Showing 85 changed files with 2,584 additions and 2,558 deletions.
22 changes: 19 additions & 3 deletions 404.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,23 @@
---
title: Page Not Found
number-sections: false
title: 👓 Eureka!
toc: false
heading: false
---

The page you requested cannot be found (perhaps it was moved or renamed).
You've just discovered an uncharted territory on [💎fmin.xyz](/index.md)!

You may want to try searching to find the page's new location
It seems you've tried to access a page that's as elusive as a global minimum in a non-convex optimization problem. 😄

:::{.plotly}
docs/theory/dual_balls.html
:::

But fear not, intrepid explorer! Here are some tools to navigate back to familiar grounds:

* [💎fmin.xyz home page](/index.md): Like restarting your gradient descent, head back to start.
* 👆 Search with Precision: Use our search engine, more reliable than the Newton Method with far starting point.

Keep Calm and Optimize On!

Who knew a 404 error could be an opportunity for an adventure in learning? Happy exploring and may your journey be gradient-vanishing-free! 🚀
3 changes: 1 addition & 2 deletions _extensions/fminxyz/_extension.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,13 @@ version: 2.0.0
contributes:
formats:
html:
template-partials:
- title-block.html # Empty title block
number-sections: true
link-external-newwindow: true
theme: [cosmo, theme.scss]
toc: true
code-link: true
code-copy: true
anchor-sections: true
html-math-method: katex
# reference-location: margin
include-in-header:
Expand Down
25 changes: 19 additions & 6 deletions _extensions/fminxyz/theme.scss
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,24 @@ div.callout-proof.callout-style-default > .callout-header {
}
}

.callout.callout-titled .callout-body>:last-child:not(.sourceCode), .callout.callout-titled .callout-body>div>:last-child:not(.sourceCode) {
padding-bottom: .1rem;
}

.callout {
&.callout-style-default {
&.no-icon {
&.callout-titled {
&.callout-proof,
&.callout-answer,
&.callout-solution {
margin-bottom: 0.1rem;
}
}
}
}
}

// For correct print 2 pdf (without sidebar and toc)
@media print {
#quarto-sidebar, #quarto-margin-sidebar {
Expand Down Expand Up @@ -198,11 +216,6 @@ div.callout-proof.callout-style-default > .callout-header {
}
}

#quarto-content>*{
padding-top: 0px;
margin-top: 0px;
}

.table {
display: block; // Change the table's display property
max-width: 100%; // Limit the maximum width
Expand All @@ -211,4 +224,4 @@ div.callout-proof.callout-style-default > .callout-header {

.navbar #quarto-search{
margin-left: unset;
}
}
2 changes: 0 additions & 2 deletions _extensions/fminxyz/title-block.html

This file was deleted.

1 change: 1 addition & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ project:
render:
- /docs/**/*.md
- index.md
- 404.md
output-dir: _site
resources:
- "docs/**/*.mp4"
Expand Down
Binary file removed assets/files/Numerical_Optimization2006.pdf
Binary file not shown.
24 changes: 12 additions & 12 deletions docs/applications/A-Star.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: $A^*$ algorithm for path finding
parent: Applications
---

# Problem
## Problem

The **graph** is one of the most significant structures in the algorithms, because this structure can represent many real life cases, from streets to networks.

Expand All @@ -14,10 +14,10 @@ Generally, we need determine input and output data:
- **Input data:** graph map and end or start point/node (or both for certain path)
- **Output data:** paths (or intermediate points/nodes) with the least sum of graph edges as result

# Solutions
## Solutions
Today there is a variety of algorithms for solving this problem, and solutions have their own advantages and disadvantages regarding the task, so let's consider main of them:

## Breadth First Search
### Breadth First Search
This is the simplest algorithm for graph traversing. It starts at the tree root (it may be start/end node) and explores all the neighbor nodes at the present depth prior to moving on to the nodes at the next depth level.

| Origin Graph | Result Tree| Animation
Expand All @@ -28,13 +28,13 @@ This is the simplest algorithm for graph traversing. It starts at the tree root
Obviously this algorithm has low performance: $O(\vert V \vert + \vert E\vert) = O(b^d)$, where **b** is *branch factor* (average quantity of children nodes in tree, e.g. for binary tree $b=2$) and **d** is depth/distance from root.


## Dijkstra's algorithm
### Dijkstra's algorithm

| Description | Animation|
|---|---|
|Dijkstra’s Algorithm (also called Uniform Cost Search) lets us prioritize which paths to explore. Instead of exploring all possible paths equally (like in [Breadth First Search](#breadth-first-search)), it favors lower cost paths.|_________________________![](https://upload.wikimedia.org/wikipedia/commons/2/23/Dijkstras_progress_animation.gif)

## Greedy Best-First-Search
### Greedy Best-First-Search
With Breadth First Search and Dijkstra’s Algorithm, the frontier expands in all directions. This is a reasonable choice if you’re trying to find a path to all locations or to many locations. However, a common case is to find a path to only one location.
Let’s make the frontier expand towards the goal more than it expands in other directions. First, we’ll define a **heuristic function** that tells us how close we are to the goal. E.g. on flat map we can use function like $H(A, B) = |A.x - B.x| + |A.y - B.y|$ , where **A** and **B** are nodes with coordinates **{x, y}**.
Let's consider not only shortest edges, but also use the estimated distance to the goal for the priority queue ordering. The location closest to the goal will be explored first.
Expand All @@ -43,17 +43,17 @@ Let's consider not only shortest edges, but also use the estimated distance to t
|---|---|
| We can see that firstly nodes, that are closer to target are considered at first. But when algorithm finds a barrier, then it tries to find the path to walk around, but this path is best from the corner, not from the start position, so the result path is not the shortest. This is a result of the *heuristic function*. To solve this problem let's consider next algorithm |_________________________![_________________________](https://upload.wikimedia.org/wikipedia/commons/8/85/Weighted_A_star_with_eps_5.gif)

## A-Star Algorithm
### A-Star Algorithm
Dijkstra’s Algorithm works well to find the shortest path, but it wastes time exploring in directions that aren’t promising. Greedy Best First Search explores in promising directions but it may not find the shortest path. The A* algorithm uses _both_ the actual distance from the start and the estimated distance to the goal.

| Result of Cost and Heuristic function | Animation |
|---|---|
| Because of considering both **cost** and result of **heuristic functuion** as result metric for Dijkstra’s algorithm, we can find the shortest path faster, than raw Dijkstra’s algorithm, and precisely, than Greedy Best-First-Search |_________________________![](https://upload.wikimedia.org/wikipedia/commons/5/5d/Astar_progress_animation.gif)|

## A-Star Implementation
### A-Star Implementation
Let's take a closer look at this algorithm and analyze it with code example. First of all you need to create a *Priority Queue* because you should consider points, which are closer to destination from start position. *Priority does not equal cost*. This Queue contains possible *points*, that are to be considered as possible shortest way to destination.
```python
# Only main methods
## Only main methods
class PriorityQueue:
# Puts item in collection, sorted by priority.
def put(self, item, priority):
Expand All @@ -63,7 +63,7 @@ class PriorityQueue:
Also you need a class, that describes your Graph Model with 2 methods. First finds neighbors of current node, and second returns cost between current node and next. This methods allows to implement any structure, be neither grid, hexagonal map or graph.

```python
# Only main methods
## Only main methods
class SquareGrid:
# Returns neigbours of 'id' cell
# according to map and 'walls'.
Expand Down Expand Up @@ -114,7 +114,7 @@ def a_star_search(graph, start, goal):

```

# Results
## Results
Now let's try to compare Dijkstra's algorithm with A-Star. For this task we will generate map with size from 5 to 50 with step equal 3. Start position is in left top corner, and End position is opposite. Also, we will generate corners (the quantity is SIZE^0.4), with random length for one side and other side to the end of the map. Generated Maps you can find below, there is only example and comparison plot of iterations depending on the map size.

| Dijkstras | A-Star |
Expand All @@ -126,11 +126,11 @@ Now let's try to compare Dijkstra's algorithm with A-Star. For this task we will

It is seen that in most cases $A^*$ finds faster. However, there are situations where heuristics do not help, and in this case A-Star works the same way as Dijkstra's.

# Code
## Code

[Open In Colab](https://colab.research.google.com/github/MerkulovDaniil/optim/blob/master/assets/Notebooks/A_star.ipynb){: .btn }

# References
## References

* [Artificial Intelligence: A New Synthesis](https://epdf.pub/artificial-intelligence-a-new-synthesis.html)
* [Introduction_to_algorithms](https://edutechlearners.com/download/Introduction_to_algorithms-3rd%20Edition.pdf)
Expand Down
16 changes: 8 additions & 8 deletions docs/applications/MLE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,19 @@
title: Maximum likelihood estimation
---

# Problem
## Problem
We need to estimate probability density $p(x)$ of a random variable from observed values.

![Illustration](mle.svg)

# Approach
## Approach
We will use idea of parametric distribution estimation, which involves choosing *the best* parameters, of a chosen family of densities $p_\theta(x)$, indexed by a parameter $\theta$. The idea is very natural: we choose such parameters, which maximizes the probability (or logarithm of probability) of observed values.

$$
\arg \max\limits_{\theta} \log p_\theta(x) = \theta^*
$$

## Linear measurements with i.i.d. noise
### Linear measurements with i.i.d. noise

Suppose, we are given the set of observations:

Expand All @@ -36,7 +36,7 @@ $$

Where the sum goes from the fact, that all observation are independent, which leads to the fact, that $p(\xi) = \prod\limits_{i=1}^m p(\xi_i)$. The target function is called log-likelihood function $L(\theta)$.

### Gaussian noise
#### Gaussian noise

$$
p(z) = \dfrac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{z^2}{2 \sigma^2}}
Expand All @@ -55,7 +55,7 @@ $$

Which means, the maximum likelihood estimation in case of gaussian noise is a least squares solution.

### Laplacian noise
#### Laplacian noise

$$
p(z) = \dfrac{1}{2a} e^{-\frac{|z|}{a}}
Expand All @@ -74,7 +74,7 @@ $$

Which means, the maximum likelihood estimation in case of Laplacian noise is a $l_1$-norm solution.

### Uniform noise
#### Uniform noise

$$
p(z) = \begin{cases}
Expand All @@ -100,7 +100,7 @@ $$

Which means, the maximum likelihood estimation in case of uniform noise is any vector $\theta$, which satisfies $\vert x_i - \theta^\top a_i \vert \leq a$.

## Binary logistic regression
### Binary logistic regression

Suppose, we are given a set of binary random variables $y_i \in \{0,1\}$. Let us parametrize the distribution function as a sigmoid, using linear transformation of the input as an argument of a sigmoid.

Expand All @@ -119,7 +119,7 @@ $$
L(\theta_0, \theta_1) = \sum\limits_{i=1}^k (\theta_0^\top x_i + \theta_1) - \sum\limits_{i=1}^m \log(1 + \text{exp}(\theta_0^\top x_i + \theta_1))
$$

# References
## References

* [Convex Optimization @ UCLA](http://www.seas.ucla.edu/~vandenbe/ee236b/ee236b.html) by Prof. L. Vandenberghe
* [Numerical explanation](https://cvxopt.org/examples/book/logreg.html)
6 changes: 3 additions & 3 deletions docs/applications/NN_Loss_Surface.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: Neural Network Loss Surface Visualization
parent: Applications
---

# Scalar Projection
## Scalar Projection

Let's consider the training of our neural network by solving the following optimization problem:

Expand Down Expand Up @@ -31,7 +31,7 @@ It is important to note that the characteristics of the resulting graph heavily

![Illustration](nn_vis_CNN_line_drop.svg)

# Two dimensional projection
## Two dimensional projection
We can explore this idea further and draw the projection of the loss surface to the plane, which is defined by 2 random vectors. Note, that with 2 random gaussian vectors in the huge dimensional space are almost certainly orthogonal.

So, as previously, we generate random normalized gaussian vectors $w_1, w_2 \in \mathbb{R}^p$ and evaluate the loss function
Expand All @@ -50,5 +50,5 @@ nn_vis_CNN_plane_no_drop.html
nn_vis_CNN_plane_drop.html
:::

# Code
## Code
[Open In Colab](https://colab.research.google.com/github/MerkulovDaniil/optim/blob/master/assets/Notebooks/NN_Surface_Visualization.ipynb){: .btn }
4 changes: 2 additions & 2 deletions docs/applications/Neural_Lipschitz_constant.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: Neural network Lipschitz constant
parent: Applications
---

# Lipschitz constant of a convolutional layer in neural network
## Lipschitz constant of a convolutional layer in neural network

It was observed, that small perturbation in Neural Network input could lead to significant errors, i.e. misclassifications.

Expand All @@ -17,5 +17,5 @@ $$

In this notebook we will try to estimate Lipschitz constant of some convolutional layer of a Neural Network.

# Code
## Code
[Open In Colab](https://colab.research.google.com/github/MerkulovDaniil/optim/blob/master/assets/Notebooks/Neural_Lipschitz.ipynb){: .btn }
84 changes: 42 additions & 42 deletions docs/applications/deep_learning.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,42 @@
---
title: Deep learning
---

# Problem

![Illustration](dl.png)

A lot of practical tasks nowadays are being solved using the deep learning approach, which is usually implies finding local minimum of a non-convex function, that generalizes well (enough 😉). The goal of this short text is to show you the importance of the optimization behind neural network training.

## Cross entropy
One of the most commonly used loss functions in classification tasks is the normalized categorical cross-entropy in $K$ class problem:

$$
L(\theta) = - \dfrac{1}{n}\sum_{i=1}^n (y_i^\top\log(h_\theta(x_i)) + (1 - y_i)^\top\log(1 - h_\theta(x_i))), \qquad h_\theta^k(x_i) = \dfrac{e^{\theta_k^\top x_i}}{\sum_{j = 1}^K e^{\theta_j^\top x_i}}
$$

Since in Deep Learning tasks the number of points in a dataset could be really huge, we usually use {%include link.html title='Stochastic gradient descent'%} based approaches as a workhorse.

In such algorithms one uses the estimation of a gradient at each step instead of the full gradient vector, for example, in cross-entropy we have:

$$
\nabla_\theta L(\theta) = \dfrac{1}{n} \sum\limits_{i=1}^n \left( h_\theta(x_i) - y_i \right) x_i^\top
$$

The simplest approximation is statistically judged unbiased estimation of a gradient:

$$
g(\theta) = \dfrac{1}{b} \sum\limits_{i=1}^b \left( h_\theta(x_i) - y_i \right) x_i^\top\approx \nabla_\theta L(\theta)
$$

where we initially sample randomly only $b \ll n$ points and calculate sample average. It can be also considered as a noisy version of the full gradient approach.

![Illustration](MLP_optims.svg)


# Code
[Open In Colab](https://colab.research.google.com/github/MerkulovDaniil/optim/blob/master/assets/Notebooks/Deep%20learning.ipynb){: .btn }

# References
* [Optimization for Deep Learning Highlights in 2017](http://ruder.io/deep-learning-optimization-2017/)
* [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/)
---
title: Deep learning
---

## Problem

![Illustration](dl.png)

A lot of practical tasks nowadays are being solved using the deep learning approach, which is usually implies finding local minimum of a non-convex function, that generalizes well (enough 😉). The goal of this short text is to show you the importance of the optimization behind neural network training.

### Cross entropy
One of the most commonly used loss functions in classification tasks is the normalized categorical cross-entropy in $K$ class problem:

$$
L(\theta) = - \dfrac{1}{n}\sum_{i=1}^n (y_i^\top\log(h_\theta(x_i)) + (1 - y_i)^\top\log(1 - h_\theta(x_i))), \qquad h_\theta^k(x_i) = \dfrac{e^{\theta_k^\top x_i}}{\sum_{j = 1}^K e^{\theta_j^\top x_i}}
$$

Since in Deep Learning tasks the number of points in a dataset could be really huge, we usually use {%include link.html title='Stochastic gradient descent'%} based approaches as a workhorse.

In such algorithms one uses the estimation of a gradient at each step instead of the full gradient vector, for example, in cross-entropy we have:

$$
\nabla_\theta L(\theta) = \dfrac{1}{n} \sum\limits_{i=1}^n \left( h_\theta(x_i) - y_i \right) x_i^\top
$$

The simplest approximation is statistically judged unbiased estimation of a gradient:

$$
g(\theta) = \dfrac{1}{b} \sum\limits_{i=1}^b \left( h_\theta(x_i) - y_i \right) x_i^\top\approx \nabla_\theta L(\theta)
$$

where we initially sample randomly only $b \ll n$ points and calculate sample average. It can be also considered as a noisy version of the full gradient approach.

![Illustration](MLP_optims.svg)


## Code
[Open In Colab](https://colab.research.google.com/github/MerkulovDaniil/optim/blob/master/assets/Notebooks/Deep%20learning.ipynb){: .btn }

## References
* [Optimization for Deep Learning Highlights in 2017](http://ruder.io/deep-learning-optimization-2017/)
* [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/)
Loading

0 comments on commit d1750b4

Please sign in to comment.