fix headers

MerkulovDaniil · Nov 15, 2023 · d1750b4 · d1750b4
1 parent b4abb67
commit d1750b4
Show file tree

Hide file tree

Showing 85 changed files with 2,584 additions and 2,558 deletions.
diff --git a/404.md b/404.md
@@ -1,7 +1,23 @@
 ---
-title: Page Not Found
+number-sections: false
+title: 👓 Eureka!
+toc: false
+heading: false
 ---
 
-The page you requested cannot be found (perhaps it was moved or renamed).
+You've just discovered an uncharted territory on [💎fmin.xyz](/index.md)!
 
-You may want to try searching to find the page's new location
+It seems you've tried to access a page that's as elusive as a global minimum in a non-convex optimization problem. 😄
+
+:::{.plotly}
+docs/theory/dual_balls.html
+:::
+
+But fear not, intrepid explorer! Here are some tools to navigate back to familiar grounds:
+
+* [💎fmin.xyz home page](/index.md): Like restarting your gradient descent, head back to start.
+* 👆 Search with Precision: Use our search engine, more reliable than the Newton Method with far starting point.
+
+Keep Calm and Optimize On!
+
+Who knew a 404 error could be an opportunity for an adventure in learning? Happy exploring and may your journey be gradient-vanishing-free! 🚀
diff --git a/_extensions/fminxyz/_extension.yml b/_extensions/fminxyz/_extension.yml
@@ -4,14 +4,13 @@ version: 2.0.0
 contributes:
   formats:
     html:
-      template-partials:
-        - title-block.html # Empty title block
       number-sections: true
       link-external-newwindow: true
       theme: [cosmo, theme.scss]
       toc: true
       code-link: true
       code-copy: true
+      anchor-sections: true
       html-math-method: katex
       # reference-location: margin
       include-in-header:

diff --git a/_extensions/fminxyz/theme.scss b/_extensions/fminxyz/theme.scss
@@ -160,6 +160,24 @@ div.callout-proof.callout-style-default > .callout-header {
   }
 }
 
+.callout.callout-titled .callout-body>:last-child:not(.sourceCode), .callout.callout-titled .callout-body>div>:last-child:not(.sourceCode) {
+  padding-bottom: .1rem;
+  }
+
+.callout {
+  &.callout-style-default {
+      &.no-icon {
+          &.callout-titled {
+              &.callout-proof,
+              &.callout-answer,
+              &.callout-solution {
+                  margin-bottom: 0.1rem;
+              }
+          }
+      }
+  }
+}
+
 // For correct print 2 pdf (without sidebar and toc)
 @media print {
   #quarto-sidebar, #quarto-margin-sidebar {
@@ -198,11 +216,6 @@ div.callout-proof.callout-style-default > .callout-header {
   }
 }
 
-#quarto-content>*{
-  padding-top: 0px;
-  margin-top: 0px;
-}
-
 .table {
   display: block; // Change the table's display property
   max-width: 100%; // Limit the maximum width
@@ -211,4 +224,4 @@ div.callout-proof.callout-style-default > .callout-header {
 
 .navbar #quarto-search{
   margin-left: unset;
-}
+}
diff --git a/_extensions/fminxyz/title-block.html b/_extensions/fminxyz/title-block.html
diff --git a/_quarto.yml b/_quarto.yml
@@ -3,6 +3,7 @@ project:
   render:
     - /docs/**/*.md
     - index.md
+    - 404.md
   output-dir: _site
   resources: 
     - "docs/**/*.mp4"

diff --git a/assets/files/Numerical_Optimization2006.pdf b/assets/files/Numerical_Optimization2006.pdf
diff --git a/docs/applications/A-Star.md b/docs/applications/A-Star.md
@@ -3,7 +3,7 @@ title: $A^*$ algorithm for path finding
 parent: Applications
 ---
 
-# Problem
+## Problem
 
 The **graph** is one of the most significant structures in the algorithms, because this structure can represent many real life cases, from streets to networks.
 
@@ -14,10 +14,10 @@ Generally, we need determine input and output data:
 - **Input data:** graph map and end or start point/node (or both for certain path)
 - **Output data:** paths (or intermediate points/nodes) with the least sum of graph edges as result
 
-# Solutions
+## Solutions
 Today there is a variety of algorithms for solving this problem, and solutions have their own advantages and disadvantages regarding the task, so let's consider main of them:
 
-## Breadth First Search
+### Breadth First Search
 This is the simplest algorithm for graph traversing. It starts at the tree root (it may be start/end node) and explores all the neighbor nodes at the present depth prior to moving on to the nodes at the next depth level. 
 
 | Origin Graph | Result Tree| Animation
@@ -28,13 +28,13 @@ This is the simplest algorithm for graph traversing. It starts at the tree root
 Obviously this algorithm has low performance: $O(\vert V \vert + \vert E\vert) = O(b^d)$, where **b** is *branch factor* (average quantity of children nodes in tree, e.g. for binary tree $b=2$) and **d** is depth/distance from root.
 
 
-## Dijkstra's algorithm
+### Dijkstra's algorithm
 
 | Description | Animation|
 |---|---|
 |Dijkstra’s Algorithm (also called Uniform Cost Search) lets us prioritize which paths to explore. Instead of exploring all possible paths equally (like in [Breadth First Search](#breadth-first-search)), it favors lower cost paths.|_________________________![](https://upload.wikimedia.org/wikipedia/commons/2/23/Dijkstras_progress_animation.gif)
 
-## Greedy Best-First-Search
+### Greedy Best-First-Search
 With Breadth First Search and Dijkstra’s Algorithm, the frontier expands in all directions. This is a reasonable choice if you’re trying to find a path to all locations or to many locations. However, a common case is to find a path to only one location. 
 Let’s make the frontier expand towards the goal more than it expands in other directions. First, we’ll define a **heuristic function** that tells us how close we are to the goal. E.g. on flat map we can use function like $H(A, B) = |A.x - B.x| + |A.y - B.y|$ , where **A** and **B**  are nodes with coordinates **{x, y}**.
 Let's consider not only shortest edges, but also use the estimated distance to the goal for the priority queue ordering. The location closest to the goal will be explored first.
@@ -43,17 +43,17 @@ Let's consider not only shortest edges, but also use the estimated distance to t
 |---|---|
 | We can see that firstly nodes, that are closer to target are considered at first. But when algorithm finds a barrier, then it tries to find the path to walk around, but this path is best from the corner, not from the start position, so the result path is not the shortest. This is a result of the *heuristic function*. To solve this problem let's consider next algorithm |_________________________![_________________________](https://upload.wikimedia.org/wikipedia/commons/8/85/Weighted_A_star_with_eps_5.gif)
 
-## A-Star Algorithm
+### A-Star Algorithm
 Dijkstra’s Algorithm works well to find the shortest path, but it wastes time exploring in directions that aren’t promising. Greedy Best First Search explores in promising directions but it may not find the shortest path. The A* algorithm uses _both_ the actual distance from the start and the estimated distance to the goal.
 
 | Result of Cost and Heuristic function | Animation |
 |---|---|
 | Because of considering both **cost** and result of **heuristic functuion** as result metric for  Dijkstra’s algorithm, we can find the shortest path faster, than raw Dijkstra’s algorithm, and precisely, than Greedy Best-First-Search |_________________________![](https://upload.wikimedia.org/wikipedia/commons/5/5d/Astar_progress_animation.gif)|
 
-## A-Star Implementation
+### A-Star Implementation
 Let's take a closer look at this algorithm and analyze it with code example. First of all you need to create a *Priority Queue* because you should consider points, which are closer to destination from start position. *Priority does not equal cost*. This Queue contains possible *points*, that are to be considered as possible shortest way to destination.
 ```python
-# Only main methods
+## Only main methods
 class PriorityQueue:
 	# Puts item in collection, sorted by priority.
 	def put(self, item, priority):
@@ -63,7 +63,7 @@ class PriorityQueue:
 Also you  need a class, that describes your Graph Model with 2 methods. First finds neighbors of current node, and second returns cost between current node and next. This methods allows to implement any structure, be neither grid, hexagonal map or graph.
 
 ```python
-# Only main methods
+## Only main methods
 class SquareGrid:
 	# Returns neigbours of 'id' cell
 	# according to map and 'walls'.
@@ -114,7 +114,7 @@ def  a_star_search(graph, start, goal):
 
 ```
 
-# Results
+## Results
 Now let's try to compare Dijkstra's algorithm with A-Star. For this task we will generate map with size from 5 to 50 with step equal 3. Start position is in left top corner, and End position is opposite. Also, we will generate corners (the quantity is SIZE^0.4), with random length for one side and other side to the end of the map. Generated Maps you can find below, there is only example and comparison plot of iterations depending on the map size.
 
 | Dijkstras | A-Star |
@@ -126,11 +126,11 @@ Now let's try to compare Dijkstra's algorithm with A-Star. For this task we will
 
 It is seen that in most cases $A^*$ finds faster. However, there are situations where heuristics do not help, and in this case A-Star works the same way as Dijkstra's.
 
-# Code
+## Code
 
 [Open In Colab](https://colab.research.google.com/github/MerkulovDaniil/optim/blob/master/assets/Notebooks/A_star.ipynb){: .btn }
 
-# References
+## References
 
 * [Artificial Intelligence: A New Synthesis](https://epdf.pub/artificial-intelligence-a-new-synthesis.html)
 * [Introduction_to_algorithms](https://edutechlearners.com/download/Introduction_to_algorithms-3rd%20Edition.pdf)

diff --git a/docs/applications/MLE.md b/docs/applications/MLE.md
@@ -2,19 +2,19 @@
 title: Maximum likelihood estimation
 ---
 
-# Problem
+## Problem
 We need to estimate probability density $p(x)$ of a random variable from observed values.
 
 ![Illustration](mle.svg)
 
-# Approach
+## Approach
 We will use idea of parametric distribution estimation, which involves choosing *the best* parameters, of a chosen family of densities $p_\theta(x)$, indexed by a parameter $\theta$. The idea is very natural: we choose such parameters, which maximizes the probability (or logarithm of probability) of observed values.
 
 $$
 \arg \max\limits_{\theta} \log p_\theta(x) = \theta^* 
 $$
 
-## Linear measurements with i.i.d. noise
+### Linear measurements with i.i.d. noise
 
 Suppose, we are given the set of observations:
 
@@ -36,7 +36,7 @@ $$
 
 Where the sum goes from the fact, that all observation are independent, which leads to the fact, that $p(\xi) = \prod\limits_{i=1}^m p(\xi_i)$. The target function is called log-likelihood function $L(\theta)$.
 
-### Gaussian noise
+#### Gaussian noise
 
 $$
 p(z) = \dfrac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{z^2}{2 \sigma^2}}
@@ -55,7 +55,7 @@ $$
 
 Which means, the maximum likelihood estimation in case of gaussian noise is a least squares solution.
 
-### Laplacian noise
+#### Laplacian noise
 
 $$
 p(z) = \dfrac{1}{2a} e^{-\frac{|z|}{a}}
@@ -74,7 +74,7 @@ $$
 
 Which means, the maximum likelihood estimation in case of Laplacian noise is a $l_1$-norm solution.
 
-### Uniform noise
+#### Uniform noise
 
 $$
 p(z) = \begin{cases}
@@ -100,7 +100,7 @@ $$
 
 Which means, the maximum likelihood estimation in case of uniform noise is any vector $\theta$, which satisfies $\vert x_i - \theta^\top a_i \vert \leq a$.
 
-## Binary logistic regression
+### Binary logistic regression
 
 Suppose, we are given a set of binary random variables $y_i \in \{0,1\}$. Let us parametrize the distribution function as a sigmoid, using linear transformation of the input as an argument of a sigmoid.
 
@@ -119,7 +119,7 @@ $$
 L(\theta_0, \theta_1) = \sum\limits_{i=1}^k (\theta_0^\top x_i + \theta_1) - \sum\limits_{i=1}^m \log(1 + \text{exp}(\theta_0^\top x_i + \theta_1))
 $$
 
-# References
+## References
 
 * [Convex Optimization @ UCLA](http://www.seas.ucla.edu/~vandenbe/ee236b/ee236b.html) by Prof. L. Vandenberghe
 * [Numerical explanation](https://cvxopt.org/examples/book/logreg.html)
diff --git a/docs/applications/NN_Loss_Surface.md b/docs/applications/NN_Loss_Surface.md
@@ -3,7 +3,7 @@ title: Neural Network Loss Surface Visualization
 parent: Applications
 ---
 
-# Scalar Projection
+## Scalar Projection
 
 Let's consider the training of our neural network by solving the following optimization problem:
 
@@ -31,7 +31,7 @@ It is important to note that the characteristics of the resulting graph heavily
 
 ![Illustration](nn_vis_CNN_line_drop.svg)
 
-# Two dimensional projection
+## Two dimensional projection
 We can explore this idea further and draw the projection of the loss surface to the plane, which is defined by 2 random vectors. Note, that with 2 random gaussian vectors in the huge dimensional space are almost certainly orthogonal.
 
 So, as previously, we generate random normalized gaussian vectors $w_1, w_2 \in \mathbb{R}^p$ and evaluate the loss function
@@ -50,5 +50,5 @@ nn_vis_CNN_plane_no_drop.html
 nn_vis_CNN_plane_drop.html
 :::
 
-# Code
+## Code
 [Open In Colab](https://colab.research.google.com/github/MerkulovDaniil/optim/blob/master/assets/Notebooks/NN_Surface_Visualization.ipynb){: .btn }
diff --git a/docs/applications/Neural_Lipschitz_constant.md b/docs/applications/Neural_Lipschitz_constant.md
@@ -3,7 +3,7 @@ title: Neural network Lipschitz constant
 parent: Applications
 ---
 
-# Lipschitz constant of a convolutional layer in neural network
+## Lipschitz constant of a convolutional layer in neural network
 
 It was observed, that small perturbation in Neural Network input could lead to significant errors, i.e. misclassifications.
 
@@ -17,5 +17,5 @@ $$
 
 In this notebook we will try to estimate Lipschitz constant of some convolutional layer of a Neural Network.
 
-# Code
+## Code
 [Open In Colab](https://colab.research.google.com/github/MerkulovDaniil/optim/blob/master/assets/Notebooks/Neural_Lipschitz.ipynb){: .btn }
diff --git a/docs/applications/deep_learning.md b/docs/applications/deep_learning.md
@@ -1,42 +1,42 @@
----
-title: Deep learning
----
-
-# Problem
-
-![Illustration](dl.png)
-
-A lot of practical tasks nowadays are being solved using the deep learning approach, which is usually implies finding local minimum of a non-convex function, that generalizes well (enough 😉). The goal of this short text is to show you the importance of the optimization behind neural network training.
-
-## Cross entropy
-One of the most commonly used loss functions in classification tasks is the normalized categorical cross-entropy in $K$ class problem:
-
-$$
-L(\theta) = - \dfrac{1}{n}\sum_{i=1}^n (y_i^\top\log(h_\theta(x_i)) + (1 - y_i)^\top\log(1 - h_\theta(x_i))), \qquad h_\theta^k(x_i) = \dfrac{e^{\theta_k^\top x_i}}{\sum_{j = 1}^K e^{\theta_j^\top x_i}}
-$$
-
-Since in Deep Learning tasks the number of points in a dataset could be really huge, we usually use {%include link.html title='Stochastic gradient descent'%} based approaches as a workhorse. 
-
-In such algorithms one uses the estimation of a gradient at each step instead of the full gradient vector, for example, in cross-entropy we have:
-
-$$
-\nabla_\theta L(\theta) = \dfrac{1}{n} \sum\limits_{i=1}^n \left( h_\theta(x_i) - y_i \right) x_i^\top
-$$
-
-The simplest approximation is statistically judged unbiased estimation of a gradient:
-
-$$
-g(\theta) = \dfrac{1}{b} \sum\limits_{i=1}^b \left( h_\theta(x_i) - y_i \right) x_i^\top\approx \nabla_\theta L(\theta)
-$$
-
-where we initially sample randomly only $b \ll n$ points and calculate sample average. It can be also considered as a noisy version of the full gradient approach.
-
-![Illustration](MLP_optims.svg)
-
-
-# Code
-[Open In Colab](https://colab.research.google.com/github/MerkulovDaniil/optim/blob/master/assets/Notebooks/Deep%20learning.ipynb){: .btn }
-
-# References
-* [Optimization for Deep Learning Highlights in 2017](http://ruder.io/deep-learning-optimization-2017/)
-* [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/)
+---
+title: Deep learning
+---
+
+## Problem
+
+![Illustration](dl.png)
+
+A lot of practical tasks nowadays are being solved using the deep learning approach, which is usually implies finding local minimum of a non-convex function, that generalizes well (enough 😉). The goal of this short text is to show you the importance of the optimization behind neural network training.
+
+### Cross entropy
+One of the most commonly used loss functions in classification tasks is the normalized categorical cross-entropy in $K$ class problem:
+
+$$
+L(\theta) = - \dfrac{1}{n}\sum_{i=1}^n (y_i^\top\log(h_\theta(x_i)) + (1 - y_i)^\top\log(1 - h_\theta(x_i))), \qquad h_\theta^k(x_i) = \dfrac{e^{\theta_k^\top x_i}}{\sum_{j = 1}^K e^{\theta_j^\top x_i}}
+$$
+
+Since in Deep Learning tasks the number of points in a dataset could be really huge, we usually use {%include link.html title='Stochastic gradient descent'%} based approaches as a workhorse. 
+
+In such algorithms one uses the estimation of a gradient at each step instead of the full gradient vector, for example, in cross-entropy we have:
+
+$$
+\nabla_\theta L(\theta) = \dfrac{1}{n} \sum\limits_{i=1}^n \left( h_\theta(x_i) - y_i \right) x_i^\top
+$$
+
+The simplest approximation is statistically judged unbiased estimation of a gradient:
+
+$$
+g(\theta) = \dfrac{1}{b} \sum\limits_{i=1}^b \left( h_\theta(x_i) - y_i \right) x_i^\top\approx \nabla_\theta L(\theta)
+$$
+
+where we initially sample randomly only $b \ll n$ points and calculate sample average. It can be also considered as a noisy version of the full gradient approach.
+
+![Illustration](MLP_optims.svg)
+
+
+## Code
+[Open In Colab](https://colab.research.google.com/github/MerkulovDaniil/optim/blob/master/assets/Notebooks/Deep%20learning.ipynb){: .btn }
+
+## References
+* [Optimization for Deep Learning Highlights in 2017](http://ruder.io/deep-learning-optimization-2017/)
+* [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/)