diff --git a/CONTRIBUTORS b/CONTRIBUTORS
index 1c360db32..6129af79a 100644
--- a/CONTRIBUTORS
+++ b/CONTRIBUTORS
@@ -1,10 +1,10 @@
 --ar
   Amjad Khatabi (translation of deep learning)
   Zaid Alyafeai (review of deep learning)
-  
+
 --de
 
---es 
+--es
   Erick Gabriel Mendoza Flores (translation of deep learning)
   Fernando Diaz (review of deep learning)
   Fernando González-Herrera (review of deep learning)
@@ -13,12 +13,12 @@
   Alonso Melgar López (review of deep learning)
   Gustavo Velasco-Hernández (review of deep learning)
   Juan Manuel Nava Zamudio (review of deep learning)
-  
+
   Fernando González-Herrera (translation of linear algebra)
   Fernando Diaz (review of linear algebra)
   Gustavo Velasco-Hernández (review of linear algebra)
   Juan P. Chavat (review of linear algebra)
-  
+
   David Jiménez Paredes (translation of machine learning tips and tricks)
   Fernando Diaz (translation of machine learning tips and tricks)
   Gustavo Velasco-Hernández (review of machine learning tips and tricks)
@@ -36,7 +36,7 @@
   Jaime Noel Alvarez Luna (translation of unsupervised learning)
   Alonso Melgar López (review of unsupervised learning)
   Fernando Diaz (review of unsupervised learning)
-  
+
 --fa
   AlisterTA (translation of deep learning)
   Mohammad Karimi (review of deep learning)
@@ -44,7 +44,7 @@
 
   Erfan Noury (translation of linear algebra)
   Mohammad Karimi (review of linear algebra)
-  
+
   AlisterTA (translation of machine learning tips and tricks)
   Mohammad Reza (translation of machine learning tips and tricks)
   Erfan Noury (review of machine learning tips and tricks)
@@ -52,14 +52,14 @@
 
   Erfan Noury (translation of probabilities and statistics)
   Mohammad Karimi (review of probabilities and statistics)
-  
+
   Amirhosein Kazemnejad (translation of supervised learning)
   Erfan Noury (review of supervised learning)
   Mohammad Karimi (review of supervised learning)
-  
+
   Erfan Noury (translation of unsupervised learning)
   Mohammad Karimi (review of unsupervised learning)
-  
+
 --fr
   Original authors
 
@@ -75,7 +75,7 @@
 
   Gabriel Fonseca (translation of linear algebra)
   Leticia Portella (review of linear algebra)
-  
+
   Fernando Santos (translation of machine learning tips and tricks)
   Leticia Portella (review of machine learning tips and tricks)
   Gabriel Fonseca (review of machine learning tips and tricks)
@@ -86,21 +86,21 @@
   Leticia Portella (translation of supervised learning)
   Gabriel Fonseca (review of supervised learning)
   Flavio Clesio (review of supervised learning)
-  
+
   Gabriel Fonseca (translation of unsupervised learning)
   Tiago Danin (review of unsupervised learning)
 
 --tr
   Ekrem Çetinkaya (translation of deep learning)
   Omer Bukte (review of deep learning)
-  
+
   Kadir Tekeli (translation of linear algebra)
   Ekrem Çetinkaya (review of linear algebra)
-  
+
 --uk
   Gregory Reshetniak (translation of probabilities and statistics)
   Denys (review of probabilities and statistics)
-  
+
 --zh
   Wang Hongnian (translation of supervised learning)
   Xiaohu Zhu (朱小虎) (review of supervised learning)
@@ -109,3 +109,6 @@
 --zh-tw
   kevingo (translation of deep learning)
   TobyOoO (review of deep learning)
+
+--pl
+    Michał Jamry
diff --git a/pl/cheatsheet-deep-learning.md b/pl/cheatsheet-deep-learning.md
new file mode 100644
index 000000000..7e64938b7
--- /dev/null
+++ b/pl/cheatsheet-deep-learning.md
@@ -0,0 +1,321 @@
+**1. Deep Learning cheatsheet**
+
+&#10230; Deep Learning - ściąga
+
+<br>
+
+**2. Neural Networks**
+
+&#10230; Sieci neuronowe
+
+<br>
+
+**3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.**
+
+&#10230; Sieci neuronowe to klasa modeli zbudowanych z warstw. Często wykorzystywane rodzaje sieci neuronowych to konwolucyjne i rekurencyjne sieci neuronowe.
+
+<br>
+
+**4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:**
+
+&#10230; Architektura - słownictwo związane z sieciami neuronowymi jest opisane poniżej:
+
+<br>
+
+**5. [Input layer, hidden layer, output layer]**
+
+&#10230; [Warstwa wejściowa, warstwa ukryta, warstwa wyjściowa]
+
+<br>
+
+**6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
+
+&#10230; Przez i rozumiemy i-tą warstwę sieci a przez j, j-ty neuron warstwy, mamy więc:
+
+<br>
+
+**7. where we note w, b, z the weight, bias and output respectively.**
+
+&#10230; gdzie w to wagi (współczynniki), b to wyraz wolny funkcji i z to wynik.
+
+<br>
+
+**8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:**
+
+&#10230; Funkcja aktywacji - Funkcje aktywacji stosowane są po wyliczeniu warstwy ukrytej w celu wprowadzenia nieliniowości do modelu. Oto najczęściej stosowane:
+
+<br>
+
+**9. [Sigmoid, Tanh, ReLU, Leaky ReLU]**
+
+&#10230; [Sigmoid, Tanh, ReLU, Leaky ReLU]
+
+<br>
+
+**10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
+
+&#10230; Strata logarytmiczna (Cross-entropy loss) ― W kontekście sieci neuronowych strata logarytmiczna L(z,y) jest często stosowany i wygląda następująco:
+
+<br>
+
+**11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
+
+&#10230; Współczynnik uczenia ― Współczynnik uczenia, często zapisywany jako α lub rzadziej η, określa z jaką szybkością będą aktualizowane wagi. Może on mieć wartość stałą lub zmienną. Obecnie najpopularniejszą metodą optymalizacji funkcji kosztu jest metoda Adam, która dostosowuje wartość współczynnika uczenia.
+
+<br>
+
+**12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:**
+
+&#10230; Propagacja wsteczna ― Propagacja wsteczna jest metodą aktualizacji wag w sieci neuronowej, która bierze pod uwagę różnice pomiędzy wynikiem uzyskanym, a oczekiwanym (koszt). Pochodna cząstkowa względem wagi w jest liczona z wykorzystaniem zasady złożenia pochodnych funkcji i wygląda następująco:
+
+<br>
+
+**13. As a result, the weight is updated as follows:**
+
+&#10230; W wyniku czego, wagi są aktualizowane w następujący sposób:
+
+<br>
+
+**14. Updating weights ― In a neural network, weights are updated as follows:**
+
+&#10230; Aktualizacja wag ― W sieci neuronowej, wagi są aktualizowane w następujący sposób: 
+
+<br>
+
+**15. Step 1: Take a batch of training data.**
+
+&#10230; Krok 1: Pobierz pakiet danych treningowych.
+
+<br>
+
+**16. Step 2: Perform forward propagation to obtain the corresponding loss.**
+
+&#10230; Krok 2: dokonaj propagacji do przodu aby uzyskać wartość straty.
+
+<br>
+
+**17. Step 3: Backpropagate the loss to get the gradients.**
+
+&#10230; Step 3: Z wykorzystaniem propagacji wstecznej użyj straty aby uzyskać gradient.
+
+<br>
+
+**18. Step 4: Use the gradients to update the weights of the network.**
+
+&#10230; Krok 4: Wykorzystaj gradient aby zaktualizować wagi w sieci neuronowej.
+
+<br>
+
+**19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
+
+&#10230; Dropout ― Dropout jest techniką zapobiegania nadmiernemu dopasowaniu (overfitting) do danych treningowych poprzez pomijanie niektórych neuronów w sieci. W praktyce, neurony są pomijane z prawdopodobieństwem p lub nie są pomijane z prawdopodobieństwem 1-p
+
+<br>
+
+**20. Convolutional Neural Networks**
+
+&#10230; Konwolucyjne Sieci Neuronowe
+
+<br>
+
+**21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
+
+&#10230; Wymagania warstwy konwolucyjnej ― Zauwżając, że W to rozmiar danych wejściowych, F to rozmiar neuronów warstwy konwolucyjnej, P rozmiar uzupełnienia zerami, to wymaganą ilość neuronów określamy następująco:
+
+<br>
+
+**22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
+
+&#10230; Normalizacja pakietu (Batch normalization) ― Jest to krok w którym hiperparametry γ,β są wykorzystywane do normalizacji pakietu {xi}. Zauważając, że μB to średnia, a σ2B to wariancja, to normalizacja pakiet wygląda następująca:
+
+<br>
+
+**23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
+
+&#10230; Jest ona zazwyczaj stosowana po warstwie pełnej lub konwolucyjnej, a przed zastosowaniem nieliniowej funkcji aktywacyjnej i ma na celu umożliwienie stosowania dużego współczynnika uczenia i zmniejszenia zależności od inicjalizacji.
+
+<br>
+
+**24. Recurrent Neural Networks**
+
+&#10230; Rekurencyjne Sieci Neuronowe
+
+<br>
+
+**25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:**
+
+&#10230; Rodzaje bramek ― Przedstawiamy różne rodzaje bramek, które możemy spotkać w typowych sieciach rekurencyjnych (RNN):
+
+<br>
+
+**26. [Input gate, forget gate, gate, output gate]**
+
+&#10230; [Bramka wejściowa, bramka zapominajca, bramka, bramka wyjściowa]
+
+<br>
+
+**27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]**
+
+&#10230; [Pisać do komórki, czy nie?, Wyczyścić komówke, czy nie?, Jak dużo zapisać do komórki?, Jak dużo ujawnić komórce?]
+
+<br>
+
+**28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
+
+&#10230; LSTM ― Długa krótkoterminowa sieć neuronowa (LSTM) to rodzaj sieci rekurencyjnej (RNN), która radzi sobie z problemem zanikającego gradientu poprzez wykorzystanie bramek zapominających.
+
+<br>
+
+**29. Reinforcement Learning and Control**
+
+&#10230; Uczenie Wspomagane i Kontrola
+
+<br>
+
+**30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.**
+
+&#10230; Celem uczenia wspomaganego jest nauczenie agenta tego, w jaki sposób ewoluować w danym środowisku.
+
+<br>
+
+**31. Definitions**
+
+&#10230; Definicje:
+
+<br>
+
+**32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:**
+
+&#10230; Proces decyzyjny Markowa ― Proces decyzyjny markowa (MDP) jest 5-krotką (S,A,{Psa},γ,R), gdzie: 
+
+<br>
+
+**33. S is the set of states**
+
+&#10230;
+
+<br> S jest zbiorem stanów
+
+**34. A is the set of actions**
+
+&#10230; A jest zbiorem działań
+
+<br>
+
+**35. {Psa} are the state transition probabilities for s∈S and a∈A**
+
+&#10230; {Psa} to zbiór prawdopodobieństw przejść pomiędzy stanami gdzie s∈S i a∈A 
+
+<br>
+
+**36. γ∈[0,1[ is the discount factor**
+
+&#10230; γ∈[0,1[ jest współczynnikiem dyskontującym. 
+
+<br>
+
+**37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize**
+
+&#10230; R:S×A⟶R lub R:S⟶R to funkcja nagrody, którą algorytm ma za zadanie zmaksymalizować.
+
+<br>
+
+**38. Policy ― A policy π is a function π:S⟶A that maps states to actions.**
+
+&#10230; Strategia - Strategia π jest funkcją π:S⟶A, która mapuje stany na działania.
+
+<br>
+
+**39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).**
+
+&#10230; Przypomnienie: mówimy, że wykonujemy daną strategię π w danym stanie s, gdy wykonujemy działanie a=π(s).
+
+<br>
+
+**40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:**
+
+&#10230; Funkcja wartości ― Dla danej strategii π w danym stanie s, definiujemy wartość funkcji Vπ w następujący sposób:    
+
+<br>
+
+**41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:**
+
+&#10230; Równanie Bellmana - Optymalne równania Bellmana charakteryzują wartość funkcji Vπ∗ optymalnej strategii π∗:
+
+<br>
+
+**42. Remark: we note that the optimal policy π∗ for a given state s is such that:**
+
+&#10230; Przypomnienie: zauważmy, że optymalna strategia π∗ dla danego stanu s jest taka, że:
+
+<br>
+
+**43. Value iteration algorithm ― The value iteration algorithm is in two steps:**
+
+&#10230; Algorytm iteracyjnego ustalania wartości zmiennej - algorytm ten składa się z dwóch kroków:
+
+<br>
+
+**44. 1) We initialize the value:**
+
+&#10230; Inicjalizujemy zmienną wartością:
+
+<br>
+
+**45. 2) We iterate the value based on the values before:**
+
+&#10230; W iteracyjny sposób ustalamy wartość zmiennej w oparciu o wartość poprzedniej zmiennej:
+
+<br>
+
+**46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
+
+&#10230; Szacowanie maksymalnego prawdopodobieństwa - Szacowanie maksymalnego prawdopodobieństwo dla poszczególnych przejść pomiędzy stanami wygląda następująco:
+
+<br>
+
+**47. times took action a in state s and got to s′**
+
+&#10230; ile razy podjęto działanie a w stanie s i otrzymano stan s'
+
+<br>
+
+**48. times took action a in state s**
+
+&#10230; ile razu podjęto działanie a w stanie s
+
+<br>
+
+**49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:**
+
+&#10230; Q-learning ― Q-learning jest bezmodelowym sposobem estymowania Q, który wygląda następująco: 
+
+<br>
+
+**50. View PDF version on GitHub**
+
+&#10230;
+
+<br> Przejrzyj wersje PDF na Githubie
+
+**51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
+
+&#10230; [Sieci neuronowe, Architektura, Funkcja aktywacji, Propagacja wsteczna, Dropout]
+
+<br>
+
+**52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
+
+&#10230; [Konwolucyjne Sieci Neuronowe, Warstwa konwolucyjna, Normalizacja pakietu (Batch normalization)]
+
+<br>
+
+**53. [Recurrent Neural Networks, Gates, LSTM]**
+
+&#10230; [Rekurencyjne Sieci Neuronowe, Bramki, LSTM]
+
+<br>
+
+**54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
+
+&#10230; [Uczenie wspoagane, Proces decyzyjny Markowa, Iteracja wartość/strategia, Przybliżone programowanie dynamiczne, Wyszukiwanie strategii]
diff --git a/pl/cheatsheet-machine-learning-tips-and-tricks.md b/pl/cheatsheet-machine-learning-tips-and-tricks.md
new file mode 100644
index 000000000..ddec59ee5
--- /dev/null
+++ b/pl/cheatsheet-machine-learning-tips-and-tricks.md
@@ -0,0 +1,285 @@
+**1. Machine Learning tips and tricks cheatsheet**
+
+&#10230; Uczenie maszynowe - ściąga z poradami
+
+<br>
+
+**2. Classification metrics**
+
+&#10230; Miary efektywności klasyfikatorów
+
+<br>
+
+**3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
+
+&#10230; W przypadku klasyfikacji binarnej, następujące miary są użyteczne do ustalenia efektywności modelu.
+
+<br>
+
+**4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
+
+&#10230; Macierz pomyłek - Macierz pomyłek jest wykorzystywana w celu przedstawienia bardziej całościowego obrazu efektywności modelu. Definiuje się ją w następujący sposób:
+
+<br>
+
+**5. [Predicted class, Actual class]**
+
+&#10230; [Klasa predykowana, Klasa rzeczywista]
+
+<br>
+
+**6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
+
+&#10230; Główne miary - Następujące miary często wykorzystywane są do ustalenia efektywności modelu:
+
+<br>
+
+**7. [Metric, Formula, Interpretation]**
+
+&#10230; [Miara, Wzór, Interpretacja]
+
+<br>
+
+**8. Overall performance of model**
+
+&#10230; Dokładność - całościowa efektywność modelu
+
+<br>
+
+**9. How accurate the positive predictions are**
+
+&#10230; Precyzja - jak dokładne są predykcje pozytywne
+
+<br>
+
+**10. Coverage of actual positive sample**
+
+&#10230; Czułość - stosunek wyników prawdziwie dodatnich do sumy prawdziwie dodatnich i fałszywie ujemnych
+
+<br>
+
+**11. Coverage of actual negative sample**
+
+&#10230; Swoistość - stosunek wyników prawdziwie ujemnych do sumy prawdziwie ujemnych i fałszywie dodatnich
+
+<br>
+
+**12. Hybrid metric useful for unbalanced classes**
+
+&#10230; Hybrydowa miara, przydatna przy niezbalansowanych klasach
+
+<br>
+
+**13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
+
+&#10230; ROC - jest to wykres TPR do FPR przy zmiennym progu. Podsumowanie tych miar znajduje się w tabeli poniżej:
+
+<br>
+
+**14. [Metric, Formula, Equivalent]**
+
+&#10230; [Miara, Wzór, Odpowiednik]
+
+<br>
+
+**15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
+
+&#10230; AUC - Powierzchna pola pod ROC, zwane także AUC lub AUROC, jest to powierzchnia pola pod wykresem ROC, jak to pokazano na wykresie obok:
+
+<br>
+
+**16. [Actual, Predicted]**
+
+&#10230; [Rzeczywiste, Predykowane]
+
+<br>
+
+**17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
+
+&#10230; Miary podstawowe - Mając model regresyjny f, następujące miary są często używane do sprawdzenia efektywności modelu:
+
+<br>
+
+**18. [Total sum of squares, Explained sum of squares, Residual sum of squares]**
+
+&#10230; [Całkowita suma kwadratów, Wyjaśniona suma kwadratów, Pozostała suma kwadratów]
+
+<br>
+
+**19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
+
+&#10230; Współczynnik determinacji - często zapisywany jako R2 lub r2, jest miarą tego, jak dobrze zaobserwowane wyniki są replikowane przez model. Definiuje się go następująco:
+
+<br>
+
+**20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
+
+&#10230; Główne miary - Następujące miary często wykorzystywane są do ustalenia efektywności modelu regresyjnego. Opierają się one na ilości zmiennych n, które model wykorzystuje:
+
+<br>
+
+**21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
+
+&#10230; gdzie L jest prawdopodobieństwem i ˆσ2 jest estymatą wariancji związanej z każdą odpowiedzią.
+
+<br>
+
+**22. Model selection**
+
+&#10230; Wybór modelu
+
+<br>
+
+**23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
+
+&#10230; Słownictwo - Przy wybieraniu modelu rozróżniamy 3 różne porcje danych. Określamy je następująco:
+
+<br>
+
+**24. [Training set, Validation set, Testing set]**
+
+&#10230; [Zbiór treningowy, Zbiór walidacyjny, Zbiór testowy]
+
+<br>
+
+**25. [Model is trained, Model is assessed, Model gives predictions]**
+
+&#10230; [Model jest trenowany, Model jest sprawdzany, Model generuje predykcje]
+
+<br>
+
+**26. [Usually 80% of the dataset, Usually 20% of the dataset]**
+
+&#10230; [Zazwyczaj 80% zbioru danych, Zazwyczaj 20% zbioru danych]
+
+<br>
+
+**27. [Also called hold-out or development set, Unseen data]**
+
+&#10230; [Zwany także zbiorem zachowanym albo zbiorem deweloperskim, Niewidziane dane]
+
+<br>
+
+**28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
+
+&#10230; Po wyborze modelu, szkolimy go na całym zbiorze danych (treningowy + walidacyjny) i testujemy na niewidzianym zbiorze (testowy). Zbiory są przedstawione na obrazkach poniżej:
+
+<br>
+
+**29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
+
+&#10230; Walidacja krzyżowa - Cross-validation, zapisywana także jako CV, jest metodą która zakłada że przy wyborze modelu nie opieramy się tylko na jednych danych treningowych. Różne rodzaje tej metody opisane są poniżej w tabeli:
+
+<br>
+
+**30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
+
+&#10230; [Trenowanie na k-1 podzbiorach i sprawdzanie na pozostałym podzbiorze, Trenowanie na n-p obserwacjach i sprawdzanie na p pozostałych]
+
+<br>
+
+**31. [Generally k=5 or 10, Case p=1 is called leave-one-out]**
+
+&#10230; [Zazwyczaj k=5 lub 10, przypadek przy p=1 zwany jest leave-one-out]
+
+<br>
+
+**32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
+
+&#10230; Najczęściej stosowanym rodzajem walidacji krzyżowej jest metoda zwana k-fold cross-validation (K-krotna walidacja krzyżowa). Dzieli ona dane treningowe na k równych podzbiorów. Model jest trenowany na k-1 podzbiorach i testowany na pozostałym jednym podzbiorze. Proces powtarzany jest k razy przy zmianie podzbioru walidacyjnej na następną. Błąd jest liczony jako średnia błędów ze wszytkich podzbiorów walidacyjnych.
+
+<br>
+
+**33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
+
+&#10230; Regularyzacja - jest to proces mający na celu uniknięcie nadmiernemu dopasowaniu (overfitting) modelu do danych treningowych i uniknięciu wysokiej wariancji modelu. Tabela obok przedstawia rodzaje często stosowanych motod regularyzacyjnych:
+
+<br>
+
+**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+&#10230; [Zmniejsza współczynniki do 0, Dobra do doboru zmiennych, Zmniejsza współczynniki, Rozwiązanie pośrednie pomiędzy doborem zmiennych a małymi współczynnikami]
+
+<br>
+
+**35. Diagnostics**
+
+&#10230; Diagnostyka
+
+<br>
+
+**36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
+
+&#10230; Niewystarczające dopasowanie (bias, underfitting) - jest to różnica pomiędzy predykowanymi wynikami a wynikami rzeczywistymi. Predykcje modelu cechuje mała wariancja i słabe dopasowanie do danych treningowych.
+
+<br>
+
+**37. Variance ― The variance of a model is the variability of the model prediction for given data points.**
+
+&#10230; Nadmierne dopasowanie (variance, overfitting) - predykcje modelu cechuje duża wariancja i dobre dopasowanie do danych treningowych.
+
+<br>
+
+**38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
+
+&#10230; Nadmierne/Niewystarczające dopasowanie modelu - im prostszy model tym będzie bardziej niewystarczająco dopasowany, im bradziej złożony tym będzie bardziej nadmiernie dopasowany.
+
+<br>
+
+**39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
+
+&#10230; [Objawy, Regresja, Klasyfikacja, Deep learning, Co zrobić?]
+
+<br>
+
+**40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
+
+&#10230; [Wysoki błąd treningowy, Błąd treningowy zbliżony do błędu testowego, Niewystarczające dopasowanie, Błąd treningowy odrobinę mniejszy niż błąd testowy, Bardzo mały błąd treningowy, Błąd treningowy o wiele mniejszy niż błąd testowy, Nadmierne dopasowanie]
+
+<br>
+
+**41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
+
+&#10230; [Uczyń model bardziej złożonym, Dodaj zmiennych, Ucz model dłużej, Zastosuj regularyzacje, Zdobąć więcej danych]
+
+<br>
+
+**42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
+
+&#10230; Analiza błędu - Jest to analiza głównych powodów różnicy efektywności modelu testowanego i modelu doskonałego. W celu poprawy efektywności modelu.
+
+<br>
+
+**43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
+
+&#10230; Analiza ablacyjna - analiza głównych powodów różnicy efektywności modelu testowanego i modelu podstawowego. W celu uproszczenia modelu.
+
+<br>
+
+**44. Regression metrics**
+
+&#10230; Miary regresji
+
+<br>
+
+**45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
+
+&#10230; [Miary klasyfikacji, macierz pomyłek, dokładność, precyzja, czułość, F1, ROC]
+
+<br>
+
+**46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]**
+
+&#10230; [Miary regresji, R kwadrat, CP Mallow'a, AIC, BIC]
+
+<br>
+
+**47. [Model selection, cross-validation, regularization]**
+
+&#10230; [Wybór modelu, walidacja krzyżowa, regularyzacja]
+
+<br>
+
+**48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
+
+&#10230; [Diagnostyka, Niedostateczne/nadmierne dopasowanie modelu, analiza ablacyjna/błędu]
diff --git a/pl/cheatsheet-supervised-learning.md b/pl/cheatsheet-supervised-learning.md
new file mode 100644
index 000000000..8cfba9a40
--- /dev/null
+++ b/pl/cheatsheet-supervised-learning.md
@@ -0,0 +1,567 @@
+**1. Supervised Learning cheatsheet**
+
+&#10230; Uczenie nadzorowane - ściąga
+
+<br>
+
+**2. Introduction to Supervised Learning**
+
+&#10230; Wprowadzenie do Uczenia nadzorowanego
+
+<br>
+
+**3. Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.**
+
+&#10230; Mając zbiór danych {x(1),...,x(m)} i powiązany z nimi zbiór wyników {y(1),...,y(m)}, chcemy zbudować klasyfikator, który nauczy się predykcji y na podstawie x.
+
+<br>
+
+**4. Type of prediction ― The different types of predictive models are summed up in the table below:**
+
+&#10230; Rodzaje predykcji ― Różne rodzaje predykcji opisane są w tabelce poniżej:
+
+<br>
+
+**5. [Regression, Classifier, Outcome, Examples]**
+
+&#10230; [Regresja, Klasyfikacja, Wynik, Przykład]
+
+<br>
+
+**6. [Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
+
+&#10230; [Ciągłość, Klasa, Regresja liniowa, Regresja logistyczna, SVM, Naive Bayes]
+
+<br>
+
+**7. Type of model ― The different models are summed up in the table below:**
+
+&#10230; Rodzaj modelu ― Różne rodzaje modeli opisane są w tabelce poniżej:
+
+<br>
+
+**8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
+
+&#10230; [Model dyskryminacyjny, Model generatywny, Cel, Co jest uczone?, Obrazek, Przykład]
+
+<br>
+
+**9. [Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary,  	Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]**
+
+&#10230; [Bezpośrednia estymata P(y|x), Estymata P(x|y) aby wydedukować P(y|x), Rozgraniczenie decyzyjne, Rozkład prawdopodobieństwa danych, Regresja, SVM, GDA, Naive Bayes]
+
+<br>
+
+**10. Notations and general concepts**
+
+&#10230; Zapis i stwierdzenia ogólne
+
+<br>
+
+**11. Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).**
+
+&#10230; Hipoteza ― Hipoteze zapisujemy jako h0 i jest wybranym przez nas modelem. Dla danych danech wejściowych x(i) model tworzy predykcje wyniku h0(x(i)).
+
+<br>
+
+**12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
+
+&#10230; Funkcja straty - Funkcja straty jest funkcją L:(z,y)∈R×Y⟼L(z,y)∈R która bierze za wejście predykowany wynik modelu oraz odpowiadający mu wynik rzeczywisty y i wyraża jak różne są od siebie. Częśto stosowane funkcje straty przedstawione są w tabelce poniżej:
+
+<br>
+
+**13. [Least squared error, Logistic loss, Hinge loss, Cross-entropy]**
+
+&#10230; [Błąd najmniejszych kwadratów, Strata logistyczny, Strata Hinge-a, Strata logarytmiczny (Cross-entropy)]
+
+<br>
+
+**14. [Linear regression, Logistic regression, SVM, Neural Network]**
+
+&#10230; [Regresja liniowa, Regresja logistyczna, SVM, Sieć neuronowa]
+
+<br>
+
+**15. Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:**
+
+&#10230; Funkcja kosztu - Funkcja kosztu J jest często używana w celu określenia efektywności modelu, definiuje sie ją za pomocą funkcji straty L w następujący sposób:
+
+<br>
+
+**16. Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:**
+
+&#10230; Schodzenie gradientu (Gradient descent) ― Przyjmując, że współczynnik uczenia to α∈R, zasadę aktualizacji przy schodzeniu gradientu można wyrazić za pomocą współczynnika uczenia i funkcji kosztu J w następujący sposób:
+
+<br>
+
+**17. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.**
+
+&#10230; Przypomnienie: Stochastyczne schodzenie gradientu (Stochastic Gradient Descent, SGD) aktualizuje współczynniki funkcji (wagi) w oparciu o każdy przykład z danych treningowych z osobna, a pakietowe schodzenie gradientu (batch gradient descent) aktualizuje je na podstawie całego pakietu (podzbioru) przykładów z danych treningowych.
+
+<br>
+
+**18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
+
+&#10230; Prawdopodobieństwo ― Prawdopodobieństwo modelu L(θ) przy parametrze θ jest wykorzystywane do znalezienia optymalnego parametru θ poprzez maksymalizacje prawdopodobieństwa. W praktyce, używamy prawdopodobieństwa logarytmicznego ℓ(θ)=log(L(θ)) które łatwiej zoptymalizować (logspace). Mamy więc:
+
+<br>
+
+**19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**
+
+&#10230; Algorytm Newtona ― Algorytm Newtona to numeryczna metoda znalezienia takiego parametru θ, dla którego ℓ′(θ)=0. Zasada jego aktualizacji:
+
+<br>
+
+**20. Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:**
+
+&#10230; Przypomnienie: wielowymiarowa generalizacja, znana także jako metoda Newtona-Raphsona, ma następującą zasadę aktualizacji:
+
+<br>
+
+**21. Linear models**
+
+&#10230; Modele liniowy
+
+<br>
+
+**22. Linear regression**
+
+&#10230; Regresja liniowa
+
+<br>
+
+**23. We assume here that y|x;θ∼N(μ,σ2)**
+
+&#10230; Zakładając że y|x;θ∼N(μ,σ2)
+
+<br>
+
+**24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
+
+&#10230; Równania normalnej - Przyjmując za X macierz, wartość θ minimalizująca funkcje kosztu ma zamknięte rozwiązanie:
+
+<br>
+
+**25. LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:**
+
+&#10230; Algorytm aproksymacji średniokwadratowej - Przyjmując, że α to współczynnik uczenia, zasada aktualizacji aproksymacji średniokwadratowej (Least Mean Square, LMS) z wykorzystaniem m przykładów z danych treningowych (zwana także algorytmem Widrow-Hoffa) wygląda następująco:
+
+<br>
+
+**26. Remark: the update rule is a particular case of the gradient ascent.**
+
+&#10230; Przypomnienie: zasada aktualizacji to szczególny przypadek wchodzenia gradientu.
+
+<br>
+
+**27. LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:**
+
+&#10230; LWR ― Regresja ważona lokalnie, jest odmianą regresji liniowej, w której waży się każdy przykład ze zbioru treningowego funkcją kosztu w(i)(x), która jest zdefiniowana z wykorzystaniem parametru t∈R w sposób następujący:
+
+<br>
+
+**28. Classification and logistic regression**
+
+&#10230; Klasyfikacja i regresja logistyczna
+
+<br>
+
+**29. Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:**
+
+&#10230; Funkcja sigmoidalna - Funkcja sigmoidalna g, anana także jako funkcja logistyczna, jest zdefiniowana w następujący sposób:
+
+<br>
+
+**30. Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:**
+
+&#10230; Regresja logistyczna ―  Zakładając, że y|x;θ∼Bernoulli(ϕ). Mamy następującą formułę:
+
+<br>
+
+**31. Remark: there is no closed form solution for the case of logistic regressions.**
+
+&#10230; Przypomnienie: nie istnieje zamknięte rozwiązanie przypadku regresji logistycznej.
+
+<br>
+
+**32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
+
+&#10230; Regresja softmax ―  Regresja softmax, zwana także wieloklasową regresją logistyczną, używana jest jako uogólnienie regresji logistycznej w przypadku, gdy mamy więcej niż 2 klasy wynikowe. Konwencją jest, że θK=0, czyni to parametr Bernoulliego ϕi każdej klasy i równy:
+
+<br>
+
+**33. Generalized Linear Models**
+
+&#10230; Generalne modeli liniowych
+
+<br>
+
+**34. Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:**
+
+&#10230; Rodzina wykładnicza ― O klasie rozkładu mówi się, że należy do rodziny wykładniczej jeśli można ją zapisać z wykorzystaniem parametrów naturalnych, zwanych także kanonicznymi parametrami η, wystarczającej statystyki T(y) i podzału logarytmicznego funkcji a(η) w nastepujący sposób:
+
+<br>
+
+**35. Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.**
+
+&#10230; Przypomnienie: często zdarzy się, że T(y)=y. Więc exp(-a(η)) może być rozumiany jako parametr normalizujący, który zapewni, że suma prawdopodobieństw będzie wynosiła 1.
+
+<br>
+
+**36. Here are the most common exponential distributions summed up in the following table:**
+
+&#10230; W tabeli przedstawione są najczęściej spotykane rozkłady wykładnicze:
+
+<br>
+
+**37. [Distribution, Bernoulli, Gaussian, Poisson, Geometric]**
+
+&#10230; [Rozkład, Bernoulli, Gaussian, Poisson, Geometric]
+
+<br>
+
+**38. Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:**
+
+&#10230; Założenia generalnych modeli liniowych ― generalne modele liniowe mają za zadanie przewidzieć losową zmienną y jako funkcje x∈Rn+1 i opieraja się na 3 założeniach:
+
+<br>
+
+**39. Remark: ordinary least squares and logistic regression are special cases of generalized linear models.**
+
+&#10230; Przypomnienie: zwykła metoda najmniejszych kwadratów i regresja logistyczna to przypadki szczególne generalnych modeli liniowych.
+
+<br>
+
+**40. Support Vector Machines**
+
+&#10230; Maszyny wektorów nośnych (Support Vector Machines)
+
+<br>
+
+**41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
+
+&#10230; Celem maszyn wektorów nośnych jest znalezienie hiperpłaszczyzny, która maksymalizuje margines pomiędzy przykładami oddzielnych klas.
+
+<br>
+
+**42: Optimal margin classifier ― The optimal margin classifier h is such that:**
+
+&#10230; Klasyfikator optymalnego marginesu ― Klasyfikator optymalnego marginesu h jest opisany następująco:
+
+<br>
+
+**43: where (w,b)∈Rn×R is the solution of the following optimization problem:**
+
+&#10230; gdzie (w,b)∈Rn×R jest rozwiązaniem następującego problemu optymalizacyjnego:
+
+<br>
+
+**44. such that**
+
+&#10230; takich, że
+
+<br>
+
+**45. support vectors**
+
+&#10230; wektory nośne 
+
+<br>
+
+**46. Remark: the line is defined as wTx−b=0.**
+
+&#10230; Przypomnienie: linia zdefiniowana jest jako wTx−b=0.
+
+<br>
+
+**47. Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:**
+
+&#10230; Strata Hinge'a ― Strata Hinge'a jest wykorzystywana w maszynach wektorów nośnych, definiowana jest następująco:
+
+<br>
+
+**48. Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:**
+
+&#10230; Jądro ― Mając mapowanie ϕ, definiujemy jądro K jako:
+
+<br>
+
+**49. In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.**
+
+&#10230; W praktyce, jądro K zdefiniowane jako K(x,z)=exp(−||x−z||22σ2) nazywane jest Jądrem Gaussa i jest powszechnie używane.
+
+<br>
+
+**50. [Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]**
+
+&#10230; [Nieliniowa separowalność, Użycie mapowania jądra, Rozgraniczenie decyzyjne w oryginalnej przestrzeni]
+
+<br>
+
+**51. Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.**
+
+&#10230; Przypomnienie: mówimy, że używamy "kernel trick" do opliczenia funkcji kosztu wykorzystując jądro, ponieważ w rzeczywistości nie potrzebujemy znać mapowania ϕ, które często bywa skomplikowane. W zamian, jedynie wartości K(x,z) są potrzebne.
+
+<br>
+
+**52. Lagrangian ― We define the Lagrangian L(w,b) as follows:**
+
+&#10230; Lagrangian ― Definiujemy Lagrangian L(w,b) następująco:
+
+<br>
+
+**53. Remark: the coefficients βi are called the Lagrange multipliers.**
+
+&#10230; Przypomnienie: współczynniki βi nazywane są mnożnikami Legrange'a
+
+<br>
+
+**54. Generative Learning**
+
+&#10230; Uczenie generatywne
+
+<br>
+
+**55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
+
+&#10230; Model generatywny po pierwsze stara się nauczyć jak dane są generowane poprzez estymacje P(x|y), które możemy użyć do estymacji P(y|x) korzystając z reguły Bayesa.
+
+<br>
+
+**56. Gaussian Discriminant Analysis**
+
+&#10230; Analiza dyskryminanty Gaussa
+
+<br>
+
+**57. Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:**
+
+&#10230; Założenia ―  Analiza dyskryminanty gaussa zakłada że y i x|y=0 i x|y=1 i jest taka, że:
+
+<br>
+
+**58. Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:**
+
+&#10230; Estymacja ― Następująca tabela przedstawia estymaty, które widać przy maksymalizacji prawdopodobieństwa:
+
+<br>
+
+**59. Naive Bayes**
+
+&#10230; Naiwny klasyfikator bayesowski
+
+<br>
+
+**60. Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:**
+
+&#10230; Założenie ― Naiwny klasyfikator bayesowski zakłada, że cechy (features) każdego przykładu są niezależne.
+
+<br>
+
+**61. Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]**
+
+&#10230; Rozwiązanie ― Maksymalizując logarytmiczne prawdopodobieństwo otrzymujemy następująze rozwiązanie z k∈{0,1},l∈[[1,L]]
+
+<br>
+
+**62. Remark: Naive Bayes is widely used for text classification and spam detection.**
+
+&#10230; Przypomnienie: Naiwny klasyfikator bayesowski jest powszechnie używany do klasyfikacji tekstu i detekcji spamu.
+
+<br>
+
+**63. Tree-based and ensemble methods**
+
+&#10230; Metody oparte o drzewa i "ensembling"
+
+<br>
+
+**64. These methods can be used for both regression and classification problems.**
+
+&#10230; Te metody mogą być używane zarówno do problemów regresyjnych jak i klasyfikacyjnych.
+
+<br>
+
+**65. CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.**
+
+&#10230; CART ― Drzewa klasyfikacyjne i regresyjne (Classification and Regression Trees), zwane także drzewami decyzyjnymi, mogą być reprezentowane jako drzewa binarne. Zaletą tych metod jest ich wysoka interpretowalność.
+
+<br>
+
+**66. Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.**
+
+&#10230; Las losowy ― Jest to metoda oparta na drzewach, która wykorzystuje dużą ilość drzew decyzyjnych, opeartych na losowo dobieranych cechach. W przeciwieństwie do prostego drzewa decyzyjnego, jest on wysoce nieinterpretowalny, jednak dobra efektywność czyni go popularnym algorytmem.
+
+<br>
+
+**67. Remark: random forests are a type of ensemble methods.**
+
+&#10230; Przypomnienie: las losowy jest rodzajem algorytmu opartego na ensemblingu.
+
+<br>
+
+**68. Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:**
+
+&#10230; Boostowanie ― pomysł polega na połączeniu kilku słabszych modeli w celu otworzenia jednego silniejszego. Poniżej przedstawione są główne rodzaje:
+
+<br>
+
+**69. [Adaptive boosting, Gradient boosting]**
+
+&#10230; [Boostowanie adaptacyjne, Boostowanie gradientowe]
+
+<br>
+
+**70. High weights are put on errors to improve at the next boosting step**
+
+&#10230; Duża waga kładziona jest na błędy w celu polepszenia wyniku w następnym kroku boostującym 
+
+<br>
+
+**71. Weak learners trained on remaining errors**
+
+&#10230; Słabe modele trenowane są na pozostałych błędach
+
+<br>
+
+**72. Other non-parametric approaches**
+
+&#10230; Inne nie sparametryzowane podejścia
+
+<br>
+
+**73. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
+
+&#10230; k-najbliżsi sąsiedzi ― Algorytm k-najbliższych sąsiadów, znany powszechnie jako k-NN, jest nie sparametryzowanym podejściem, gdzie przynależność danego przykładu do danej klasy zależy od przynależności k-najbliższych punktów. Może być wykorzystywane zarówno przy klasyfikacji, jak i regresji.
+
+<br>
+
+**74. Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
+
+&#10230; Przypomnienie: Im wyższy parametr k, tym niższe dopasowanie, im mniejszy parametr k, tym wyższe dopasowanie.
+
+<br>
+
+**75. Learning Theory**
+
+&#10230; Teoria uczenia
+
+<br>
+
+**76. Union bound ― Let A1,...,Ak be k events. We have:**
+
+&#10230; Nierównośc Boole'a (Boole's inequality, union bound) ― Przyjmując, że A1,...,Ak są k zdarzeniami. Mamy:
+
+<br>
+
+**77. Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:**
+
+&#10230; Nierówność Hoeffding'a ― Przyjmując, że Z1,...,Zm są m zmiennymi pobranymi z rozkładu Bernoulli'ego parametru ϕ. Przyjmując, ze ˆϕ jest średnią próbki i y>0, mamy:
+
+<br>
+
+**78. Remark: this inequality is also known as the Chernoff bound.**
+
+&#10230; Przypomnienie: nierówność ta zwana jest także "Chernoff bound".
+
+<br>
+
+**79. Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:**
+
+&#10230; Błąd uczenia ― Dla danego klasyfikatora h, definiujemy błąd treningowy ˆϵ(h), znana także jako błąd empiryczny lub ryzyko empiryczne. Wygląda następująco:
+
+<br>
+
+**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions:**
+
+&#10230; Probably Approximately Correct (PAC) ― PAC jest sposobem, którym dowiedziono wiele teorii Teorii Uczenia i ma następujący zbiór założeń:
+
+<br>
+
+**81: the training and testing sets follow the same distribution**
+
+&#10230; zbiór treningowy i testowy mają taki sam rozkład.
+
+<br>
+
+**82. the training examples are drawn independently**
+
+&#10230; przykłady ze zbioru treningowego są wybierane niezależnie.
+
+<br>
+
+**83. Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:**
+
+&#10230; Shattering ― Mając zbiór S={x(1),...,x(d)} i zbiór klasyfikatorów H, mówimy, że H "shatteruje" S jeśli dla jakiegokolwiek zbioru etykiet {y(1),...,y(d)}, mamy:
+
+<br>
+
+**84. Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:**
+
+&#10230; 
+
+<br> Teoria górnej granicy (Upper bound theorem)  ― Przyjmując, że H jest skończoną hipotezą klasy, taką, że |H|=k, przyjmijmy, że δ i rozmiar przykładu jest stały. To, z prawdopodobieństwem co najmniej 1−δ mamy:
+
+**85. VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.**
+
+&#10230;
+
+<br> Wymiar Vapnika-Chervonenkisa ― wymiar danej nieskończonej hipotezy klasy H, zapisywany VC(H) jest rozmiarem największego zbioru, który jest "shattered" przez H.
+
+**86. Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.**
+
+&#10230; Przypomnienie: wymiar VC z H={zbiór liniowych klasyfikatorów w 2 wymiarach} wynosi 3.
+
+<br>
+
+**87. Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:**
+
+&#10230; Teoria Vapnika ― Przyjmując że mamy H, które VC(H)=d i m jest liczbą przykładów treningowych. Z prawdopodobieństwem co najmniej 1−δ mamy:
+
+<br>
+
+**88. [Introduction, Type of prediction, Type of model]**
+
+&#10230; [Wprowadzenie, Rodzaje predykcji, Roszaje modelu]
+
+<br>
+
+**89. [Notations and general concepts, loss function, gradient descent, likelihood]**
+
+&#10230; [Zapis i ogólne założenia, funkcja straty, opadanie gradientu, prawdopodobieństwo]
+
+<br>
+
+**90. [Linear models, linear regression, logistic regression, generalized linear models]**
+
+&#10230; [Modele liniowe, regresja liniowa, regresja logistyczna, generalne modele liniowe]
+
+<br>
+
+**91. [Support vector machines, Optimal margin classifier, Hinge loss, Kernel]**
+
+&#10230; [Maszyny wektorów nośnych, Klasyfikator optymalnego marginesu, Strata Hinge'a, Jądro]
+
+<br>
+
+**92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]**
+
+&#10230; [Uczenie generatywne, Analiza dyskryminanty Gaussa, Naiwny Klasyfikator Bayesowski (Naive Bayes)]
+
+<br>
+
+**93. [Trees and ensemble methods, CART, Random forest, Boosting]**
+
+&#10230; [Drzewa i "ensembling", CART, Las losowy, Boostowanie]
+
+<br>
+
+**94. [Other methods, k-NN]**
+
+&#10230; [Inne metody, k-NN]
+
+<br>
+
+**95. [Learning theory, Hoeffding inequality, PAC, VC dimension]**
+
+&#10230; [Teoria uczenia, Nierówność Hoeffding'a, PAC, wymiary VC]
diff --git a/pl/cheatsheet-unsupervised-learning.md b/pl/cheatsheet-unsupervised-learning.md
new file mode 100644
index 000000000..1bf117d72
--- /dev/null
+++ b/pl/cheatsheet-unsupervised-learning.md
@@ -0,0 +1,340 @@
+**1. Unsupervised Learning cheatsheet**
+
+&#10230;
+
+<br>
+
+**2. Introduction to Unsupervised Learning**
+
+&#10230;
+
+<br>
+
+**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
+
+&#10230;
+
+<br>
+
+**4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
+
+&#10230;
+
+<br>
+
+**5. Clustering**
+
+&#10230;
+
+<br>
+
+**6. Expectation-Maximization**
+
+&#10230;
+
+<br>
+
+**7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
+
+&#10230;
+
+<br>
+
+**8. [Setting, Latent variable z, Comments]**
+
+&#10230;
+
+<br>
+
+**9. [Mixture of k Gaussians, Factor analysis]**
+
+&#10230;
+
+<br>
+
+**10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
+
+&#10230;
+
+<br>
+
+**11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
+
+&#10230;
+
+<br>
+
+**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
+
+&#10230;
+
+<br>
+
+**13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
+
+&#10230;
+
+<br>
+
+**14. k-means clustering**
+
+&#10230;
+
+<br>
+
+**15. We note c(i) the cluster of data point i and μj the center of cluster j.**
+
+&#10230;
+
+<br>
+
+**16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
+
+&#10230;
+
+<br>
+
+**17. [Means initialization, Cluster assignment, Means update, Convergence]**
+
+&#10230;
+
+<br>
+
+**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
+
+&#10230;
+
+<br>
+
+**19. Hierarchical clustering**
+
+&#10230;
+
+<br>
+
+**20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
+
+&#10230;
+
+<br>
+
+**21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**22. [Ward linkage, Average linkage, Complete linkage]**
+
+&#10230;
+
+<br>
+
+**23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
+
+&#10230;
+
+<br>
+
+**24. Clustering assessment metrics**
+
+&#10230;
+
+<br>
+
+**25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
+
+&#10230;
+
+<br>
+
+**26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
+
+&#10230;
+
+<br>
+
+**27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
+
+&#10230;
+
+<br>
+
+**28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
+
+&#10230;
+
+<br>
+
+**29. Dimension reduction**
+
+&#10230;
+
+<br>
+
+**30. Principal component analysis**
+
+&#10230;
+
+<br>
+
+**31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
+
+&#10230;
+
+<br>
+
+**32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+&#10230;
+
+<br>
+
+**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+&#10230;
+
+<br>
+
+**34. diagonal**
+
+&#10230;
+
+<br>
+
+**35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
+
+&#10230;
+
+<br>
+
+**36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k
+dimensions by maximizing the variance of the data as follows:**
+
+&#10230;
+
+<br>
+
+**37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
+
+&#10230;
+
+<br>
+
+**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
+
+&#10230;
+
+<br>
+
+**39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
+
+&#10230;
+
+<br>
+
+**40. Step 4: Project the data on spanR(u1,...,uk).**
+
+&#10230;
+
+<br>
+
+**41. This procedure maximizes the variance among all k-dimensional spaces.**
+
+&#10230;
+
+<br>
+
+**42. [Data in feature space, Find principal components, Data in principal components space]**
+
+&#10230;
+
+<br>
+
+**43. Independent component analysis**
+
+&#10230;
+
+<br>
+
+**44. It is a technique meant to find the underlying generating sources.**
+
+&#10230;
+
+<br>
+
+**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
+
+&#10230;
+
+<br>
+
+**46. The goal is to find the unmixing matrix W=A−1.**
+
+&#10230;
+
+<br>
+
+**47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
+
+&#10230;
+
+<br>
+
+**48. Write the probability of x=As=W−1s as:**
+
+&#10230;
+
+<br>
+
+**49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
+
+&#10230;
+
+<br>
+
+**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
+
+&#10230;
+
+<br>
+
+**51. The Machine Learning cheatsheets are now available in German.**
+
+&#10230;
+
+<br>
+
+**52. Original authors**
+
+&#10230;
+
+<br>
+
+**53. Translated by X, Y and Z**
+
+&#10230;
+
+<br>
+
+**54. Reviewed by X, Y and Z**
+
+&#10230;
+
+<br>
+
+**55. [Introduction, Motivation, Jensen's inequality]**
+
+&#10230;
+
+<br>
+
+**56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
+
+&#10230;
+
+<br>
+
+**57. [Dimension reduction, PCA, ICA]**
+
+&#10230;
diff --git a/pl/refresher-linear-algebra.md b/pl/refresher-linear-algebra.md
new file mode 100644
index 000000000..a6b440d1e
--- /dev/null
+++ b/pl/refresher-linear-algebra.md
@@ -0,0 +1,339 @@
+**1. Linear Algebra and Calculus refresher**
+
+&#10230;
+
+<br>
+
+**2. General notations**
+
+&#10230;
+
+<br>
+
+**3. Definitions**
+
+&#10230;
+
+<br>
+
+**4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:**
+
+&#10230;
+
+<br>
+
+**5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:**
+
+&#10230;
+
+<br>
+
+**6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.**
+
+&#10230;
+
+<br>
+
+**7. Main matrices**
+
+&#10230;
+
+<br>
+
+**8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:**
+
+&#10230;
+
+<br>
+
+**9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.**
+
+&#10230;
+
+<br>
+
+**10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:**
+
+&#10230;
+
+<br>
+
+**11. Remark: we also note D as diag(d1,...,dn).**
+
+&#10230;
+
+<br>
+
+**12. Matrix operations**
+
+&#10230;
+
+<br>
+
+**13. Multiplication**
+
+&#10230;
+
+<br>
+
+**14. Vector-vector ― There are two types of vector-vector products:**
+
+&#10230;
+
+<br>
+
+**15. inner product: for x,y∈Rn, we have:**
+
+&#10230;
+
+<br>
+
+**16. outer product: for x∈Rm,y∈Rn, we have:**
+
+&#10230;
+
+<br>
+
+**17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:**
+
+&#10230;
+
+<br>
+
+**18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.**
+
+&#10230;
+
+<br>
+
+**19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:**
+
+&#10230;
+
+<br>
+
+**20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively**
+
+&#10230;
+
+<br>
+
+**21. Other operations**
+
+&#10230;
+
+<br>
+
+**22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:**
+
+&#10230;
+
+<br>
+
+**23. Remark: for matrices A,B, we have (AB)T=BTAT**
+
+&#10230;
+
+<br>
+
+**24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:**
+
+&#10230;
+
+<br>
+
+**25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1**
+
+&#10230;
+
+<br>
+
+**26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:**
+
+&#10230;
+
+<br>
+
+**27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)**
+
+&#10230;
+
+<br>
+
+**28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:**
+
+&#10230;
+
+<br>
+
+**29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.**
+
+&#10230;
+
+<br>
+
+**30. Matrix properties**
+
+&#10230;
+
+<br>
+
+**31. Definitions**
+
+&#10230;
+
+<br>
+
+**32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:**
+
+&#10230;
+
+<br>
+
+**33. [Symmetric, Antisymmetric]**
+
+&#10230;
+
+<br>
+
+**34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:**
+
+&#10230;
+
+<br>
+
+**35. N(ax)=|a|N(x) for a scalar**
+
+&#10230;
+
+<br>
+
+**36. if N(x)=0, then x=0**
+
+&#10230;
+
+<br>
+
+**37. For x∈V, the most commonly used norms are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**38. [Norm, Notation, Definition, Use case]**
+
+&#10230;
+
+<br>
+
+**39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.**
+
+&#10230;
+
+<br>
+
+**40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent**
+
+&#10230;
+
+<br>
+
+**41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.**
+
+&#10230;
+
+<br>
+
+**42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:**
+
+&#10230;
+
+<br>
+
+**43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.**
+
+&#10230;
+
+<br>
+
+**44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+&#10230;
+
+<br>
+
+**45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+&#10230;
+
+<br>
+
+**46. diagonal**
+
+&#10230;
+
+<br>
+
+**47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:**
+
+&#10230;
+
+<br>
+
+**48. Matrix calculus**
+
+&#10230;
+
+<br>
+
+**49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:**
+
+&#10230;
+
+<br>
+
+**50. Remark: the gradient of f is only defined when f is a function that returns a scalar.**
+
+&#10230;
+
+<br>
+
+**51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:**
+
+&#10230;
+
+<br>
+
+**52. Remark: the hessian of f is only defined when f is a function that returns a scalar**
+
+&#10230;
+
+<br>
+
+**53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:**
+
+&#10230;
+
+<br>
+
+**54. [General notations, Definitions, Main matrices]**
+
+&#10230;
+
+<br>
+
+**55. [Matrix operations, Multiplication, Other operations]**
+
+&#10230;
+
+<br>
+
+**56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]**
+
+&#10230;
+
+<br>
+
+**57. [Matrix calculus, Gradient, Hessian, Operations]**
+
+&#10230;
diff --git a/pl/refresher-probability.md b/pl/refresher-probability.md
new file mode 100644
index 000000000..5c9b34656
--- /dev/null
+++ b/pl/refresher-probability.md
@@ -0,0 +1,381 @@
+**1. Probabilities and Statistics refresher**
+
+&#10230;
+
+<br>
+
+**2. Introduction to Probability and Combinatorics**
+
+&#10230;
+
+<br>
+
+**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
+
+&#10230;
+
+<br>
+
+**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
+
+&#10230;
+
+<br>
+
+**5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
+
+&#10230;
+
+<br>
+
+**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
+
+&#10230;
+
+<br>
+
+**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
+
+&#10230;
+
+<br>
+
+**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
+
+&#10230;
+
+<br>
+
+**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
+
+&#10230;
+
+<br>
+
+**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
+
+&#10230;
+
+<br>
+
+**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
+
+&#10230;
+
+<br>
+
+**12. Conditional Probability**
+
+&#10230;
+
+<br>
+
+**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
+
+&#10230;
+
+<br>
+
+**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
+
+&#10230;
+
+<br>
+
+**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
+
+&#10230;
+
+<br>
+
+**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
+
+&#10230;
+
+<br>
+
+**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
+
+&#10230;
+
+<br>
+
+**18. Independence ― Two events A and B are independent if and only if we have:**
+
+&#10230;
+
+<br>
+
+**19. Random Variables**
+
+&#10230;
+
+<br>
+
+**20. Definitions**
+
+&#10230;
+
+<br>
+
+**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
+
+&#10230;
+
+<br>
+
+**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
+
+&#10230;
+
+<br>
+
+**23. Remark: we have P(a<X⩽B)=F(b)−F(a).**
+
+&#10230;
+
+<br>
+
+**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
+
+&#10230;
+
+<br>
+
+**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
+
+&#10230;
+
+<br>
+
+**26. [Case, CDF F, PDF f, Properties of PDF]**
+
+&#10230;
+
+<br>
+
+**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
+
+&#10230;
+
+<br>
+
+**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
+
+&#10230;
+
+<br>
+
+**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
+
+&#10230;
+
+<br>
+
+**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
+
+&#10230;
+
+<br>
+
+**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
+
+&#10230;
+
+<br>
+
+**32. Probability Distributions**
+
+&#10230;
+
+<br>
+
+**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
+
+&#10230;
+
+<br>
+
+**34. Main distributions ― Here are the main distributions to have in mind:**
+
+&#10230;
+
+<br>
+
+**35. [Type, Distribution]**
+
+&#10230;
+
+<br>
+
+**36. Jointly Distributed Random Variables**
+
+&#10230;
+
+<br>
+
+**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
+
+&#10230;
+
+<br>
+
+**38. [Case, Marginal density, Cumulative function]**
+
+&#10230;
+
+<br>
+
+**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
+
+&#10230;
+
+<br>
+
+**40. Independence ― Two random variables X and Y are said to be independent if we have:**
+
+&#10230;
+
+<br>
+
+**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
+
+&#10230;
+
+<br>
+
+**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
+
+&#10230;
+
+<br>
+
+**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
+
+&#10230;
+
+<br>
+
+**44. Remark 2: If X and Y are independent, then ρXY=0.**
+
+&#10230;
+
+<br>
+
+**45. Parameter estimation**
+
+&#10230;
+
+<br>
+
+**46. Definitions**
+
+&#10230;
+
+<br>
+
+**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
+
+&#10230;
+
+<br>
+
+**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
+
+&#10230;
+
+<br>
+
+**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
+
+&#10230;
+
+<br>
+
+**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
+
+&#10230;
+
+<br>
+
+**51. Estimating the mean**
+
+&#10230;
+
+<br>
+
+**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
+
+&#10230;
+
+<br>
+
+**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
+
+&#10230;
+
+<br>
+
+**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
+
+&#10230;
+
+<br>
+
+**55. Estimating the variance**
+
+&#10230;
+
+<br>
+
+**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
+
+&#10230;
+
+<br>
+
+**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
+
+&#10230;
+
+<br>
+
+**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
+
+&#10230;
+
+<br>
+
+**59. [Introduction, Sample space, Event, Permutation]**
+
+&#10230;
+
+<br>
+
+**60. [Conditional probability, Bayes' rule, Independence]**
+
+&#10230;
+
+<br>
+
+**61. [Random variables, Definitions, Expectation, Variance]**
+
+&#10230;
+
+<br>
+
+**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
+
+&#10230;
+
+<br>
+
+**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
+
+&#10230;
+
+<br>
+
+**64. [Parameter estimation, Mean, Variance]**
+
+&#10230;