From 13c9834c81ac466c42ea7c683d39dd9a79e46e6c Mon Sep 17 00:00:00 2001 From: Ruben Maso Date: Tue, 25 Sep 2018 20:18:39 +0200 Subject: [PATCH] Catalan languaje. Translation of: machine learning tips and tricks, probabilities and statistics, linear algebra --- CONTRIBUTORS | 6 + ca/cheatsheet-deep-learning.md | 321 ++++++++++ ...tsheet-machine-learning-tips-and-tricks.md | 285 +++++++++ ca/cheatsheet-supervised-learning.md | 567 ++++++++++++++++++ ca/cheatsheet-unsupervised-learning.md | 340 +++++++++++ ca/refresher-linear-algebra.md | 341 +++++++++++ ca/refresher-probability.md | 381 ++++++++++++ 7 files changed, 2241 insertions(+) create mode 100644 ca/cheatsheet-deep-learning.md create mode 100644 ca/cheatsheet-machine-learning-tips-and-tricks.md create mode 100644 ca/cheatsheet-supervised-learning.md create mode 100644 ca/cheatsheet-unsupervised-learning.md create mode 100644 ca/refresher-linear-algebra.md create mode 100644 ca/refresher-probability.md diff --git a/CONTRIBUTORS b/CONTRIBUTORS index 0547b6c5f..c12468c4a 100644 --- a/CONTRIBUTORS +++ b/CONTRIBUTORS @@ -1,5 +1,11 @@ --ar +--ca + Ruben Maso Carcases (translation of probabilities and statistics) + Ruben Maso Carcases (translation of linear algebra) + Ruben Maso Carcases (translation of machine learning tips and tricks) + + --de --es diff --git a/ca/cheatsheet-deep-learning.md b/ca/cheatsheet-deep-learning.md new file mode 100644 index 000000000..a5aa3756c --- /dev/null +++ b/ca/cheatsheet-deep-learning.md @@ -0,0 +1,321 @@ +**1. Deep Learning cheatsheet** + +⟶ + +
+ +**2. Neural Networks** + +⟶ + +
+ +**3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.** + +⟶ + +
+ +**4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:** + +⟶ + +
+ +**5. [Input layer, hidden layer, output layer]** + +⟶ + +
+ +**6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:** + +⟶ + +
+ +**7. where we note w, b, z the weight, bias and output respectively.** + +⟶ + +
+ +**8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:** + +⟶ + +
+ +**9. [Sigmoid, Tanh, ReLU, Leaky ReLU]** + +⟶ + +
+ +**10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** + +⟶ + +
+ +**11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** + +⟶ + +
+ +**12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:** + +⟶ + +
+ +**13. As a result, the weight is updated as follows:** + +⟶ + +
+ +**14. Updating weights ― In a neural network, weights are updated as follows:** + +⟶ + +
+ +**15. Step 1: Take a batch of training data.** + +⟶ + +
+ +**16. Step 2: Perform forward propagation to obtain the corresponding loss.** + +⟶ + +
+ +**17. Step 3: Backpropagate the loss to get the gradients.** + +⟶ + +
+ +**18. Step 4: Use the gradients to update the weights of the network.** + +⟶ + +
+ +**19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p** + +⟶ + +
+ +**20. Convolutional Neural Networks** + +⟶ + +
+ +**21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:** + +⟶ + +
+ +**22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** + +⟶ + +
+ +**23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** + +⟶ + +
+ +**24. Recurrent Neural Networks** + +⟶ + +
+ +**25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:** + +⟶ + +
+ +**26. [Input gate, forget gate, gate, output gate]** + +⟶ + +
+ +**27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]** + +⟶ + +
+ +**28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.** + +⟶ + +
+ +**29. Reinforcement Learning and Control** + +⟶ + +
+ +**30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.** + +⟶ + +
+ +**31. Definitions** + +⟶ + +
+ +**32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:** + +⟶ + +
+ +**33. S is the set of states** + +⟶ + +
+ +**34. A is the set of actions** + +⟶ + +
+ +**35. {Psa} are the state transition probabilities for s∈S and a∈A** + +⟶ + +
+ +**36. γ∈[0,1[ is the discount factor** + +⟶ + +
+ +**37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize** + +⟶ + +
+ +**38. Policy ― A policy π is a function π:S⟶A that maps states to actions.** + +⟶ + +
+ +**39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).** + +⟶ + +
+ +**40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:** + +⟶ + +
+ +**41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:** + +⟶ + +
+ +**42. Remark: we note that the optimal policy π∗ for a given state s is such that:** + +⟶ + +
+ +**43. Value iteration algorithm ― The value iteration algorithm is in two steps:** + +⟶ + +
+ +**44. 1) We initialize the value:** + +⟶ + +
+ +**45. 2) We iterate the value based on the values before:** + +⟶ + +
+ +**46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:** + +⟶ + +
+ +**47. times took action a in state s and got to s′** + +⟶ + +
+ +**48. times took action a in state s** + +⟶ + +
+ +**49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:** + +⟶ + +
+ +**50. View PDF version on GitHub** + +⟶ + +
+ +**51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]** + +⟶ + +
+ +**52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]** + +⟶ + +
+ +**53. [Recurrent Neural Networks, Gates, LSTM]** + +⟶ + +
+ +**54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]** + +⟶ diff --git a/ca/cheatsheet-machine-learning-tips-and-tricks.md b/ca/cheatsheet-machine-learning-tips-and-tricks.md new file mode 100644 index 000000000..d5ec7b2ff --- /dev/null +++ b/ca/cheatsheet-machine-learning-tips-and-tricks.md @@ -0,0 +1,285 @@ +**1. Machine Learning tips and tricks cheatsheet** + +⟶Consells i trucs sobre Aprenentatge Automàtic + +
+ +**2. Classification metrics** + +⟶ Mètriques per a classificació + +
+ +**3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.** + +⟶ En el context d'una classificació binària, aquestes son les principals mètriques que es important seguir per a evaluar el rendiment del model. + +
+ +**4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:** + +⟶ Matriu de confusió - la matriu de confusió s'utilitza per a tindre una visió més completa a l'evaluar el rendimiento d'un model. Es defineix de la següent forma: + +
+ +**5. [Predicted class, Actual class]** + +⟶ [Classe predita, classe real] + +
+ +**6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:** + +⟶ Mètriques principals - les següents mètriques s'utilitzen comunment per a evaluar el rendiment dels models de classificació: + +
+ +**7. [Metric, Formula, Interpretation]** + +⟶ [Mètrica, Fórmula, Interpretació] + +
+ +**8. Overall performance of model** + +⟶ Rendiment general del model + +
+ +**9. How accurate the positive predictions are** + +⟶ Com de precises son les prediccions positives? + +
+ +**10. Coverage of actual positive sample** + +⟶ Cobertura de la mostra positiva real + +
+ +**11. Coverage of actual negative sample** + +⟶ Cobertura de la mostra negativa real + +
+ +**12. Hybrid metric useful for unbalanced classes** + +⟶ Mètrica híbrida útil per a classes desbalancejades + +
+ +**13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:** + +⟶ ROC ― La Corba característica Operativa del Receptor, també coneguda com ROC, es una representació gràfica de la sensibilitat davant a la especificitat segons varia el llindar. Aquestes mètriques es resumeixen en la taula a continuació: + +
+ +**14. [Metric, Formula, Equivalent]** + +⟶ [Mètrica, Fórmula, Interpretació] + +
+ +**15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:** + +⟶ AUC ― L'àrea sota la corba Característica Operativa del Receptor, també denotada com AUC o AUROC, es l'àrea sota del ROC, com es mostra en la següent figura: + +
+ +**16. [Actual, Predicted]** + +⟶ [Actual, predita] + +
+ +**17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:** + +⟶ Mètriques bàsiques - Donat un model de regresió f, les següents mètriques s'utilitzen comunmente per a evaluar el rendimient del model: + +
+ +**18. [Total sum of squares, Explained sum of squares, Residual sum of squares]** + +⟶ [Suma total de quadrats, suma de quadrats explicada, suma residual de quadrats] + +
+ +**19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:** + +⟶ Coeficient de determinació: el coeficient de determinació, a sovint denotat com R2 o r2, proporciona una mesura de com de bé els resultats observats son replicats por el model i es defineix de la següent forma: + +
+ +**20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:** + +⟶ Mètriques principals: les següents mètriques s'utilitzen comunment per a evaluar el rendiment dels models de regressió, tenint en compte la quantitat de variables n que tenen en compte: + +
+ +**21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.** + +⟶ on L es la probabilitat i ˆσ2 es una estimació de la variància associada amb cada resposta. + +
+ +**22. Model selection** + +⟶ Selecció de model + +
+ +**23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:** + +⟶ Vocabulari - al seleccionar un model, distingim 3 parts diferents de les dades que tenim de la següent forma: + +
+ +**24. [Training set, Validation set, Testing set]** + +⟶ [Conjunt d'entrenament, Conjunt de validació, Conjunt de prova] + +
+ +**25. [Model is trained, Model is assessed, Model gives predictions]** + +⟶ [Model es entrenat, model es evaluat, model dona prediccions] + +
+ +**26. [Usually 80% of the dataset, Usually 20% of the dataset]** + +⟶ [Generalment el 80% del conjunt de dades, Generalment el 20% del conjunt de dades] + +
+ +**27. [Also called hold-out or development set, Unseen data]** + +⟶ [També anomenat hold-out o conjunt de desenvolupament, Dades no vistes] + +
+ +**28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:** + +⟶ Una vegada que s'ha escollit el model, s'entrena sobre tot el conjunt de dades i es testeja sobre el conjunt de prova no vist. Aquests estan representats en la figura a continuació: + +
+ +**29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:** + +⟶ Validació creuada - La validació creuada, també denominada CV, es un mètode que s'utilitza per a seleccionar un model que no confíe excesivament en el conjunt d'entrenament inicial. Els diferents tipus es resumeixen en la taula següent: + +
+ +**30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]** + +⟶ [Entrenament sobre els conjunts k-1 i evaluació en la resta, Entrenament en observacions n-p y evaluació en els p restants + +
+ +**31. [Generally k=5 or 10, Case p=1 is called leave-one-out]** + +⟶ [Generalment k = 5 o 10, el cas p = 1 se l'anomena deixant un fora] + +
+ +**32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.** + +⟶ El mètodo més comunment utilitzat es denomina validació creuada k-fold i divideix les dades d'entrenament en k conjunts per a validar el model sobre un conjunt mentre s'entrena el model amb els altres k-1 conjunts, tot açò k vegades. L'error després es promèdia sobre els k conjunts i es denomina error de validació creuada. + +
+ +**33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:** + +⟶ Regularització - El procedimient de regularització té com a objetiu evitar que el model es sobreajuste a les dades i, per tant, resol els problemes d'alta variància. La següent taula resum els diferents tipus de tècniques de regularizació comunment utilitzades: + +
+ +**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]** + +⟶ [Redueix els coeficients a 0, Bo per a la selecció de variables, Fa que els coeficients siguen més petits, Compensació entre la selecció de variables i els coeficients petits] + +
+ +**35. Diagnostics** + +⟶ Diagnòstico + +
+ +**36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.** + +⟶ Biaix - El biaix d'un model es la diferència entre la predicció esperada i el model correcte que tratem de predir per a determinats punts de dades. + +
+ +**37. Variance ― The variance of a model is the variability of the model prediction for given data points.** + +⟶ Variància - La variància d'un model es la variabilitat de la predicció del model per a punts de dades donats. + +
+ +**38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.** + +⟶ Correcció de biaix/variància - Quant més simple es el model, major es el biaix, i quant més complex es el model, major es la variància. + +
+ +**39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]** + +⟶ [Símptomes, exemple de regresió, exemple de classificació, exemple d'aprenentatge profund, possibles solucions] + +
+ +**40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]** + +⟶ [Error d'entrenament alt, Error d'entrenament prop a l'error de prova, Biaix alt, Error d'entrenament lleugerament inferior a l'error de prova, Error d'entrenament molt baix, Error d'entrenament molt més baix que l'error de prova, Variància alta] + +
+ +**41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]** + +⟶ [Complementar el model, agregar més funcions, entrenar més temps, realitzar regularització, obtindre més dades] + +
+ +**42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.** + +⟶ Anàlisi d'errors - L'anàlisi d'errors analitza la causa arrel de la diferència de rendiment entre els models actuals i perfectes. + +
+ +**43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.** + +⟶ Anàlisi ablatiu - L'anàlisi ablatiu analitza la causa arrel de la diferència en el rendiment entre els models actuals i de referència. + +
+ +**44. Regression metrics** + +⟶ Mètriques de regressió + +
+ +**45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]** + +⟶ [Mètriques de classificació, matriu de confusió, exactitut, precisió, recall, F1 score, ROC] + +
+ +**46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]** + +⟶ [Mètriques de regressió, R quadrat, Mallow's CP, AIC, BIC] + +
+ +**47. [Model selection, cross-validation, regularization]** + +⟶ [Selecció de model, validació creuada, regularització] + +
+ +**48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]** + +⟶ [Diagnòstic, Correcció Biaix/variància, Anàlisi d'errors/ablatiu] \ No newline at end of file diff --git a/ca/cheatsheet-supervised-learning.md b/ca/cheatsheet-supervised-learning.md new file mode 100644 index 000000000..a6b19ea1c --- /dev/null +++ b/ca/cheatsheet-supervised-learning.md @@ -0,0 +1,567 @@ +**1. Supervised Learning cheatsheet** + +⟶ + +
+ +**2. Introduction to Supervised Learning** + +⟶ + +
+ +**3. Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.** + +⟶ + +
+ +**4. Type of prediction ― The different types of predictive models are summed up in the table below:** + +⟶ + +
+ +**5. [Regression, Classifier, Outcome, Examples]** + +⟶ + +
+ +**6. [Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]** + +⟶ + +
+ +**7. Type of model ― The different models are summed up in the table below:** + +⟶ + +
+ +**8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]** + +⟶ + +
+ +**9. [Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]** + +⟶ + +
+ +**10. Notations and general concepts** + +⟶ + +
+ +**11. Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).** + +⟶ + +
+ +**12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:** + +⟶ + +
+ +**13. [Least squared error, Logistic loss, Hinge loss, Cross-entropy]** + +⟶ + +
+ +**14. [Linear regression, Logistic regression, SVM, Neural Network]** + +⟶ + +
+ +**15. Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:** + +⟶ + +
+ +**16. Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:** + +⟶ + +
+ +**17. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.** + +⟶ + +
+ +**18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:** + +⟶ + +
+ +**19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:** + +⟶ + +
+ +**20. Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:** + +⟶ + +
+ +**21. Linear models** + +⟶ + +
+ +**22. Linear regression** + +⟶ + +
+ +**23. We assume here that y|x;θ∼N(μ,σ2)** + +⟶ + +
+ +**24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:** + +⟶ + +
+ +**25. LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:** + +⟶ + +
+ +**26. Remark: the update rule is a particular case of the gradient ascent.** + +⟶ + +
+ +**27. LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:** + +⟶ + +
+ +**28. Classification and logistic regression** + +⟶ + +
+ +**29. Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:** + +⟶ + +
+ +**30. Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:** + +⟶ + +
+ +**31. Remark: there is no closed form solution for the case of logistic regressions.** + +⟶ + +
+ +**32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:** + +⟶ + +
+ +**33. Generalized Linear Models** + +⟶ + +
+ +**34. Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:** + +⟶ + +
+ +**35. Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.** + +⟶ + +
+ +**36. Here are the most common exponential distributions summed up in the following table:** + +⟶ + +
+ +**37. [Distribution, Bernoulli, Gaussian, Poisson, Geometric]** + +⟶ + +
+ +**38. Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:** + +⟶ + +
+ +**39. Remark: ordinary least squares and logistic regression are special cases of generalized linear models.** + +⟶ + +
+ +**40. Support Vector Machines** + +⟶ + +
+ +**41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.** + +⟶ + +
+ +**42: Optimal margin classifier ― The optimal margin classifier h is such that:** + +⟶ + +
+ +**43: where (w,b)∈Rn×R is the solution of the following optimization problem:** + +⟶ + +
+ +**44. such that** + +⟶ + +
+ +**45. support vectors** + +⟶ + +
+ +**46. Remark: the line is defined as wTx−b=0.** + +⟶ + +
+ +**47. Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:** + +⟶ + +
+ +**48. Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:** + +⟶ + +
+ +**49. In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.** + +⟶ + +
+ +**50. [Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]** + +⟶ + +
+ +**51. Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.** + +⟶ + +
+ +**52. Lagrangian ― We define the Lagrangian L(w,b) as follows:** + +⟶ + +
+ +**53. Remark: the coefficients βi are called the Lagrange multipliers.** + +⟶ + +
+ +**54. Generative Learning** + +⟶ + +
+ +**55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.** + +⟶ + +
+ +**56. Gaussian Discriminant Analysis** + +⟶ + +
+ +**57. Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:** + +⟶ + +
+ +**58. Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:** + +⟶ + +
+ +**59. Naive Bayes** + +⟶ + +
+ +**60. Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:** + +⟶ + +
+ +**61. Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]** + +⟶ + +
+ +**62. Remark: Naive Bayes is widely used for text classification and spam detection.** + +⟶ + +
+ +**63. Tree-based and ensemble methods** + +⟶ + +
+ +**64. These methods can be used for both regression and classification problems.** + +⟶ + +
+ +**65. CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.** + +⟶ + +
+ +**66. Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.** + +⟶ + +
+ +**67. Remark: random forests are a type of ensemble methods.** + +⟶ + +
+ +**68. Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:** + +⟶ + +
+ +**69. [Adaptive boosting, Gradient boosting]** + +⟶ + +
+ +**70. High weights are put on errors to improve at the next boosting step** + +⟶ + +
+ +**71. Weak learners trained on remaining errors** + +⟶ + +
+ +**72. Other non-parametric approaches** + +⟶ + +
+ +**73. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.** + +⟶ + +
+ +**74. Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.** + +⟶ + +
+ +**75. Learning Theory** + +⟶ + +
+ +**76. Union bound ― Let A1,...,Ak be k events. We have:** + +⟶ + +
+ +**77. Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:** + +⟶ + +
+ +**78. Remark: this inequality is also known as the Chernoff bound.** + +⟶ + +
+ +**79. Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:** + +⟶ + +
+ +**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: ** + +⟶ + +
+ +**81: the training and testing sets follow the same distribution ** + +⟶ + +
+ +**82. the training examples are drawn independently** + +⟶ + +
+ +**83. Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:** + +⟶ + +
+ +**84. Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:** + +⟶ + +
+ +**85. VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.** + +⟶ + +
+ +**86. Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.** + +⟶ + +
+ +**87. Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:** + +⟶ + +
+ +**88. [Introduction, Type of prediction, Type of model]** + +⟶ + +
+ +**89. [Notations and general concepts, loss function, gradient descent, likelihood]** + +⟶ + +
+ +**90. [Linear models, linear regression, logistic regression, generalized linear models]** + +⟶ + +
+ +**91. [Support vector machines, Optimal margin classifier, Hinge loss, Kernel]** + +⟶ + +
+ +**92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]** + +⟶ + +
+ +**93. [Trees and ensemble methods, CART, Random forest, Boosting]** + +⟶ + +
+ +**94. [Other methods, k-NN]** + +⟶ + +
+ +**95. [Learning theory, Hoeffding inequality, PAC, VC dimension]** + +⟶ diff --git a/ca/cheatsheet-unsupervised-learning.md b/ca/cheatsheet-unsupervised-learning.md new file mode 100644 index 000000000..709e72834 --- /dev/null +++ b/ca/cheatsheet-unsupervised-learning.md @@ -0,0 +1,340 @@ +**1. Unsupervised Learning cheatsheet** + +⟶ + +
+ +**2. Introduction to Unsupervised Learning** + +⟶ + +
+ +**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.** + +⟶ + +
+ +**4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:** + +⟶ + +
+ +**5. Clustering** + +⟶ + +
+ +**6. Expectation-Maximization** + +⟶ + +
+ +**7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:** + +⟶ + +
+ +**8. [Setting, Latent variable z, Comments]** + +⟶ + +
+ +**9. [Mixture of k Gaussians, Factor analysis]** + +⟶ + +
+ +**10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:** + +⟶ + +
+ +**11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:** + +⟶ + +
+ +**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:** + +⟶ + +
+ +**13. [Gaussians initialization, Expectation step, Maximization step, Convergence]** + +⟶ + +
+ +**14. k-means clustering** + +⟶ + +
+ +**15. We note c(i) the cluster of data point i and μj the center of cluster j.** + +⟶ + +
+ +**16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:** + +⟶ + +
+ +**17. [Means initialization, Cluster assignment, Means update, Convergence]** + +⟶ + +
+ +**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:** + +⟶ + +
+ +**19. Hierarchical clustering** + +⟶ + +
+ +**20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.** + +⟶ + +
+ +**21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:** + +⟶ + +
+ +**22. [Ward linkage, Average linkage, Complete linkage]** + +⟶ + +
+ +**23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]** + +⟶ + +
+ +**24. Clustering assessment metrics** + +⟶ + +
+ +**25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.** + +⟶ + +
+ +**26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:** + +⟶ + +
+ +**27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as** + +⟶ + +
+ +**28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:** + +⟶ + +
+ +**29. Dimension reduction** + +⟶ + +
+ +**30. Principal component analysis** + +⟶ + +
+ +**31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.** + +⟶ + +
+ +**32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** + +⟶ + +
+ +**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** + +⟶ + +
+ +**34. diagonal** + +⟶ + +
+ +**35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.** + +⟶ + +
+ +**36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k +dimensions by maximizing the variance of the data as follows:** + +⟶ + +
+ +**37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.** + +⟶ + +
+ +**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.** + +⟶ + +
+ +**39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.** + +⟶ + +
+ +**40. Step 4: Project the data on spanR(u1,...,uk).** + +⟶ + +
+ +**41. This procedure maximizes the variance among all k-dimensional spaces.** + +⟶ + +
+ +**42. [Data in feature space, Find principal components, Data in principal components space]** + +⟶ + +
+ +**43. Independent component analysis** + +⟶ + +
+ +**44. It is a technique meant to find the underlying generating sources.** + +⟶ + +
+ +**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:** + +⟶ + +
+ +**46. The goal is to find the unmixing matrix W=A−1.** + +⟶ + +
+ +**47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:** + +⟶ + +
+ +**48. Write the probability of x=As=W−1s as:** + +⟶ + +
+ +**49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:** + +⟶ + +
+ +**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:** + +⟶ + +
+ +**51. The Machine Learning cheatsheets are now available in Spanish.** + +⟶ + +
+ +**52. Original authors** + +⟶ + +
+ +**53. Translated by X, Y and Z** + +⟶ + +
+ +**54. Reviewed by X, Y and Z** + +⟶ + +
+ +**55. [Introduction, Motivation, Jensen's inequality]** + +⟶ + +
+ +**56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]** + +⟶ + +
+ +**57. [Dimension reduction, PCA, ICA]** + +⟶ diff --git a/ca/refresher-linear-algebra.md b/ca/refresher-linear-algebra.md new file mode 100644 index 000000000..8ca056b6a --- /dev/null +++ b/ca/refresher-linear-algebra.md @@ -0,0 +1,341 @@ +**1. Linear Algebra and Calculus refresher** + +⟶ Repàs d'Àlgebra lineal i càlcul + +
+ +**2. General notations** + +⟶ Notacions Generals + +
+ +**3. Definitions** + +⟶ Definicions + +
+ +**4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:** + +⟶ Vector ― Siga x∈Rn un vector amb n entrades, on xi∈R es la n-èsima entrada: + +
+ +**5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:** + +⟶ Matriu ― Siga A∈Rm×n una matriu amb n filas i m columnes; on Ai, j∈R es el valor de la i-èsima fila i la n-èsima columna: + +
+ +**6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.** + +⟶ Nota: el vector x definit dalt pot ser vist com una matriu d'n×1 i es anomenat particularment vector-columna. + +
+ +**7. Main matrices** + +⟶ Matrius principals + +
+ +**8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:** + +⟶ Matriu identitat - La matriu identitat I∈Rn×n es una matriu quadrada amb valor 1 a la seva diagonal i cero a la resta: + +
+ +**9. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:** + +⟶ Nota: per a totes les matrius A∈Rn×n, tenim A×I=I×A=A. + +
+ +**10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:** + +⟶ Matriu diagonal ― Una matriu diagonal D∈Rn×n es una matriu quadrada amb valores diferentes de zero en la seva diagonal i cero a la resta: + +
+ +**11. Remark: we also note D as diag(d1,...,dn).** + +⟶ Nota: També denotem D com diag(d1,...,dn). + +
+ +**12. Matrix operations** + +⟶ Operacions de matrius + +
+ +**13. Multiplication** + +⟶ Multiplicació + +
+ +**14. Vector-vector ― There are two types of vector-vector products:** + +⟶ Vector-vector ― Hi ha dos tipus de multiplicacions vector-vector: + +
+ +**15. inner product: for x,y∈Rn, we have:** + +⟶ producte intern: per a x,y∈Rn, es té: + +
+ +**16. outer product: for x∈Rm,y∈Rn, we have:** + +⟶ product extern: per a x∈Rm,y∈Rn, es té: + +
+ +**17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that** + +⟶ Matriu-vector ― El producte de la matriu A∈Rm×n i el vector x∈Rn, es un vector de tamany Rn; tal que: + +
+ +**18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.** + +⟶ on aTr,i son les files del vector i ac,j son les columnes del vector A, i xi son les entrades d'x. + +
+ +**19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:** + +⟶ Matriu-matriu ― El productw de les matrius A∈Rm×n i B∈Rn×p es una matriu de tamany Rn×p, tal que: + +
+ +**20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively** + +⟶ on aTr,i,bTr,i son les files del vector i ac,j,bc,j les columnes d'A i B respectivament + +
+ +**21. Other operations** + +⟶ Altres operacions + +
+ +**22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:** + +⟶ Transposada ― La transposada de la matriu A∈Rm×n, denotada AT, es tal que les seves entrades estan voltejades: + +
+ +**23. Remark: for matrices A,B, we have (AB)T=BTAT** + +⟶ Nota: per a matrius A,B, es té (AB)T=BTAT + +
+ +**24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:** + +⟶ Inversa ― La inversa d'una matriu quadrada invertible A, es denota per A−1 i es l'única matriu tal que: + +
+ +**25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1** + +⟶ Nota: no totes les matrius es poden invertir. A més, per a les matrius A,B, es té que (AB)−1=B−1A−1 + +
+ +**26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:** + +⟶ Traça ― La traça d'una matriu cuadrada A, tr(A), es la suma dels seus elements de la diagonal: + +
+ +**27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)** + +⟶ Nota: per a matrius A,B, es té tr(AT)=tr(A) i tr(AB)=tr(BA) + +
+ +**28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:** + +⟶ Determinanat ― El determinant d'una matriu cuadrada A∈Rn×n, denotat per |A| o det(A) es expresat recursivament en termes d'A∖i,∖j, que es la matriu A en la seva i-èsima fila i j-èsima columna, com es mostra: + +
+ +**29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.** + +⟶ Nota: A té inversa si i sols si |A|≠0. A més, |AB|=|A||B| i |AT|=|A|. + +
+ +**30. Matrix properties** + +⟶ Propietats de matrius + +
+ +**31. Definitions** + +⟶ Definicions + +
+ +**32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:** + +⟶ Descomposició simètrica ― Una matriu A pot ser expressada en termes de les seves parts simètriques i asimètriques, com es mostra: + +
+ +**33. [Symmetric, Antisymmetric]** + +⟶ [Simètrica, Asimètrica] + +
+ +**34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:** + +⟶ Norma ― La norma o mòdul es una funció N:V⟶[0,+∞[ on V es un espai vectorial, tal que para todos los x,y∈V, es té: + +
+ +**35. N(ax)=|a|N(x) for a scalar** + +⟶ N(ax)=|a|N(x) per a un escalar + +
+ +**36. if N(x)=0, then x=0** + +⟶ si N(x)=0, aleshores x=0 + +
+ +**37. For x∈V, the most commonly used norms are summed up in the table below:** + +⟶ Per a x∈V, els mòduls o normes més comunment utilitzades estan descrites en la tabla inferior: + +
+ +**38. [Norm, Notation, Definition, Use case]** + +⟶ [Norma, Notació, Definició, Cas d'us] + +
+ +**39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.** + +⟶ Dependència lineal ― Un conjunt de vectors es diu que es linealmente dependent si un dels vectores en el conjunt pot ser definit com una combinació lineal dels altres. + +
+ +**40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent** + +⟶ Nota: si no es pot escriure el vector d'aquesta forma, aleshores el vector es diu que es linealmente independent + +
+ +**41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.** + +⟶ Rang matricial ― El rang d'una matriu A, denotat per rank(A), es la dimensió de l'espai vectorial generat per les seves columnes. Es equivalent al nombre màxim de columnes linealmente independents d'A. + +
+ +**42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:** + +⟶ Matriz semi-definida positiva ― Una matriu A∈Rn×n es semi-defininda positiva (PSD) i es denota per A⪰0 si: + +
+ +**43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.** + +⟶ Nota: de igual forma, una matriu A es diu definida positiva, A≻0, si es una matriu PSD que satisfa per a tots els vectors diferents de zero x, xTAx>0. + +
+ +**44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** + +⟶ Eigenvalor, eigenvector ― Donada una matriu A∈Rn×n, λ es diu que es el eigenvalor de A si existeix un vector z∈Rn∖{0}, anomenat eigenvector, tal que es té: + +
+ +**45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** + +⟶ Teorema espectral ― Siga A∈Rn×n. si A es simètrica, aleshores A es diagonalitzable per una matriu real ortogonal U∈Rn×n. +Denotant Λ=diag(λ1,...,λn), es té que: + +
+ +**46. diagonal** + +⟶ diagonal + +
+ +**47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:** + +⟶ Descomposició de valors singulars ― Per a una matriu A de dimensions m×n, la descomposició en valors singulars (SVD) es una tècnica de factorització que garantitza que existeixen matrius U m×m unitaria, Σ m×n diagonal i V n×n unitària, tal que: + +
+ +**48. Matrix calculus** + +⟶ Càlcul de matrius + +
+ +**49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:** + +⟶ Gradient ― Siga f:Rm×n→R una funció i A∈Rm×n una matriu. El gradient d'f amb respecte a A es una matriu de m×n, denotat ∇Af(A), tal que: + +
+ +**50. Remark: the gradient of f is only defined when f is a function that returns a scalar.** + +⟶ Nota: el gradient d'f està sols definit quan f es una funció que retorna un escalar. + +
+ +**51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:** + +⟶ Matriu Hessiana ― Siga f:Rn→R una funció i x∈Rn un vector. La matriu hessiana o hessià d'f recpecte a x +es una matriu simètrica de n×n, denotada ∇2xf(x), tal que: + +
+ +**52. Remark: the hessian of f is only defined when f is a function that returns a scalar** + +⟶ Nota: la matriu hessiana d'f sols està definida quan f es una funció que retorna un escalar + +
+ +**53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:** + +⟶ Operacions de gradient ― Per a matrius A,B,C, val la pena tindre en compte les següents propietats del gradient: + +
+ +**54. [General notations, Definitions, Main matrices]** + +⟶ [Notacions generals, Definicions, Matrius principals] + +
+ +**55. [Matrix operations, Multiplication, Other operations]** + +⟶ [Operacions matricials, Multiplicació, Altres operacions] + +
+ +**56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]** + +⟶ [Propietats matricials, Norma, Eigenvalor/Eigenvector, Descomposició de valors singulars] + +
+ +**57. [Matrix calculus, Gradient, Hessian, Operations]** + +⟶ [Càlcul matricial, Gradiant, Matriu Hessiana, Operacions] diff --git a/ca/refresher-probability.md b/ca/refresher-probability.md new file mode 100644 index 000000000..d81fef1c9 --- /dev/null +++ b/ca/refresher-probability.md @@ -0,0 +1,381 @@ +**1. Probabilities and Statistics refresher** + +⟶ Repàs de Probabilitat i Estadística + +
+ +**2. Introduction to Probability and Combinatorics** + +⟶ Introducció a la Probabilitat i Combinatòria + +
+ +**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.** + +⟶ Espai mostral - El conjunt de tots els possibles resultats d'un experiment es conegut como l'espai mostral de l'experiment i es denota per S. + +
+ +**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.** + +⟶ Event - Qualsevol subconjunt de l'espai mostral es conegut com un event. Açò significa que un event es un conjunt de possibles resultats d'un experiment. Si el resultat d'un experiment està contingut en E, aleshores diem que l'event E ha ocorregut. + +
+ +**5. Axioms of probability - For each event E, we denote P(E) as the probability of event E occuring.** + +⟶ Axiomes de la probabilitat - Per a cada event E, denotem P(E) com la probabilitat de que l'event E ocorrega. + +
+ +**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:** + +⟶ Axioma 1 - Cada probabilitat té valors entre 0 i 1 inclosos, es a dir: + +
+ +**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:** + +⟶ Axioma 2 - La probabilitat de que al menys un dels events elementals de tot l'espai mostral ocorrega es 1, es a dir: + +
+ +**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:** + +⟶ Axioma 3 - Per a cada seqüencia d'events mutuament excloents E1,...,En, es té: + +
+ +**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:** + +⟶ Permutació - Una permutació es un arranjament d'r objectes d'un grup d'n objectes, en un ordre donat. El nombre d'aquests arranjaments ve donat per P(n,r), definit com: + +
+ +**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:** + +⟶ Combinació - Una combinació es un arranjament d'r objectes d'un grup de n objetes, on l'ordre no importa. El nombre d'aquests arranjaments ve donat per C(n,r), definit com: + +
+ +**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)** + +⟶ Observació: cap resaltar que per a 0⩽r⩽n, es te P(n,r)⩾C(n,r) + +
+ +**12. Conditional Probability** + +⟶ Probabilitat Condicionada + +
+ +**13. Bayes' rule ― For events A and B such that P(B)>0, we have:** + +⟶ Regla de Bayes - Per a events A i B tal que P(B)>0, es té: + +
+ +**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)** + +⟶ Observació: Es té P(A∩B)=P(A)P(B|A)=P(A|B)P(B) + +
+ +**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:** + +⟶ Partició - Siga {Ai,i∈[[1,n]]} tal que per a tot i, Ai≠∅. Es diu que {Ai} es una partició si es compleix: + +
+ +**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).** + +⟶ Observació: Per a qualsevol event B de l'espai mostral, es compleix P(B)=n∑i=1P(B|Ai)P(Ai). + +
+ +**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:** + +⟶ Regla de Bayes extesa - Siga {Ai,i∈[[1,n]]} una partició de l'espai mostral. Es compleix: + +
+ +**18. Independence ― Two events A and B are independent if and only if we have:** + +⟶ Independència - Dos events A i B son independents si i sols si es compleix: + +
+ +**19. Random Variables** + +⟶ Variables Aleatòries + +
+ +**20. Definitions** + +⟶ Definicions + +
+ +**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.** + +⟶ Variable aleatòria - Una variable aleatòria, generalment denotada per X, es una funció que associa cada element d'un espai mostral a una línia real. + +
+ +**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:** + +⟶ Funció de distribució acumulada (FDA) - La funció de distribució acumulada F, la cual es monòtonament creixent i es tal que limx→−∞F(x)=0 i limx→+∞F(x)=1, es definida com: + +
+ +**23. Remark: we have P(a + +**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.** + +⟶ Funció de densitat de probabilitat (FDP) - La funció de densitat de probabilitat f es la probabilitat que X prenga valors entre dos ocurrències adjacents de la variable aleatòria. + +
+ +**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.** + +⟶ Relacions entre la FDA i FDP - Aquestes son les propietats més importants per a coneixer en els casos discret (D) i continu (C). + +
+ +**26. [Case, CDF F, PDF f, Properties of PDF]** + +⟶ [Cas, FDA F, FDP f, Propietats de FDP] + +
+ +**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:** + +⟶ Valor esperat i Moments de la Distribució - Ací estan les expressions del valor esperat E[X], valor esperat generalitzat E[g(X)], k-ésim moment E[Xk] i funció característica ψ(ω) per als casos discret i continu: + +
+ +**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:** + +⟶ Variància - La variància d'una variable aleatòria, freqüentement denotada per Var(X) o σ2, es la mesura de dispersió de la seva funció de distribució. Està determinada de la següent forma: + +
+ +**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:** + +⟶ Desviació estàndard - La desviació estàndard d'una variable aleatòria, freqüentement denotada per σ, es una mesura de la dispersió de la seva funció de distribució la qual es compatible amb les unitats de la corresponent variable aleatòria. Es determina de la següent forma: + +
+ +**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:** + +⟶ Transformació de variables aleatòries - Siguen les variables X i Y asociades per alguna funció. Denotem com fX i fY la funció de distribució de X i Y respectivament, es té: + +
+ +**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:** + +⟶ Regla integral de Leibniz - Siga g una funció de x i possiblement de c, i a més a més siga a,b un interval que pot dependre de c. Es té: + +
+ +**32. Probability Distributions** + +⟶ Distribucions de probabilitat + +
+ +**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:** + +⟶ Desigualtat de Chebyshev - Siga X una variable aleatòria amb valor esperat μ. Per a k,σ>0, es té la següent desigualtat: + +
+ +**34. Main distributions ― Here are the main distributions to have in mind:** + +⟶ Distribucions importants - Ací estan les distribucions més importants per a tindre en compte: + +
+ +**35. [Type, Distribution]** + +⟶ [Tipus, Distribució] + +
+ +**36. Jointly Distributed Random Variables** + +⟶ Variables aleatòries conjuntes + +
+ +**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have** + +⟶ Densitat marginal i distribució acumulada - De la funció conjunta de densitat de probabilitat fXY , es té + +
+ +**38. [Case, Marginal density, Cumulative function]** + +⟶ [Cas, Densitat marginal, Funció acumulativa] + +
+ +**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:** + +⟶ Densitat condicional - La densitat condicional d'X amb respecte a Y, freqüentement denotada com fX|Y, es definida com: + +
+ +**40. Independence ― Two random variables X and Y are said to be independent if we have:** + +⟶ Independència - Dos variables aleatòries X i Y son considerades independents si es té: + +
+ +**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:** + +⟶ Covariància - Definim la covariància de dos variables aleatòries X i Y, denotada com σ2XY o comunment com Cov(X,Y), de la següent forma: + +
+ +**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:** + +⟶ Correlació - Siguen σX,σY les desviacions estàndard d'X i Y, definim la correlació entre aquestes variables, denotada com ρXY, de la següent forma: + +
+ +**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].** + +⟶ Observació 1: Resaltem que per a X,Y variables aleatòries qualsevol, es té que ρXY∈[−1,1]. + +
+ +**44. Remark 2: If X and Y are independent, then ρXY=0.** + +⟶ Observació 2: Si X i Y son independents, aleshores ρXY=0. + +
+ +**45. Parameter estimation** + +⟶ Estimació de paràmetres + +
+ +**46. Definitions** + +⟶ Definicions + +
+ +**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.** + +⟶ Mostra aleatòria - Una mostra aleatòria es una col·lecció d'n variables aleatòries X1,...,Xn que son independents e idènticament distribuïdes a X. + +
+ +**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.** + +⟶ Estimador - Un estimador es una funció de les dades que es usada per a inferir el valor d'un paràmetre desconegut en un model estadístic. + +
+ +**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:** + +⟶ Biaix - El biaix d'un estimador ^θ es defineix com la diferència entre el valor esperat de la distribució de ^θ i el valor exacte, es a dir: + +
+ +**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.** + +⟶ Observació: es diu que un estimador es no esbiaixat quan es té E[^θ]=θ. + +
+ +**51. Estimating the mean** + +⟶ Estimació de la mitjana + +
+ +**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:** + +⟶ Mitjana de la mostra - La mitjana d'una mostra aleatòria s'utilitza per a estimar el valor exacte de la mitjana μ de la distribució, es denota freqüentement com ¯¯¯¯¯X i es defineix de la següent forma: + +
+ +**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.** + +⟶ Observació: La mitjana de la mostra es no esbiaixada, es a dir E[¯¯¯¯¯X]=μ. + +
+ +**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:** + +⟶ Teorema Central del Límit - Siguen X1,...,Xn una mostra aleatòria que segueix una distribució amb mitjana μ i variància σ2, aleshores es té: + +
+ +**55. Estimating the variance** + +⟶ Estimació de la variància + +
+ +**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:** + +⟶ Variància de la mostra - La variància de la mostra aleatòria s'utilitza per a estimar el valor exacte de la variància σ2 d'una distribució, es denota freqüentement com s2 o ^σ2 i es defineix de la següent forma: + +
+ +**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.** + +⟶ Observació: La variància de la mostra es no esbiaixada, es a dir E[s2]=σ2. + +
+ +**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:** + +⟶ Relació Chi-quadrat amb la variància de la mostra - Siga s2 la variància de la mostra d'una variable aleatòria. Es té: + +
+ +**59. [Introduction, Sample space, Event, Permutation]** + +⟶ [Introducció, Espai mostral, Event, Permutació] + +
+ +**60. [Conditional probability, Bayes' rule, Independence]** + +⟶ [Probabilitat condicionada, Regla de Bayes, Independència] + +
+ +**61. [Random variables, Definitions, Expectation, Variance]** + +⟶ [Variables aleatòries, Definicions, Valor esperat, Variància] + +
+ +**62. [Probability distributions, Chebyshev's inequality, Main distributions]** + +⟶ [Distribucions de probabilitat, Desigualtat de Chebyshev, Principals distribucions] + +
+ +**63. [Jointly distributed random variables, Density, Covariance, Correlation]** + +⟶ [Variables aleatòrias distribuïdes conjuntamente, Densitat, Covariància, Correlació] + +
+ +**64. [Parameter estimation, Mean, Variance]** + +⟶ [Estimació de Paràmetres, Mitjana, Variància]