diff --git a/CONTRIBUTORS b/CONTRIBUTORS
index a4a83d86a..0516e623f 100644
--- a/CONTRIBUTORS
+++ b/CONTRIBUTORS
@@ -121,7 +121,7 @@
--ko
Soyoung Lee (translation of convolutional neural networks)
Jack Kang (review of convolutional neural networks)
-
+
Soyoung Lee (translation of linear algebra)
Wooil Jeong (translation of machine learning tips and tricks)
diff --git a/ar/cheatsheet-supervised-learning.md b/ar/cheatsheet-supervised-learning.md
deleted file mode 100644
index a6b19ea1c..000000000
--- a/ar/cheatsheet-supervised-learning.md
+++ /dev/null
@@ -1,567 +0,0 @@
-**1. Supervised Learning cheatsheet**
-
-⟶
-
-
-
-**2. Introduction to Supervised Learning**
-
-⟶
-
-
-
-**3. Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.**
-
-⟶
-
-
-
-**4. Type of prediction ― The different types of predictive models are summed up in the table below:**
-
-⟶
-
-
-
-**5. [Regression, Classifier, Outcome, Examples]**
-
-⟶
-
-
-
-**6. [Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
-
-⟶
-
-
-
-**7. Type of model ― The different models are summed up in the table below:**
-
-⟶
-
-
-
-**8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
-
-⟶
-
-
-
-**9. [Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]**
-
-⟶
-
-
-
-**10. Notations and general concepts**
-
-⟶
-
-
-
-**11. Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).**
-
-⟶
-
-
-
-**12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
-
-⟶
-
-
-
-**13. [Least squared error, Logistic loss, Hinge loss, Cross-entropy]**
-
-⟶
-
-
-
-**14. [Linear regression, Logistic regression, SVM, Neural Network]**
-
-⟶
-
-
-
-**15. Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:**
-
-⟶
-
-
-
-**16. Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:**
-
-⟶
-
-
-
-**17. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.**
-
-⟶
-
-
-
-**18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
-
-⟶
-
-
-
-**19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**
-
-⟶
-
-
-
-**20. Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:**
-
-⟶
-
-
-
-**21. Linear models**
-
-⟶
-
-
-
-**22. Linear regression**
-
-⟶
-
-
-
-**23. We assume here that y|x;θ∼N(μ,σ2)**
-
-⟶
-
-
-
-**24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
-
-⟶
-
-
-
-**25. LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:**
-
-⟶
-
-
-
-**26. Remark: the update rule is a particular case of the gradient ascent.**
-
-⟶
-
-
-
-**27. LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:**
-
-⟶
-
-
-
-**28. Classification and logistic regression**
-
-⟶
-
-
-
-**29. Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:**
-
-⟶
-
-
-
-**30. Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:**
-
-⟶
-
-
-
-**31. Remark: there is no closed form solution for the case of logistic regressions.**
-
-⟶
-
-
-
-**32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
-
-⟶
-
-
-
-**33. Generalized Linear Models**
-
-⟶
-
-
-
-**34. Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:**
-
-⟶
-
-
-
-**35. Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.**
-
-⟶
-
-
-
-**36. Here are the most common exponential distributions summed up in the following table:**
-
-⟶
-
-
-
-**37. [Distribution, Bernoulli, Gaussian, Poisson, Geometric]**
-
-⟶
-
-
-
-**38. Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:**
-
-⟶
-
-
-
-**39. Remark: ordinary least squares and logistic regression are special cases of generalized linear models.**
-
-⟶
-
-
-
-**40. Support Vector Machines**
-
-⟶
-
-
-
-**41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
-
-⟶
-
-
-
-**42: Optimal margin classifier ― The optimal margin classifier h is such that:**
-
-⟶
-
-
-
-**43: where (w,b)∈Rn×R is the solution of the following optimization problem:**
-
-⟶
-
-
-
-**44. such that**
-
-⟶
-
-
-
-**45. support vectors**
-
-⟶
-
-
-
-**46. Remark: the line is defined as wTx−b=0.**
-
-⟶
-
-
-
-**47. Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:**
-
-⟶
-
-
-
-**48. Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:**
-
-⟶
-
-
-
-**49. In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.**
-
-⟶
-
-
-
-**50. [Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]**
-
-⟶
-
-
-
-**51. Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.**
-
-⟶
-
-
-
-**52. Lagrangian ― We define the Lagrangian L(w,b) as follows:**
-
-⟶
-
-
-
-**53. Remark: the coefficients βi are called the Lagrange multipliers.**
-
-⟶
-
-
-
-**54. Generative Learning**
-
-⟶
-
-
-
-**55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
-
-⟶
-
-
-
-**56. Gaussian Discriminant Analysis**
-
-⟶
-
-
-
-**57. Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:**
-
-⟶
-
-
-
-**58. Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:**
-
-⟶
-
-
-
-**59. Naive Bayes**
-
-⟶
-
-
-
-**60. Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:**
-
-⟶
-
-
-
-**61. Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]**
-
-⟶
-
-
-
-**62. Remark: Naive Bayes is widely used for text classification and spam detection.**
-
-⟶
-
-
-
-**63. Tree-based and ensemble methods**
-
-⟶
-
-
-
-**64. These methods can be used for both regression and classification problems.**
-
-⟶
-
-
-
-**65. CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.**
-
-⟶
-
-
-
-**66. Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.**
-
-⟶
-
-
-
-**67. Remark: random forests are a type of ensemble methods.**
-
-⟶
-
-
-
-**68. Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:**
-
-⟶
-
-
-
-**69. [Adaptive boosting, Gradient boosting]**
-
-⟶
-
-
-
-**70. High weights are put on errors to improve at the next boosting step**
-
-⟶
-
-
-
-**71. Weak learners trained on remaining errors**
-
-⟶
-
-
-
-**72. Other non-parametric approaches**
-
-⟶
-
-
-
-**73. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
-
-⟶
-
-
-
-**74. Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
-
-⟶
-
-
-
-**75. Learning Theory**
-
-⟶
-
-
-
-**76. Union bound ― Let A1,...,Ak be k events. We have:**
-
-⟶
-
-
-
-**77. Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:**
-
-⟶
-
-
-
-**78. Remark: this inequality is also known as the Chernoff bound.**
-
-⟶
-
-
-
-**79. Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:**
-
-⟶
-
-
-
-**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: **
-
-⟶
-
-
-
-**81: the training and testing sets follow the same distribution **
-
-⟶
-
-
-
-**82. the training examples are drawn independently**
-
-⟶
-
-
-
-**83. Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:**
-
-⟶
-
-
-
-**84. Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:**
-
-⟶
-
-
-
-**85. VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.**
-
-⟶
-
-
-
-**86. Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.**
-
-⟶
-
-
-
-**87. Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:**
-
-⟶
-
-
-
-**88. [Introduction, Type of prediction, Type of model]**
-
-⟶
-
-
-
-**89. [Notations and general concepts, loss function, gradient descent, likelihood]**
-
-⟶
-
-
-
-**90. [Linear models, linear regression, logistic regression, generalized linear models]**
-
-⟶
-
-
-
-**91. [Support vector machines, Optimal margin classifier, Hinge loss, Kernel]**
-
-⟶
-
-
-
-**92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]**
-
-⟶
-
-
-
-**93. [Trees and ensemble methods, CART, Random forest, Boosting]**
-
-⟶
-
-
-
-**94. [Other methods, k-NN]**
-
-⟶
-
-
-
-**95. [Learning theory, Hoeffding inequality, PAC, VC dimension]**
-
-⟶
diff --git a/ar/cheatsheet-unsupervised-learning.md b/ar/cheatsheet-unsupervised-learning.md
deleted file mode 100644
index 1d80c47b5..000000000
--- a/ar/cheatsheet-unsupervised-learning.md
+++ /dev/null
@@ -1,340 +0,0 @@
-**1. Unsupervised Learning cheatsheet**
-
-⟶
-
-
-
-**2. Introduction to Unsupervised Learning**
-
-⟶
-
-
-
-**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
-
-⟶
-
-
-
-**4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
-
-⟶
-
-
-
-**5. Clustering**
-
-⟶
-
-
-
-**6. Expectation-Maximization**
-
-⟶
-
-
-
-**7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
-
-⟶
-
-
-
-**8. [Setting, Latent variable z, Comments]**
-
-⟶
-
-
-
-**9. [Mixture of k Gaussians, Factor analysis]**
-
-⟶
-
-
-
-**10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
-
-⟶
-
-
-
-**11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
-
-⟶
-
-
-
-**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
-
-⟶
-
-
-
-**13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
-
-⟶
-
-
-
-**14. k-means clustering**
-
-⟶
-
-
-
-**15. We note c(i) the cluster of data point i and μj the center of cluster j.**
-
-⟶
-
-
-
-**16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
-
-⟶
-
-
-
-**17. [Means initialization, Cluster assignment, Means update, Convergence]**
-
-⟶
-
-
-
-**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
-
-⟶
-
-
-
-**19. Hierarchical clustering**
-
-⟶
-
-
-
-**20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
-
-⟶
-
-
-
-**21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
-
-⟶
-
-
-
-**22. [Ward linkage, Average linkage, Complete linkage]**
-
-⟶
-
-
-
-**23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
-
-⟶
-
-
-
-**24. Clustering assessment metrics**
-
-⟶
-
-
-
-**25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
-
-⟶
-
-
-
-**26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
-
-⟶
-
-
-
-**27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
-
-⟶
-
-
-
-**28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
-
-⟶
-
-
-
-**29. Dimension reduction**
-
-⟶
-
-
-
-**30. Principal component analysis**
-
-⟶
-
-
-
-**31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
-
-⟶
-
-
-
-**32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
-
-⟶
-
-
-
-**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
-
-⟶
-
-
-
-**34. diagonal**
-
-⟶
-
-
-
-**35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
-
-⟶
-
-
-
-**36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k
-dimensions by maximizing the variance of the data as follows:**
-
-⟶
-
-
-
-**37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
-
-⟶
-
-
-
-**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
-
-⟶
-
-
-
-**39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
-
-⟶
-
-
-
-**40. Step 4: Project the data on spanR(u1,...,uk).**
-
-⟶
-
-
-
-**41. This procedure maximizes the variance among all k-dimensional spaces.**
-
-⟶
-
-
-
-**42. [Data in feature space, Find principal components, Data in principal components space]**
-
-⟶
-
-
-
-**43. Independent component analysis**
-
-⟶
-
-
-
-**44. It is a technique meant to find the underlying generating sources.**
-
-⟶
-
-
-
-**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
-
-⟶
-
-
-
-**46. The goal is to find the unmixing matrix W=A−1.**
-
-⟶
-
-
-
-**47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
-
-⟶
-
-
-
-**48. Write the probability of x=As=W−1s as:**
-
-⟶
-
-
-
-**49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
-
-⟶
-
-
-
-**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
-
-⟶
-
-
-
-**51. The Machine Learning cheatsheets are now available in Arabic.**
-
-⟶
-
-
-
-**52. Original authors**
-
-⟶
-
-
-
-**53. Translated by X, Y and Z**
-
-⟶
-
-
-
-**54. Reviewed by X, Y and Z**
-
-⟶
-
-
-
-**55. [Introduction, Motivation, Jensen's inequality]**
-
-⟶
-
-
-
-**56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
-
-⟶
-
-
-
-**57. [Dimension reduction, PCA, ICA]**
-
-⟶
diff --git a/ar/cs-229-deep-learning.md b/ar/cs-229-deep-learning.md
new file mode 100644
index 000000000..197538d2b
--- /dev/null
+++ b/ar/cs-229-deep-learning.md
@@ -0,0 +1,431 @@
+
+**1. Deep Learning cheatsheet**
+
+⟶
+
+ملخص مختصر التعلم العميق
+
+
+
+**2. Neural Networks**
+
+⟶
+
+الشبكة العصبونية الاصطناعية(Neural Networks)
+
+
+**3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.**
+
+⟶
+
+الشبكة العصبونية الاصطناعيةهي عبارة عن نوع من النماذج يبنى من عدة طبقات , اكثر هذة الانواع استخداما هي الشبكات الالتفافية و الشبكات العصبونية المتكرره
+
+
+
+
+**4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:**
+
+⟶
+
+البنية - المصطلحات حول بنية الشبكة العصبونية موضح في الشكل ادناة
+
+
+
+**5. [Input layer, hidden layer, output layer]**
+
+⟶
+
+[طبقة ادخال, طبقة مخفية, طبقة اخراج ]
+
+
+
+**6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
+
+⟶
+
+عبر تدوين i كالطبقة رقم i و j للدلالة على رقم الوحده الخفية في تلك الطبقة , نحصل على:
+
+
+
+**7. where we note w, b, z the weight, bias and output respectively.**
+
+⟶
+
+حيث نعرف w, b, z كالوزن , و معامل التعديل , و الناتج حسب الترتيب.
+
+
+
+**8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:**
+
+⟶
+
+دالة التفعيل(Activation function) - دالة التفعيل تستخدم في نهاية الوحده الخفية لتضمن المكونات الغير خطية للنموذج. هنا بعض دوال التفعيل الشائعة
+
+
+
+**9. [Sigmoid, Tanh, ReLU, Leaky ReLU]**
+
+⟶
+
+[Sigmoid, Tanh, ReLU, Leaky ReLU]
+
+
+
+**10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
+
+⟶
+
+دالة الانتروبيا التقاطعية للخسارة(Cross-entropy loss) - في سياق الشبكات العصبونية, دالة الأنتروبيا L(z,y) تستخدم و تعرف كالاتي:
+
+
+
+**11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
+
+⟶
+
+معدل التعلم(Learning rate) - معدل التعلم, يرمز , و هو مؤشر في اي تجاة يتم تحديث الاوزان. يمكن تثبيت هذا المعامل او تحديثة بشكل تأقلمي . حاليا اكثر النسب شيوعا تدعى Adam , وهي طريقة تجعل هذه النسبة سرعة التعلم بشكل تأقلمي α او η ب ,
+
+
+
+**12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:**
+
+⟶
+
+التغذية الخلفية(Backpropagation) - التغذية الخلفية هي طريقة لتحديث الاوزان في الشبكة العصبونية عبر اعتبار القيم الحقيقة للناتج مع القيمة المطلوبة للخرج. المشتقة بالنسبة للوزن w يتم حسابها باستخدام قاعدة التسلسل و تكون عبر الشكل الاتي:
+
+
+
+**13. As a result, the weight is updated as follows:**
+
+⟶
+
+كنتيجة , الوزن سيتم تحديثة كالتالي:
+
+
+
+**14. Updating weights ― In a neural network, weights are updated as follows:**
+
+⟶
+
+تحديث الاوزان - في الشبكات العصبونية , يتم تحديث الاوزان كما يلي:
+
+
+
+**15. Step 1: Take a batch of training data.**
+
+⟶
+
+الخطوة 1: خذ حزمة من بيانات التدريب
+
+
+
+**16. Step 2: Perform forward propagation to obtain the corresponding loss.**
+
+⟶
+
+الخطوة 2: قم بعملية التغذيه الامامية لحساب الخسارة الناتجة
+
+
+
+**17. Step 3: Backpropagate the loss to get the gradients.**
+
+⟶
+
+الخطوة 3: قم بتغذية خلفية للخساره للحصول على دالة الانحدار
+
+
+
+**18. Step 4: Use the gradients to update the weights of the network.**
+
+⟶
+
+الخطوة 4: استخدم قيم الانحدار لتحديث اوزان الشبكة
+
+
+
+**19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
+
+⟶
+
+الاسقاط(Dropout) - الاسقاط هي طريقة الغرض منها منع التكيف الزائد للنموذج في بيانات التدريب عبر اسقاط بعض الواحدات في الشبكة العصبونية, العصبونات يتم اما اسقاطها باحتمالية p او الحفاظ عليها باحتمالية 1-p.
+
+
+
+**20. Convolutional Neural Networks**
+
+⟶
+
+الشبكات العصبونية الالتفافية(CNN)
+
+
+
+**21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
+
+⟶
+
+احتياج الطبقة الالتفافية - عبر رمز w لحجم المدخل , F حجم العصبونات للطبقة الالتفافية , P عدد الحشوات الصفرية , فأن N عدد العصبونات لكل حجم معطى يحسب عبر الاتي:
+
+
+
+**22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
+
+⟶
+
+تنظيم الحزمة(Batch normalization) - هي خطوه من قيم التحسين الخاصة γ,β والتي تعدل الحزمة {xi}. لنجعل μB,σ2B المتوسط و الانحراف للحزمة المعنية و نريد تصحيح هذه الحزمة, يتم ذلك كالتالي:
+
+
+
+**23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
+
+⟶
+
+في الغالب تتم بعد الطبقة الالتفافية أو المتصلة كليا و قبل طبقة التغيرات الغير خطية و تهدف للسماح للسرعات التعليم العالية للتقليل من الاعتمادية القوية للقيم الاولية.
+
+
+
+
+**24. Recurrent Neural Networks**
+
+⟶
+
+(RNN)الشبكات العصبونية التكرارية
+
+
+
+**25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:**
+
+⟶
+
+انواع البوابات - هنا الانواع المختلفة التي ممكن مواجهتها في الشبكة العصبونية الاعتيادية:
+
+
+
+**26. [Input gate, forget gate, gate, output gate]**
+
+⟶
+
+[بوابة ادخال, بوابة نسيان, بوابة منفذ, بوابة اخراج ]
+
+
+
+**27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]**
+
+⟶
+
+[كتابة ام عدم كتابة الى الخلية؟, مسح ام عدم مسح الخلية؟, كمية الكتابة الى الخلية ؟ , مدى الافصاح عن الخلية ؟ ]
+
+
+
+**28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
+
+⟶
+
+LSTM - ذاكرة طويلة قصير الامد (long short-term memory) هي نوع من نموذج ال RNN تستخدم لتجنب مشكلة اختفاء الانحدار عبر اضافة بوابات النسيان.
+
+
+
+**29. Reinforcement Learning and Control**
+
+⟶
+
+التعلم و التحكم المعزز(Reinforcement Learning)
+
+
+
+**30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.**
+
+⟶
+
+الهدف من التعلم المعزز للعميل الذكي هو التعلم لكيفية التأقلم في اي بيئة.
+
+
+
+**31. Definitions**
+
+⟶
+
+تعريفات
+
+
+
+**32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:**
+
+⟶
+
+عملية ماركوف لاتخاذ القرار - عملية ماركوف لاتخاذ القرار هي سلسلة خماسية (S,A,{Psa},γ,R) حيث
+
+
+**33. S is the set of states**
+
+⟶
+
+ S هي مجموعة من حالات البيئة
+
+
+
+**34. A is the set of actions**
+
+⟶
+
+A هي مجموعة من حالات الاجراءات
+
+
+
+**35. {Psa} are the state transition probabilities for s∈S and a∈A**
+
+⟶
+
+{Psa} هو حالة احتمال الانتقال من الحالة s∈S و a∈A
+
+
+
+**36. γ∈[0,1[ is the discount factor**
+
+⟶
+
+γ∈[0,1[ هي عامل الخصم
+
+
+
+**37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize**
+
+⟶
+
+R:S×A⟶R or R:S⟶R هي دالة المكافأة والتي تعمل الخوارزمية على جعلها اعلى قيمة
+
+
+
+**38. Policy ― A policy π is a function π:S⟶A that maps states to actions.**
+
+⟶
+
+دالة القواعد - دالة القواعد π:S⟶A هي التي تقوم بترجمة الحالات الى اجراءات.
+
+
+
+**39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).**
+
+⟶
+
+ملاحظة: نقول ان النموذج ينفذ القاعدة المعينه π للحالة المعطاة s ان نتخذ الاجراءa=π(s).
+
+
+
+**40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:**
+
+⟶
+
+دالة القاعدة - لاي قاعدة معطاة π و حالة s, نقوم بتعريف دالة القيمة Vπ كما يلي:
+
+
+
+**41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:**
+
+⟶
+
+معادلة بيلمان - معادلات بيلمان المثلى تشخص دالة القيمة دالة القيمة Vπ∗ π∗:للقاعدة المثلى
+
+
+
+**42. Remark: we note that the optimal policy π∗ for a given state s is such that:**
+
+⟶
+
+ π∗ للحالة المعطاه s تعطى كاالتالي: ملاحظة: نلاحظ ان القاعدة المثلى
+
+
+
+**43. Value iteration algorithm ― The value iteration algorithm is in two steps:**
+
+⟶
+
+خوارزمية تكرار القيمة(Value iteration algorithm) - خوارزمية تكرار القيمة تكون في خطوتين:
+
+
+
+**44. 1) We initialize the value:**
+
+⟶
+
+ 1) نقوم بوضع قيمة اولية:
+
+
+
+**45. 2) We iterate the value based on the values before:**
+
+⟶
+
+2) نقوم بتكرير القيمة حسب القيم السابقة:
+
+
+
+**46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
+
+⟶
+
+تقدير الامكانية القصوى - تقديرات الامكانية القصوى (تقدير الاحتمال الأرجح) لحتماليات انتقال الحالة تكون كما يلي :
+
+
+
+**47. times took action a in state s and got to s′**
+
+⟶
+
+اوقات تنفيذ الاجراء a في الحالة s و انتقلت الى s'
+
+
+
+**48. times took action a in state s**
+
+⟶
+
+اوقات تنفيذ الاجراء a في الحالة s
+
+
+
+**49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:**
+
+⟶
+
+التعلم-Q (Q-learning) -هي طريقة غير منمذجة لتقدير Q , و تتم كالاتي:
+
+
+
+**50. View PDF version on GitHub**
+
+⟶
+
+قم باستعراض نسخة ال PDF على GitHub
+
+
+
+**51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
+
+⟶
+
+ [شبكات عصبونية, البنية , دالة التفعيل , التغذية الخلفية , الاسقاط ]
+
+
+
+**52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
+
+⟶
+
+[ الشبكة العصبونية الالتفافية , طبقة التفافية , تنظيم الحزمة ]
+
+
+
+**53. [Recurrent Neural Networks, Gates, LSTM]**
+
+⟶
+
+[الشبكة العصبونية التكرارية , البوابات , LSTM]
+
+
+
+**54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
+
+⟶
+
+[التعلم المعزز , عملية ماركوف لاتخاذ القرار , تكرير القيمة / القاعدة , بحث القاعدة]
+
diff --git a/ar/cs-229-linear-algebra.md b/ar/cs-229-linear-algebra.md
new file mode 100644
index 000000000..d0e88a543
--- /dev/null
+++ b/ar/cs-229-linear-algebra.md
@@ -0,0 +1,413 @@
+**1. Linear Algebra and Calculus refresher**
+
+
+ملخص الجبر الخطي و التفاضل و التكامل
+
+
+
+**2. General notations**
+
+الرموز العامة
+
+
+
+
+**3. Definitions**
+
+
+التعريفات
+
+
+
+
+**4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:**
+
+ متجه (vector) - نرمز ل $x \in \mathbb{R^n}$ متجه يحتوي على $n$ مدخلات، حيث $x_i \in \mathbb{R}$ يعتبر المدخل رقم $i$ .
+
+
+
+**5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:**
+
+
+ مصفوفة (Matrix) - نرمز ل ${A \in \mathbb{R}^{m\times n$ مصفوفة تحتوي على $m$ صفوف و $n$ أعمدة، حيث $A_{i,j}$ يرمز للمدخل في الصف$ i$ و العمود $j$
+
+
+
+
+**6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.**
+
+ملاحظة : المتجه $x$ المعرف مسبقا يمكن اعتباره مصفوفة من الشكل $n \times 1$ والذي يسمى ب مصفوفة من عمود واحد.
+
+
+
+
+**7. Main matrices**
+
+
+المصفوفات الأساسية
+
+
+
+**8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:**
+
+ مصفوفة الوحدة (Identity) - مصفوفة الوحدة $I \in \mathbb{R^{n\times n}$ تعتبر مصفوفة مربعة تحتوي على المدخل 1 في قطر المصفوفة و 0 في بقية المدخلات:
+
+
+
+
+**9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.**
+
+
+ملاحظة : جميع المصفوفات من الشكل $A \in \mathbb{R^}{n\times n}$ فإن $A \times I = I \times A = A$.
+
+
+
+**10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:**
+
+مصفوفة قطرية (diagonal) - المصفوفة القطرية هي مصفوفة من الشكل
+ $D \in \mathbb{R}^{n\times n}$ حيث أن جميع العناصر الواقعة خارج القطر الرئيسي تساوي الصفر والعناصر على القطر الرئيسي تحتوي أعداد لاتساوي الصفر.
+
+
+
+**11. Remark: we also note D as diag(d1,...,dn).**
+
+
+ملاحظة: نرمز كذلك ل $D$ ب $text{diag}(d_1, \dots, d_n)\$.
+
+
+
+**12. Matrix operations**
+
+
+ عمليات المصفوفات
+
+
+
+
+**13. Multiplication**
+
+
+ الضرب
+
+
+
+
+**14. Vector-vector ― There are two types of vector-vector products:**
+
+
+ ضرب المتجهات - توجد طريقتين لضرب متجه بمتجه :
+
+
+
+
+**15. inner product: for x,y∈Rn, we have:**
+
+
+ ضرب داخلي (inner product): ل $x,y \in \mathbb{R}^n$ نستنتج :
+
+
+
+
+**16. outer product: for x∈Rm,y∈Rn, we have:**
+
+
+ ضرب خارجي (outer product): ل $x \in \mathbb{m}, y \in \mathbb{R}^n$ نستنتج :
+
+
+
+
+**17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:**
+
+
+ مصفوفة - متجه : ضرب المصفوفة $A \in \mathbb{R}^{n\times m}$ والمتجه $x \in \mathbb{R}^n$ ينتجه متجه من الشكل $x \in \mathbb{R}^n$ حيث :
+
+
+
+**18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.**
+
+
+ حيث $a^{T}_{r,i}$ يعتبر متجه الصفوف و $a_{c,j}$ يعتبر متجه الأعمدة ل $A$ كذلك $x_i$ يرمز لعناصر $x$.
+
+
+
+
+**19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:**
+
+
+ ضرب مصفوفة ومصفوفة - ضرب المصفوفة $A \in \mathbb{R}^{n \times m}$ و $A \in \mathbb{R}^{n \times p}$ ينتجه عنه المصفوفة $A \in \mathbb{R}^{n \times p}$ حيث أن :
+
+
+
+
+**20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively**
+
+
+حيث $a^T_{r, i}$ و $b^T_{r, i}$ يعتبر متجه الصفوف $a_{c, j}$ و b_{c, j}$ متجه الأعمدة ل $A$ و $B$ على التوالي.
+
+
+
+
+**21. Other operations**
+
+
+ عمليات أخرى
+
+
+
+
+**22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:**
+
+
+ المنقول (Transpose) - منقول المصفوفة$A \in \mathbb{R}^{m \times n}$ يرمز له ب $A^T$ حيث الصفوف يتم تبديلها مع الأعمدة :
+
+
+
+
+**23. Remark: for matrices A,B, we have (AB)T=BTAT**
+
+
+ ملاحظة: لأي مصفوفتين $A$ و $B$، نستنتج $(AB)^T = B^T A^T$.
+
+
+
+**24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:**
+
+
+ المعكوس (Inverse)- معكوس أي مصفوفة $A$ قابلة للعكس (Invertible) يرمز له ب $A^{-1}$ ويعتبر المعكوس المصفوفة الوحيدة التي لديها الخاصية التالية :
+
+
+
+**25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1**
+
+
+ملاحظة: ليس جميع المصفوفات يمكن إيجاد معكوس لها. كذلك لأي مصفوفتين $A$ و $B$ نستنتج $(AB)^{-1} = B^{-1} A^{-1}$.
+
+
+
+
+**26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:**
+
+
+أثر المصفوفة (Trace) - أثر أي مصفوفة مربعة $A$ يرمز له ب $tr(A)$ يعتبر مجموع العناصر التي في القطر:
+
+
+
+**27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)**
+
+
+ ملاحظة : لأي مصفوفتين $A$ و $B$ لدينا $tr(A^T) = tr(A)$ و $tr(AB) = tr(BA)$.
+
+
+
+**28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:**
+
+
+المحدد (Determinant) - المحدد لأي مصفوفة مربعة من الشكل $A \in \mathbb{R}^{n \times n}$ يرمز له ب $|A|$ او $det(A)$يتم تعريفه بإستخدام $ِA_{\\i,\\j}$ والذي يعتبر المصفوفة $A$ مع حذف الصف $i$ والعمود $j$ كالتالي :
+
+
+
+**29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.**
+
+
+ ملاحظة: $A$ يكون لديه معكوذ إذا وفقط إذا $\neq 0 |A|$. كذلك $|A B| = |A| |B|$ و $|A^T| = |A|$.
+
+
+
+**30. Matrix properties**
+
+
+خواص المصفوفات
+
+
+
+**31. Definitions**
+
+
+التعريفات
+
+
+
+**32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:**
+
+
+ التفكيك المتماثل (Symmetric Decomposition)- المصفوفة $A$ يمكن التعبير عنها بإستخدام جزئين مثماثل (Symmetric) وغير متماثل(Antisymmetric) كالتالي :
+
+
+
+**33. [Symmetric, Antisymmetric]**
+
+
+[متماثل، غير متماثل]
+
+
+
+
+**34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:**
+
+
+المعيار (Norm) - المعيار يعتبر دالة $N: V \to [0, +\infity)$ حيث $V$ يعتبر فضاء متجه (Vector Space)، حيث أن لكل $x,y \in V$ لدينا :
+
+
+
+**35. N(ax)=|a|N(x) for a scalar**
+
+
+لأي عدد $a$ فإن $N(ax) = |a| N(x)$
+
+
+
+**36. if N(x)=0, then x=0**
+
+
+$N(x) =0 \implies x = 0$
+
+
+
+**37. For x∈V, the most commonly used norms are summed up in the table below:**
+
+
+لأي $x \in V$ المعايير الأكثر إستخداماً ملخصة في الجدول التالي:
+
+
+
+**38. [Norm, Notation, Definition, Use case]**
+
+
+[المعيار، الرمز، التعريف، مثال للإستخدام]
+
+
+
+**39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.**
+
+
+ الارتباط الخطي (Linear Dependence): مجموعة المتجهات تعتبر تابعة خطياً إذا وفقط إذا كل متجه يمكن كتابته بشكل خطي بإسخدام مجموعة من المتجهات الأخرى.
+
+
+
+**40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent**
+
+
+ملاحظة: إذا لم يتحقق هذا الشرط فإنها تسمى مستقلة خطياً .
+
+
+
+**41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.**
+
+
+ رتبة المصفوفة (Rank) - رتبة المصفوفة $A$ يرمز له ب $text{rank}(A)\$ وهو يصف حجم الفضاء المتجهي الذي نتج من أعمدة المصفوفة. يمكن وصفه كذلك بأقصى عدد من أعمدة المصفوفة $A$ التي تمتلك خاصية أنها مستقلة خطياً.
+
+
+
+**42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:**
+
+
+ مصفوفة شبه معرفة موجبة (Positive semi-definite) - المصفوفة $A \in \mathbb{R}^{n \times n}$ تعتبر مصفوفة شبه معرفة موجبة (PSD) ويرمز لها بالرمز $A \succed 0 $ إذا :
+
+
+
+**43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.**
+
+
+ ملاحظة: المصفوفة $A$ تعتبر مصفوفة معرفة موجبة إذا $A \succ 0 $ وهي تعتبر مصفوفة (PSD) والتي تستوفي الشرط : لكل متجه غير الصفر $x$ حيث $x^TAx>0 $.
+
+
+
+**44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+
+ القيم الذايتة (eigenvalue), المتجه الذاتي (eigenvector) - إذا كان لدينا مصفوفة $A \in \mathbb{R}^{n \times n}$، القيمة $\lambda$ تعتبر قيمة ذاتية للمصفوفة $A$ إذا وجد متجه $z \in \mathbb{R}^n \\ \{0\}$ يسمى متجه ذاتي حيث أن :
+
+
+
+**45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+
+ النظرية الطيفية (spectral theorem) - نفرض $A \in \mathbb{R}^{n \times n}$ إذا كانت المصفوفة $A$ متماثلة فإن $A$ تعتبر مصفوفة قطرية بإستخدام مصفوفة متعامدة (orthogonal) $U \in \mathbb{R} ^{n \times n}$ ويرمز لها بالرمز $\Lambda = \diag(\lambda_1, \dots, \lambda_n)$ حيث أن:
+
+
+
+**46. diagonal**
+
+
+ قطرية
+
+
+
+**47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:**
+
+
+ مجزئ القيمة المفرده (singular value decomposition) : لأي مصفوفة $A$ من الشكل $n\times m$ ، تفكيك القيمة المنفردة (SVD) يعتبر طريقة تحليل تضمن وجود $U \in \mathbb{R}^{m \times m}$ , مصفوفة قطرية $\Sigma \in \mathbb{R}^{m \times n}$ و $V \in \mathbb{R}^{n \times n}$ حيث أن :
+
+
+
+**48. Matrix calculus**
+
+
+ حساب المصفوفات
+
+
+
+**49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:**
+
+
+ المشتقة في فضاءات عالية (gradient) - افترض $f: \mathbb{R}^{m \times n} \rightarrow \mathbb{R}$ تعتبر دالة و $f: \mathbb{R}^{m \times n} \rightarrow \mathbb{R}$ تعتبر مصفوفة. المشتقة العليا ل $f$ بالنسبة ل $A$ يعتبر مصفوفة $n\times m$ يرمز له $nabla_A f(A)\$ حيث أن:
+
+
+
+**50. Remark: the gradient of f is only defined when f is a function that returns a scalar.**
+
+
+ملاحظة : المشتقة العليا معرفة فقط إذا كانت الدالة $f$ لديها مدى ضمن الأعداد الحقيقية.
+
+
+
+**51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:**
+
+
+هيشيان (Hessian) - افترض $f: \mathbb{R}^n \rightarrow \mathbb{R}$ تعتبر دالة و $x \in \mathbb{R}^n$ يعتبر متجه. الهيشيان ل $f$ بالنسبة ل $x$ تعتبر مصفوفة متماثلة من الشكل $n \times n$ يرمز لها بالرمز $nabla^2_x f(x)\$ حيثب أن :
+
+
+
+**52. Remark: the hessian of f is only defined when f is a function that returns a scalar**
+
+
+ ملاحظة : الهيشيان معرفة فقط إذا كانت الدالة $f$ لديها مدى ضمن الأعداد الحقيقية.
+
+
+
+
+**53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:**
+
+
+ الحساب في مشتقة الفضاءات العالية- لأي مصفوفات $A,B,C$ فإن الخواص التالية مهمة :
+
+
+
+
+**54. [General notations, Definitions, Main matrices]**
+
+
+ [الرموز العامة، التعاريف، المصفوفات الرئيسية]
+
+
+
+
+**55. [Matrix operations, Multiplication, Other operations]**
+
+
+ [عمليات المصفوفات، الضرب، عمليات أخرى]
+
+
+
+**56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]**
+
+
+ [خواص المصفوفات، المعيار، قيمة ذاتية/متجه ذاتي، تفكيك القيمة المنفردة]
+
+
+
+**57. [Matrix calculus, Gradient, Hessian, Operations]**
+
+
+ [حساب المصفوفات، مشتقة الفضاءات العالية، الهيشيان، العمليات]
+
diff --git a/ar/cs-229-machine-learning-tips-and-tricks.md b/ar/cs-229-machine-learning-tips-and-tricks.md
new file mode 100644
index 000000000..d48445a75
--- /dev/null
+++ b/ar/cs-229-machine-learning-tips-and-tricks.md
@@ -0,0 +1,338 @@
+**Machine Learning tips and tricks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-machine-learning-tips-and-tricks)
+
+
+
+**1. Machine Learning tips and tricks cheatsheet**
+
+
+مرجع سريع لنصائح وحيل تعلّم الآلة
+
+
+
+**2. Classification metrics**
+
+
+مقاييس التصنيف
+
+
+
+**3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
+
+
+في سياق التصنيف الثنائي، هذه المقاييس (metrics) المهمة التي يجدر مراقبتها من أجل تقييم آداء النموذج.
+
+
+
+**4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
+
+
+مصفوفة الدقّة (confusion matrix) - تستخدم مصفوفة الدقّة لأخذ تصور شامل عند تقييم أداء النموذج. وهي تعرّف كالتالي:
+
+
+
+**5. [Predicted class, Actual class]**
+
+
+[التصنيف المتوقع، التصنيف الفعلي]
+
+
+
+**6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
+
+
+المقاييس الأساسية - المقاييس التالية تستخدم في العادة لتقييم أداء نماذج التصنيف:
+
+
+
+**7. [Metric, Formula, Interpretation]**
+
+
+[المقياس، المعادلة، التفسير]
+
+
+
+**8. Overall performance of model**
+
+
+الأداء العام للنموذج
+
+
+
+**9. How accurate the positive predictions are**
+
+
+دقّة التوقعات الإيجابية (positive)
+
+
+
+**10. Coverage of actual positive sample**
+
+
+تغطية عينات التوقعات الإيجابية الفعلية
+
+
+
+**11. Coverage of actual negative sample**
+
+
+تغطية عينات التوقعات السلبية الفعلية
+
+
+
+**12. Hybrid metric useful for unbalanced classes**
+
+
+مقياس هجين مفيد للأصناف غير المتوازنة (unbalanced)
+
+
+
+**13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
+
+
+منحنى دقّة الأداء (ROC) - منحنى دقّة الآداء، ويطلق عليه ROC، هو رسمة لمعدل التصنيفات الإيجابية الصحيحة (TPR) مقابل معدل التصنيفات الإيجابية الخاطئة (FPR) باستخدام قيم حد (threshold) متغيرة. هذه المقاييس ملخصة في الجدول التالي:
+
+
+
+**14. [Metric, Formula, Equivalent]**
+
+
+[المقياس، المعادلة، مرادف]
+
+
+
+**15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
+
+
+المساحة تحت منحنى دقة الأداء (المساحة تحت المنحنى) (AUC) - المساحة تحت منحنى دقة الأداء (المساحة تحت المنحنى)، ويطلق عليها AUC أو AUROC، هي المساحة تحت ROC كما هو موضح في الرسمة التالية:
+
+
+
+**16. [Actual, Predicted]**
+
+
+[الفعلي، المتوقع]
+
+
+
+**17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
+
+
+المقاييس الأساسية - إذا كان لدينا نموذج الانحدار f، فإن المقاييس التالية غالباً ما تستخدم لتقييم أداء النموذج:
+
+
+
+**18. [Total sum of squares, Explained sum of squares, Residual sum of squares]**
+
+
+[المجموع الكلي للمربعات، مجموع المربعات المُفسَّر، مجموع المربعات المتبقي]
+
+
+
+**19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
+
+
+مُعامل التحديد (Coefficient of determination) - مُعامل التحديد، وغالباً يرمز له بـ R2 أو r2، يعطي قياس لمدى مطابقة النموذج للنتائج الملحوظة، ويعرف كما يلي:
+
+
+
+**20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
+
+
+المقاييس الرئيسية - المقاييس التالية تستخدم غالباً لتقييم أداء نماذج الانحدار، وذلك بأن يتم الأخذ في الحسبان عدد المتغيرات n المستخدمة فيها:
+
+
+
+**21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
+
+
+حيث L هو الأرجحية، و ˆσ2 تقدير التباين الخاص بكل نتيجة.
+
+
+
+**22. Model selection**
+
+
+اختيار النموذج
+
+
+
+**23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
+
+
+مفردات - عند اختيار النموذج، نفرق بين 3 أجزاء من البيانات التي لدينا كالتالي:
+
+
+
+**24. [Training set, Validation set, Testing set]**
+
+
+[مجموعة تدريب، مجموعة تحقق، مجموعة اختبار]
+
+
+
+**25. [Model is trained, Model is assessed, Model gives predictions]**
+
+
+[يتم تدريب النموذج، يتم تقييم النموذج، النموذج يعطي التوقعات]
+
+
+
+**26. [Usually 80% of the dataset, Usually 20% of the dataset]**
+
+
+[غالباً 80% من مجموعة البيانات، غالباً 20% من مجموعة البيانات]
+
+
+
+**27. [Also called hold-out or development set, Unseen data]**
+
+
+[يطلق عليها كذلك المجموعة المُجنّبة أو مجموعة التطوير، بيانات لم يسبق رؤيتها من قبل]
+
+
+
+**28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
+
+
+بمجرد اختيار النموذج، يتم تدريبه على مجموعة البيانات بالكامل ثم يتم اختباره على مجموعة اختبار لم يسبق رؤيتها من قبل. كما هو موضح في الشكل التالي:
+
+
+
+**29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
+
+
+التحقق المتقاطع (Cross-validation) - التحقق المتقاطع، وكذلك يختصر بـ CV، هو طريقة تستخدم لاختيار نموذج بحيث لا يعتمد بشكل كبير على مجموعة بيانات التدريب المبدأية. أنواع التحقق المتقاطع المختلفة ملخصة في الجدول التالي:
+
+
+
+**30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
+
+
+[التدريب على k-1 جزء والتقييم باستخدام الجزء الباقي، التدريب على n−p عينة والتقييم باستخدام الـ p عينات المتبقية]
+
+
+
+**31. [Generally k=5 or 10, Case p=1 is called leave-one-out]**
+
+
+[بشكل عام k=5 أو 10، الحالة p=1 يطلق عليها الإبقاء على واحد (leave-one-out)]
+
+
+
+**32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
+
+
+الطريقة الأكثر استخداماً يطلق عليها التحقق المتقاطع س جزء/أجزاء (k-fold)، ويتم فيها تقسيم البيانات إلى k جزء، بحيث يتم تدريب النموذج باستخدام k−1 والتحقق باستخدام الجزء المتبقي، ويتم تكرار ذلك k مرة. يتم بعد ذلك حساب معدل الأخطاء في الأجزاء k ويسمى خطأ التحقق المتقاطع.
+
+
+
+**33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
+
+
+ضبط (Regularization) - عمليه الضبط تهدف إلى تفادي فرط التخصيص (overfit) للنموذج، وهو بذلك يتعامل مع مشاكل التباين العالي. الجدول التالي يلخص أنواع وطرق الضبط الأكثر استخداماً:
+
+
+
+**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+
+[يقلص المُعاملات إلى 0، جيد لاختيار المتغيرات، يجعل المُعاملات أصغر، المفاضلة بين اختيار المتغيرات والمُعاملات الصغيرة]
+
+
+
+**35. Diagnostics**
+
+
+التشخيصات
+
+
+
+**36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
+
+
+الانحياز (Bias) - الانحياز للنموذج هو الفرق بين التنبؤ المتوقع والنموذج الحقيقي الذي نحاول تنبؤه للبيانات المعطاة.
+
+
+
+**37. Variance ― The variance of a model is the variability of the model prediction for given data points.**
+
+
+التباين (Variance) - تباين النموذج هو مقدار التغير في تنبؤ النموذج لنقاط البيانات المعطاة.
+
+
+
+**38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
+
+
+موازنة الانحياز/التباين (Bias/variance tradeoff) - كلما زادت بساطة النموذج، زاد الانحياز، وكلما زاد تعقيد النموذج، زاد التباين.
+
+
+
+**39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
+
+
+[الأعراض، توضيح الانحدار، توضيح التصنيف، توضيح التعلم العميق، العلاجات الممكنة]
+
+
+
+**40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
+
+
+[خطأ التدريب عالي، خطأ التدريب قريب من خطأ الاختبار، انحياز عالي، خطأ التدريب أقل بقليل من خطأ الاختبار، خطأ التدريب منخفض جداً، خطأ التدريب أقل بكثير من خطأ الاختبار، تباين عالي]
+
+
+
+**41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
+
+
+[زيادة تعقيد النموذج، إضافة المزيد من الخصائص، تدريب لمدة أطول، إجراء الضبط (regularization)، الحصول على المزيد من البيانات]
+
+
+
+**42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
+
+
+تحليل الخطأ - تحليل الخطأ هو تحليل السبب الرئيسي للفرق في الأداء بين النماذج الحالية والنماذج المثالية.
+
+
+
+**43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
+
+
+تحليل استئصالي (Ablative analysis) - التحليل الاستئصالي هو تحليل السبب الرئيسي للفرق في الأداء بين النماذج الحالية والنماذج المبدئية (baseline).
+
+
+
+**44. Regression metrics**
+
+
+مقاييس الانحدار
+
+
+
+**45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
+
+
+[مقاييس التصنيف، مصفوفة الدقّة، الضبط (accuracy)، الدقة (precision)، الاستدعاء (recall)، درجة F1]
+
+
+
+**46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]**
+
+
+[مقاييس الانحدار، مربع R، معيار معامل مالوس (Mallow's)، معيار آكياك المعلوماتي (AIC)، معيار المعلومات البايزي (BIC)]
+
+
+
+**47. [Model selection, cross-validation, regularization]**
+
+
+[اختيار النموذج، التحقق المتقاطع، الضبط]
+
+
+
+**48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
+
+
+[التشخيصات، موازنة الانحياز/التباين، تحليل الخطأ/التحليل الاستئصالي]
+
diff --git a/ar/cs-229-probability.md b/ar/cs-229-probability.md
new file mode 100644
index 000000000..d57cadf9f
--- /dev/null
+++ b/ar/cs-229-probability.md
@@ -0,0 +1,385 @@
+**Probabilities and Statistics translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/refresher-probabilities-statistics)
+
+
+
+**1. Probabilities and Statistics refresher**
+
+مراجعة للاحتمالات والإحصاء
+
+
+
+**2. Introduction to Probability and Combinatorics**
+
+مقدمة في الاحتمالات والتوافيق
+
+
+
+**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
+
+فضاء العينة ― يعرَّف فضاء العينة لتجربة ما بمجموعة كل النتائج الممكنة لهذه التجربة ويرمز لها بـ S.
+
+
+
+**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
+
+الحدث ― أي مجموعة جزئية E من فضاء العينة تعتبر حدثاً. أي، الحدث هو مجموعة من النتائج الممكنة للتجربة. إذا كانت نتيجة التجربة محتواة في E، عندها نقول أن الحدث E وقع.
+
+
+
+**5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
+
+مسلَّمات الاحتمالات. لكل حدث E، نرمز لإحتمال وقوعه بـ P(E).
+
+
+
+**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
+
+المسلَّمة 1 ― كل احتمال يأخد قيماً بين الـ 0 والـ 1 مضمَّنة:
+
+
+
+**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
+
+المسلَّمة 2 ― احتمال وقوع حدث ابتدائي واحد على الأقل من الأحداث الابتدائية في فضاء العينة يساوي الـ 1:
+
+
+
+**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
+
+المسلَّمة 3 ― لأي سلسلة من الأحداث الغير متداخلة E1,...,En، لدينا:
+
+
+
+**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
+
+التباديل ― التبديل هو عبارة عن عدد الاختيارات لـ r غرض من مجموعة مكونة من n غرض بترتيب محدد. عدد هكذا تراتيب يرمز له بـ P(n, r)، المعرف كالتالي:
+
+
+
+**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
+
+التوافيق ― التوفيق هو عدد الاختيارات لـ r غرض من مجموعة مكونة من n غرض بدون إعطاء الترتيب أية أهمية. عدد هكذا توافيق يرمز له بـ C(n, r)، المعرف كالتالي:
+
+
+
+**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
+
+ملاحظة: لكل 0⩽r⩽n، يكون لدينا P(n,r)⩾C(n,r)
+
+
+
+**12. Conditional Probability**
+
+الاحتمال الشرطي
+
+
+
+**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
+
+قاعدة بايز ― إذا كانت لدينا الأحداث A و B بحيث P(B)>0، يكون لدينا:
+
+
+
+**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
+
+ملاحظة: لدينا P(A∩B)=P(A)P(B|A)=P(A|B)P(B)
+
+
+
+**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
+
+القسم ― ليكن {Ai,i∈[[1,n]]} بحيث لكل i لديناAi≠∅ . نقول أن {Ai} قسم إذا كان لدينا:
+
+
+
+**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
+
+ملاحظة: لأي حدث B في فضاء العينة، لدينا P(B)=n∑i=1P(B|Ai)P(Ai).
+
+
+
+**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
+
+النسخة الموسعة من قاعدة بايز ― ليكن {Ai,i∈[[1,n]]} قسم من فضاء العينة. لدينا:
+
+
+
+**18. Independence ― Two events A and B are independent if and only if we have:**
+
+الاستقلال ― يكون حدثين A و B مستقلين إذا وفقط إذا كان لدينا:
+
+
+
+**19. Random Variables**
+
+المتحولات العشوائية
+
+
+
+**20. Definitions**
+
+تعاريف
+
+
+
+**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
+
+المتحول العشوائي ― المتحول العشوائي، ويرمز له عادة بـ X، هو دالة تربط كل عنصر في فضاء العينة إلى خط الأعداد الحقيقية.
+
+
+
+**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
+
+دالة التوزيع التراكمي (CDF) ― تعرف دالة التوزيع التراكمي F، والتي تكون غير متناقصة بشكل رتيب وتحقق limx→−∞F(x)=0 و limx→+∞F(x)=1، كالتالي:
+
+
+
+**23. Remark: we have P(a
+ملاحظة: لدينا P(a<X⩽B)=F(b)−F(a).
+
+
+
+**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
+
+دالة الكثافة الإحتمالية (PDF) ― دالة الكثافة الاحتمالية f هي احتمال أن يأخذ X قيماً بين قيمتين متجاورتين من قيم المتحول العشوائي.
+
+
+
+**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
+
+علاقات تتضمن دالة الكثافة الاحتمالية ودالة التوزع التراكمي ― هذه بعض الخصائص التي من المهم معرفتها في الحالتين المتقطعة (D) والمستمرة (C).
+
+
+
+**26. [Case, CDF F, PDF f, Properties of PDF]**
+
+[الحالة، دالة التوزع التراكمي F، دالة الكثافة الاحتمالية f، خصائص دالة الكثافة الاحتمالية]
+
+
+
+**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
+
+التوقع وعزوم التوزيع ― فيما يلي المصطلحات المستخدمة للتعبير عن القيمة المتوقعة E[X]، الصيغة العامة للقيمة المتوقعة E[g(X)]، العزم رقم K E[XK] ودالة السمة ψ(ω) للحالات المتقطعة والمستمرة:
+
+
+
+**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
+
+التباين ― تباين متحول عشوائي، والذي يرمز له عادةً ب Var(X) أو σ2، هو مقياس لانتشار دالة توزيع هذا المتحول. يحسب بالشكل التالي:
+
+
+
+**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
+
+الانحراف المعياري ― الانحراف المعياري لمتحول عشوائي، والذي يرمز له عادةً ب σ، هو مقياس لانتشار دالة توزيع هذا المتحول بما يتوافق مع وحدات قياس المتحول العشوائي. يحسب بالشكل التالي:
+
+
+
+**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
+
+تحويل المتحولات العشوائية ― لتكن المتحولات العشوائية X وY مرتبطة من خلال دالة ما. باعتبار fX وfY دالتا التوزيع لX وY على التوالي، يكون لدينا:
+
+
+**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
+
+ قاعدة لايبنتز (Leibniz) للتكامل ― لتكن g دالة لـ x وربما لـ c، ولتكن a وb حدود قد تعتمد على c. يكون لدينا:
+
+
+
+**32. Probability Distributions**
+
+التوزيعات الاحتمالية
+
+
+
+**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
+
+متراجحة تشيبشيف (Chebyshev) ― ليكن X متحولاً عشوائياً قيمته المتوقعة تساوي μ. إذا كان لدينا k ،σ>0، سنحصل على المتراجحة التالية:
+
+
+
+**34. Main distributions ― Here are the main distributions to have in mind:**
+
+التوزيعات الأساسية ― فيما يلي التوزيعات الأساسية لأخذها بالاعتبار:
+
+
+
+**35. [Type, Distribution]**
+
+[النوع، التوزيع]
+
+
+
+**36. Jointly Distributed Random Variables**
+
+المتغيرات العشوائية الموزعة اشتراكياً
+
+
+
+**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
+
+الكثافة الهامشية والتوزيع التراكمي ― من دالة الكثافة الاحتمالية المشتركة fXY، لدينا:
+
+
+
+**38. [Case, Marginal density, Cumulative function]**
+
+[الحالة، الكثافة الهامشية، الدالة التراكمية]
+
+
+
+**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
+
+الكثافة الشرطية ― الكثافة الشرطية لـ X بالنسبة لـ Y، والتي يرمز لها عادةً بـ fX|Y، تعرف بالشكل التالي:
+
+
+
+**40. Independence ― Two random variables X and Y are said to be independent if we have:**
+
+الاستقلال ― يقال عن متحولين عشوائيين X و Y أنهما مستقلين إذا كان لدينا:
+
+
+
+**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
+
+التغاير ― نعرف تغاير متحولين عشوائيين X و Y، والذي نرمز له بـ σ2XY أو بالرمز الأكثر شيوعاً Cov(X,Y)، كالتالي:
+
+
+
+**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
+
+الارتباط ― بأخذ σX، σY كانحراف معياري لـ X و Y، نعرف الارتباط بين المتحولات العشوائية X و Y، والمرمز بـ ρXY، كالتالي:
+
+
+
+**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
+
+ملاحظة 1: لأي متحولات عشوائية X، Y، لدينا ρXY∈[−1,1].
+
+
+
+**44. Remark 2: If X and Y are independent, then ρXY=0.**
+
+ملاحظة 2: إذا كان X و Y مستقلين، فإن ρXY=0.
+
+
+
+**45. Parameter estimation**
+
+تقدير المُدخَل (Parameter)
+
+
+
+**46. Definitions**
+
+تعاريف
+
+
+
+**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
+
+العينة العشوائية ― العينة العشوائية هي مجموعة من n متحول عشوائي X1,...,Xn والتي تكون مستقلة وموزعة تطابقياً مع X.
+
+
+
+**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
+
+المُقَدِّر ― المُقَدِّر هو دالة للبيانات المستخدمة ويستخدم لاستنباط قيمة مُدخل غير معلوم ضمن نموذج إحصائي.
+
+
+
+**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
+
+الانحياز ― انحياز مُقَدِّر ^θ هو الفرق بين القيمة المتوقعة لتوزيع ^θ والقيمة الحقيقية، كالتالي:
+
+
+
+**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
+
+ملاحظة: يقال عن مُقَدِّر أنه غير منحاز عندما يكون لدينا E[^θ]=θ.
+
+
+
+**51. Estimating the mean**
+
+تقدير المتوسط
+
+
+
+**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
+
+متوسط العينة ― يستخدم متوسط عينة عشوائية لتقدير المتوسط الحقيقي μ لتوزيع ما، عادةً ما يرمز له بـ ¯¯¯¯¯X ويعرف كالتالي:
+
+
+
+**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
+
+ملاحظة: متوسط العينة غير منحاز، أي E[¯¯¯¯¯X]=μ.
+
+
+
+**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
+
+مبرهنة النهاية المركزية ― ليكن لدينا عينة عشوائية X1,...,Xn والتي تتبع لتوزيع معطى له متوسط μ وتباين σ2، فيكون:
+
+
+
+**55. Estimating the variance**
+
+تقدير التباين
+
+
+
+**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
+
+تباين العينة ― يستخدم تباين عينة عشوائية لتقدير التباين الحقيقي σ2 لتوزيع ما، والذي يرمز له عادةً بـ s2 أو ^σ2 ويعرّف بالشكل التالي:
+
+
+
+**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
+
+ملاحظة: تباين العينة غير منحاز، أي E[s2]=σ2.
+
+
+
+**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
+
+علاقة مربع كاي (Chi-Squared) مع تباين العينة ― ليكن s2 تباين العينة لعينة عشوائية. لدينا:
+
+
+
+**59. [Introduction, Sample space, Event, Permutation]**
+
+[مقدمة، فضاء العينة، الحدث، التبديل]
+
+
+
+**60. [Conditional probability, Bayes' rule, Independence]**
+
+[الاحتمال الشرطي، قاعدة بايز، الاستقلال]
+
+
+
+**61. [Random variables, Definitions, Expectation, Variance]**
+
+[المتحولات العشوائية، تعاريف، القيمة المتوقعة، التباين]
+
+
+
+**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
+
+[التوزيعات الاحتمالية، متراجحة تشيبشيف، توزيعات رئيسية]
+
+
+
+**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
+
+[المتحولات العشوائية الموزعة اشتراكياً، الكثافة، التغاير، الارتباط]
+
+
+
+**64. [Parameter estimation, Mean, Variance]**
+
+[تقدير المُدخَل، المتوسط، التباين]
+
diff --git a/ar/cs-229-supervised-learning.md b/ar/cs-229-supervised-learning.md
new file mode 100644
index 000000000..9104d46a1
--- /dev/null
+++ b/ar/cs-229-supervised-learning.md
@@ -0,0 +1,663 @@
+**1. Supervised Learning cheatsheet**
+
+
+مرجع سريع للتعلّم المُوَجَّه
+
+
+
+**2. Introduction to Supervised Learning**
+
+
+مقدمة للتعلّم المُوَجَّه
+
+
+
+**3. Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.**
+
+
+إذا كان لدينا مجموعة من نقاط البيانات {x(1),...,x(m)} مرتبطة بمجموعة مخرجات {y(1),...,y(m)}، نريد أن نبني مُصَنِّف يتعلم كيف يتوقع y من x.
+
+
+
+**4. Type of prediction ― The different types of predictive models are summed up in the table below:**
+
+
+نوع التوقّع - أنواع نماذج التوقّع المختلفة موضحة في الجدول التالي:
+
+
+
+**5. [Regression, Classifier, Outcome, Examples]**
+
+
+[الانحدار (Regression)، التصنيف (Classification)، المُخرَج، أمثلة]
+
+
+
+**6. [Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
+
+
+[مستمر، صنف، انحدار خطّي (Linear regression)، انحدار لوجستي (Logistic regression)، آلة المتجهات الداعمة (SVM)، بايز البسيط (Naive Bayes)]
+
+
+
+**7. Type of model ― The different models are summed up in the table below:**
+
+
+نوع النموذج - أنواع النماذج المختلفة موضحة في الجدول التالي:
+
+
+
+**8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
+
+
+[نموذج تمييزي (discriminative)، نموذج توليدي (Generative)، الهدف، ماذا يتعلم، توضيح، أمثلة]
+
+
+
+**9. [Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, آلة المتجهات الداعمة (SVM), GDA, Naive Bayes]**
+
+
+[التقدير المباشر لـ P(y|x)، تقدير P(x|y) ثم استنتاج P(y|x)، حدود القرار، التوزيع الاحتمالي للبيانات، الانحدار (Regression)، آلة المتجهات الداعمة (SVM)، GDA، بايز البسيط (Naive Bayes)]
+
+
+
+**10. Notations and general concepts**
+
+
+الرموز ومفاهيم أساسية
+
+
+
+**11. Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).**
+
+
+الفرضية (Hypothesis) - الفرضية، ويرمز لها بـ hθ، هي النموذج الذي نختاره. إذا كان لدينا المدخل x(i)، فإن المخرج الذي سيتوقعه النموذج هو hθ(x(i)).
+
+
+
+**12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
+
+
+دالة الخسارة (Loss function) - دالة الخسارة هي الدالة L:(z,y)∈R×Y⟼L(z,y)∈R التي تأخذ كمدخلات القيمة المتوقعة z والقيمة الحقيقية y وتعطينا الاختلاف بينهما. الجدول التالي يحتوي على بعض دوال الخسارة الشائعة:
+
+
+
+**13. [Least squared error, Logistic loss, Hinge loss, Cross-entropy]**
+
+
+[خطأ أصغر تربيع (Least squared error)، خسارة لوجستية (Logistic loss)، خسارة مفصلية (Hinge loss)، الانتروبيا التقاطعية (Cross-entropy)]
+
+
+
+**14. [Linear regression, Logistic regression, SVM, Neural Network]**
+
+
+[الانحدار الخطّي (Linear regression)، الانحدار اللوجستي (Logistic regression)، آلة المتجهات الداعمة (SVM)، الشبكات العصبية (Neural Network)]
+
+
+
+**15. Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:**
+
+
+دالة التكلفة (Cost function) - دالة التكلفة J تستخدم عادة لتقييم أداء نموذج ما، ويتم تعريفها مع دالة الخسارة L كالتالي:
+
+
+
+**16. Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:**
+
+
+النزول الاشتقاقي (Gradient descent) - لنعرّف معدل التعلّم α∈R، يمكن تعريف القانون الذي يتم تحديث خوارزمية النزول الاشتقاقي من خلاله باستخدام معدل التعلّم ودالة التكلفة J كالتالي:
+
+
+
+**17. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.**
+
+
+ملاحظة: في النزول الاشتقاقي العشوائي (Stochastic gradient descent (SGD)) يتم تحديث المُعاملات (parameters) بناءاً على كل عينة تدريب على حدة، بينما في النزول الاشتقاقي الحُزَمي (batch gradient descent) يتم تحديثها باستخدام حُزَم من عينات التدريب.
+
+
+
+**18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
+
+
+الأرجحية (Likelihood) - تستخدم أرجحية النموذج L(θ)، حيث أن θ هي المُدخلات، للبحث عن المُدخلات θ الأحسن عن طريق تعظيم (maximizing) الأرجحية. عملياً يتم استخدام الأرجحية اللوغاريثمية (log-likelihood) ℓ(θ)=log(L(θ)) حيث أنها أسهل في التحسين (optimize). فيكون لدينا:
+
+
+
+**19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**
+
+
+خوارزمية نيوتن (Newton's algorithm) - خوارزمية نيوتن هي طريقة حسابية للعثور على θ بحيث يكون ℓ′(θ)=0. قاعدة التحديث للخوارزمية كالتالي:
+
+
+
+**20. Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:**
+
+
+ملاحظة: هناك خوارزمية أعم وهي متعددة الأبعاد (multidimensional)، يطلق عليها خوارزمية نيوتن-رافسون (Newton-Raphson)، ويتم تحديثها عبر القانون التالي:
+
+
+
+**21. Linear models**
+
+
+النماذج الخطيّة (Linear models)
+
+
+
+**22. Linear regression**
+
+
+الانحدار الخطّي (Linear regression)
+
+
+
+**23. We assume here that y|x;θ∼N(μ,σ2)**
+
+
+هنا نفترض أن y|x;θ∼N(μ,σ2)
+
+
+
+**24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
+
+
+المعادلة الطبيعية/الناظمية (Normal) - إذا كان لدينا المصفوفة X، القيمة θ التي تقلل من دالة التكلفة يمكن حلها رياضياً بشكل مغلق (closed-form) عن طريق:
+
+
+
+**25. LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:**
+
+
+خوارزمية أصغر معدل تربيع LMS - إذا كان لدينا معدل التعلّم α، فإن قانون التحديث لخوارزمية أصغر معدل تربيع (Least Mean Squares (LMS)) لمجموعة بيانات من m عينة، ويطلق عليه قانون تعلم ويدرو-هوف (Widrow-Hoff)، كالتالي:
+
+
+
+**26. Remark: the update rule is a particular case of the gradient ascent.**
+
+
+ملاحظة: قانون التحديث هذا يعتبر حالة خاصة من النزول الاشتقاقي (Gradient descent).
+
+
+
+**27. LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:**
+
+
+الانحدار الموزون محليّاً (LWR) - الانحدار الموزون محليّاً (Locally Weighted Regression)، ويعرف بـ LWR، هو نوع من الانحدار الخطي يَزِن كل عينة تدريب أثناء حساب دالة التكلفة باستخدام w(i)(x)، التي يمكن تعريفها باستخدام المُدخل (parameter) τ∈R كالتالي:
+
+
+
+**28. Classification and logistic regression**
+
+
+التصنيف والانحدار اللوجستي
+
+
+
+**29. Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:**
+
+
+دالة سيجمويد (Sigmoid) - دالة سيجمويد g، وتعرف كذلك بالدالة اللوجستية، تعرّف كالتالي:
+
+
+
+**30. Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:**
+
+
+الانحدار اللوجستي (Logistic regression) - نفترض هنا أن y|x;θ∼Bernoulli(ϕ). فيكون لدينا:
+
+
+
+**31. Remark: there is no closed form solution for the case of logistic regressions.**
+
+
+ملاحظة: ليس هناك حل رياضي مغلق للانحدار اللوجستي.
+
+
+
+**32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
+
+
+انحدار سوفت ماكس (Softmax) - ويطلق عليه الانحدار اللوجستي متعدد الأصناف (multiclass logistic regression)، يستخدم لتعميم الانحدار اللوجستي إذا كان لدينا أكثر من صنفين. في العرف يتم تعيين θK=0، بحيث تجعل مُدخل بيرنوللي (Bernoulli) ϕi لكل فئة i يساوي:
+
+
+
+**33. Generalized Linear Models**
+
+
+النماذج الخطية العامة (Generalized Linear Models - GLM)
+
+
+
+**34. Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:**
+
+
+العائلة الأُسيّة (Exponential family) - يطلق على صنف من التوزيعات (distributions) بأنها تنتمي إلى العائلة الأسيّة إذا كان يمكن كتابتها بواسطة مُدخل قانوني (canonical parameter) η، إحصاء كافٍ (sufficient statistic) T(y)، ودالة تجزئة لوغاريثمية a(η)، كالتالي:
+
+
+
+**35. Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.**
+
+
+ملاحظة: كثيراً ما سيكون T(y)=y. كذلك فإن exp(−a(η)) يمكن أن تفسر كمُدخل تسوية (normalization) للتأكد من أن الاحتمالات يكون حاصل جمعها يساوي واحد.
+
+
+
+**36. Here are the most common exponential distributions summed up in the following table:**
+
+
+تم تلخيص أكثر التوزيعات الأسيّة استخداماً في الجدول التالي:
+
+
+
+**37. [Distribution, Bernoulli, Gaussian, Poisson, Geometric]**
+
+
+[التوزيع، بِرنوللي (Bernoulli)، جاوسي (Gaussian)، بواسون (Poisson)، هندسي (Geometric)]
+
+
+
+**38. Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:**
+
+
+افتراضات GLMs - تهدف النماذج الخطيّة العامة (GLM) إلى توقع المتغير العشوائي y كدالة لـ x∈Rn+1، وتستند إلى ثلاثة افتراضات:
+
+
+
+**39. Remark: ordinary least squares and logistic regression are special cases of generalized linear models.**
+
+
+ملاحظة: أصغر تربيع (least squares) الاعتيادي و الانحدار اللوجستي يعتبران من الحالات الخاصة للنماذج الخطيّة العامة.
+
+
+
+**40. Support Vector Machines**
+
+
+آلة المتجهات الداعمة (Support Vector Machines)
+
+
+
+**41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
+
+
+تهدف آلة المتجهات الداعمة (SVM) إلى العثور على الخط الذي يعظم أصغر مسافة إليه:
+
+
+
+**42: Optimal margin classifier ― The optimal margin classifier h is such that:**
+
+
+مُصنِّف الهامش الأحسن (Optimal margin classifier) - يعرَّف مُصنِّف الهامش الأحسن h كالتالي:
+
+
+
+**43: where (w,b)∈Rn×R is the solution of the following optimization problem:**
+
+
+حيث (w,b)∈Rn×R هو الحل لمشكلة التحسين (optimization) التالية:
+
+
+
+**44. such that**
+
+
+بحيث أن
+
+
+
+**45. support vectors**
+
+
+المتجهات الداعمة (support vectors)
+
+
+
+**46. Remark: the line is defined as wTx−b=0.**
+
+
+ملاحظة: يتم تعريف الخط بهذه المعادلة wTx−b=0.
+
+
+
+**47. Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:**
+
+
+الخسارة المفصلية (Hinge loss) - تستخدم الخسارة المفصلية في حل SVM ويعرف على النحو التالي:
+
+
+
+**48. Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:**
+
+
+النواة (Kernel) - إذا كان لدينا دالة ربط الخصائص (features) ϕ، يمكننا تعريف النواة K كالتالي:
+
+
+
+**49. In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.**
+
+
+عملياً، يمكن أن تُعَرَّف الدالة K عن طريق المعادلة K(x,z)=exp(−||x−z||22σ2)، ويطلق عليها النواة الجاوسية (Gaussian kernel)، وهي تستخدم بكثرة.
+
+
+
+**50. [Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]**
+
+
+[قابلية الفصل غير الخطي، استخدام ربط النواة، حد القرار في الفضاء الأصلي]
+
+
+
+**51. Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.**
+
+
+ملاحظة: نقول أننا نستخدم "حيلة النواة" (kernel trick) لحساب دالة التكلفة عند استخدام النواة لأننا في الحقيقة لا نحتاج أن نعرف التحويل الصريح ϕ، الذي يكون في الغالب شديد التعقيد. ولكن، نحتاج أن فقط أن نحسب القيم K(x,z).
+
+
+
+**52. Lagrangian ― We define the Lagrangian L(w,b) as follows:**
+
+
+اللّاغرانجي (Lagrangian) - يتم تعريف اللّاغرانجي L(w,b) على النحو التالي:
+
+
+
+**53. Remark: the coefficients βi are called the Lagrange multipliers.**
+
+
+ملاحظة: المعامِلات (coefficients) βi يطلق عليها مضروبات لاغرانج (Lagrange multipliers).
+
+
+
+**54. Generative Learning**
+
+
+التعلم التوليدي (Generative Learning)
+
+
+
+**55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
+
+
+النموذج التوليدي في البداية يحاول أن يتعلم كيف تم توليد البيانات عن طريق تقدير P(x|y)، التي يمكن حينها استخدامها لتقدير P(y|x) باستخدام قانون بايز (Bayes' rule).
+
+
+
+**56. Gaussian Discriminant Analysis**
+
+
+تحليل التمايز الجاوسي (Gaussian Discriminant Analysis)
+
+
+
+**57. Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:**
+
+
+الإطار - تحليل التمايز الجاوسي يفترض أن y و x|y=0 و x|y=1 بحيث يكونوا كالتالي:
+
+
+
+**58. Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:**
+
+
+التقدير - الجدول التالي يلخص التقديرات التي يمكننا التوصل لها عند تعظيم الأرجحية (likelihood):
+
+
+
+**59. Naive Bayes**
+
+
+بايز البسيط (Naive Bayes)
+
+
+
+**60. Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:**
+
+
+الافتراض - يفترض نموذج بايز البسيط أن جميع الخصائص لكل عينة بيانات مستقلة (independent):
+
+
+
+**61. Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]**
+
+
+الحل - تعظيم الأرجحية اللوغاريثمية (log-likelihood) يعطينا الحلول التالية إذا كان k∈{0,1}، l∈[[1,L]]:
+
+
+
+**62. Remark: Naive Bayes is widely used for text classification and spam detection.**
+
+
+ملاحظة: بايز البسيط يستخدم بشكل واسع لتصنيف النصوص واكتشاف البريد الإلكتروني المزعج.
+
+
+
+**63. Tree-based and ensemble methods**
+
+
+الطرق الشجرية (tree-based) والتجميعية (ensemble)
+
+
+
+**64. These methods can be used for both regression and classification problems.**
+
+
+هذه الطرق يمكن استخدامها لكلٍ من مشاكل الانحدار (regression) والتصنيف (classification).
+
+
+
+**65. CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.**
+
+
+التصنيف والانحدار الشجري (CART) - والاسم الشائع له أشجار القرار (decision trees)، يمكن أن يمثل كأشجار ثنائية (binary trees). من المزايا لهذه الطريقة إمكانية تفسيرها بسهولة.
+
+
+
+**66. Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.**
+
+
+الغابة العشوائية (Random forest) - هي أحد الطرق الشجرية التي تستخدم عدداً كبيراً من أشجار القرار مبنية باستخدام مجموعة عشوائية من الخصائص. بخلاف شجرة القرار البسيطة لا يمكن تفسير النموذج بسهولة، ولكن أدائها العالي جعلها أحد الخوارزمية المشهورة.
+
+
+
+**67. Remark: random forests are a type of ensemble methods.**
+
+
+ملاحظة: أشجار القرار نوع من الخوارزميات التجميعية (ensemble).
+
+
+
+**68. Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:**
+
+
+التعزيز (Boosting) - فكرة خوارزميات التعزيز هي دمج عدة خوارزميات تعلم ضعيفة لتكوين نموذج قوي. الطرق الأساسية ملخصة في الجدول التالي:
+
+
+
+**69. [Adaptive boosting, Gradient boosting]**
+
+
+[التعزيز التَكَيُّفي (Adaptive boosting)، التعزيز الاشتقاقي (Gradient boosting)]
+
+
+
+**70. High weights are put on errors to improve at the next boosting step**
+
+
+يتم التركيز على مواطن الخطأ لتحسين النتيجة في الخطوة التالية.
+
+
+
+**71. Weak learners trained on remaining errors**
+
+
+يتم تدريب خوارزميات التعلم الضعيفة على الأخطاء المتبقية.
+
+
+
+**72. Other non-parametric approaches**
+
+
+طرق أخرى غير بارامترية (non-parametric)
+
+
+
+**73. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
+
+
+خوارزمية أقرب الجيران (k-nearest neighbors) - تعتبر خوارزمية أقرب الجيران، وتعرف بـ k-NN، طريقة غير بارامترية، حيث يتم تحديد نتيجة عينة من البيانات من خلال عدد k من البيانات المجاورة في مجموعة التدريب. ويمكن استخدامها للتصنيف والانحدار.
+
+
+
+**74. Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
+
+
+ملاحظة: كلما زاد المُدخل k، كلما زاد الانحياز (bias)، وكلما نقص k، زاد التباين (variance).
+
+
+
+**75. Learning Theory**
+
+
+نظرية التعلُّم
+
+
+
+**76. Union bound ― Let A1,...,Ak be k events. We have:**
+
+
+حد الاتحاد (Union bound) - لنجعل A1,...,Ak تمثل k حدث. فيكون لدينا:
+
+
+
+**77. Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:**
+
+
+متراجحة هوفدينج (Hoeffding) - لنجعل Z1,..,Zm تمثل m متغير مستقلة وموزعة بشكل مماثل (iid) مأخوذة من توزيع بِرنوللي (Bernoulli distribution) ذا مُدخل ϕ. لنجعل ˆϕ متوسط العينة (sample mean) و γ>0 ثابت. فيكون لدينا:
+
+
+
+**78. Remark: this inequality is also known as the Chernoff bound.**
+
+
+ملاحظة: هذه المتراجحة تعرف كذلك بحد تشرنوف (Chernoff bound).
+
+
+
+**79. Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:**
+
+
+خطأ التدريب - ليكن لدينا المُصنِّف h، يمكن تعريف خطأ التدريب ˆϵ(h)، ويعرف كذلك بالخطر التجريبي أو الخطأ التجريبي، كالتالي:
+
+
+
+**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: **
+
+
+تقريباً صحيح احتمالياً (Probably Approximately Correct (PAC)) - هو إطار يتم من خلاله إثبات العديد من نظريات التعلم، ويحتوي على الافتراضات التالية:
+
+
+
+**81: the training and testing sets follow the same distribution **
+
+
+مجموعتي التدريب والاختبار يتبعان نفس التوزيع.
+
+
+
+**82. the training examples are drawn independently**
+
+
+عينات التدريب تؤخذ بشكل مستقل.
+
+
+
+**83. Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:**
+
+
+مجموعة تكسيرية (Shattering Set) - إذا كان لدينا المجموعة S={x(1),...,x(d)}، ومجموعة مُصنٍّفات H، نقول أن H تكسر S (H shatters S) إذا كان لكل مجموعة علامات (labels) {y(1),...,y(d)} لدينا:
+
+
+
+**84. Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:**
+
+
+مبرهنة الحد الأعلى (Upper bound theorem) - لنجعل H فئة فرضية محدودة (finite hypothesis class) بحيث |H|=k، و δ وحجم العينة m ثابتين. حينها سيكون لدينا، مع احتمال على الأقل 1−δ، التالي:
+
+
+
+**85. VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.**
+
+
+بُعْد فابنيك-تشرفونيكس (Vapnik-Chervonenkis - VC) لفئة فرضية غير محدودة (infinite hypothesis class) H، ويرمز له بـ VC(H)، هو حجم أكبر مجموعة (set) التي تم تكسيرها بواسطة H (shattered by H).
+
+
+
+**86. Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.**
+
+
+ملاحظة: بُعْد فابنيك-تشرفونيكس VC لـ H = {مجموعة التصنيفات الخطية في بُعدين} يساوي 3.
+
+
+
+**87. Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:**
+
+
+مبرهنة فابنيك (Vapnik theorem) - ليكن لدينا H، مع VC(H)=d وعدد عيّنات التدريب m. سيكون لدينا، مع احتمال على الأقل 1−δ، التالي:
+
+
+
+**88. [Introduction, Type of prediction, Type of model]**
+
+
+[مقدمة، نوع التوقع، نوع النموذج]
+
+
+
+**89. [Notations and general concepts, loss function, gradient descent, likelihood]**
+
+
+[الرموز ومفاهيم أساسية، دالة الخسارة، النزول الاشتقاقي، الأرجحية]
+
+
+
+**90. [Linear models, linear regression, logistic regression, generalized linear models]**
+
+
+[النماذج الخطيّة، الانحدار الخطّي، الانحدار اللوجستي، النماذج الخطية العامة]
+
+
+
+**91. [Support vector machines, Optimal margin classifier, Hinge loss, Kernel]**
+
+
+[آلة المتجهات الداعمة (SVM)، مُصنِّف الهامش الأحسن، الفرق المفصلي، النواة]
+
+
+
+**92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]**
+
+
+[التعلم التوليدي، تحليل التمايز الجاوسي، بايز البسيط]
+
+
+
+**93. [Trees and ensemble methods, CART, Random forest, Boosting]**
+
+
+[الطرق الشجرية والتجميعية، التصنيف والانحدار الشجري (CART)، الغابة العشوائية (Random forest)، التعزيز (Boosting)]
+
+
+
+**94. [Other methods, k-NN]**
+
+
+[طرق أخرى، خوارزمية أقرب الجيران (k-NN)]
+
+
+
+**95. [Learning theory, Hoeffding inequality, PAC, VC dimension]**
+
+
+[نظرية التعلُّم، متراجحة هوفدنك، تقريباً صحيح احتمالياً (PAC)، بُعْد فابنيك-تشرفونيكس (VC dimension)]
+
diff --git a/ar/cs-229-unsupervised-learning.md b/ar/cs-229-unsupervised-learning.md
new file mode 100644
index 000000000..6e309b36d
--- /dev/null
+++ b/ar/cs-229-unsupervised-learning.md
@@ -0,0 +1,399 @@
+**1. Unsupervised Learning cheatsheet**
+
+
+ مرجع سريع للتعلّم غير المُوَجَّه
+
+
+
+
+**2. Introduction to Unsupervised Learning**
+
+
+ مقدمة للتعلّم غير المُوَجَّه
+
+
+
+
+**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
+
+
+ {x(1),...,x(m)} الحافز ― الهدف من التعلّم غير المُوَجَّه هو إيجاد الأنماط الخفية في البيانات غير المٌعلمّة
+
+
+
+
+**4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
+
+
+متباينة جينسن ― لتكن f دالة محدبة و X متغير عشوائي. لدينا المتباينة التالية:
+
+
+
+
+**5. Clustering**
+
+
+ التجميع
+
+
+
+**6. Expectation-Maximization**
+
+
+ تعظيم القيمة المتوقعة (Expectation-Maximization)
+
+
+
+**7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
+
+
+ المتغيرات الكامنة ― المتغيرات الكامنة هي متغيرات مخفية/غير معاينة تزيد من صعوبة مشاكل التقدير، غالباً ما ترمز بالحرف z. في مايلي الإعدادات الشائعة التي تحتوي على متغيرات كامنة:
+
+
+
+**8. [Setting, Latent variable z, Comments]**
+
+
+ [الإعداد، المتغير الكامن z، ملاحظات]
+
+
+
+**9. [Mixture of k Gaussians, Factor analysis]**
+
+
+ [خليط من k توزيع جاوسي، تحليل عاملي]
+
+
+
+**10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
+
+
+خوارزمية ― تعظيم القيمة المتوقعة (Expectation-Maximization) هي عبارة عن طريقة فعالة لتقدير المُدخل θ عبر تقدير تقدير الأرجحية الأعلى (maximum likelihood estimation)، ويتم ذلك بشكل تكراري حيث يتم إيجاد حد أدنى للأرجحية (الخطوة M)، ثم يتم تحسين (optimizing) ذلك الحد الأدنى (الخطوة E)، كما يلي:
+
+
+
+**11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
+
+
+الخطوة E : حساب الاحتمال البعدي Qi(z(i)) بأن تصدر كل نقطة x(i) من مجموعة (cluster) z(i) كما يلي:
+
+
+
+**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
+
+
+ الخطوة M : يتم استعمال الاحتمالات البعدية Qi(z(i)) كأوزان خاصة لكل مجموعة (cluster) على النقط x(i)، لكي يتم تقدير نموذج لكل مجموعة بشكل منفصل، و ذلك كما يلي:
+
+
+
+**13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
+
+
+[استهلالات جاوسية، خطوة القيمة المتوقعة، خطوة التعظيم، التقارب]
+
+
+
+**14. k-means clustering**
+
+
+التجميع بالمتوسطات k (k-mean clustering)
+
+
+
+**15. We note c(i) the cluster of data point i and μj the center of cluster j.**
+
+
+نرمز لمجموعة النقط i بـ c(i)، ونرمز بـ μj مركز المجموعات j.
+
+
+
+**16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
+
+
+خوارزمية - بعد الاستهلال العشوائي للنقاط المركزية (centroids) للمجوعات μ1,μ2,...,μk∈Rn، التجميع بالمتوسطات k تكرر الخطوة التالية حتى التقارب:
+
+
+
+**17. [Means initialization, Cluster assignment, Means update, Convergence]**
+
+
+[استهلال المتوسطات، تعيين المجموعات، تحديث المتوسطات، التقارب]
+
+
+
+**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
+
+
+دالة التحريف (distortion function) - لكي نتأكد من أن الخوارزمية تقاربت، ننظر إلى دالة التحريف المعرفة كما يلي:
+
+
+
+**19. Hierarchical clustering**
+
+
+ التجميع الهرمي
+
+
+
+**20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
+
+
+خوارزمية - هي عبارة عن خوارزمية تجميع تعتمد على طريقة تجميع هرمية تبني مجموعات متداخلة بشكل متتال.
+
+
+
+**21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
+
+
+الأنواع - هنالك عدة أنواع من خوارزميات التجميع الهرمي التي ترمي إلى تحسين دوال هدف (objective functions) مختلفة، هذه الأنواع ملخصة في الجدول التالي:
+
+
+
+**22. [Ward linkage, Average linkage, Complete linkage]**
+
+
+[ربط وارْد (ward linkage)، الربط المتوسط، الربط الكامل]
+
+
+
+**23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
+
+
+[تصغير المسافة داخل المجموعة، تصغير متوسط المسافة بين أزواج المجموعات، تصغير المسافة العظمى بين أزواج المجموعات]
+
+
+**24. Clustering assessment metrics**
+
+
+مقاييس تقدير المجموعات
+
+
+
+**25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
+
+
+في التعلّم غير المُوَجَّه من الصعب غالباً تقدير أداء نموذج ما، لأن القيم الحقيقية تكون غير متوفرة كما هو الحال في التعلًم المُوَجَّه.
+
+
+**26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
+
+
+معامل الظّل (silhouette coefficient) - إذا رمزنا a و b لمتوسط المسافة بين عينة وكل النقط المنتمية لنفس الصنف، و بين عينة وكل النقط المنتمية لأقرب مجموعة، المعامل الظِلِّي s لعينة واحدة معرف كالتالي:
+
+
+
+**27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
+
+
+مؤشر كالينسكي-هارباز (Calinski-Harabaz index) - إذا رمزنا بـ k لعدد المجموعات، فإن Bk و Wk مصفوفتي التشتت بين المجموعات وداخلها تعرف كالتالي:
+
+
+
+**28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
+
+
+مؤشر كالينسكي-هارباز s(k) يشير إلى جودة نموذج تجميعي في تعريف مجموعاته، بحيث كلما كانت النتيجة أعلى كلما دل ذلك على أن المجموعات أكثر كثافة وأكثر انفصالاً فيما بينها. هذا المؤشر معرّف كالتالي:
+
+
+
+**29. Dimension reduction**
+
+
+تقليص الأبعاد
+
+
+**30. Principal component analysis**
+
+
+تحليل المكون الرئيس
+
+
+
+**31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
+
+
+إنها طريقة لتقليص الأبعاد ترمي إلى إيجاد الاتجاهات المعظمة للتباين من أجل إسقاط البيانات عليها.
+
+
+
+**32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+
+قيمة ذاتية (eigenvalue)، متجه ذاتي (eigenvector) - لتكن A∈Rn×n مصفوفة، نقول أن λ قيمة ذاتية للمصفوفة A إذا وُجِد متجه z∈Rn∖{0} يسمى متجهاً ذاتياً، بحيث:
+
+
+
+**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+
+مبرهنة الطّيف (Spectral theorem) - لتكن A∈Rn×n. إذا كانت A متناظرة فإنها يمكن أن تكون شبه قطرية عن طريق مصفوفة متعامدة حقيقية U∈Rn×n. إذا رمزنا Λ=diag(λ1,...,λn) ، لدينا:
+
+
+
+**34. diagonal**
+
+
+قطري
+
+
+
+**35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
+
+
+ملحوظة: المتجه الذاتي المرتبط بأكبر قيمة ذاتية يسمى بالمتجه الذاتي الرئيسي (principal eigenvector) للمصفوفة A.
+
+
+
+**36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k
+dimensions by maximizing the variance of the data as follows:**
+
+
+خوارزمية - تحليل المكون الرئيس (Principal Component Analysis (PCA)) طريقة لخفض الأبعاد تهدف إلى إسقاط البيانات على k بُعد بحيث يتم تعطيم التباين (variance)، خطواتها كالتالي:
+
+
+
+**37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
+
+
+الخطوة 1: تسوية البيانات بحيث تصبح ذات متوسط يساوي صفر وانحراف معياري يساوي واحد.
+
+
+
+**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
+
+
+الخطوة 2: حساب Σ=1mm∑i=1x(i)x(i)T∈Rn×n، وهي متناظرة وذات قيم ذاتية حقيقية.
+
+
+
+**39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
+
+
+الخطوة 3: حساب u1,...,uk∈Rn المتجهات الذاتية الرئيسية المتعامدة لـ Σ وعددها k ، بعبارة أخرى، k من المتجهات الذاتية المتعامدة ذات القيم الذاتية الأكبر.
+
+
+
+**40. Step 4: Project the data on spanR(u1,...,uk).**
+
+
+الخطوة 4: إسقاط البيانات على spanR(u1,...,uk).
+
+
+
+**41. This procedure maximizes the variance among all k-dimensional spaces.**
+
+
+هذا الإجراء يعظم التباين بين كل الفضاءات البُعدية.
+
+
+
+**42. [Data in feature space, Find principal components, Data in principal components space]**
+
+
+[بيانات في فضاء الخصائص, أوجد المكونات الرئيسية, بيانات في فضاء المكونات الرئيسية]
+
+
+
+**43. Independent component analysis**
+
+
+تحليل المكونات المستقلة
+
+
+
+**44. It is a technique meant to find the underlying generating sources.**
+
+
+هي طريقة تهدف إلى إيجاد المصادر التوليدية الكامنة.
+
+
+
+**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
+
+
+افتراضات - لنفترض أن بياناتنا x تم توليدها عن طريق المتجه المصدر s=(s1,...,sn) ذا n بُعد، حيث si متغيرات عشوائية مستقلة، وذلك عبر مصفوفة خلط غير منفردة (mixing and non-singular) A كالتالي:
+
+
+
+**46. The goal is to find the unmixing matrix W=A−1.**
+
+
+الهدف هو العثور على مصفوفة الفصل W=A−1.
+
+
+
+**47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
+
+خوارزمية تحليل المكونات المستقلة (ICA) لبيل وسجنوسكي (Bell and Sejnowski) - هذه الخوارزمية تجد مصفوفة الفصل W عن طريق الخطوات التالية:
+
+
+
+**48. Write the probability of x=As=W−1s as:**
+
+
+اكتب الاحتمال لـ x=As=W−1s كالتالي:
+
+
+
+**49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
+
+
+لتكن {x(i),i∈[[1,m]]} بيانات التمرن و g دالة سيجمويد، اكتب الأرجحية اللوغاريتمية (log likelihood) كالتالي:
+
+
+
+**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
+
+
+هكذا، باستخدام الصعود الاشتقاقي العشوائي (stochastic gradient ascent)، لكل عينة تدريب x(i) نقوم بتحديث W كما يلي:
+
+
+
+**51. The Machine Learning cheatsheets are now available in Arabic.**
+
+
+المرجع السريع لتعلم الآلة متوفر الآن باللغة العربية.
+
+
+
+**52. Original authors**
+
+
+المحررون الأصليون
+
+
+
+**53. Translated by X, Y and Z**
+
+
+تمت الترجمة بواسطة X,Y و Z
+
+
+
+**54. Reviewed by X, Y and Z**
+
+
+تمت المراجعة بواسطة X,Y و Z
+
+
+
+**55. [Introduction, Motivation, Jensen's inequality]**
+
+
+[مقدمة، الحافز، متباينة جينسن]
+
+
+
+**56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
+
+
+[التجميع، تعظيم القيمة المتوقعة، تجميع k-متوسطات، التجميع الهرمي، مقاييس]
+
+
+
+**57. [Dimension reduction, PCA, ICA]**
+
+
+[تقليص الأبعاد، تحليل المكون الرئيس (PCA)، تحليل المكونات المستقلة (ICA)]
+
+
diff --git a/ar/refresher-probability.md b/ar/refresher-probability.md
deleted file mode 100644
index 5c9b34656..000000000
--- a/ar/refresher-probability.md
+++ /dev/null
@@ -1,381 +0,0 @@
-**1. Probabilities and Statistics refresher**
-
-⟶
-
-
-
-**2. Introduction to Probability and Combinatorics**
-
-⟶
-
-
-
-**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
-
-⟶
-
-
-
-**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
-
-⟶
-
-
-
-**5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
-
-⟶
-
-
-
-**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
-
-⟶
-
-
-
-**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
-
-⟶
-
-
-
-**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
-
-⟶
-
-
-
-**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
-
-⟶
-
-
-
-**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
-
-⟶
-
-
-
-**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
-
-⟶
-
-
-
-**12. Conditional Probability**
-
-⟶
-
-
-
-**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
-
-⟶
-
-
-
-**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
-
-⟶
-
-
-
-**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
-
-⟶
-
-
-
-**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
-
-⟶
-
-
-
-**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
-
-⟶
-
-
-
-**18. Independence ― Two events A and B are independent if and only if we have:**
-
-⟶
-
-
-
-**19. Random Variables**
-
-⟶
-
-
-
-**20. Definitions**
-
-⟶
-
-
-
-**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
-
-⟶
-
-
-
-**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
-
-⟶
-
-
-
-**23. Remark: we have P(a
-
-**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
-
-⟶
-
-
-
-**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
-
-⟶
-
-
-
-**26. [Case, CDF F, PDF f, Properties of PDF]**
-
-⟶
-
-
-
-**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
-
-⟶
-
-
-
-**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
-
-⟶
-
-
-
-**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
-
-⟶
-
-
-
-**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
-
-⟶
-
-
-
-**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
-
-⟶
-
-
-
-**32. Probability Distributions**
-
-⟶
-
-
-
-**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
-
-⟶
-
-
-
-**34. Main distributions ― Here are the main distributions to have in mind:**
-
-⟶
-
-
-
-**35. [Type, Distribution]**
-
-⟶
-
-
-
-**36. Jointly Distributed Random Variables**
-
-⟶
-
-
-
-**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
-
-⟶
-
-
-
-**38. [Case, Marginal density, Cumulative function]**
-
-⟶
-
-
-
-**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
-
-⟶
-
-
-
-**40. Independence ― Two random variables X and Y are said to be independent if we have:**
-
-⟶
-
-
-
-**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
-
-⟶
-
-
-
-**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
-
-⟶
-
-
-
-**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
-
-⟶
-
-
-
-**44. Remark 2: If X and Y are independent, then ρXY=0.**
-
-⟶
-
-
-
-**45. Parameter estimation**
-
-⟶
-
-
-
-**46. Definitions**
-
-⟶
-
-
-
-**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
-
-⟶
-
-
-
-**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
-
-⟶
-
-
-
-**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
-
-⟶
-
-
-
-**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
-
-⟶
-
-
-
-**51. Estimating the mean**
-
-⟶
-
-
-
-**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
-
-⟶
-
-
-
-**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
-
-⟶
-
-
-
-**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
-
-⟶
-
-
-
-**55. Estimating the variance**
-
-⟶
-
-
-
-**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
-
-⟶
-
-
-
-**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
-
-⟶
-
-
-
-**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
-
-⟶
-
-
-
-**59. [Introduction, Sample space, Event, Permutation]**
-
-⟶
-
-
-
-**60. [Conditional probability, Bayes' rule, Independence]**
-
-⟶
-
-
-
-**61. [Random variables, Definitions, Expectation, Variance]**
-
-⟶
-
-
-
-**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
-
-⟶
-
-
-
-**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
-
-⟶
-
-
-
-**64. [Parameter estimation, Mean, Variance]**
-
-⟶
diff --git a/de/cheatsheet-deep-learning.md b/de/cheatsheet-deep-learning.md
deleted file mode 100644
index a5aa3756c..000000000
--- a/de/cheatsheet-deep-learning.md
+++ /dev/null
@@ -1,321 +0,0 @@
-**1. Deep Learning cheatsheet**
-
-⟶
-
-
-
-**2. Neural Networks**
-
-⟶
-
-
-
-**3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.**
-
-⟶
-
-
-
-**4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:**
-
-⟶
-
-
-
-**5. [Input layer, hidden layer, output layer]**
-
-⟶
-
-
-
-**6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
-
-⟶
-
-
-
-**7. where we note w, b, z the weight, bias and output respectively.**
-
-⟶
-
-
-
-**8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:**
-
-⟶
-
-
-
-**9. [Sigmoid, Tanh, ReLU, Leaky ReLU]**
-
-⟶
-
-
-
-**10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
-
-⟶
-
-
-
-**11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
-
-⟶
-
-
-
-**12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:**
-
-⟶
-
-
-
-**13. As a result, the weight is updated as follows:**
-
-⟶
-
-
-
-**14. Updating weights ― In a neural network, weights are updated as follows:**
-
-⟶
-
-
-
-**15. Step 1: Take a batch of training data.**
-
-⟶
-
-
-
-**16. Step 2: Perform forward propagation to obtain the corresponding loss.**
-
-⟶
-
-
-
-**17. Step 3: Backpropagate the loss to get the gradients.**
-
-⟶
-
-
-
-**18. Step 4: Use the gradients to update the weights of the network.**
-
-⟶
-
-
-
-**19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
-
-⟶
-
-
-
-**20. Convolutional Neural Networks**
-
-⟶
-
-
-
-**21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
-
-⟶
-
-
-
-**22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
-
-⟶
-
-
-
-**23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
-
-⟶
-
-
-
-**24. Recurrent Neural Networks**
-
-⟶
-
-
-
-**25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:**
-
-⟶
-
-
-
-**26. [Input gate, forget gate, gate, output gate]**
-
-⟶
-
-
-
-**27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]**
-
-⟶
-
-
-
-**28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
-
-⟶
-
-
-
-**29. Reinforcement Learning and Control**
-
-⟶
-
-
-
-**30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.**
-
-⟶
-
-
-
-**31. Definitions**
-
-⟶
-
-
-
-**32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:**
-
-⟶
-
-
-
-**33. S is the set of states**
-
-⟶
-
-
-
-**34. A is the set of actions**
-
-⟶
-
-
-
-**35. {Psa} are the state transition probabilities for s∈S and a∈A**
-
-⟶
-
-
-
-**36. γ∈[0,1[ is the discount factor**
-
-⟶
-
-
-
-**37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize**
-
-⟶
-
-
-
-**38. Policy ― A policy π is a function π:S⟶A that maps states to actions.**
-
-⟶
-
-
-
-**39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).**
-
-⟶
-
-
-
-**40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:**
-
-⟶
-
-
-
-**41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:**
-
-⟶
-
-
-
-**42. Remark: we note that the optimal policy π∗ for a given state s is such that:**
-
-⟶
-
-
-
-**43. Value iteration algorithm ― The value iteration algorithm is in two steps:**
-
-⟶
-
-
-
-**44. 1) We initialize the value:**
-
-⟶
-
-
-
-**45. 2) We iterate the value based on the values before:**
-
-⟶
-
-
-
-**46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
-
-⟶
-
-
-
-**47. times took action a in state s and got to s′**
-
-⟶
-
-
-
-**48. times took action a in state s**
-
-⟶
-
-
-
-**49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:**
-
-⟶
-
-
-
-**50. View PDF version on GitHub**
-
-⟶
-
-
-
-**51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
-
-⟶
-
-
-
-**52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
-
-⟶
-
-
-
-**53. [Recurrent Neural Networks, Gates, LSTM]**
-
-⟶
-
-
-
-**54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
-
-⟶
diff --git a/de/cheatsheet-machine-learning-tips-and-tricks.md b/de/cheatsheet-machine-learning-tips-and-tricks.md
deleted file mode 100644
index 9712297b8..000000000
--- a/de/cheatsheet-machine-learning-tips-and-tricks.md
+++ /dev/null
@@ -1,285 +0,0 @@
-**1. Machine Learning tips and tricks cheatsheet**
-
-⟶
-
-
-
-**2. Classification metrics**
-
-⟶
-
-
-
-**3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
-
-⟶
-
-
-
-**4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
-
-⟶
-
-
-
-**5. [Predicted class, Actual class]**
-
-⟶
-
-
-
-**6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
-
-⟶
-
-
-
-**7. [Metric, Formula, Interpretation]**
-
-⟶
-
-
-
-**8. Overall performance of model**
-
-⟶
-
-
-
-**9. How accurate the positive predictions are**
-
-⟶
-
-
-
-**10. Coverage of actual positive sample**
-
-⟶
-
-
-
-**11. Coverage of actual negative sample**
-
-⟶
-
-
-
-**12. Hybrid metric useful for unbalanced classes**
-
-⟶
-
-
-
-**13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
-
-⟶
-
-
-
-**14. [Metric, Formula, Equivalent]**
-
-⟶
-
-
-
-**15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
-
-⟶
-
-
-
-**16. [Actual, Predicted]**
-
-⟶
-
-
-
-**17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
-
-⟶
-
-
-
-**18. [Total sum of squares, Explained sum of squares, Residual sum of squares]**
-
-⟶
-
-
-
-**19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
-
-⟶
-
-
-
-**20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
-
-⟶
-
-
-
-**21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
-
-⟶
-
-
-
-**22. Model selection**
-
-⟶
-
-
-
-**23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
-
-⟶
-
-
-
-**24. [Training set, Validation set, Testing set]**
-
-⟶
-
-
-
-**25. [Model is trained, Model is assessed, Model gives predictions]**
-
-⟶
-
-
-
-**26. [Usually 80% of the dataset, Usually 20% of the dataset]**
-
-⟶
-
-
-
-**27. [Also called hold-out or development set, Unseen data]**
-
-⟶
-
-
-
-**28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
-
-⟶
-
-
-
-**29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
-
-⟶
-
-
-
-**30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
-
-⟶
-
-
-
-**31. [Generally k=5 or 10, Case p=1 is called leave-one-out]**
-
-⟶
-
-
-
-**32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
-
-⟶
-
-
-
-**33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
-
-⟶
-
-
-
-**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
-
-⟶
-
-
-
-**35. Diagnostics**
-
-⟶
-
-
-
-**36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
-
-⟶
-
-
-
-**37. Variance ― The variance of a model is the variability of the model prediction for given data points.**
-
-⟶
-
-
-
-**38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
-
-⟶
-
-
-
-**39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
-
-⟶
-
-
-
-**40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
-
-⟶
-
-
-
-**41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
-
-⟶
-
-
-
-**42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
-
-⟶
-
-
-
-**43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
-
-⟶
-
-
-
-**44. Regression metrics**
-
-⟶
-
-
-
-**45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
-
-⟶
-
-
-
-**46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]**
-
-⟶
-
-
-
-**47. [Model selection, cross-validation, regularization]**
-
-⟶
-
-
-
-**48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
-
-⟶
diff --git a/de/cheatsheet-supervised-learning.md b/de/cheatsheet-supervised-learning.md
deleted file mode 100644
index a6b19ea1c..000000000
--- a/de/cheatsheet-supervised-learning.md
+++ /dev/null
@@ -1,567 +0,0 @@
-**1. Supervised Learning cheatsheet**
-
-⟶
-
-
-
-**2. Introduction to Supervised Learning**
-
-⟶
-
-
-
-**3. Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.**
-
-⟶
-
-
-
-**4. Type of prediction ― The different types of predictive models are summed up in the table below:**
-
-⟶
-
-
-
-**5. [Regression, Classifier, Outcome, Examples]**
-
-⟶
-
-
-
-**6. [Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
-
-⟶
-
-
-
-**7. Type of model ― The different models are summed up in the table below:**
-
-⟶
-
-
-
-**8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
-
-⟶
-
-
-
-**9. [Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]**
-
-⟶
-
-
-
-**10. Notations and general concepts**
-
-⟶
-
-
-
-**11. Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).**
-
-⟶
-
-
-
-**12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
-
-⟶
-
-
-
-**13. [Least squared error, Logistic loss, Hinge loss, Cross-entropy]**
-
-⟶
-
-
-
-**14. [Linear regression, Logistic regression, SVM, Neural Network]**
-
-⟶
-
-
-
-**15. Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:**
-
-⟶
-
-
-
-**16. Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:**
-
-⟶
-
-
-
-**17. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.**
-
-⟶
-
-
-
-**18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
-
-⟶
-
-
-
-**19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**
-
-⟶
-
-
-
-**20. Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:**
-
-⟶
-
-
-
-**21. Linear models**
-
-⟶
-
-
-
-**22. Linear regression**
-
-⟶
-
-
-
-**23. We assume here that y|x;θ∼N(μ,σ2)**
-
-⟶
-
-
-
-**24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
-
-⟶
-
-
-
-**25. LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:**
-
-⟶
-
-
-
-**26. Remark: the update rule is a particular case of the gradient ascent.**
-
-⟶
-
-
-
-**27. LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:**
-
-⟶
-
-
-
-**28. Classification and logistic regression**
-
-⟶
-
-
-
-**29. Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:**
-
-⟶
-
-
-
-**30. Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:**
-
-⟶
-
-
-
-**31. Remark: there is no closed form solution for the case of logistic regressions.**
-
-⟶
-
-
-
-**32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
-
-⟶
-
-
-
-**33. Generalized Linear Models**
-
-⟶
-
-
-
-**34. Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:**
-
-⟶
-
-
-
-**35. Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.**
-
-⟶
-
-
-
-**36. Here are the most common exponential distributions summed up in the following table:**
-
-⟶
-
-
-
-**37. [Distribution, Bernoulli, Gaussian, Poisson, Geometric]**
-
-⟶
-
-
-
-**38. Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:**
-
-⟶
-
-
-
-**39. Remark: ordinary least squares and logistic regression are special cases of generalized linear models.**
-
-⟶
-
-
-
-**40. Support Vector Machines**
-
-⟶
-
-
-
-**41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
-
-⟶
-
-
-
-**42: Optimal margin classifier ― The optimal margin classifier h is such that:**
-
-⟶
-
-
-
-**43: where (w,b)∈Rn×R is the solution of the following optimization problem:**
-
-⟶
-
-
-
-**44. such that**
-
-⟶
-
-
-
-**45. support vectors**
-
-⟶
-
-
-
-**46. Remark: the line is defined as wTx−b=0.**
-
-⟶
-
-
-
-**47. Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:**
-
-⟶
-
-
-
-**48. Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:**
-
-⟶
-
-
-
-**49. In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.**
-
-⟶
-
-
-
-**50. [Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]**
-
-⟶
-
-
-
-**51. Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.**
-
-⟶
-
-
-
-**52. Lagrangian ― We define the Lagrangian L(w,b) as follows:**
-
-⟶
-
-
-
-**53. Remark: the coefficients βi are called the Lagrange multipliers.**
-
-⟶
-
-
-
-**54. Generative Learning**
-
-⟶
-
-
-
-**55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
-
-⟶
-
-
-
-**56. Gaussian Discriminant Analysis**
-
-⟶
-
-
-
-**57. Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:**
-
-⟶
-
-
-
-**58. Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:**
-
-⟶
-
-
-
-**59. Naive Bayes**
-
-⟶
-
-
-
-**60. Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:**
-
-⟶
-
-
-
-**61. Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]**
-
-⟶
-
-
-
-**62. Remark: Naive Bayes is widely used for text classification and spam detection.**
-
-⟶
-
-
-
-**63. Tree-based and ensemble methods**
-
-⟶
-
-
-
-**64. These methods can be used for both regression and classification problems.**
-
-⟶
-
-
-
-**65. CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.**
-
-⟶
-
-
-
-**66. Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.**
-
-⟶
-
-
-
-**67. Remark: random forests are a type of ensemble methods.**
-
-⟶
-
-
-
-**68. Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:**
-
-⟶
-
-
-
-**69. [Adaptive boosting, Gradient boosting]**
-
-⟶
-
-
-
-**70. High weights are put on errors to improve at the next boosting step**
-
-⟶
-
-
-
-**71. Weak learners trained on remaining errors**
-
-⟶
-
-
-
-**72. Other non-parametric approaches**
-
-⟶
-
-
-
-**73. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
-
-⟶
-
-
-
-**74. Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
-
-⟶
-
-
-
-**75. Learning Theory**
-
-⟶
-
-
-
-**76. Union bound ― Let A1,...,Ak be k events. We have:**
-
-⟶
-
-
-
-**77. Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:**
-
-⟶
-
-
-
-**78. Remark: this inequality is also known as the Chernoff bound.**
-
-⟶
-
-
-
-**79. Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:**
-
-⟶
-
-
-
-**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: **
-
-⟶
-
-
-
-**81: the training and testing sets follow the same distribution **
-
-⟶
-
-
-
-**82. the training examples are drawn independently**
-
-⟶
-
-
-
-**83. Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:**
-
-⟶
-
-
-
-**84. Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:**
-
-⟶
-
-
-
-**85. VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.**
-
-⟶
-
-
-
-**86. Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.**
-
-⟶
-
-
-
-**87. Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:**
-
-⟶
-
-
-
-**88. [Introduction, Type of prediction, Type of model]**
-
-⟶
-
-
-
-**89. [Notations and general concepts, loss function, gradient descent, likelihood]**
-
-⟶
-
-
-
-**90. [Linear models, linear regression, logistic regression, generalized linear models]**
-
-⟶
-
-
-
-**91. [Support vector machines, Optimal margin classifier, Hinge loss, Kernel]**
-
-⟶
-
-
-
-**92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]**
-
-⟶
-
-
-
-**93. [Trees and ensemble methods, CART, Random forest, Boosting]**
-
-⟶
-
-
-
-**94. [Other methods, k-NN]**
-
-⟶
-
-
-
-**95. [Learning theory, Hoeffding inequality, PAC, VC dimension]**
-
-⟶
diff --git a/de/cs-229-deep-learning.md b/de/cs-229-deep-learning.md
new file mode 100644
index 000000000..9d6cc3cfc
--- /dev/null
+++ b/de/cs-229-deep-learning.md
@@ -0,0 +1,321 @@
+**1. Deep Learning cheatsheet**
+
+⟶ Deep-Learning-Spickzettel
+
+
+
+**2. Neural Networks**
+
+⟶ Neuronale Netze
+
+
+
+**3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.**
+
+⟶ Neuronale Netze sind eine Klasse von Modellen die in Schichten aufgebaut sind. Gängige Typen neuronaler Netze sind unter anderem faltende und rekurrente neuronale Netze.
+
+
+
+**4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:**
+
+⟶ Architektur ― Das Vokabular rund um Architekturen neuronaler Netze ist in folgender Abbildung beschrieben:
+
+
+
+**5. [Input layer, hidden layer, output layer]**
+
+⟶ [Eingabeschicht, versteckte Schicht, Ausgabeschicht]
+
+
+
+**6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
+
+⟶ Sei i die i-te Schicht des Netzes und j die j-te verborgene Einheit der Schicht, so ist:
+
+
+
+**7. where we note w, b, z the weight, bias and output respectively.**
+
+⟶ wobei w, b und z jeweils Gewicht, Verzerrung und Ausgabe bezeichnen.
+
+
+
+**8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:**
+
+⟶ Aktivierungsfunktion ― Aktivierungsfunktionen werden am Ende einer verborgenen Schicht benutzt um nicht-lineare Komplexität zu ermöglichen. Am häufigsten werden folgende Funktionen angewendet:
+
+
+
+**9. [Sigmoid, Tanh, ReLU, Leaky ReLU]**
+
+⟶ [Sigmoid, Tanh, ReLU, undichte ReLU (Leaky ReLU)]
+
+
+
+**10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
+
+⟶ Kreuzentropieverlust ― Im Kontext neuronaler Netze wird der Kreuzentropieverlust L(z,y) gebräuchlicherweise benutzt und definiert wie folgt:
+
+
+
+**11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
+
+⟶ Lernrate ― Die Lernrate, oft mit α oder manchmal mit η bezeichnet, gibt an mit welcher Rate die Gewichtungen aktualisiert werden. Die Lernrate kann konstant sein oder dynamisch angepasst werden. Die aktuell populärste Methode, Adam, aktualisiert die Lernrate dynamisch.
+
+
+
+**12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:**
+
+⟶ Fehlerrückführung (backpropagation) ― Fehlerrückführung aktualisiert die Gewichte in neuronalen Netzen durch Einberechnung der tatsächlichen und der gewünschten Ausgabe. Die Ableitung nach der Gewichtung w wird mit Hilfe der Kettenregel berechnet und hat die folgende Form:
+
+
+
+**13. As a result, the weight is updated as follows:**
+
+⟶ Das Gewicht wird wie folgt aktualisiert:
+
+
+
+**14. Updating weights ― In a neural network, weights are updated as follows:**
+
+⟶ Das Aktualisieren der Gewichte ― In neuronalen Netzen werden die Gewichtungen wie folgt aktualisiert:
+
+
+
+**15. Step 1: Take a batch of training data.**
+
+⟶ Schritt 1: Nimm ein Bündel von Lerndaten.
+
+
+
+**16. Step 2: Perform forward propagation to obtain the corresponding loss.**
+
+⟶ Schritt 2: Führe Vorwärtsausbreitung durch um den dazugehörigen Verlust zu erhalten.
+
+
+
+**17. Step 3: Backpropagate the loss to get the gradients.**
+
+⟶ Schritt 3: Führe Fehlerrückführung mit dem Verlust durch um die Gradienten zu erhalten.
+
+
+
+**18. Step 4: Use the gradients to update the weights of the network.**
+
+⟶ Schritt 4: Verwende die Gradienten um die Gewichte im Netz zu aktualisieren.
+
+
+
+**19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
+
+⟶ Aussetzen ― Aussetzen ist eine Technik um eine Überanpassung der Lerndaten zu verhindern bei der Einheiten in einem neuronalen Netz ausfallen. In der Praxis setzen Neuronen entweder mit Wahrscheinlichkeit p aus oder werden mit Wahrscheinlichkeit 1-p behalten.
+
+
+
+**20. Convolutional Neural Networks**
+
+⟶ Faltende neuronale Netzwerke (convolutional neural networks, CNN)
+
+
+
+**21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
+
+⟶ Vorraussetzung für eine faltende Schicht ― Sei W das Eingangsvolumen, f die Größe der Neuronen der faltenden Schicht, P die Anzahl der aufgefüllten Nullen, dann ist die Anzahl der Neuronen N die in ein gegebenes Volumen passen:
+
+
+
+**22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
+
+⟶ Bündelnormalisierung ― Ein Schritt des Hyperparameters γ,β welcher das Bündel {xi} normalisiert. Seien μB der Mittelwert und σ2B die Varianz von dem Wert mit dem der Batch korrigiert werden soll, dann gilt folgendes:
+
+
+
+**23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
+
+⟶ Wird üblicherweise nach einer vollständig verbundenen/faltenden Schicht und vor einer nicht-linearen Schicht durchgeführt und bezweckt die Erhöhung der Lernrate und eine Reduzierung der starken Abhängigkeit von der Initialisierung.
+
+
+
+**24. Recurrent Neural Networks**
+
+⟶ Rekurrente neuronale Netze (recurrent neural networks, RNN)
+
+
+
+**25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:**
+
+⟶ Typen von Gattern ― Verschiedene Typen der einzelnen Gattern die man in einem LSTM Block vorfindet:
+
+
+
+**26. [Input gate, forget gate, gate, output gate]**
+
+⟶ [Eingangsgatter, Vergessgatter, Ausgangsgatter, Speichergatter]
+
+
+
+**27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]**
+
+⟶ [Zelle beschreiben oder nicht?, Zelle löschen oder nicht?, Wieviel in die Zelle schreiben?, Wieviel um die Zelle aufzudecken?]
+
+
+
+**28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
+
+⟶ LSTM ― Ein langes Kurzzeitgedächtnis (long short-term memory, LSTM) gehört zu der Klasse der RNN, welches durch Hinzufügen von Vergessgattern das Problem der verschwindenden Gradienten vermeidet.
+
+
+
+**29. Reinforcement Learning and Control**
+
+⟶ Bestärkendes Lernen und bestärkende Regelung
+
+
+
+**30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.**
+
+⟶ Das Ziel des bestärkenden Lernens ist einen Agenten zu traineren welcher selbstständig lernt sich in einer Umgebung zu entwickeln.
+
+
+
+**31. Definitions**
+
+⟶ Definitionen
+
+
+
+**32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:**
+
+⟶ Markow-Entscheidungsproblem ― Ein Markow-Entscheidungsproblem (Markow decision process, MDP) ist ein 5-Tupel (S,A,{Psa},γ,R), wobei
+
+
+
+**33. S is the set of states**
+
+⟶ S die Menge von Zuständen ist
+
+
+
+**34. A is the set of actions**
+
+⟶ A die Menge von Aktionen ist
+
+
+
+**35. {Psa} are the state transition probabilities for s∈S and a∈A**
+
+⟶ {Psa} die Übergangswahrscheinlichkeiten von s∈S und a∈A sind
+
+
+
+**36. γ∈[0,1[ is the discount factor**
+
+⟶ γ∈[0,1[ der Discount-Faktor ist
+
+
+
+**37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize**
+
+⟶ R:S×A⟶R oder R:S⟶R die Belohnungsfunktion ist die der Algorithmus zu maximieren wünscht
+
+
+
+**38. Policy ― A policy π is a function π:S⟶A that maps states to actions.**
+
+⟶ Strategie ― Die Strategie π ist die Funktion π:S⟶A welche Zustände auf Aktionen abbildet.
+
+
+
+**39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).**
+
+⟶ Hinweis: Wir führen eine gegebene Strategie π aus wenn wir für einen gegebenen Zustand s die Aktion a=π(s) tätigen.
+
+
+
+**40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:**
+
+⟶ Wertfunktion ― Für eine gegebene Strategie π und einen gegebenen Zustand s definieren wir die Wertfunktion Vπ wie folgt:
+
+
+
+**41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:**
+
+⟶ Bellman-Gleichung ― Die optimale Bellman-Gleichung charakterisiert die Wertfunktion Vπ∗ der optimalen Strategie π∗:
+
+
+
+**42. Remark: we note that the optimal policy π∗ for a given state s is such that:**
+
+⟶ Hinweis: wir bezeichnen die optimale Strategie π∗ für einen gegebenen Zustand s sodass:
+
+
+
+**43. Value iteration algorithm ― The value iteration algorithm is in two steps:**
+
+⟶ Wert-Interationsalgorithmus ― Der Wert-Iterationsalgorithmus hat zwei Schritte:
+
+
+
+**44. 1) We initialize the value:**
+
+⟶ 1) Wir initialisieren den Wert:
+
+
+
+**45. 2) We iterate the value based on the values before:**
+
+⟶ 2) Wir iterieren den Wert aufbauend auf dem vorherigen Wert:
+
+
+
+**46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
+
+⟶ Maximum-Likelihood-Schätzung ― Die Maximum-Likelihood-Schätzungen für die Zustandsübergangswahrscheinlichkeiten sind wie folgt:
+
+
+
+**47. times took action a in state s and got to s′**
+
+⟶ Anzahl Ausführung Aktion a in Zustand s führt zu s′
+
+
+
+**48. times took action a in state s**
+
+⟶ Anzahl Ausführung Aktion a in Zustand s
+
+
+
+**49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:**
+
+⟶ Q-Lernen ― Q-Lernen ist eine modellfreie Schätzung von Q welche wie folgt durchgeführt wird:
+
+
+
+**50. View PDF version on GitHub**
+
+⟶ Finde die PDF-Version auf GitHub
+
+
+
+**51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
+
+⟶ [Neuronale Netze, Architektur, Aktivierungsfunktion, Fehlerrückführung, Aussetzen]
+
+
+
+**52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
+
+⟶ [Faltende neuronale Netzwerke, faltende Schicht, Bündelnormalisierung]
+
+
+
+**53. [Recurrent Neural Networks, Gates, LSTM]**
+
+⟶ [Rekurrente neuronale Netze, Gatter, LSTM]
+
+
+
+**54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
+
+⟶ [Bestärkendes Lernen, Markow-Entscheidungsproblem, Wert-Strategie-Iteration, näherungs-dynamische Programmierung, Strategiesuche]
diff --git a/de/refresher-probability.md b/de/refresher-probability.md
deleted file mode 100644
index 5c9b34656..000000000
--- a/de/refresher-probability.md
+++ /dev/null
@@ -1,381 +0,0 @@
-**1. Probabilities and Statistics refresher**
-
-⟶
-
-
-
-**2. Introduction to Probability and Combinatorics**
-
-⟶
-
-
-
-**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
-
-⟶
-
-
-
-**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
-
-⟶
-
-
-
-**5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
-
-⟶
-
-
-
-**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
-
-⟶
-
-
-
-**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
-
-⟶
-
-
-
-**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
-
-⟶
-
-
-
-**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
-
-⟶
-
-
-
-**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
-
-⟶
-
-
-
-**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
-
-⟶
-
-
-
-**12. Conditional Probability**
-
-⟶
-
-
-
-**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
-
-⟶
-
-
-
-**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
-
-⟶
-
-
-
-**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
-
-⟶
-
-
-
-**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
-
-⟶
-
-
-
-**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
-
-⟶
-
-
-
-**18. Independence ― Two events A and B are independent if and only if we have:**
-
-⟶
-
-
-
-**19. Random Variables**
-
-⟶
-
-
-
-**20. Definitions**
-
-⟶
-
-
-
-**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
-
-⟶
-
-
-
-**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
-
-⟶
-
-
-
-**23. Remark: we have P(a
-
-**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
-
-⟶
-
-
-
-**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
-
-⟶
-
-
-
-**26. [Case, CDF F, PDF f, Properties of PDF]**
-
-⟶
-
-
-
-**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
-
-⟶
-
-
-
-**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
-
-⟶
-
-
-
-**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
-
-⟶
-
-
-
-**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
-
-⟶
-
-
-
-**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
-
-⟶
-
-
-
-**32. Probability Distributions**
-
-⟶
-
-
-
-**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
-
-⟶
-
-
-
-**34. Main distributions ― Here are the main distributions to have in mind:**
-
-⟶
-
-
-
-**35. [Type, Distribution]**
-
-⟶
-
-
-
-**36. Jointly Distributed Random Variables**
-
-⟶
-
-
-
-**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
-
-⟶
-
-
-
-**38. [Case, Marginal density, Cumulative function]**
-
-⟶
-
-
-
-**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
-
-⟶
-
-
-
-**40. Independence ― Two random variables X and Y are said to be independent if we have:**
-
-⟶
-
-
-
-**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
-
-⟶
-
-
-
-**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
-
-⟶
-
-
-
-**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
-
-⟶
-
-
-
-**44. Remark 2: If X and Y are independent, then ρXY=0.**
-
-⟶
-
-
-
-**45. Parameter estimation**
-
-⟶
-
-
-
-**46. Definitions**
-
-⟶
-
-
-
-**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
-
-⟶
-
-
-
-**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
-
-⟶
-
-
-
-**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
-
-⟶
-
-
-
-**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
-
-⟶
-
-
-
-**51. Estimating the mean**
-
-⟶
-
-
-
-**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
-
-⟶
-
-
-
-**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
-
-⟶
-
-
-
-**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
-
-⟶
-
-
-
-**55. Estimating the variance**
-
-⟶
-
-
-
-**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
-
-⟶
-
-
-
-**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
-
-⟶
-
-
-
-**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
-
-⟶
-
-
-
-**59. [Introduction, Sample space, Event, Permutation]**
-
-⟶
-
-
-
-**60. [Conditional probability, Bayes' rule, Independence]**
-
-⟶
-
-
-
-**61. [Random variables, Definitions, Expectation, Variance]**
-
-⟶
-
-
-
-**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
-
-⟶
-
-
-
-**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
-
-⟶
-
-
-
-**64. [Parameter estimation, Mean, Variance]**
-
-⟶
diff --git a/es/cheatsheet-deep-learning.md b/es/cs-229-deep-learning.md
similarity index 100%
rename from es/cheatsheet-deep-learning.md
rename to es/cs-229-deep-learning.md
diff --git a/es/refresher-linear-algebra.md b/es/cs-229-linear-algebra.md
similarity index 100%
rename from es/refresher-linear-algebra.md
rename to es/cs-229-linear-algebra.md
diff --git a/es/cheatsheet-machine-learning-tips-and-tricks.md b/es/cs-229-machine-learning-tips-and-tricks.md
similarity index 100%
rename from es/cheatsheet-machine-learning-tips-and-tricks.md
rename to es/cs-229-machine-learning-tips-and-tricks.md
diff --git a/es/refresher-probability.md b/es/cs-229-probability.md
similarity index 100%
rename from es/refresher-probability.md
rename to es/cs-229-probability.md
diff --git a/es/cheatsheet-supervised-learning.md b/es/cs-229-supervised-learning.md
similarity index 100%
rename from es/cheatsheet-supervised-learning.md
rename to es/cs-229-supervised-learning.md
diff --git a/es/cheatsheet-unsupervised-learning.md b/es/cs-229-unsupervised-learning.md
similarity index 100%
rename from es/cheatsheet-unsupervised-learning.md
rename to es/cs-229-unsupervised-learning.md
diff --git a/ar/cheatsheet-machine-learning-tips-and-tricks.md b/et/cs-229-machine-learning-tips-and-tricks.md
similarity index 50%
rename from ar/cheatsheet-machine-learning-tips-and-tricks.md
rename to et/cs-229-machine-learning-tips-and-tricks.md
index 9712297b8..2f4c488b2 100644
--- a/ar/cheatsheet-machine-learning-tips-and-tricks.md
+++ b/et/cs-229-machine-learning-tips-and-tricks.md
@@ -1,285 +1,289 @@
+**Machine Learning tips and tricks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-machine-learning-tips-and-tricks)
+
+
+
**1. Machine Learning tips and tricks cheatsheet**
-⟶
+⟶ Masinõppe näpunäidete spikker
**2. Classification metrics**
-⟶
+⟶ Klassifikatsiooni mõõdikud
**3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
-⟶
+⟶ Siin on meil põhi mõõdikud, mille jälgimine on oluline, et mõõta meie mudeli võimekust binaarse klassifikatsiooni kontekstis.
**4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
-⟶
+⟶ Eksimismaatriks ― Eksimismaatriksit kasutatakse mudeli võimekust hindamisel täielikuma pildi saamiseks. See on defineeritud järgmiselt:
**5. [Predicted class, Actual class]**
-⟶
+⟶ [Ennustatud klass, Tegelik klass]
**6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
-⟶
+⟶ Põhimõõdikud ― järgnevad mõõdikud on tavaliselt kasutatud, et hinnata klassifikatsiooni mudelite võimekust:
**7. [Metric, Formula, Interpretation]**
-⟶
+⟶ [Mõõdik, Valem, Tõlgendus]
**8. Overall performance of model**
-⟶
+⟶ Mudeli üldine võimekus
**9. How accurate the positive predictions are**
-⟶
+⟶ Kui täpsed mudeli positiivsed ennustused on
**10. Coverage of actual positive sample**
-⟶
+⟶ Tegeliku positiivse valimi katvus
**11. Coverage of actual negative sample**
-⟶
+⟶ Tegeliku negatiivse valimi katvus
**12. Hybrid metric useful for unbalanced classes**
-⟶
+⟶ Hübriidmõõdik on kasulik tasakaalustamata klasside jaoks
**13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
-⟶
+⟶ ROC ― Vastuvõtja töökõver on graafik, millel kuvatakse TPR-i ja FPR-i suhet erineval lävel. Need mõõdikud on kokku võetud allolevas tabelis:
**14. [Metric, Formula, Equivalent]**
-⟶
+⟶ [Mõõdik, Valem, Ekvivalent]
**15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
-⟶
+⟶ AUC ― Vastuvõtva töökõvera alune ala, tuntud kui AUC või AUROC, on ROC töökõvera alune ala, mis on näidatud järgmisel joonisel:
**16. [Actual, Predicted]**
-⟶
+⟶ [Tegelik, Ennustatud]
**17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
-⟶
+⟶ Põhilised mõõdikud ― Arvestades regressioonimudelit f, kasutatakse mudeli võimekuse hindamiseks tavaliselt järgmisi mõõdikuid:
**18. [Total sum of squares, Explained sum of squares, Residual sum of squares]**
-⟶
+⟶ [Kogu ruutude summa, selgitatud ruutude summa, jääkliikmete ruutude summa]
**19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
-⟶
+⟶ Determinatsioonikordaja ― Determinatsioonikordaja, tuntud kui R2 või r2 näitab, kui hästi jälgitavad tulemused mudelis korduvad, ja see on määratletud järgmiselt:
**20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
-⟶
+⟶ Põhimõõdikud ― Regressioonimudelite võimekuse hindamiseks kasutatakse tavaliselt järgmisi mõõdikuid, võttes arvesse muutujate arvu n, mida nad arvestavad.
**21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
-⟶
+⟶ kus L on tõenäosus ning ˆσ2 on iga vastusega seotud dispersiooni hinnang.
**22. Model selection**
-⟶
+⟶ Mudeli valik
**23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
-⟶
+⟶ Sõnavara ― Mudeli valimisel eristame 3 erinevat osa andmetest järgnevalt:
**24. [Training set, Validation set, Testing set]**
-⟶
+⟶ [Treenimiskomplekt, Valideerimiskomplekt, Testimiskomplekt]
**25. [Model is trained, Model is assessed, Model gives predictions]**
-⟶
+⟶ [Mudel treenitakse, Mudel valideeritakse, Mudel annab ennustusi]
**26. [Usually 80% of the dataset, Usually 20% of the dataset]**
-⟶
+⟶ [Tavaliselt 80% andmekogust, Tavaliselt 20% andmekogust]
**27. [Also called hold-out or development set, Unseen data]**
-⟶
+⟶ [Nimetatakse arenduskomplektiks, andmed mida mudel pole näinud]
**28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
-⟶
+⟶ Kui mudel on valitud, treenitakse seda kogu andmehulgaga ja testitakse andmetega mida mudel ei ole näinud. Need on esitatud alloleval joonisel:
**29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
-⟶
+⟶ Ristvalideerimine ― Ristvalideerimine, tuntud kui CV, on meetod, mida kasutatakse mudeli valimiseks ning mis ei sõltu liiga palju algsest treeningkomplektist. Erinevad ristkontrolli tüübid on kokku võetud allolevas tabelis:
**30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
-⟶
+⟶ [Treenitakse k-1 ühikuga (treeningkomplektist) ning hindamine ülejäänuga, Treenimine n-p vaatlusega ja hindamine ülejäänud p vaatlustega]
**31. [Generally k=5 or 10, Case p=1 is called leave-one-out]**
-⟶
+⟶ [Tavaliselt k=5 või 10, Kui p=1 siis selle nimi on jäta-üks-välja]
**32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
-⟶
+⟶ Kõige sagedamini kasutatavat meetodit nimetatakse k-kordseks ristvalideerimiseks ja see jagab treeningu andmehulga k kordseks osadeks, et mudelit valideerida ühe osaga, samal ajal treenides mudelit k−1 teisel osaga, seda kõike k korda. Seejärel arvutatakse viga keskmiselt k korda ja seda nimetatakse ristvalideerimise veaks.
**33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
-⟶
+⟶ Regulariseerimine ― Regulariseerimisprotsessi eesmärk on vältida mudeli ülemäärast ülesobitust treeningandmestikule ning seeläbi tegeleb kõrge dispersiooni probleemidega. Järgmises tabelis on toodud levinumate reguleerimist meetodite tüübid:
**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
-⟶
+⟶ [Kahandab koefitsiente 0-ni, Hea muutujate valimiseks, Muudab koefitsendid väiksemaks, Muutuja valiku ja väikeste koefitsentide kompromiss]
**35. Diagnostics**
-⟶
+⟶ Diagnostika
**36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
-⟶
+⟶ Nihe (Vabaliige) ― Mudeli vabaliige on erinevus eeldatava ennustuse ja õige mudeli vahel, mida proovime antud andmepunktidega ennustada.
**37. Variance ― The variance of a model is the variability of the model prediction for given data points.**
-⟶
+⟶ Variatsioon ― Mudeli variatsioon on mudeli ennustuse ja antud andmepunkti varieeruvus.
**38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
-⟶
+⟶ Nihke/Variatsioon kompromiss ― Mida lihtsam mudel seda suurem on nihe (vabaliige) ning mida keerulisem on mudel seda suurem on dispersioon.
**39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
-⟶
+⟶ [Sümptomid, Regressiooni näide, Klassifikatsiooni näide, Süvavõppe näide, võimalikud abinõud]
**40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
-⟶
+⟶ [Kõrge treenimisviga, Treenimisviga sarnane tesimisvegaga, Kõrge nihe, Treenimisviga madalam kui testimisviga, Väga madal treenimisviga, Treenimisviga palju madalam kui testimisviga, Kõrge variatsioon]
**41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
-⟶
+⟶ [Mudeli keerukamaks muutmine, Muutujate lisamine, Pigem treenimine, Regulatsiooni lisamine, Andmehulga suurendamine]
**42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
-⟶
+⟶ Veaanalüüs ― Veaanalüüs on olemasoleva ja perfektse mudeli võimekuse erinevuse algpõhjuse leidmine.
**43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
-⟶
+⟶ Ablatiivne analüüs ― Ablatiivne analüüs on olemasoleva ja baas mudeli võimekuse erinevuse algpõhjuse leidmine.
**44. Regression metrics**
-⟶
+⟶ Regressiooni mõõdikud
**45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
-⟶
+⟶ [Klassifikatsiooni mõõdikud, Eksimismaatriks, täpsus, saagis, F1 skoor, ROC]
**46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]**
-⟶
+⟶ [Regressiooni mõõdikud, R ruudus, Mallow-i CP, AIC, BIC]
**47. [Model selection, cross-validation, regularization]**
-⟶
+⟶ [Mudeli valik, ristvalideerimine, reguleerimine]
**48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
-⟶
+⟶ [Diagnostika, Nihke/variatsiooni kompromiss, vea/ablatiivne analüüs]
diff --git a/fa/cheatsheet-deep-learning.md b/fa/cs-229-deep-learning.md
similarity index 100%
rename from fa/cheatsheet-deep-learning.md
rename to fa/cs-229-deep-learning.md
diff --git a/fa/refresher-linear-algebra.md b/fa/cs-229-linear-algebra.md
similarity index 100%
rename from fa/refresher-linear-algebra.md
rename to fa/cs-229-linear-algebra.md
diff --git a/fa/cheatsheet-machine-learning-tips-and-tricks.md b/fa/cs-229-machine-learning-tips-and-tricks.md
similarity index 100%
rename from fa/cheatsheet-machine-learning-tips-and-tricks.md
rename to fa/cs-229-machine-learning-tips-and-tricks.md
diff --git a/fa/refresher-probability.md b/fa/cs-229-probability.md
similarity index 100%
rename from fa/refresher-probability.md
rename to fa/cs-229-probability.md
diff --git a/fa/cheatsheet-supervised-learning.md b/fa/cs-229-supervised-learning.md
similarity index 100%
rename from fa/cheatsheet-supervised-learning.md
rename to fa/cs-229-supervised-learning.md
diff --git a/fa/cheatsheet-unsupervised-learning.md b/fa/cs-229-unsupervised-learning.md
similarity index 100%
rename from fa/cheatsheet-unsupervised-learning.md
rename to fa/cs-229-unsupervised-learning.md
diff --git a/fa/cs-230-convolutional-neural-networks.md b/fa/cs-230-convolutional-neural-networks.md
new file mode 100644
index 000000000..ee4201100
--- /dev/null
+++ b/fa/cs-230-convolutional-neural-networks.md
@@ -0,0 +1,923 @@
+**Convolutional Neural Networks translation**
+
+
+
+**1. Convolutional Neural Networks cheatsheet**
+
+
+راهنمای کوتاه شبکههای عصبی پیچشی (کانولوشنی)
+
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+
+کلاس CS 230 - یادگیری عمیق
+
+
+
+
+
+
+**3. [Overview, Architecture structure]**
+
+
+[نمای کلی، ساختار معماری]
+
+
+
+
+
+**4. [Types of layer, Convolution, Pooling, Fully connected]**
+
+
+[انواع لایه، کانولوشنی، ادغام، تماممتصل]
+
+
+
+
+
+**5. [Filter hyperparameters, Dimensions, Stride, Padding]**
+
+
+[ابرفراسنجهای فیلتر، ابعاد، گام، حاشیه]
+
+
+
+
+
+
+**6. [Tuning hyperparameters, Parameter compatibility, Model complexity, Receptive field]**
+
+
+[تنظیم ابرفراسنجها، سازشپذیری فراسنج، پیچیدگی مدل، ناحیهی تاثیر]
+
+
+
+
+
+**7. [Activation functions, Rectified Linear Unit, Softmax]**
+
+
+[توابع فعالسازی، تابع یکسوساز خطی، تابع بیشینهی هموار]
+
+
+
+
+
+**8. [Object detection, Types of models, Detection, Intersection over Union, Non-max suppression, YOLO, R-CNN]**
+
+
+[شناسایی شیء، انواع مدلها، شناسایی، نسبت همپوشانی اشتراک به اجتماع، فروداشت غیربیشینه، YOLO، R-CNN]
+
+
+
+
+
+**9. [Face verification/recognition, One shot learning, Siamese network, Triplet loss]**
+
+
+[تایید/بازشناسایی چهره، یادگیری یکبارهای (One shot)، شبکهی Siamese، خطای سهگانه]
+
+
+
+
+
+**10. [Neural style transfer, Activation, Style matrix, Style/content cost function]**
+
+
+[انتقالِ سبکِ عصبی، فعال سازی، ماتریسِ سبک، تابع هزینهی محتوا/سبک]
+
+
+
+
+
+**11. [Computational trick architectures, Generative Adversarial Net, ResNet, Inception Network]**
+
+
+[معماریهای با ترفندهای محاسباتی، شبکهی همآوردِ مولد، ResNet، شبکهی Inception]
+
+
+
+
+
+**12. Overview**
+
+
+نمای کلی
+
+
+
+
+
+**13. Architecture of a traditional CNN ― Convolutional neural networks, also known as CNNs, are a specific type of neural networks that are generally composed of the following layers:**
+
+
+معماری یک CNN سنتی – شبکههای عصبی مصنوعی پیچشی، که همچنین با عنوان CNN شناخته می شوند، یک نوع خاص از شبکه های عصبی هستند که عموما از لایههای زیر تشکیل شدهاند:
+
+
+
+
+
+**14. The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters that are described in the next sections.**
+
+
+لایهی کانولوشنی و لایهی ادغام میتوانند به نسبت ابرفراسنجهایی که در بخشهای بعدی بیان شدهاند تنظیم و تعدیل شوند.
+
+
+
+
+
+**15. Types of layer**
+
+
+انواع لایهها
+
+
+
+
+
+**16. Convolution layer (CONV) ― The convolution layer (CONV) uses filters that perform convolution operations as it is scanning the input I with respect to its dimensions. Its hyperparameters include the filter size F and stride S. The resulting output O is called feature map or activation map.**
+
+
+لایه کانولوشنی (CONV) - لایه کانولوشنی (CONV) از فیلترهایی استفاده میکند که عملیات کانولوشنی را در هنگام پویش ورودی I به نسبت ابعادش، اجرا میکند. ابرفراسنجهای آن شامل اندازه فیلتر F و گام S هستند. خروجی حاصل شده O نگاشت ویژگی یا نگاشت فعالسازی نامیده میشود.
+
+
+
+
+
+**17. Remark: the convolution step can be generalized to the 1D and 3D cases as well.**
+
+
+نکته: مرحله کانولوشنی همچنین میتواند به موارد یک بُعدی و سه بُعدی تعمیم داده شود.
+
+
+
+
+
+**18. Pooling (POOL) ― The pooling layer (POOL) is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance. In particular, max and average pooling are special kinds of pooling where the maximum and average value is taken, respectively.**
+
+
+لایه ادغام (POOL) - لایه ادغام (POOL) یک عمل نمونهکاهی است، که معمولا بعد از یک لایه کانولوشنی اعمال میشود، که تا حدی منجر به ناوردایی مکانی میشود. به طور خاص، ادغام بیشینه و میانگین انواع خاص ادغام هستند که به ترتیب مقدار بیشینه و میانگین گرفته میشود.
+
+
+
+
+
+**19. [Type, Purpose, Illustration, Comments]**
+
+
+[نوع، هدف، نگاره، توضیحات]
+
+
+
+
+
+**20. [Max pooling, Average pooling, Each pooling operation selects the maximum value of the current view, Each pooling operation averages the values of the current view]**
+
+
+[ادغام بیشینه، ادغام میانگین، هر عمل ادغام مقدار بیشینهی نمای فعلی را انتخاب میکند، هر عمل ادغام مقدار میانگینِ نمای فعلی را انتخاب میکند]
+
+
+
+
+
+**21. [Preserves detected features, Most commonly used, Downsamples feature map, Used in LeNet]**
+
+
+[ویژگیهای شناسایی شده را حفظ میکند، اغلب مورد استفاده قرار میگیرد، کاستن نگاشت ویژگی، در (معماری) LeNet استفاده شده است]
+
+
+
+
+
+**22. Fully Connected (FC) ― The fully connected layer (FC) operates on a flattened input where each input is connected to all neurons. If present, FC layers are usually found towards the end of CNN architectures and can be used to optimize objectives such as class scores.**
+
+
+تماممتصل (FC) - لایهی تماممتصل (FC) بر روی یک ورودی مسطح به طوری که هر ورودی به تمامی نورونها متصل است، عمل میکند. در صورت وجود، لایههای FC معمولا در انتهای معماریهای CNN یافت میشوند و میتوان آنها را برای بهینهسازی اهدافی مثل امتیازات کلاس به کار برد.
+
+
+
+
+**23. Filter hyperparameters**
+
+
+ابرفراسنجهای فیلتر
+
+
+
+
+
+**24. The convolution layer contains filters for which it is important to know the meaning behind its hyperparameters.**
+
+
+لایه کانولوشنی شامل فیلترهایی است که دانستن مفهوم نهفته در فراسنجهای آن اهمیت دارد.
+
+
+
+
+
+**25. Dimensions of a filter ― A filter of size F×F applied to an input containing C channels is a F×F×C volume that performs convolutions on an input of size I×I×C and produces an output feature map (also called activation map) of size O×O×1.**
+
+
+ابعاد یک فیلتر - یک فیلتر به اندازه F×F اعمال شده بر روی یک ورودیِ حاوی C کانال، یک توده F×F×C است که (عملیات) پیچشی بر روی یک ورودی به اندازه I×I×C اعمال میکند و یک نگاشت ویژگی خروجی (که همچنین نگاشت فعالسازی نامیده میشود) به اندازه O×O×1 تولید میکند.
+
+
+
+
+
+**26. Filter**
+
+
+فیلتر
+
+
+
+
+
+**27. Remark: the application of K filters of size F×F results in an output feature map of size O×O×K.**
+
+
+نکته: اعمال K فیلتر به اندازهی F×F، منتج به یک نگاشت ویژگی خروجی به اندازه O×O×K میشود.
+
+
+
+
+
+**28. Stride ― For a convolutional or a pooling operation, the stride S denotes the number of pixels by which the window moves after each operation.**
+
+
+گام – در یک عملیات ادغام یا پیچشی، اندازه گام S به تعداد پیکسلهایی که پنجره بعد از هر عملیات جابهجا میشود، اشاره دارد.
+
+
+
+
+
+**29. Zero-padding ― Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. This value can either be manually specified or automatically set through one of the three modes detailed below:**
+
+
+حاشیهی صفر – حاشیهی صفر به فرآیند افزودن P صفر به هر طرف از کرانههای ورودی اشاره دارد. این مقدار میتواند به طور دستی مشخص شود یا به طور خودکار به سه روش زیر تعیین گردد:
+
+
+
+
+
+**30. [Mode, Value, Illustration, Purpose, Valid, Same, Full]**
+
+
+[نوع، مقدار، نگاره، هدف، Valid، Same، Full]
+
+
+
+
+
+**31. [No padding, Drops last convolution if dimensions do not match, Padding such that feature map size has size ⌈IS⌉, Output size is mathematically convenient, Also called 'half' padding, Maximum padding such that end convolutions are applied on the limits of the input, Filter 'sees' the input end-to-end]**
+
+
+[فاقد حاشیه، اگر ابعاد مطابقت ندارد آخرین کانولوشنی را رها کن، (اعمال) حاشیه به طوری که اندازه نگاشت ویژگی ⌈IS⌉ باشد، (محاسبه) اندازه خروجی به لحاظ ریاضیاتی آسان است، همچنین حاشیهی 'نیمه' نامیده میشود، بالاترین حاشیه (اعمال میشود) به طوری که (عملیات) کانولوشنی انتهایی بر روی مرزهای ورودی اعمال میشود، فیلتر ورودی را به صورت پکپارچه 'میپیماید']
+
+
+
+
+
+**32. Tuning hyperparameters**
+
+
+تنظیم ابرفراسنجها
+
+
+
+
+
+**33. Parameter compatibility in convolution layer ― By noting I the length of the input volume size, F the length of the filter, P the amount of zero padding, S the stride, then the output size O of the feature map along that dimension is given by:**
+
+
+سازشپذیری فراسنج در لایه کانولوشنی – با ذکر I به عنوان طول اندازه توده ورودی، F طول فیلتر، P میزان حاشیهی صفر، S گام، اندازه خروجی نگاشت ویژگی O در امتداد ابعاد خواهد بود:
+
+
+
+
+
+**34. [Input, Filter, Output]**
+
+
+[ورودی، فیلتر، خروجی]
+
+
+
+
+
+**35. Remark: often times, Pstart=Pend≜P, in which case we can replace Pstart+Pend by 2P in the formula above.**
+
+
+نکته: اغلب Pstart=Pend≜P است، در این صورت Pstart+Pend را میتوان با 2 Pدر فرمول بالا جایگزین کرد.
+
+
+
+
+
+**36. Understanding the complexity of the model ― In order to assess the complexity of a model, it is often useful to determine the number of parameters that its architecture will have. In a given layer of a convolutional neural network, it is done as follows:**
+
+
+درک پیچیدگی مدل – برای برآورد پیچیدگی مدل، اغلب تعیین تعداد فراسنجهایی که معماری آن میتواند داشته باشد، مفید است. در یک لایه مفروض شبکه پیچشی عصبی این امر به صورت زیر انجام میشود:
+
+
+
+
+
+**37. [Illustration, Input size, Output size, Number of parameters, Remarks]**
+
+
+[نگاره، اندازه ورودی، اندازه خروجی، تعداد فراسنجها، ملاحظات]
+
+
+
+
+
+**38. [One bias parameter per filter, In most cases, S
+[یک پیشقدر به ازای هر فیلتر، در بیشتر موارد S<F است، یک انتخاب رایج برای K، 2C است]
+
+
+
+
+
+
+**39. [Pooling operation done channel-wise, In most cases, S=F]**
+
+
+[عملیات ادغام به صورت کانالبهکانال انجام میشود، در بیشتر موارد S=F است]
+
+
+
+
+**40. [Input is flattened, One bias parameter per neuron, The number of FC neurons is free of structural constraints]**
+
+
+[ورودی مسطح شده است، یک پیشقدر به ازای هر نورون، تعداد نورونهای FC فاقد محدودیتهای ساختاریست]
+
+
+
+
+
+**41. Receptive field ― The receptive field at layer k is the area denoted Rk×Rk of the input that each pixel of the k-th activation map can 'see'. By calling Fj the filter size of layer j and Si the stride value of layer i and with the convention S0=1, the receptive field at layer k can be computed with the formula:**
+
+
+ناحیه تاثیر – ناحیه تاثیر در لایه k محدودهای از ورودی Rk×Rk است که هر پیکسلِ kاٌم نگاشت ویژگی میتواند 'ببیند'. با ذکر Fj به عنوان اندازه فیلتر لایه j و Si مقدار گام لایه i و با این توافق که S0=1 است، ناحیه تاثیر در لایه k با فرمول زیر محاسبه میشود:
+
+
+
+
+
+**42. In the example below, we have F1=F2=3 and S1=S2=1, which gives R2=1+2⋅1+2⋅1=5.**
+
+
+در مثال زیر داریم، F1=F2=3 و S1=S2=1 که منتج به R2=1+2⋅1+2⋅1=5 میشود.
+
+
+
+
+
+**43. Commonly used activation functions**
+
+
+توابع فعالسازی پرکاربرد
+
+
+
+
+
+**44. Rectified Linear Unit ― The rectified linear unit layer (ReLU) is an activation function g that is used on all elements of the volume. It aims at introducing non-linearities to the network. Its variants are summarized in the table below:**
+
+
+تابع یکسوساز خطی – تابع یکسوساز خطی (ReLU) یک تابع فعالسازی g است که بر روی تمامی عناصر توده اعمال میشود. هدف آن ارائه (رفتار) غیرخطی به شبکه است. انواع آن در جدول زیر بهصورت خلاصه آمدهاند:
+
+
+
+
+
+**45. [ReLU, Leaky ReLU, ELU, with]**
+
+
+[ReLU ، ReLUنشتدار، ELU، با]
+
+
+
+
+
+**46. [Non-linearity complexities biologically interpretable, Addresses dying ReLU issue for negative values, Differentiable everywhere]**
+
+
+[پیچیدگیهای غیر خطی که از دیدگاه زیستی قابل تفسیر هستند، مسئله افول ReLU برای مقادیر منفی را مهار میکند، در تمامی نقاط مشتقپذیر است]
+
+
+
+
+
+**47. Softmax ― The softmax step can be seen as a generalized logistic function that takes as input a vector of scores x∈Rn and outputs a vector of output probability p∈Rn through a softmax function at the end of the architecture. It is defined as follows:**
+
+
+بیشینهی هموار – مرحله بیشینهی هموار را میتوان به عنوان یک تابع لجستیکی تعمیم داده شده که یک بردار x∈Rn را از ورودی میگیرد و یک بردار خروجی احتمال p∈Rn، بهواسطهی تابع بیشینهی هموار در انتهای معماری، تولید میکند. این تابع بهصورت زیر تعریف میشود:
+
+
+
+
+
+**48. where**
+
+
+که
+
+
+
+
+
+**49. Object detection**
+
+
+شناسایی شیء
+
+
+
+
+
+**50. Types of models ― There are 3 main types of object recognition algorithms, for which the nature of what is predicted is different. They are described in the table below:**
+
+
+انواع مدل – سه نوع اصلی از الگوریتمهای بازشناسایی وجود دارد، که ماهیت آنچهکه شناسایی شده متفاوت است. این الگوریتمها در جدول زیر توضیح داده شدهاند:
+
+
+
+
+
+**51. [Image classification, Classification w. localization, Detection]**
+
+
+[دستهبندی تصویر، دستهبندی با موقعیتیابی، شناسایی]
+
+
+
+
+
+**52. [Teddy bear, Book]**
+
+
+[خرس تدی، کتاب]
+
+
+
+
+
+**53. [Classifies a picture, Predicts probability of object, Detects an object in a picture, Predicts probability of object and where it is located, Detects up to several objects in a picture, Predicts probabilities of objects and where they are located]**
+
+
+[یک عکس را دستهبندی میکند، احتمال شیء را پیشبینی میکند، یک شیء را در یک عکس شناسایی میکند، احتمال یک شیء و موقعیت آن را پیشبینی میکند، چندین شیء در یک عکس را شناسایی میکند، احتمال اشیاء و موقعیت آنها را پیشبینی میکند]
+
+
+
+
+
+**54. [Traditional CNN, Simplified YOLO, R-CNN, YOLO, R-CNN]**
+
+
+[CNN سنتی، YOLO ساده شده، R-CNN، YOLO، R-CNN]
+
+
+
+
+
+**55. Detection ― In the context of object detection, different methods are used depending on whether we just want to locate the object or detect a more complex shape in the image. The two main ones are summed up in the table below:**
+
+
+شناسایی – در مضمون شناسایی شیء، روشهای مختلفی بسته به اینکه آیا فقط میخواهیم موقعیت قرارگیری شیء را پیدا کنیم یا شکل پیچیدهتری در تصویر را شناسایی کنیم، استفاده میشوند. دو مورد از اصلی ترین آنها در جدول زیر بهصورت خلاصه آورده شدهاند:
+
+
+
+
+
+**56. [Bounding box detection, Landmark detection]**
+
+
+[پیشبینی کادر محصورکننده، ]شناسایی نقاط(برجسته)
+
+
+
+
+
+**57. [Detects the part of the image where the object is located, Detects a shape or characteristics of an object (e.g. eyes), More granular]**
+
+
+[بخشی از تصویر که شیء در آن قرار گرفته را شناسایی میکند، یک شکل یا مشخصات یک شیء (مثل چشمها) را شناسایی میکند، موشکافانهتر]
+
+
+
+
+
+**58. [Box of center (bx,by), height bh and width bw, Reference points (l1x,l1y), ..., (lnx,lny)]**
+
+
+[مرکزِ کادر (bx,by)، ارتفاع bh و عرض bw، نقاط مرجع (l1x,l1y), ..., (lnx,lny)]
+
+
+
+
+
+**59. Intersection over Union ― Intersection over Union, also known as IoU, is a function that quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding box Ba. It is defined as:**
+
+
+نسبت همپوشانی اشتراک به اجتماع - نسبت همپوشانی اشتراک به اجتماع، همچنین به عنوان IoU شناخته میشود، تابعی است که میزان موقعیت دقیق کادر محصورکننده Bp نسبت به کادر محصورکننده حقیقی Ba را میسنجد. این تابع بهصورت زیر تعریف میشود:
+
+
+
+
+
+**60. Remark: we always have IoU∈[0,1]. By convention, a predicted bounding box Bp is considered as being reasonably good if IoU(Bp,Ba)⩾0.5.**
+
+
+نکته: همواره داریم IoU∈[0,1]. به صورت قرارداد، یک کادر محصورکننده Bp را میتوان نسبتا خوب در نظر گرفت اگر IoU(Bp,Ba)⩾0.5 باشد.
+
+
+
+
+
+**61. Anchor boxes ― Anchor boxing is a technique used to predict overlapping bounding boxes. In practice, the network is allowed to predict more than one box simultaneously, where each box prediction is constrained to have a given set of geometrical properties. For instance, the first prediction can potentially be a rectangular box of a given form, while the second will be another rectangular box of a different geometrical form.**
+
+
+کادرهای محوری – کادر بندی محوری روشی است که برای پیشبینی کادرهای محصورکننده همپوشان استفاده میشود. در عمل، شبکه این اجازه را دارد که بیش از یک کادر بهصورت همزمان پیشبینی کند جاییکه هر پیشبینی کادر مقید به داشتن یک مجموعه خصوصیات هندسی مفروض است. به عنوان مثال، اولین پیشبینی میتواند یک کادر مستطیلی با قالب خاص باشد حال آنکه کادر دوم، یک کادر مستطیلی محوری با قالب هندسی متفاوتی خواهد بود.
+
+
+
+
+
+**62. Non-max suppression ― The non-max suppression technique aims at removing duplicate overlapping bounding boxes of a same object by selecting the most representative ones. After having removed all boxes having a probability prediction lower than 0.6, the following steps are repeated while there are boxes remaining:**
+
+
+فروداشت غیربیشینه – هدف روش فروداشت غیربیشینه، حذف کادرهای محصورکننده همپوشان تکراریِ دسته یکسان با انتخاب معرفترینها است. بعد از حذف همه کادرهایی که احتمال پیشبینی پایینتر از 0.6 دارند، مراحل زیر با وجود آنکه کادرهایی باقی میمانند، تکرار میشوند:
+
+
+
+
+
+**63. [For a given class, Step 1: Pick the box with the largest prediction probability., Step 2: Discard any box having an IoU⩾0.5 with the previous box.]**
+
+
+[برای یک دسته مفروض، گام اول: کادر با بالاترین احتمال پیشبینی را انتخاب کن، گام دوم: هر کادری که IoU≥0.5 نسبت به کادر پیشین دارد را رها کن.]
+
+
+
+
+
+**64. [Box predictions, Box selection of maximum probability, Overlap removal of same class, Final bounding boxes]**
+
+
+[پیشبینی کادرها، انتخاب کادرِ با احتمال بیشینه، حذف (کادر) همپوشان دسته یکسان، کادرهای محصورکننده نهایی]
+
+
+
+
+
+**65. YOLO ― You Only Look Once (YOLO) is an object detection algorithm that performs the following steps:**
+
+
+YOLO - «شما فقط یکبار نگاه میکنید» (YOLO) یک الگوریتم شناسایی شیء است که مراحل زیر را اجرا میکند:
+
+
+
+
+
+**66. [Step 1: Divide the input image into a G×G grid., Step 2: For each grid cell, run a CNN that predicts y of the following form:, repeated k times]**
+
+
+[گام اول: تصویر ورودی را به یک مشبک G×G تقسیم کن، گام دوم: برای هر سلول مشبک، یک CNN که y را به شکل زیر پیشبینی میکند، اجرا کن:، k مرتبه تکرارشده]
+
+
+
+
+
+**67. where pc is the probability of detecting an object, bx,by,bh,bw are the properties of the detected bouding box, c1,...,cp is a one-hot representation of which of the p classes were detected, and k is the number of anchor boxes.**
+
+
+که pc احتمال شناسایی یک شیء است، bx,by,bh,bw اندازههای نسبی کادر محیطی شناسایی شده است، c1,...,cp نمایش «تکفعال» یک دسته از p دسته که تشخیص داده شده است، و k تعداد کادرهای محوری است.
+
+
+
+
+
+
+**68. Step 3: Run the non-max suppression algorithm to remove any potential duplicate overlapping bounding boxes.**
+
+
+گام سوم: الگوریتم فروداشت غیربیشینه را برای حذف هر کادر محصورکننده همپوشان تکراری بالقوه، اجرا کن.
+
+
+
+
+
+**69. [Original image, Division in GxG grid, Bounding box prediction, Non-max suppression]**
+
+
+[تصویر اصلی، تقسیم به GxG مشبک، پیشبینی کادر محصورکننده، فروداشت غیربیشینه]
+
+
+
+
+
+**70. Remark: when pc=0, then the network does not detect any object. In that case, the corresponding predictions bx,...,cp have to be ignored.**
+
+
+نکته: زمانیکه pc=0 است، شبکه هیچ شیئی را شناسایی نمیکند. در چنین حالتی، پیشبینیهای متناظر bx,…,cp بایستی نادیده گرفته شوند.
+
+
+
+
+
+**71. R-CNN ― Region with Convolutional Neural Networks (R-CNN) is an object detection algorithm that first segments the image to find potential relevant bounding boxes and then run the detection algorithm to find most probable objects in those bounding boxes.**
+
+
+R-CNN - ناحیه با شبکههای عصبی پیچشی (R-CNN) یک الگوریتم شناسایی شیء است که ابتدا تصویر را برای یافتن کادرهای محصورکننده مربوط بالقوه قطعهبندی میکند و سپس الگوریتم شناسایی را برای یافتن محتملترین اشیاء در این کادرهای محصور کننده اجرا میکند.
+
+
+
+
+
+**72. [Original image, Segmentation, Bounding box prediction, Non-max suppression]**
+
+
+[تصویر اصلی، قطعه بندی، پیشبینی کادر محصور کننده، فروداشت غیربیشینه]
+
+
+
+
+
+**73. Remark: although the original algorithm is computationally expensive and slow, newer architectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.**
+
+
+نکته: هرچند الگوریتم اصلی به لحاظ محاسباتی پرهزینه و کند است، معماریهای جدید از قبیل Fast R-CNN و Faster R-CNN باعث شدند که الگوریتم سریعتر اجرا شود.
+
+
+
+
+
+**74. Face verification and recognition**
+
+
+تایید چهره و بازشناسایی
+
+
+
+
+
+**75. Types of models ― Two main types of model are summed up in table below:**
+
+
+انواع مدل – دو نوع اصلی از مدل در جدول زیر بهصورت خلاصه آورده شدهاند:
+
+
+
+
+
+**76. [Face verification, Face recognition, Query, Reference, Database]**
+
+
+[تایید چهره، بازشناسایی چهره، جستار، مرجع، پایگاه داده]
+
+
+
+
+
+**77. [Is this the correct person?, One-to-one lookup, Is this one of the K persons in the database?, One-to-many lookup]**
+
+
+[فرد مورد نظر است؟، جستجوی یکبهیک، این فرد یکی از K فرد پایگاه داده است؟، جستجوی یکبهچند]
+
+
+
+
+
+**78. One Shot Learning ― One Shot Learning is a face verification algorithm that uses a limited training set to learn a similarity function that quantifies how different two given images are. The similarity function applied to two images is often noted d(image 1,image 2).**
+
+
+یادگیری یکبارهای – یادگیری یکبارهای یک الگوریتم تایید چهره است که از یک مجموعه آموزشی محدود برای یادگیری یک تابع مشابهت که میزان اختلاف دو تصویر مفروض را تعیین میکند، بهره میبرد. تابع مشابهت اعمالشده بر روی دو تصویر اغلب با نماد d(image 1, image 2) نمایش داده میشود.
+
+
+
+
+
+**79. Siamese Network ― Siamese Networks aim at learning how to encode images to then quantify how different two images are. For a given input image x(i), the encoded output is often noted as f(x(i)).**
+
+
+شبکهی Siamese - هدف شبکهی Siamese یادگیری طریقه رمزنگاری تصاویر و سپس تعیین اختلاف دو تصویر است. برای یک تصویر مفروض ورودی x(i)، خروجی رمزنگاری شده اغلب با نماد f(x(i)) نمایش داده میشود.
+
+
+
+
+
+**80. Triplet loss ― The triplet loss ℓ is a loss function computed on the embedding representation of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive example belong to a same class, while the negative example to another one. By calling α∈R+ the margin parameter, this loss is defined as follows:**
+
+
+خطای سهگانه – خطای سهگانه ℓ یک تابع خطا است که بر روی بازنمایی تعبیهی سهگانهی تصاویر A (محور)، P (مثبت) و N (منفی) محاسبه میشود. نمونههای محور (anchor) و مثبت به دسته یکسانی تعلق دارند، حال آنکه نمونه منفی به دسته دیگری تعلق دارد. با نامیدن α∈R+ (به عنوان) فراسنج حاشیه، این خطا بهصورت زیر تعریف میشود:
+
+
+
+
+
+**81. Neural style transfer**
+
+
+انتقالِ سبک عصبی
+
+
+
+
+
+**82. Motivation ― The goal of neural style transfer is to generate an image G based on a given content C and a given style S.**
+
+
+انگیزه – هدف انتقالِ سبک عصبی تولید یک تصویر G بر مبنای یک محتوای مفروض C و سبک مفروض S است.
+
+
+
+
+
+**83. [Content C, Style S, Generated image G]**
+
+
+[محتوای C، سبک S، تصویر تولیدشدهی G]
+
+
+
+
+
+**84. Activation ― In a given layer l, the activation is noted a[l] and is of dimensions nH×nw×nc**
+
+
+فعالسازی – در یک لایه مفروض l، فعالسازی با a[l] نمایش داده میشود و به ابعاد nH×nw×nc است
+
+
+
+
+
+**85. Content cost function ― The content cost function Jcontent(C,G) is used to determine how the generated image G differs from the original content image C. It is defined as follows:**
+
+
+تابع هزینهی محتوا – تابع هزینهی محتوا Jcontent(C,G) برای تعیین میزان اختلاف تصویر تولیدشده G از تصویر اصلی C استفاده میشود. این تابع بهصورت زیر تعریف میشود:
+
+
+
+
+
+**86. Style matrix ― The style matrix G[l] of a given layer l is a Gram matrix where each of its elements G[l]kk′ quantifies how correlated the channels k and k′ are. It is defined with respect to activations a[l] as follows:**
+
+
+ماتریسِ سبک - ماتریسِ سبک G[l] یک لایه مفروض l، یک ماتریس گرَم (Gram) است که هر کدام از عناصر G[l]kk′ میزان همبستگی کانالهای k و k′ را میسنجند. این ماتریس نسبت به فعالسازیهای a[l] بهصورت زیر محاسبه میشود:
+
+
+
+
+
+**87. Remark: the style matrix for the style image and the generated image are noted G[l] (S) and G[l] (G) respectively.**
+
+
+نکته: ماتریس سبک برای تصویر سبک و تصویر تولید شده، به ترتیب با G[l] (S) و G[l] (G) نمایش داده میشوند.
+
+
+
+
+
+**88. Style cost function ― The style cost function Jstyle(S,G) is used to determine how the generated image G differs from the style S. It is defined as follows:**
+
+
+تابع هزینهی سبک – تابع هزینهی سبک Jstyle(S,G) برای تعیین میزان اختلاف تصویر تولیدشده G و سبک S استفاده میشود. این تابع به صورت زیر تعریف میشود:
+
+
+
+
+
+**89. Overall cost function ― The overall cost function is defined as being a combination of the content and style cost functions, weighted by parameters α,β, as follows:**
+
+
+تابع هزینهی کل – تابع هزینهی کل به صورت ترکیبی از توابع هزینهی سبک و محتوا تعریف شده است که با فراسنجهای α,β, به شکل زیر وزندار شده است:
+
+
+
+
+
+**90. Remark: a higher value of α will make the model care more about the content while a higher value of β will make it care more about the style.**
+
+
+نکته: مقدار بیشتر α مدل را به توجه بیشتر به محتوا وا میدارد حال آنکه مقدار بیشتر β مدل را به توجه بیشتر به سبک وا میدارد.
+
+
+
+
+
+**91. Architectures using computational tricks**
+
+
+معماریهایی که از ترفندهای محاسباتی استفاده میکنند.
+
+
+
+
+
+**92. Generative Adversarial Network ― Generative adversarial networks, also known as GANs, are composed of a generative and a discriminative model, where the generative model aims at generating the most truthful output that will be fed into the discriminative which aims at differentiating the generated and true image.**
+
+
+شبکهی همآوردِ مولد – شبکهی همآوردِ مولد، همچنین با نام GANs شناخته میشوند، ترکیبی از یک مدل مولد و تمیزدهنده هستند، جاییکه مدل مولد هدفش تولید واقعیترین خروجی است که به (مدل) تمیزدهنده تغذیه میشود و این (مدل) هدفش تفکیک بین تصویر تولیدشده و واقعی است.
+
+
+
+
+
+**93. [Training, Noise, Real-world image, Generator, Discriminator, Real Fake]**
+
+
+[آموزش، نویز، تصویر دنیای واقعی، مولد، تمیز دهنده، واقعی بدلی]
+
+
+
+
+
+**94. Remark: use cases using variants of GANs include text to image, music generation and synthesis.**
+
+
+نکته: موارد استفاده متنوع GAN ها شامل تبدیل متن به تصویر، تولید موسیقی و تلفیقی از آنهاست.
+
+
+
+
+
+**95. ResNet ― The Residual Network architecture (also called ResNet) uses residual blocks with a high number of layers meant to decrease the training error. The residual block has the following characterizing equation:**
+
+
+ResNet – معماری شبکهی پسماند (همچنین با عنوان ResNet شناخته میشود) از بلاکهای پسماند با تعداد لایههای زیاد به منظور کاهش خطای آموزش استفاده میکند. بلاک پسماند معادلهای با خصوصیات زیر دارد:
+
+
+
+
+
+**96. Inception Network ― This architecture uses inception modules and aims at giving a try at different convolutions in order to increase its performance through features diversification. In particular, it uses the 1×1 convolution trick to limit the computational burden.**
+
+
+شبکهی Inception – این معماری از ماژولهای inception استفاده میکند و هدفش فرصت دادن به (عملیات) کانولوشنی مختلف برای افزایش کارایی از طریق تنوعبخشی ویژگیها است. به طور خاص، این معماری از ترفند کانولوشنی 1×1 برای محدود سازی بار محاسباتی استفاده میکند.
+
+
+
+
+
+**97. The Deep Learning cheatsheets are now available in [target language].**
+
+
+راهنمای یادگیری عمیق هم اکنون به زبان ]فارسی[ در دسترس است.
+
+
+
+
+
+**98. Original authors**
+
+
+نویسندگان اصلی
+
+
+
+
+
+**99. Translated by X, Y and Z**
+
+
+ترجمه شده توسط X،Y و Z
+
+
+
+
+
+**100. Reviewed by X, Y and Z**
+
+
+بازبینی شده توسط توسط X،Y و Z
+
+
+
+
+
+**101. View PDF version on GitHub**
+
+
+نسخه پیدیاف را در گیتهاب ببینید
+
+
+
+
+
+**102. By X and Y**
+
+
+توسط X و Y
+
+
+
+
diff --git a/fa/cs-230-deep-learning-tips-and-tricks.md b/fa/cs-230-deep-learning-tips-and-tricks.md
new file mode 100644
index 000000000..1248a06bf
--- /dev/null
+++ b/fa/cs-230-deep-learning-tips-and-tricks.md
@@ -0,0 +1,586 @@
+
+**Deep Learning Tips and Tricks translation**
+
+
+
+**1. Deep Learning Tips and Tricks cheatsheet**
+
+
+راهنمای کوتاه نکات و ترفندهای یادگیری عمیق
+
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+
+کلاس CS 230 - یادگیری عمیق
+
+
+
+
+
+**3. Tips and tricks**
+
+
+نکات و ترفندها
+
+
+
+
+
+**4. [Data processing, Data augmentation, Batch normalization]**
+
+
+[پردازش داده، دادهافزایی، نرمالسازی دستهای]
+
+
+
+
+
+**5. [Training a neural network, Epoch, Mini-batch, Cross-entropy loss, Backpropagation, Gradient descent, Updating weights, Gradient checking]**
+
+
+[آموزش یک شبکهی عصبی، تکرار(Epoch)، دستهی کوچک، خطای آنتروپی متقاطع، انتشار معکوس، گرادیان نزولی، بهروزرسانی وزنها، وارسی گرادیان]
+
+
+
+
+
+**6. [Parameter tuning, Xavier initialization, Transfer learning, Learning rate, Adaptive learning rates]**
+
+
+[تنظیم فراسنج، مقداردهی اولیه Xavier،یادگیری انتقالی، نرخ یادگیری، نرخ یادگیری سازگارشونده]
+
+
+
+
+
+**7. [Regularization, Dropout, Weight regularization, Early stopping]**
+
+
+[نظامبخشی، بروناندازی، نظامبخشی وزن، توقف زودهنگام]
+
+
+
+
+
+**8. [Good practices, Overfitting small batch, Gradient checking]**
+
+
+[عادتهای خوب، بیشبرارزش دستهی کوچک، وارسی گرادیان]
+
+
+
+
+
+**9. View PDF version on GitHub**
+
+
+نسخه پیدیاف را در گیتهاب ببینید
+
+
+
+
+
+**10. Data processing**
+
+
+پردازش داده
+
+
+
+
+
+**11. Data augmentation ― Deep learning models usually need a lot of data to be properly trained. It is often useful to get more data from the existing ones using data augmentation techniques. The main ones are summed up in the table below. More precisely, given the following input image, here are the techniques that we can apply:**
+
+
+دادهافزایی ― مدلهای یادگیری عمیق معمولا به دادههای زیادی نیاز دارند تا بتوانند به خوبی آموزش ببینند. اغلب، استفاده از روشهای دادهافزایی برای گرفتن دادهی بیشتر از دادههای موجود، مفید است. اصلیترین آنها در جدول زیر به اختصار آمدهاند. به عبارت دقیقتر، با در نظر گرفتن تصویر ورودی زیر، روشهایی که میتوان اعمال کرد بدین شرح هستند:
+
+
+
+
+**12. [Original, Flip, Rotation, Random crop]**
+
+
+[تصویر اصلی، قرینه، چرخش، برش تصادفی]
+
+
+
+
+
+**13. [Image without any modification, Flipped with respect to an axis for which the meaning of the image is preserved, Rotation with a slight angle, Simulates incorrect horizon calibration, Random focus on one part of the image, Several random crops can be done in a row]**
+
+
+[تصویر (آغازین) بدون هیچگونه تغییری، قرینهشده نسبت به محوری که معنای (محتوای) تصویر را حفظ میکند، چرخش با زاویهی اندک، خط افق نادرست را شبیهسازی میکند، روی ناحیهای تصادفی از تصویر متمرکز میشود، چندین برش تصادفی را میتوان پشتسرهم انجام داد]
+
+
+
+
+**14. [Color shift, Noise addition, Information loss, Contrast change]**
+
+
+[تغییر رنگ، اضافهکردن نویز، هدررفت اطلاعات، تغییر تباین(کُنتراست)]
+
+
+
+
+
+**15. [Nuances of RGB is slightly changed, Captures noise that can occur with light exposure, Addition of noise, More tolerance to quality variation of inputs, Parts of image ignored, Mimics potential loss of parts of image, Luminosity changes, Controls difference in exposition due to time of day]**
+
+
+[عناصر RGB کمی تغییر کرده است، نویزی که در هنگام مواجه شدن با نور رخ میدهد را شبیهسازی میکند، افزودگی نویز، مقاومت بیشتر نسبت به تغییر کیفیت تصاویر ورودی، بخشهایی از تصویر نادیده گرفته میشوند، تقلید (شبیه سازی) هدررفت بالقوه بخشهایی از تصویر، تغییر درخشندگی، با توجه به زمان روز تفاوت نمایش (تصویر) را کنترل میکند]
+
+
+
+
+
+**16. Remark: data is usually augmented on the fly during training.**
+
+
+نکته: دادهها معمولا در فرآیند آموزش (به صورت درجا) افزایش پیدا میکنند.
+
+
+
+
+
+**17. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
+
+
+نرمالسازی دستهای ― یک مرحله از فراسنجهای γ و β که دستهی {xi} را نرمال میکند. نماد μB و σ2B به میانگین و وردایی دستهای که میخواهیم آن را اصلاح کنیم اشاره دارد که به صورت زیر است:
+
+
+
+
+
+**18. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
+
+
+معمولا بعد از یک لایهی تماممتصل یا لایهی کانولوشنی و قبل از یک لایهی غیرخطی اعمال میشود و امکان استفاده از نرخ یادگیری بالاتر را میدهد و همچنین باعث میشود که وابستگی شدید مدل به مقداردهی اولیه کاهش یابد.
+
+
+
+
+
+**19. Training a neural network**
+
+
+آموزش یک شبکهی عصبی
+
+
+
+
+
+**20. Definitions**
+
+
+تعاریف
+
+
+
+
+
+**21. Epoch ― In the context of training a model, epoch is a term used to refer to one iteration where the model sees the whole training set to update its weights.**
+
+
+تکرار (epoch) ― در مضمون آموزش یک مدل، تکرار اصطلاحی است که مدل در یک دوره تکرار تمامی نمونههای آموزشی را برای بهروزرسانی وزنها میبیند.
+
+
+
+
+
+**22. Mini-batch gradient descent ― During the training phase, updating weights is usually not based on the whole training set at once due to computation complexities or one data point due to noise issues. Instead, the update step is done on mini-batches, where the number of data points in a batch is a hyperparameter that we can tune.**
+
+
+گرادیان نزولی دستهیکوچک ― در فاز آموزش، بهروزرسانی وزنها معمولا بر مبنای تمامی مجموعه آموزش به علت پیچیدگیهای محاسباتی، یا یک نمونه داده به علت مشکل نویز، نیست. در عوض، گام بهروزرسانی بر روی دستههای کوچک انجام می شود، که تعداد نمونههای داده در یک دسته یک ابرفراسنج است که میتوان آن را تنظیم کرد.
+
+
+
+
+
+**23. Loss function ― In order to quantify how a given model performs, the loss function L is usually used to evaluate to what extent the actual outputs y are correctly predicted by the model outputs z.**
+
+
+تابع خطا ― به منظور سنجش کارایی یک مدل مفروض، معمولا از تابع خطای L برای ارزیابی اینکه تا چه حد خروجی حقیقی y به شکل صحیح توسط خروجی z مدل پیشبینی شدهاند، استفاده میشود.
+
+
+
+
+
+**24. Cross-entropy loss ― In the context of binary classification in neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
+
+
+خطای آنتروپی متقاطع – در مضمون دستهبندی دودویی در شبکههای عصبی، عموما از تابع خطای آنتروپی متقاطع L(z,y) استفاده و به صورت زیر تعریف میشود:
+
+
+
+
+
+**25. Finding optimal weights**
+
+
+یافتن وزنهای بهینه
+
+
+
+
+
+**26. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight w is computed using the chain rule.**
+
+
+انتشار معکوس ― انتشار معکوس روشی برای بهروزرسانی وزنها با توجه به خروجی واقعی و خروجی مورد انتظار در شبکهی عصبی است. مشتق نسبت به هر وزن w توسط قاعدهی زنجیری محاسبه میشود.
+
+
+
+
+
+**27. Using this method, each weight is updated with the rule:**
+
+
+با استفاده از این روش، هر وزن با قانون زیر بهروزرسانی میشود:
+
+
+
+
+
+**28. Updating weights ― In a neural network, weights are updated as follows:**
+
+
+بهروزرسانی وزنها – در یک شبکهی عصبی، وزنها به شکل زیر بهروزرسانی میشوند:
+
+
+
+
+
+**29. [Step 1: Take a batch of training data and perform forward propagation to compute the loss, Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight, Step 3: Use the gradients to update the weights of the network.]**
+
+
+[گام 1: یک دسته از دادههای آموزشی گرفته شده و با استفاده از انتشار مستقیم خطا محاسبه میشود، گام 2: با استفاده از انتشار معکوس مشتق خطا نسبت به هر وزن محاسبه میشود، گام 3: با استفاده از مشتقات، وزنهای شبکه بهروزرسانی میشوند.]
+
+
+
+
+
+**30. [Forward propagation, Backpropagation, Weights update]**
+
+
+[انتشار مستقیم، انتشار معکوس، بهروزرسانی وزنها]
+
+
+
+
+
+**31. Parameter tuning**
+
+
+تنظیم فراسنج
+
+
+
+
+
+**32. Weights initialization**
+
+
+مقداردهی اولیهی وزنها
+
+
+
+
+
+**33. Xavier initialization ― Instead of initializing the weights in a purely random manner, Xavier initialization enables to have initial weights that take into account characteristics that are unique to the architecture.**
+
+
+مقداردهی اولیه Xavier ― بهجای مقداردهی اولیهی وزنها به شیوهی کاملا تصادفی، مقداردهی اولیه Xavier این امکان را فراهم میسازد تا وزنهای اولیهای داشته باشیم که ویژگیهای منحصر به فرد معماری را به حساب میآورند.
+
+
+
+
+
+**34. Transfer learning ― Training a deep learning model requires a lot of data and more importantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets that took days/weeks to train, and leverage it towards our use case. Depending on how much data we have at hand, here are the different ways to leverage this:**
+
+
+یادگیری انتقالی ― آموزش یک مدل یادگیری عمیق به دادههای زیاد و مهمتر از آن به زمان زیادی احتیاج دارد. اغلب بهتر است که از وزنهای پیشآموخته روی پایگاه دادههای عظیم که آموزش بر روی آنها روزها یا هفتهها طول میکشند استفاده کرد، و آنها را برای موارد استفادهی خود به کار برد. بسته به میزان دادههایی که در اختیار داریم، در زیر روشهای مختلفی که میتوان از آنها بهره جست آورده شدهاند:
+
+
+
+
+
+**35. [Training size, Illustration, Explanation]**
+
+
+[تعداد دادههای آموزش، نگاره، توضیح]
+
+
+
+
+
+**36. [Small, Medium, Large]**
+
+
+[کوچک، متوسط، بزرگ]
+
+
+
+
+
+**37. [Freezes all layers, trains weights on softmax, Freezes most layers, trains weights on last layers and softmax, Trains weights on layers and softmax by initializing weights on pre-trained ones]**
+
+
+[منجمد کردن تمامی لایهها، آموزش وزنها در بیشینهی هموار، منجمد کردن اکثر لایهها، آموزش وزنها در لایههای آخر و بیشینهی هموار، آموزش وزنها در (تمامی) لایهها و بیشینهی هموار با مقداردهیاولیهی وزنها بر طبق مقادیر پیشآموخته]
+
+
+
+
+
+**38. Optimizing convergence**
+
+
+بهینهسازی همگرایی
+
+
+
+
+
+**39. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. It can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.
+**
+
+
+نرخ یادگیری – نرخ یادگیری اغلب با نماد α و گاهی اوقات با نماد η نمایش داده میشود و بیانگر سرعت (گام) بهروزرسانی وزنها است که میتواند مقداری ثابت داشته باشد یا به صورت سازگارشونده تغییر کند. محبوبترین روش حال حاضر Adam نام دارد، روشی است که نرخ یادگیری را در حین فرآیند آموزش تنظیم میکند.
+
+
+
+
+
+**40. Adaptive learning rates ― Letting the learning rate vary when training a model can reduce the training time and improve the numerical optimal solution. While Adam optimizer is the most commonly used technique, others can also be useful. They are summed up in the table below:**
+
+
+نرخهای یادگیری سازگارشونده ― داشتن نرخ یادگیری متغیر در فرآیند آموزش یک مدل، میتواند زمان آموزش را کاهش دهد و راهحل بهینه عددی را بهبود ببخشد. با آنکه بهینهساز Adam محبوبترین روش مورد استفاده است، دیگر روشها نیز میتوانند مفید باشند. این روشها در جدول زیر به اختصار آمدهاند:
+
+
+
+
+
+**41. [Method, Explanation, Update of w, Update of b]**
+
+
+[روش، توضیح، بهروزرسانی w، بهروزرسانی b]
+
+
+
+
+
+**42. [Momentum, Dampens oscillations, Improvement to SGD, 2 parameters to tune]**
+
+
+[تکانه، نوسانات را تعدیل میدهد، بهبود SGD، دو فراسنج که نیاز به تنظیم دارند]
+
+
+
+
+
+**43. [RMSprop, Root Mean Square propagation, Speeds up learning algorithm by controlling oscillations]**
+
+
+[RMSprop، انتشار جذر میانگین مربعات، سرعت بخشیدن به الگوریتم یادگیری با کنترل نوسانات]
+
+
+
+
+
+**44. [Adam, Adaptive Moment estimation, Most popular method, 4 parameters to tune]**
+
+
+[Adam، تخمین سازگارشونده ممان، محبوبترین روش، چهار فراسنج که نیاز به تنظیم دارند]
+
+
+
+
+
+**45. Remark: other methods include Adadelta, Adagrad and SGD.**
+
+
+نکته: سایر متدها شامل Adadelta، Adagrad و SGD هستند.
+
+
+
+
+
+**46. Regularization**
+
+
+نظامبخشی
+
+
+
+
+
+**47. Dropout ― Dropout is a technique used in neural networks to prevent overfitting the training data by dropping out neurons with probability p>0. It forces the model to avoid relying too much on particular sets of features.**
+
+
+بروناندازی – بروناندازی روشی است که در شبکههای عصبی برای جلوگیری از بیشبرارزش بر روی دادههای آموزشی با حذف تصادفی نورونها با احتمال p>0 استفاده میشود. این روش مدل را مجبور میکند تا از تکیه کردن بیشازحد بر روی مجموعه خاصی از ویژگیها خودداری کند.
+
+
+
+
+
+**48. Remark: most deep learning frameworks parametrize dropout through the 'keep' parameter 1−p.**
+
+
+نکته: بیشتر کتابخانههای یادگیری عمیق بروناندازی را با استفاده از فراسنج 'نگهداشتن' 1-p کنترل میکنند.
+
+
+
+
+
+**49. Weight regularization ― In order to make sure that the weights are not too large and that the model is not overfitting the training set, regularization techniques are usually performed on the model weights. The main ones are summed up in the table below:**
+
+
+نظامبخشی وزن – برای اطمینان از اینکه (مقادیر) وزنها بیشازحد بزرگ نیستند و مدل به مجموعهی آموزش بیشبرارزش نمیکند، روشهای نظامبخشی معمولا بر روی وزنهای مدل اجرا میشوند. اصلیترین آنها در جدول زیر به اختصار آمدهاند:
+
+
+
+
+
+**50. [LASSO, Ridge, Elastic Net]**
+
+
+[LASSO, Ridge, Elastic Net]
+
+
+
+**50 bis. Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+
+ضرایب را تا صفر کاهش میدهد، برای انتخاب متغیر مناسب است، ضرایب را کوچکتر میکند، بین انتخاب متغیر و ضرایب کوچک مصالحه میکند
+
+
+
+
+**51. Early stopping ― This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase.**
+
+
+توقف زودهنگام ― این روش نظامبخشی، فرآیند آموزش را به محض اینکه خطای اعتبارسنجی ثابت میشود یا شروع به افزایش پیدا کند، متوقف میکند.
+
+
+
+
+
+**52. [Error, Validation, Training, early stopping, Epochs]**
+
+
+[خطا، اعتبارسنجی، آموزش، توقف زودهنگام، تکرارها]
+
+
+
+
+
+**53. Good practices**
+
+
+عادتهای خوب
+
+
+
+
+
+**54. Overfitting small batch ― When debugging a model, it is often useful to make quick tests to see if there is any major issue with the architecture of the model itself. In particular, in order to make sure that the model can be properly trained, a mini-batch is passed inside the network to see if it can overfit on it. If it cannot, it means that the model is either too complex or not complex enough to even overfit on a small batch, let alone a normal-sized training set.**
+
+
+بیشبرارزش روی دستهی کوچک ― هنگام اشکالزدایی یک مدل، اغلب مفید است که یک سری آزمایشهای سریع برای اطمینان از اینکه هیچ مشکل عمدهای در معماری مدل وجود ندارد، انجام شود. به طورخاص، برای اطمینان از اینکه مدل میتواند به شکل صحیح آموزش ببیند، یک دستهی کوچک (از دادهها) به شبکه داده میشود تا دریابیم که مدل میتواند به آنها بیشبرارزش کند. اگر نتواند، بدین معناست که مدل از پیچیدگی بالایی برخوردار است یا پیچیدگی کافی برای بیشبرارزش شدن روی دستهی کوچک ندارد، چه برسد به یک مجموعه آموزشی با اندازه عادی.
+
+
+
+
+
+**55. Gradient checking ― Gradient checking is a method used during the implementation of the backward pass of a neural network. It compares the value of the analytical gradient to the numerical gradient at given points and plays the role of a sanity-check for correctness.**
+
+
+وارسی گرادیان – وارسی گرادیان روشی است که در طول پیادهسازی گذر روبهعقبِ یک شبکهی عصبی استفاده میشود. این روش مقدار گرادیان تحلیلی را با گرادیان عددی در نقطههای مفروض مقایسه میکند و نقش بررسی درستی را ایفا میکند.
+
+
+
+
+
+**56. [Type, Numerical gradient, Analytical gradient]**
+
+
+[نوع، گرادیان عددی، گرادیان تحلیلی]
+
+
+
+
+
+**57. [Formula, Comments]**
+
+
+[فرمول، توضیحات]
+
+
+
+
+
+**58. [Expensive; loss has to be computed two times per dimension, Used to verify correctness of analytical implementation, Trade-off in choosing h not too small (numerical instability) nor too large (poor gradient approximation)]**
+
+
+[پرهزینه (از نظر محاسباتی)، خطا باید دو بار برای هر بُعد محاسبه شود، برای تایید صحت پیادهسازی تحلیلی استفاده میشود، مصالحه در انتخاب h: نه بسیار کوچک (ناپایداری عددی) و نه خیلی بزرگ (تخمین گرادیان ضعیف) باشد]
+
+
+
+
+
+**59. ['Exact' result, Direct computation, Used in the final implementation]**
+
+
+[نتیجه 'عینی'، محاسبه مستقیم، در پیادهسازی نهایی استفاده میشود]
+
+
+
+
+
+**60. The Deep Learning cheatsheets are now available in [target language].**
+
+
+راهنمای یادگیری عمیق هم اکنون به زبان [فارسی] در دسترس است.
+
+
+**61. Original authors**
+
+
+نویسندگان اصلی
+
+
+
+
+**62.Translated by X, Y and Z**
+
+
+ترجمه شده توسط X،Y و Z
+
+
+
+
+**63.Reviewed by X, Y and Z**
+
+
+بازبینی شده توسط توسط X،Y و Z
+
+
+
+
+**64.View PDF version on GitHub**
+
+
+نسخه پیدیاف را در گیتهاب ببینید
+
+
+
+
+**65.By X and Y**
+
+
+توسط X و Y
+
+
+
diff --git a/fa/cs-230-recurrent-neural-networks.md b/fa/cs-230-recurrent-neural-networks.md
new file mode 100644
index 000000000..22a1e2106
--- /dev/null
+++ b/fa/cs-230-recurrent-neural-networks.md
@@ -0,0 +1,868 @@
+**Recurrent Neural Networks translation**
+
+
+
+**1. Recurrent Neural Networks cheatsheet**
+
+
+راهنمای کوتاه شبکههای عصبی برگشتی
+
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+
+کلاس CS 230 - یادگیری عمیق
+
+
+
+
+
+**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]**
+
+
+[نمای کلی، ساختار معماری، کاربردهایRNN ها، تابع خطا، انتشار معکوس]
+
+
+
+
+
+**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]**
+
+
+[کنترل وابستگیهای بلندمدت، توابع فعالسازی رایج، مشتق صفرشونده/منفجرشونده، برش گرادیان، GRU/LSTM، انواع دروازه، RNN دوسویه، RNN عمیق]
+
+
+
+
+
+**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]**
+
+
+[یادگیری بازنمائی کلمه، نمادها، ماتریس تعبیه، Word2vec،skip-gram، نمونهبرداری منفی، GloVe]
+
+
+
+
+
+**6. [Comparing words, Cosine similarity, t-SNE]**
+
+
+[مقایسهی کلمات، شباهت کسینوسی، t-SNE]
+
+
+
+
+
+**7. [Language model, n-gram, Perplexity]**
+
+
+[مدل زبانی،انگرام، سرگشتگی]
+
+
+
+
+
+**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]**
+
+
+[ترجمهی ماشینی، جستجوی پرتو، نرمالسازی طول، تحلیل خطا، امتیاز Bleu]
+
+
+
+
+
+**9. [Attention, Attention model, Attention weights]**
+
+
+[ژرفنگری، مدل ژرفنگری، وزنهای ژرفنگری]
+
+
+
+
+
+**10. Overview**
+
+
+نمای کلی
+
+
+
+
+
+**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:**
+
+
+معماری RNN سنتی ــ شبکههای عصبی برگشتی که همچنین با عنوان RNN شناخته میشوند، دستهای از شبکههای عصبیاند که این امکان را میدهند خروجیهای قبلی بهعنوان ورودی استفاده شوند و در عین حال حالتهای نهان داشته باشند. این شبکهها بهطور معمول عبارتاند از:
+
+
+
+
+**12. For each timestep t, the activation a and the output y are expressed as follows:**
+
+
+بهازای هر گام زمانی t، فعالسازی a و خروجی y بهصورت زیر بیان میشود:
+
+
+
+
+
+**13. and**
+
+
+و
+
+
+
+
+
+**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.**
+
+
+که در آن Wax,Waa,Wya,ba,by ضرایبیاند که در راستای زمان به اشتراک گذاشته میشوند و g1، g2 توابع فعالسازی هستند.
+
+
+
+
+
+**15. The pros and cons of a typical RNN architecture are summed up in the table below:**
+
+
+مزایا و معایب معماری RNN بهصورت خلاصه در جدول زیر آورده شدهاند:
+
+
+
+
+
+**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]**
+
+
+مزایا، امکان پردازش ورودی با هر طولی، اندازهی مدل مطابق با اندازهی ورودی افزایش نمییابد، اطلاعات (زمانهای) گذشته در محاسبه در نظر گرفته میشود، وزنها در طول زمان به اشتراک گذاشته میشوند]
+
+
+
+
+
+**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]**
+
+
+[معایب، محاسبه کند میشود، دشوار بودن دسترسی به اطلاعات مدتها پیش، در نظر نگرفتن ورودیهای بعدی در وضعیت جاری]
+
+
+
+
+
+**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:**
+
+
+کاربردهایRNN ها ــ مدلهای RNN غالباً در حوزهی پردازش زبان طبیعی و حوزهی بازشناسایی گفتار به کار میروند. کاربردهای مختلف آنها به صورت خلاصه در جدول زیر آورده شدهاند:
+
+
+
+
+
+**19. [Type of RNN, Illustration, Example]**
+
+
+[نوع RNN، نگاره، مثال]
+
+
+
+
+
+**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]**
+
+
+[یک به یک، یک به چند، چند به یک، چند به چند]
+
+
+
+
+
+**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]**
+
+
+[شبکهی عصبی سنتی، تولید موسیقی، دستهبندی حالت احساسی، بازشناسایی موجودیت اسمی، ترجمه ماشینی]
+
+
+
+
+
+**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:**
+
+
+تابع خطا ــ در شبکه عصبی برگشتی، تابع خطا L برای همهی گامهای زمانی براساس خطا در هر گام به صورت زیر محاسبه میشود:
+
+
+
+
+
+**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:**
+
+
+انتشار معکوس در طول زمان ـــ انتشار معکوس در هر نقطه از زمان انجام میشود. در گام زمانی T، مشتق خطا L با توجه به ماتریس وزن W بهصورت زیر بیان میشود:
+
+
+
+
+
+**24. Handling long term dependencies**
+
+
+کنترل وابستگیهای بلندمدت
+
+
+
+
+
+**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:**
+
+
+توابع فعالسازی پرکاربرد ـــ رایجترین توابع فعالسازی بهکاررفته در ماژولهای RNN به شرح زیر است:
+
+
+
+
+
+**26. [Sigmoid, Tanh, RELU]**
+
+
+[سیگموید، تانژانت هذلولوی، یکسو ساز]
+
+
+
+
+
+**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.**
+
+
+مشتق صفرشونده/منفجرشونده ــ پدیده مشتق صفرشونده و منفجرشونده غالبا در بستر RNNها رخ میدهند. علت چنین رخدادی این است که به دلیل گرادیان ضربی، که میتواند با توجه به تعداد لایهها به صورت نمایی کاهش/افزایش مییابد، بهدست آوردن وابستگیهای بلندمدت سخت است.
+
+
+
+
+
+
+**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.**
+
+
+برش گرادیان ــ یک روش برای مقابله با انفجار گرادیان است که گاهی اوقات هنگام انتشار معکوس رخ میدهد. با تعیین حداکثر مقدار برای گرادیان، این پدیده در عمل کنترل میشود.
+
+
+
+
+
+**29. clipped**
+
+
+برش دادهشده
+
+
+
+
+
+**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:**
+
+
+انواع دروازه ـــ برای حل مشکل مشتق صفرشونده/منفجرشونده، در برخی از انواع RNN ها، دروازههای خاصی استفاده میشود و این دروازهها عموما هدف معینی دارند. این دروازهها عموما با نمادΓ نمایش داده میشوند و برابرند با:
+
+
+
+
+
+**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:**
+
+
+که W,U,b ضرایب خاص دروازه و σ تابع سیگموید است. دروازههای اصلی به صورت خلاصه در جدول زیر آورده شدهاند:
+
+
+
+
+
+**32. [Type of gate, Role, Used in]**
+
+
+[نوع دروازه، نقش، بهکار رفته در]
+
+
+
+
+
+**33. [Update gate, Relevance gate, Forget gate, Output gate]**
+
+
+33. [دروازهی بهروزرسانی، دروازهی ربط(میزان اهمیت)، دروازهی فراموشی، دروازهی خروجی]
+
+
+
+
+
+**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]**
+
+
+34. [چه میزان از گذشته اکنون اهمیت دارد؟ اطلاعات گذشته رها شوند؟ سلول حذف شود یا خیر؟ چه میزان از (محتوای) سلول آشکار شود؟]
+
+
+
+
+
+**35. [LSTM, GRU]**
+
+
+[LSTM، GRU]
+
+
+
+
+
+**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:**
+
+
+GRU/LSTM ـــ واحد برگشتی دروازهدار (GRU) و واحدهای حافظهی کوتاه-مدت طولانی (LSTM) مشکل مشتق صفرشونده که در RNNهای سنتی رخ میدهد، را بر طرف میکنند، درحالیکه LSTM شکل عمومیتر GRU است. در جدول زیر، معادلههای توصیفکنندهٔ هر معماری به صورت خلاصه آورده شدهاند:
+
+
+
+
+
+**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]**
+
+
+37. [توصیف، واحد برگشتی دروازهدار (GRU)، حافظهی کوتاه-مدت طولانی (LSTM)، وابستگیها]
+
+
+
+
+**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.**
+
+
+نکته: نشانهی * نمایانگر ضرب عنصربهعنصر دو بردار است.
+
+
+
+
+
+**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:**
+
+
+انواع RNN ها ــ جدول زیر سایر معماریهای پرکاربرد RNN را به صورت خلاصه نشان میدهد.
+
+
+
+
+
+**40. [Bidirectional (BRNN), Deep (DRNN)]**
+
+
+[دوسویه (BRNN)، عمیق (DRNN)]
+
+
+
+
+
+**41. Learning word representation**
+
+
+یادگیری بازنمائی کلمه
+
+
+
+
+
+**42. In this section, we note V the vocabulary and |V| its size.**
+
+
+در این بخش، برای اشاره به واژگان از V و برای اشاره به اندازهی آن از |V| استفاده میکنیم.
+
+
+
+
+
+**43. Motivation and notations**
+
+
+انگیزه و نمادها
+
+
+
+
+
+**44. Representation techniques ― The two main ways of representing words are summed up in the table below:**
+
+
+روشهای بازنمائی ― دو روش اصلی برای بازنمائی کلمات به صورت خلاصه در جدول زیر آورده شدهاند:
+
+
+
+
+
+**45. [1-hot representation, Word embedding]**
+
+
+[بازنمائی تکفعال، تعبیهی کلمه]
+
+
+
+
+
+**46. [teddy bear, book, soft]**
+
+
+[خرس تدی، کتاب، نرم]
+
+
+
+
+
+**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]**
+
+
+[نشان داده شده با نماد ow، رویکرد ساده، فاقد اطلاعات تشابه، نشان داده شده با نماد ew، بهحسابآوردن تشابه کلمات]
+
+
+
+
+
+**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:**
+
+
+ماتریس تعبیه ـــ به ازای کلمهی مفروض w ، ماتریس تعبیه E ماتریسی است که بازنمائی تکفعال ow را به نمایش تعبیهی ew نگاشت میدهد:
+
+
+
+
+
+**49. Remark: learning the embedding matrix can be done using target/context likelihood models.**
+
+
+نکته: یادگیری ماتریس تعبیه را میتوان با استفاده از مدلهای درستنمایی هدف/متن(زمینه) انجام داد.
+
+
+
+
+
+**50. Word embeddings**
+
+
+(نمایش) تعبیهی کلمه
+
+
+
+
+
+**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.**
+
+
+Word2vec ― Word2vec چهارچوبی است که با محاسبهی احتمال قرار گرفتن یک کلمهی خاص در میان سایر کلمات، تعبیههای کلمه را یاد میگیرد. مدلهای متداول شامل Skip-gram، نمونهبرداری منفی و CBOW هستند.
+
+
+
+
+
+**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]**
+
+
+[یک خرس تدی بامزه در حال مطالعه است، خرس تدی، نرم، شعر فارسی، هنر]
+
+
+
+
+
+**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]**
+
+
+[آموزش شبکه بر روی مسئلهی جایگزین، استخراج بازنمائی سطح بالا، محاسبهی نمایش تعبیهی کلمات]
+
+
+
+
+
+**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:**
+
+
+Skip-gram ــ مدل اسکیپگرام word2vec یک وظیفهی یادگیری بانظارت است که تعبیههای کلمه را با ارزیابی احتمال وقوع کلمهی t هدف با کلمهی زمینه c یاد میگیرد. با توجه به اینکه نماد θt پارامتری مرتبط با t است، احتمال P(t|c) بهصورت زیر بهدست میآید:
+
+
+
+
+
+**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.**
+
+
+نکته: جمع کل واژگان در بخش مقسومالیه بیشینهیهموار باعث میشود که این مدل از لحاظ محاسباتی گران شود. مدل CBOW مدل word2vec دیگری ست که از کلمات اطراف برای پیشبینی یک کلمهٔ مفروض استفاده میکند.
+
+
+
+
+
+**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:**
+
+
+نمونهگیری منفی ― مجموعهای از دستهبندیهای دودویی با استفاده از رگرسیون لجستیک است که مقصودش ارزیابی احتمال ظهور همزمان کلمهی مفروض هدف و کلمهی مفروض زمینه است، که در اینجا مدلها براساس مجموعه k مثال منفی و 1 مثال مثبت آموزش میبینند. با توجه به کلمهی مفروض زمینه c و کلمهی مفروض هدف t، پیشبینی به صورت زیر بیان میشود:
+
+
+
+
+
+**57. Remark: this method is less computationally expensive than the skip-gram model.**
+
+
+نکته: این روش از لحاظ محاسباتی ارزانتر از مدل skip-gram است.
+
+
+
+
+
+**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:**
+
+
+GloVe ― مدل GloVe، مخفف بردارهای سراسری بازنمائی کلمه، یکی از روشهای تعبیه کلمه است که از ماتریس همرویدادی X استفاده میکند که در آن هر Xi,j به تعداد دفعاتی اشاره دارد که هدف i با زمینهٔ j رخ میدهد. تابع هزینهی J بهصورت زیر است:
+
+
+
+
+
+**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0.
+Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:**
+
+
+که در آن f تابع وزندهی است، بهطوری که Xi,j=0⟹f(Xi,j)=0. با توجه به تقارنی که e و θ در این مدل دارند، نمایش تعبیهی نهایی کلمه e(final)w به صورت زیر محاسبه میشود:
+
+
+
+
+
+**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.**
+
+
+تذکر: مولفههای مجزا در نمایش تعبیهی یادگرفتهشدهی کلمه الزاما قابل تفسیر نیستند.
+
+
+
+
+
+**60. Comparing words**
+
+
+مقایسهی کلمات
+
+
+
+
+
+**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:**
+
+
+شباهت کسینوسی - شباهت کسینوسی بین کلمات w1 و w2 به صورت زیر بیان میشود:
+
+
+
+
+
+**62. Remark: θ is the angle between words w1 and w2.**
+
+
+نکته: θ زاویهٔ بین کلمات w1 و w2 است.
+
+
+
+
+
+**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.**
+
+
+t-SNE ― t-SNE (نمایش تعبیهی همسایهی تصادفی توزیعشده توسط توزیع t) روشی است که هدف آن کاهش تعبیههای ابعاد بالا به فضایی با ابعاد پایینتر است. این روش در تصویرسازی بردارهای کلمه در فضای 2 بعدی کاربرد فراوانی دارد.
+
+
+
+
+
+**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]**
+
+
+[ادبیات، هنر، کتاب، فرهنگ، شعر، دانش، مفرح، دوستداشتنی، دوران کودکی، مهربان، خرس تدی، نرم، آغوش، بامزه، ناز]
+
+
+
+
+
+**65. Language model**
+
+
+مدل زبانی
+
+
+
+
+
+**66. Overview ― A language model aims at estimating the probability of a sentence P(y).**
+
+
+نمای کلی ـــ هدف مدل زبان تخمین احتمال جملهی P(y) است.
+
+
+
+
+
+**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.**
+
+
+مدل انگرام ــ این مدل یک رویکرد ساده با هدف اندازهگیری احتمال نمایش یک عبارت در یک نوشته است که با دفعات تکرار آن در دادههای آموزشی محاسبه میشود.
+
+
+
+
+
+**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:**
+
+
+سرگشتگی ـــ مدلهای زبانی معمولاً با معیار سرگشتی، که با PP هم نمایش داده میشود، سنجیده میشوند، که مقدار آن معکوس احتمال یک مجموعه داده است که تقسیم بر تعداد کلمات T میشود. هر چه سرگشتگی کمتر باشد بهتر است و به صورت زیر تعریف میشود:
+
+
+
+
+
+**69. Remark: PP is commonly used in t-SNE.**
+
+
+نکته: PP عموما در t-SNE کاربرد دارد.
+
+
+
+
+
+**70. Machine translation**
+
+
+ترجمه ماشینی
+
+
+
+
+
+**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:**
+
+
+نمای کلی ― مدل ترجمهی ماشینی مشابه مدل زبانی است با این تفاوت که یک شبکهی رمزنگار قبل از آن قرار گرفته است. به همین دلیل، گاهی اوقات به آن مدل زبان شرطی میگویند. هدف آن یافتن جمله y است بطوری که:
+
+
+
+
+
+**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.**
+
+
+جستجوی پرتو ― یک الگوریتم جستجوی اکتشافی است که در ترجمهی ماشینی و بازتشخیص گفتار برای یافتن محتملترین جملهی y باتوجه به ورودی مفروض x بکار برده میشود.
+
+
+
+
+
+**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]**
+
+
+[گام 1: یافتن B کلمهی محتمل برتر y<1>، گام 2: محاسبه احتمالات شرطی y|x,y<1>,...,y، گام 3: نگهداشتن B ترکیب برتر x,y<1>,…,y، خاتمه فرآیند با کلمهی توقف]
+
+
+
+
+
+**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.**
+
+
+نکته: اگر پهنای پرتو 1 باشد، آنگاه با جستوجوی حریصانهٔ ساده برابر خواهد بود.
+
+
+
+
+
+**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.**
+
+
+پهنای پرتو ـــ پهنای پرتوی B پارامتری برای جستجوی پرتو است. مقادیر بزرگ B به نتیجه بهتر منتهی میشوند اما عملکرد آهستهتری دارند و حافظه را افزایش میدهند. مقادیر کوچک B به نتایج بدتر منتهی میشوند اما بار محاسباتی پایینتری دارند. مقدار استاندارد B حدود 10 است.
+
+
+
+
+
+**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:**
+
+
+نرمالسازی طول ― برای بهبود ثبات عددی، جستجوی پرتو معمولا با تابع هدف نرمالشدهی زیر اعمال میشود، که اغلب اوقات هدف درستنمایی لگاریتمی نرمالشده نامیده میشود و بهصورت زیر تعریف میشود:
+
+
+
+
+
+**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.**
+
+
+تذکر: پارامتر α را میتوان تعدیلکننده نامید و مقدارش معمولا بین 0.5 و 1 است.
+
+
+
+
+
+**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:**
+
+
+تحلیل خطا ―زمانیکه ترجمهی پیشبینیشدهی ^y ی بهدست میآید که مطلوب نیست، میتوان با انجام تحلیل خطای زیر از خود پرسید که چرا ترجمه y* خوب نیست:
+
+
+
+
+
+**79. [Case, Root cause, Remedies]**
+
+
+[قضیه، ریشهی مشکل، راهحل]
+
+
+
+
+
+**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]**
+
+
+[جستجوی پرتوی معیوب، RNN معیوب، افزایش پهنای پرتو، امتحان معماریهای مختلف، استفاده از تنظیمکننده، جمعآوری دادههای بیشتر]
+
+
+
+
+**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:**
+
+
+امتیاز Bleu ― جایگزین ارزشیابی دوزبانه (bleu) میزان خوب بودن ترجمه ماشینی را با محاسبهی امتیاز تشابه برمبنای دقت انگرام اندازهگیری میکند. (این امتیاز) به صورت زیر تعریف میشود:
+
+
+
+
+
+**82. where pn is the bleu score on n-gram only defined as follows:**
+
+
+که pn امتیاز bleu تنها براساس انگرام است و به صورت زیر تعریف میشود:
+
+
+
+
+
+**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.**
+
+
+تذکر: ممکن است برای پیشگیری از امتیاز اغراق آمیز تصنعیbleu ، برای ترجمههای پیشبینیشدهی کوتاه از جریمه اختصار استفاده شود.
+
+
+
+
+**84. Attention**
+
+
+ژرفنگری
+
+
+
+
+
+**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:**
+
+
+مدل ژرفنگری ― این مدل به RNN این امکان را میدهد که به بخشهای خاصی از ورودی که حائز اهمیت هستند توجه نشان دهد که در عمل باعث بهبود عملکرد مدل حاصلشده خواهد شد. اگر α به معنای مقدار توجهی باشد که خروجی y باید به فعالسازی a داشته باشد و c نشاندهندهی زمینه (متن) در زمان t باشد، داریم:
+
+
+
+
+
+**86. with**
+
+
+با
+
+
+
+
+
+**87. Remark: the attention scores are commonly used in image captioning and machine translation.**
+
+
+نکته: امتیازات ژرفنگری عموما در عنوانسازی متنی برای تصویر (image captioning) و ترجمه ماشینی کاربرد دارد.
+
+
+
+
+
+**88. A cute teddy bear is reading Persian literature.**
+
+
+یک خرس تدی بامزه در حال خواندن ادبیات فارسی است.
+
+
+
+
+
+**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:**
+
+
+وزن ژرفنگری ― مقدار توجهی که خروجی y باید به فعالسازی a داشته باشد بهوسیلهی α بهدست میآید که بهصورت زیر محاسبه میشود:
+
+
+
+
+
+**90. Remark: computation complexity is quadratic with respect to Tx.**
+
+
+نکته: پیچیدگی محاسباتی به نسبت Tx از نوع درجهی دوم است.
+
+
+
+
+
+**91. The Deep Learning cheatsheets are now available in [target language].**
+
+
+راهنمای یادگیری عمیق هم اکنون به زبان [فارسی] در دسترس است.
+
+
+
+
+**92. Original authors**
+
+
+نویسندگان اصلی
+
+
+
+
+**93. Translated by X, Y and Z**
+
+
+ترجمه شده توسط X،Y و Z
+
+
+
+
+**94. Reviewed by X, Y and Z**
+
+
+بازبینی شده توسط توسط X،Y و Z
+
+
+
+
+**95. View PDF version on GitHub**
+
+
+نسخه پیدیاف را در گیتهاب ببینید
+
+
+
+
+**96. By X and Y**
+
+
+توسط X و Y
+
+
+
diff --git a/fr/cs-221-logic-models.md b/fr/cs-221-logic-models.md
new file mode 100644
index 000000000..aa03a9b9a
--- /dev/null
+++ b/fr/cs-221-logic-models.md
@@ -0,0 +1,462 @@
+**Logic-based models translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-221/cheatsheet-logic-models)
+
+
+
+**1. Logic-based models with propositional and first-order logic**
+
+⟶ Modèles basés sur la logique : logique propositionnelle et calcul des prédicats du premier ordre
+
+
+
+
+**2. Basics**
+
+⟶ Bases
+
+
+
+
+**3. Syntax of propositional logic ― By noting f,g formulas, and ¬,∧,∨,→,↔ connectives, we can write the following logical expressions:**
+
+⟶ Syntaxe de la logique propositionnelle - En notant f et g formules et ¬,∧,∨,→,↔ opérateurs, on peut écrire les expressions logiques suivantes :
+
+
+
+
+**4. [Name, Symbol, Meaning, Illustration]**
+
+⟶ [Nom, Symbole, Signification, Illustration]
+
+
+
+
+**5. [Affirmation, Negation, Conjunction, Disjunction, Implication, Biconditional]**
+
+⟶ [Affirmation, Négation, Conjonction, Disjonction, Implication, Biconditionnel]
+
+
+
+
+**6. [not f, f and g, f or g, if f then g, f, that is to say g]**
+
+⟶ [non f, f et g, f ou g, si f alors g, f, c'est à dire g]
+
+
+
+
+**7. Remark: formulas can be built up recursively out of these connectives.**
+
+⟶ Remarque : n'importe quelle formule peut être construite de manière récursive à partir de ces opérateurs.
+
+
+
+
+**8. Model ― A model w denotes an assignment of binary weights to propositional symbols.**
+
+⟶ [Modèle - Un modèle w dénote une combinaison de valeurs binaires liées à des symboles propositionnels]
+
+
+
+
+**9. Example: the set of truth values w={A:0,B:1,C:0} is one possible model to the propositional symbols A, B and C.**
+
+⟶ Exemple : l'ensemble de valeurs de vérité w={A:0,B:1,C:0} est un modèle possible pour les symboles propositionnels A, B et C.
+
+
+
+
+**10. Interpretation function ― The interpretation function I(f,w) outputs whether model w satisfies formula f:**
+
+⟶ Interprétation - L'interprétation I(f,w) nous renseigne si le modèle w satisfait la formule f :
+
+
+
+
+**11. Set of models ― M(f) denotes the set of models w that satisfy formula f. Mathematically speaking, we define it as follows:**
+
+⟶ Ensemble de modèles - M(f) dénote l'ensemble des modèles w qui satisfont la formule f. Sa définition mathématique est donnée par :
+
+
+
+
+**12. Knowledge base**
+
+⟶ Base de connaissance
+
+
+
+
+**13. Definition ― The knowledge base KB is the conjunction of all formulas that have been considered so far. The set of models of the knowledge base is the intersection of the set of models that satisfy each formula. In other words:**
+
+⟶ Définition - La base de connaissance KB est la conjonction de toutes les formules considérées jusqu'à présent. L'ensemble des modèles de la base de connaissance est l'intersection de l'ensemble des modèles satisfaisant chaque formule. En d'autres termes :
+
+
+
+
+**14. Probabilistic interpretation ― The probability that query f is evaluated to 1 can be seen as the proportion of models w of the knowledge base KB that satisfy f, i.e.:**
+
+⟶ Interprétation en termes de probabilités - La probabilité que la requête f soit évaluée à 1 peut être vue comme la proportion des modèles w de la base de connaissance KB qui satisfait f, i.e. :
+
+
+
+
+**15. Satisfiability ― The knowledge base KB is said to be satisfiable if at least one model w satisfies all its constraints. In other words:**
+
+⟶ Satisfaisabilité - La base de connaissance KB est dite satisfaisable si au moins un modèle w satisfait toutes ses contraintes. En d'autres termes :
+
+
+
+
+**16. satisfiable**
+
+⟶ satisfaisable
+
+
+
+
+**17. Remark: M(KB) denotes the set of models compatible with all the constraints of the knowledge base.**
+
+⟶ Remarque : M(KB) dénote l'ensemble des modèles compatibles avec toutes les contraintes de la base de connaissance.
+
+
+
+
+**18. Relation between formulas and knowledge base - We define the following properties between the knowledge base KB and a new formula f:**
+
+⟶ Relation entre formules et base de connaissance - On définit les propriétés suivantes entre la base de connaissance KB et une nouvelle formule f :
+
+
+
+
+**19. [Name, Mathematical formulation, Illustration, Notes]**
+
+⟶ [Nom, Formulation mathématique, Illustration, Notes]
+
+
+
+
+**20. [KB entails f, KB contradicts f, f contingent to KB]**
+
+⟶ [KB déduit f, KB contredit f, f est contingent à KB]
+
+
+
+
+**21. [f does not bring any new information, Also written KB⊨f, No model satisfies the constraints after adding f, Equivalent to KB⊨¬f, f does not contradict KB, f adds a non-trivial amount of information to KB]**
+
+⟶ [f n'apporte aucune nouvelle information, Aussi écrit KB⊨f, Aucun modèle ne satisfait les contraintes après l'ajout de f, Équivalent à KB⊨¬f, f ne contredit pas KB, f ajoute une quantité non-triviale d'information à KB]
+
+
+
+
+**22. Model checking ― A model checking algorithm takes as input a knowledge base KB and outputs whether it is satisfiable or not.**
+
+⟶ Vérification de modèles - Un algorithme de vérification de modèles (model checking en anglais) prend comme argument une base de connaissance KB et nous renseigne si celle-ci est satisfaisable ou pas.
+
+
+
+
+**23. Remark: popular model checking algorithms include DPLL and WalkSat.**
+
+⟶ Remarque : DPLL et WalkSat sont des exemples populaires d'algorithmes de vérification de modèles.
+
+
+
+
+**24. Inference rule ― An inference rule of premises f1,...,fk and conclusion g is written:**
+
+⟶ Règle d'inférence - Une règle d'inférence de prémisses f1,...,fk et de conclusion g s'écrit :
+
+
+
+
+**25. Forward inference algorithm ― From a set of inference rules Rules, this algorithm goes through all possible f1,...,fk and adds g to the knowledge base KB if a matching rule exists. This process is repeated until no more additions can be made to KB.**
+
+⟶ Algorithme de chaînage avant - Partant d'un ensemble de règles d'inférence Rules, l'algorithme de chaînage avant (en anglais forward inference algorithm) parcourt tous les f1,...,fk et ajoute g à la base de connaissance KB si une règle parvient à une telle conclusion. Cette démarche est répétée jusqu'à ce qu'aucun autre ajout ne puisse être fait à KB.
+
+
+
+
+**26. Derivation ― We say that KB derives f (written KB⊢f) with rules Rules if f already is in KB or gets added during the forward inference algorithm using the set of rules Rules.**
+
+⟶ Dérivation - On dit que KB dérive f (noté KB⊢f) par le biais des règles Rules soit si f est déjà dans KB ou si elle se fait ajouter pendant l'application du chaînage avant utilisant les règles Rules.
+
+
+
+
+**27. Properties of inference rules ― A set of inference rules Rules can have the following properties:**
+
+⟶ Propriétés des règles d'inférence - Un ensemble de règles d'inférence Rules peut avoir les propriétés suivantes :
+
+
+
+
+**28. [Name, Mathematical formulation, Notes]**
+
+⟶ [Nom, Formulation mathématique, Notes]
+
+
+
+
+**29. [Soundness, Completeness]**
+
+⟶ [Validité, Complétude]
+
+
+
+
+**30. [Inferred formulas are entailed by KB, Can be checked one rule at a time, "Nothing but the truth", Formulas entailing KB are either already in the knowledge base or inferred from it, "The whole truth"]**
+
+⟶ [Les formules inférées sont déduites par KB, Peut être vérifiée une règle à la fois, "Rien que la vérité", Les formules déduites par KB sont soit déjà dans la base de connaissance, soit inférées de celle-ci, "La vérité dans sa totalité"]
+
+
+
+
+**31. Propositional logic**
+
+⟶ Logique propositionnelle
+
+
+
+
+**32. In this section, we will go through logic-based models that use logical formulas and inference rules. The idea here is to balance expressivity and computational efficiency.**
+
+⟶ Dans cette section, nous allons parcourir les modèles logiques utilisant des formules logiques et des règles d'inférence. L'idée est de trouver le juste milieu entre expressivité et efficacité.
+
+
+
+
+**33. Horn clause ― By noting p1,...,pk and q propositional symbols, a Horn clause has the form:**
+
+⟶ Clause de Horn - En notant p1,...,pk et q des symboles propositionnels, une clause de Horn s'écrit :
+
+
+
+
+**34. Remark: when q=false, it is called a "goal clause", otherwise we denote it as a "definite clause".**
+
+⟶ Remarque : quand q=false, cette clause de Horn est "négative", autrement elle est appelée "stricte".
+
+
+
+
+**35. Modus ponens ― For propositional symbols f1,...,fk and p, the modus ponens rule is written:**
+
+⟶ Modus ponens - Sur les symboles propositionnels f1,...,fk et p, la règle de modus ponens est écrite :
+
+
+
+
+**36. Remark: it takes linear time to apply this rule, as each application generate a clause that contains a single propositional symbol.**
+
+⟶ Remarque : l'application de cette règle se fait en temps linéaire, puisque chaque exécution génère une clause contenant un symbole propositionnel.
+
+
+
+
+**37. Completeness ― Modus ponens is complete with respect to Horn clauses if we suppose that KB contains only Horn clauses and p is an entailed propositional symbol. Applying modus ponens will then derive p.**
+
+⟶ Complétude - Modus ponens est complet lorsqu'on le munit des clauses de Horn si l'on suppose que KB contient uniquement des clauses de Horn et que p est un symbole propositionnel qui est déduit. L'application de modus ponens dérivera alors p.
+
+
+
+
+**38. Conjunctive normal form ― A conjunctive normal form (CNF) formula is a conjunction of clauses, where each clause is a disjunction of atomic formulas.**
+
+⟶ Forme normale conjonctive - La forme normale conjonctive (en anglais conjunctive normal form ou CNF) d'une formule est une conjonction de clauses, chacune d'entre elles étant une disjonction de formules atomiques.
+
+
+
+
+**39. Remark: in other words, CNFs are ∧ of ∨.**
+
+⟶ Remarque : en d'autres termes, les CNFs sont des ∧ de ∨.
+
+
+
+
+**40. Equivalent representation ― Every formula in propositional logic can be written into an equivalent CNF formula. The table below presents general conversion properties:**
+
+⟶ Représentation équivalente - Chaque formule en logique propositionnelle peut être écrite de manière équivalente sous la forme d'une formule CNF. Le tableau ci-dessous présente les propriétés principales permettant une telle conversion :
+
+
+
+
+**41. [Rule name, Initial, Converted, Eliminate, Distribute, over]**
+
+⟶ [Nom de la règle, Début, Résultat, Élimine, Distribue, sur]
+
+
+
+
+**42. Resolution rule ― For propositional symbols f1,...,fn, and g1,...,gm as well as p, the resolution rule is written:**
+
+⟶ Règle de résolution - Pour des symboles propositionnels f1,...,fn, et g1,...,gm ainsi que p, la règle de résolution s'écrit :
+
+
+
+
+**43. Remark: it can take exponential time to apply this rule, as each application generates a clause that has a subset of the propositional symbols.**
+
+⟶ Remarque : l'application de cette règle peut prendre un temps exponentiel, vu que chaque itération génère une clause constituée d'une partie des symboles propositionnels.
+
+
+
+
+**44. [Resolution-based inference ― The resolution-based inference algorithm follows the following steps:, Step 1: Convert all formulas into CNF, Step 2: Repeatedly apply resolution rule, Step 3: Return unsatisfiable if and only if False, is derived]**
+
+⟶ [Inférence basée sur la règle de résolution - L'algorithme d'inférence basée sur la règle de résolution se déroule en plusieurs étapes :, Étape 1 : Conversion de toutes les formules vers leur forme CNF, Étape 2 : Application répétée de la règle de résolution, Étape 3 : Renvoyer "non satisfaisable" si et seulement si False est dérivé]
+
+
+
+
+**45. First-order logic**
+
+⟶ Calcul des prédicats du premier ordre
+
+
+
+
+**46. The idea here is to use variables to yield more compact knowledge representations.**
+
+⟶ L'idée ici est d'utiliser des variables et ainsi permettre une représentation des connaissances plus compacte.
+
+
+
+
+**47. [Model ― A model w in first-order logic maps:, constant symbols to objects, predicate symbols to tuple of objects]**
+
+⟶ [Modèle - Un modèle w en calcul des prédicats du premier ordre lie :, des symboles constants à des objets, des prédicats à n-uplets d'objets]
+
+
+
+
+**48. Horn clause ― By noting x1,...,xn variables and a1,...,ak,b atomic formulas, the first-order logic version of a horn clause has the form:**
+
+⟶ Clause de Horn - En notant x1,...,xn variables et a1,...,ak,b formules atomiques, une clause de Horn pour le calcul des prédicats du premier ordre a la forme :
+
+
+
+
+**49. Substitution ― A substitution θ maps variables to terms and Subst[θ,f] denotes the result of substitution θ on f.**
+
+⟶ Substitution - Une substitution θ lie les variables aux termes et Subst[θ,f] désigne le résultat de la substitution θ sur f.
+
+
+
+
+**50. Unification ― Unification takes two formulas f and g and returns the most general substitution θ that makes them equal:**
+
+⟶ Unification - Une unification prend deux formules f et g et renvoie la substitution θ la plus générale les rendant égales :
+
+
+
+
+**51. such that**
+
+⟶ tel que
+
+
+
+
+**52. Note: Unify[f,g] returns Fail if no such θ exists.**
+
+⟶ Note : Unify[f,g] renvoie Fail si un tel θ n'existe pas.
+
+
+
+
+**53. Modus ponens ― By noting x1,...,xn variables, a1,...,ak and a′1,...,a′k atomic formulas and by calling θ=Unify(a′1∧...∧a′k,a1∧...∧ak) the first-order logic version of modus ponens can be written:**
+
+⟶ Modus ponens - En notant x1,...,xn variables, a1,...,ak et a′1,...,a′k formules atomiques et en notant θ=Unify(a′1∧...∧a′k,a1∧...∧ak), modus ponens pour le calcul des prédicats du premier ordre s'écrit :
+
+
+
+
+**54. Completeness ― Modus ponens is complete for first-order logic with only Horn clauses.**
+
+⟶ Complétude - Modus ponens est complet pour le calcul des prédicats du premier ordre lorsqu'il agit uniquement sur les clauses de Horn.
+
+
+
+
+**55. Resolution rule ― By noting f1,...,fn, g1,...,gm, p, q formulas and by calling θ=Unify(p,q), the first-order logic version of the resolution rule can be written:**
+
+⟶ Règle de résolution - En notant f1,...,fn, g1,...,gm, p, q formules et en posant θ=Unify(p,q), le règle de résolution pour le calcul des prédicats du premier ordre s'écrit :
+
+
+
+
+**56. [Semi-decidability ― First-order logic, even restricted to only Horn clauses, is semi-decidable., if KB⊨f, forward inference on complete inference rules will prove f in finite time, if KB⊭f, no algorithm can show this in finite time]**
+
+⟶ [Semi-décidabilité - Le calcul des prédicats du premier ordre, même restreint aux clauses de Horn, n'est que semi-décidable., si KB⊨f, l'algorithme de chaînage avant sur des règles d'inférence complètes prouvera f en temps fini, si KB⊭f, aucun algorithme ne peut le prouver en temps fini]
+
+
+
+
+**57. [Basics, Notations, Model, Interpretation function, Set of models]**
+
+⟶ [Bases, Notations, Modèle, Interprétation, Ensemble de modèles]
+
+
+
+
+**58. [Knowledge base, Definition, Probabilistic interpretation, Satisfiability, Relationship with formulas, Forward inference, Rule properties]**
+
+⟶ [Base de connaissance, Définition, Interprétation en termes de probabilité, Satisfaisabilité, Lien avec les formules, Chaînage en avant, Propriétés des règles]
+
+
+
+
+**59. [Propositional logic, Clauses, Modus ponens, Conjunctive normal form, Representation equivalence, Resolution]**
+
+⟶ [Logique propositionnelle, Clauses, Modus ponens, Forme normale conjonctive, Représentation équivalente, Résolution]
+
+
+
+
+**60. [First-order logic, Substitution, Unification, Resolution rule, Modus ponens, Resolution, Semi-decidability]**
+
+⟶ [Calcul des prédicats du premier ordre, Substitution, Unification, Règle de résolution, Modus ponens, Résolution, Semi-décidabilité]
+
+
+
+
+**61. View PDF version on GitHub**
+
+⟶ Voir la version PDF sur GitHub
+
+
+
+
+**62. Original authors**
+
+⟶ Auteurs originaux.
+
+
+
+
+**63. Translated by X, Y and Z**
+
+⟶ Traduit par X, Y et Z.
+
+
+
+
+**64. Reviewed by X, Y and Z**
+
+⟶ Revu par X, Y et Z.
+
+
+
+
+**65. By X and Y**
+
+⟶ Par X et Y.
+
+
+
+
+**66. The Artificial Intelligence cheatsheets are now available in [target language].**
+
+⟶ Les pense-bêtes d'intelligence artificielle sont maintenant disponibles en français.
diff --git a/fr/cs-221-reflex-models.md b/fr/cs-221-reflex-models.md
new file mode 100644
index 000000000..7a7a489e1
--- /dev/null
+++ b/fr/cs-221-reflex-models.md
@@ -0,0 +1,539 @@
+**Reflex-based models translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-221/cheatsheet-reflex-models)
+
+
+
+**1. Reflex-based models with Machine Learning**
+
+⟶ Modèles basés sur le réflex : apprentissage automatique
+
+
+
+
+**2. Linear predictors**
+
+⟶ Prédicteurs linéaires
+
+
+
+
+**3. In this section, we will go through reflex-based models that can improve with experience, by going through samples that have input-output pairs.**
+
+⟶ Dans cette section, nous allons explorer les modèles basés sur le réflex qui peuvent s'améliorer avec l'expérience s'appuyant sur des données ayant une correspondance entrée-sortie.
+
+
+
+
+**4. Feature vector ― The feature vector of an input x is noted ϕ(x) and is such that:**
+
+⟶ Vecteur caractéristique - Le vecteur caractéristique (en anglais feature vector) d'une entrée x est noté ϕ(x) et se décompose en :
+
+
+
+
+**5. Score ― The score s(x,w) of an example (ϕ(x),y)∈Rd×R associated to a linear model of weights w∈Rd is given by the inner product:**
+
+⟶ Score - Le score s(x,w) d'un exemple (ϕ(x),y)∈Rd×R associé à un modèle linéaire de paramètres w∈Rd est donné par le produit scalaire :
+
+
+
+
+**6. Classification**
+
+⟶ Classification
+
+
+
+
+**7. Linear classifier ― Given a weight vector w∈Rd and a feature vector ϕ(x)∈Rd, the binary linear classifier fw is given by:**
+
+⟶ Classifieur linéaire - Étant donnés un vecteur de paramètres w∈Rd et un vecteur caractéristique ϕ(x)∈Rd, le classifieur linéaire binaire est donné par :
+
+
+
+
+**8. if**
+
+⟶ si
+
+
+
+
+**9. Margin ― The margin m(x,y,w)∈R of an example (ϕ(x),y)∈Rd×{−1,+1} associated to a linear model of weights w∈Rd quantifies the confidence of the prediction: larger values are better. It is given by:**
+
+⟶ Marge - La marge (en anglais margin) m(x,y,w)∈R d'un exemple (ϕ(x),y)∈Rd×{−1,+1} associée à un modèle linéaire de paramètre w∈Rd quantifie la confiance associée à une prédiction : plus cette valeur est grande, mieux c'est. Cette quantité est donnée par :
+
+
+
+
+**10. Regression**
+
+⟶ Régression
+
+
+
+
+**11. Linear regression ― Given a weight vector w∈Rd and a feature vector ϕ(x)∈Rd, the output of a linear regression of weights w denoted as fw is given by:**
+
+⟶ Régression linéaire - Étant donnés un vecteur de paramètres w∈Rd et un vecteur caractéristique ϕ(x)∈Rd, le résultat d'une régression linéaire de paramètre w, notée fw, est donné par :
+
+
+
+
+**12. Residual ― The residual res(x,y,w)∈R is defined as being the amount by which the prediction fw(x) overshoots the target y:**
+
+⟶ Résidu - Le résidu res(x,y,w)∈R est défini comme étant la différence entre la prédiction fw(x) et la vraie valeur y.
+
+
+
+
+**13. Loss minimization**
+
+⟶ Minimisation de la fonction objectif
+
+
+
+
+**14. Loss function ― A loss function Loss(x,y,w) quantifies how unhappy we are with the weights w of the model in the prediction task of output y from input x. It is a quantity we want to minimize during the training process.**
+
+⟶ Fonction objectif - Une fonction objectif (en anglais loss function) Loss(x,y,w) traduit notre niveau d'insatisfaction avec les paramètres w du modèle dans la tâche de prédiction de la sortie y à partir de l'entrée x. C'est une quantité que l'on souhaite minimiser pendant la phase d'entraînement.
+
+
+
+
+**15. Classification case - The classification of a sample x of true label y∈{−1,+1} with a linear model of weights w can be done with the predictor fw(x)≜sign(s(x,w)). In this situation, a metric of interest quantifying the quality of the classification is given by the margin m(x,y,w), and can be used with the following loss functions:**
+
+⟶ Cas de la classification - Trouver la classe d'un exemple x appartenant à y∈{−1,+1} peut être faite par le biais d'un modèle linéaire de paramètre w à l'aide du prédicteur fw(x)≜sign(s(x,w)). La qualité de cette prédiction peut alors être évaluée au travers de la marge m(x,y,w) intervenant dans les fonctions objectif suivantes :
+
+
+
+
+**16. [Name, Illustration, Zero-one loss, Hinge loss, Logistic loss]**
+
+⟶ [Nom, Illustration, Fonction objectif zéro-un, Fonction objectif de Hinge, Fonction objectif logistique]
+
+
+
+
+**17. Regression case - The prediction of a sample x of true label y∈R with a linear model of weights w can be done with the predictor fw(x)≜s(x,w). In this situation, a metric of interest quantifying the quality of the regression is given by the margin res(x,y,w) and can be used with the following loss functions:**
+
+⟶ Cas de la régression - Prédire la valeur y∈R associée à l'exemple x peut être faite par le biais d'un modèle linéaire de paramètre w à l'aide du prédicteur fw(x)≜s(x,w). La qualité de cette prédiction peut alors être évaluée au travers du résidu res(x,y,w) intervenant dans les fonctions objectif suivantes :
+
+
+
+
+**18. [Name, Squared loss, Absolute deviation loss, Illustration]**
+
+⟶ [Nom, Erreur quadratique, Erreur absolue, Illustration]
+
+
+
+
+**19. Loss minimization framework ― In order to train a model, we want to minimize the training loss is defined as follows:**
+
+⟶ Processus de minimisation de la fonction objectif - Lors de l'entraînement d'un modèle, on souhaite minimiser la valeur de la fonction objectif évaluée sur l'ensemble d'entraînement :
+
+
+
+
+**20. Non-linear predictors**
+
+⟶ Prédicteurs non linéaires
+
+
+
+
+**21. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
+
+⟶ k plus proches voisins - L'algorithme des k plus proches voisins (en anglais k-nearest neighbors ou k-NN) est une approche non paramétrique où la réponse associée à un exemple est déterminée par la nature de ses k plus proches voisins de l'ensemble d'entraînement. Cette démarche peut être utilisée pour la classification et la régression.
+
+
+
+
+**22. Remark: the higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
+
+⟶ Remarque : plus le paramètre k est grand, plus le biais est élevé. À l'inverse, la variance devient plus élevée lorsque l'on réduit la valeur k.
+
+
+
+
+**23. Neural networks ― Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks. The vocabulary around neural networks architectures is described in the figure below:**
+
+⟶ Réseaux de neurones - Les réseaux de neurones (en anglais neural networks) constituent un type de modèle basés sur des couches (en anglais layers). Parmi les types de réseaux populaires, on peut compter les réseaux de neurones convolutionnels et récurrents (abbréviés respectivement en CNN et RNN en anglais). Une partie du vocabulaire associé aux réseaux de neurones est détaillée dans la figure ci-dessous :
+
+
+
+
+**24. [Input layer, Hidden layer, Output layer]**
+
+⟶ [Couche d'entrée, Couche cachée, Couche de sortie]
+
+
+
+
+**25. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
+
+⟶ En notant i la i-ème couche du réseau et j son j-ième neurone, on a :
+
+
+
+
+**26. where we note w, b, x, z the weight, bias, input and non-activated output of the neuron respectively.**
+
+⟶ où l'on note w, b, x, z le coefficient, le biais ainsi que la variable de sortie respectivement.
+
+
+
+
+**27. For a more detailed overview of the concepts above, check out the Supervised Learning cheatsheets!**
+
+⟶ Pour un aperçu plus détaillé des concepts ci-dessus, rendez-vous sur le pense-bête d'apprentissage supervisé !
+
+
+
+
+**28. Stochastic gradient descent**
+
+⟶ Algorithme du gradient stochastique
+
+
+
+
+**29. Gradient descent ― By noting η∈R the learning rate (also called step size), the update rule for gradient descent is expressed with the learning rate and the loss function Loss(x,y,w) as follows:**
+
+⟶ Descente de gradient - En notant η∈R le taux d'apprentissage (en anglais learning rate ou step size), la règle de mise à jour des coefficients pour cet algorithme utilise la fonction objectif Loss(x,y,w) de la manière suivante :
+
+
+
+
+**30. Stochastic updates ― Stochastic gradient descent (SGD) updates the parameters of the model one training example (ϕ(x),y)∈Dtrain at a time. This method leads to sometimes noisy, but fast updates.**
+
+⟶ Mises à jour stochastiques - L'algorithme du gradient stochastique (en anglais stochastic gradient descent ou SGD) met à jour les paramètres du modèle en parcourant les exemples (ϕ(x),y)∈Dtrain de l'ensemble d'entraînement un à un. Cette méthode engendre des mises à jour rapides à calculer mais qui manquent parfois de robustesse.
+
+
+
+
+**31. Batch updates ― Batch gradient descent (BGD) updates the parameters of the model one batch of examples (e.g. the entire training set) at a time. This method computes stable update directions, at a greater computational cost.**
+
+⟶ Mises à jour par lot - L'algorithme du gradient par lot (en anglais batch gradient descent ou BGD) met à jour les paramètre du modèle en utilisant des lots entiers d'exemples (e.g. la totalité de l'ensemble d'entraînement) à la fois. Cette méthode calcule des directions de mise à jour des coefficients plus stable au prix d'un plus grand nombre de calculs.
+
+
+
+
+**32. Fine-tuning models**
+
+⟶ Peaufinage de modèle
+
+
+
+
+**33. Hypothesis class ― A hypothesis class F is the set of possible predictors with a fixed ϕ(x) and varying w:**
+
+⟶ Classe d'hypothèses - Une classe d'hypothèses F est l'ensemble des prédicteurs candidats ayant un ϕ(x) fixé et dont le paramètre w peut varier.
+
+
+
+
+**34. Logistic function ― The logistic function σ, also called the sigmoid function, is defined as:**
+
+⟶ Fonction logistique - La fonction logistique σ, aussi appelée en anglais sigmoid function, est définie par :
+
+
+
+
+**35. Remark: we have σ′(z)=σ(z)(1−σ(z)).**
+
+⟶ Remarque : la dérivée de cette fonction s'écrit σ′(z)=σ(z)(1−σ(z)).
+
+
+
+
+**36. Backpropagation ― The forward pass is done through fi, which is the value for the subexpression rooted at i, while the backward pass is done through gi=∂out∂fi and represents how fi influences the output.**
+
+⟶ Rétropropagation du gradient (en anglais backpropagation) - La propagation avant (en anglais forward pass) est effectuée via fi, valeur correspondant à l'expression appliquée à l'étape i. La propagation de l'erreur vers l'arrière (en anglais backward pass) se fait via gi=∂out∂fi et décrit la manière dont fi agit sur la sortie du réseau.
+
+
+
+
+**37. Approximation and estimation error ― The approximation error ϵapprox represents how far the entire hypothesis class F is from the target predictor g∗, while the estimation error ϵest quantifies how good the predictor ^f is with respect to the best predictor f∗ of the hypothesis class F.**
+
+⟶ Erreur d'approximation et d'estimation - L'erreur d'approximation ϵapprox représente la distance entre la classe d'hypothèses F et le prédicteur optimal g∗. De son côté, l'erreur d'estimation quantifie la qualité du prédicteur ^f par rapport au meilleur prédicteur f∗ de la classe d'hypothèses F.
+
+
+
+
+**38. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
+
+⟶ Régularisation - Le but de la régularisation est d'empêcher le modèle de surapprendre (en anglais overfit) les données en s'occupant ainsi des problèmes de variance élevée. La table suivante résume les différents types de régularisation couramment utilisés :
+
+
+
+
+**39. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+⟶ [Réduit les coefficients à 0, Bénéfique pour la sélection de variables, Rapetissit les coefficients, Compromis entre sélection de variables et coefficients de faible magnitude]
+
+
+
+
+**40. Hyperparameters ― Hyperparameters are the properties of the learning algorithm, and include features, regularization parameter λ, number of iterations T, step size η, etc.**
+
+⟶ Hyperparamètres - Les hyperparamètres sont les paramètres de l'algorithme d'apprentissage et incluent parmi d'autres le type de caractéristiques utilisé ainsi que le paramètre de régularisation λ, le nombre d'itérations T, le taux d'apprentissage η.
+
+
+
+
+**41. Sets vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
+
+⟶ Vocabulaire ― Lors de la sélection d'un modèle, on divise les données en 3 différentes parties :
+
+
+
+
+**42. [Training set, Validation set, Testing set]**
+
+⟶ [Données d'entraînement, Données de validation, Données de test]
+
+
+
+
+**43. [Model is trained, Usually 80% of the dataset, Model is assessed, Usually 20% of the dataset, Also called hold-out or development set, Model gives predictions, Unseen data]**
+
+⟶ [Le modèle est entrainé, Constitue normalement 80% du jeu de données, Le modèle est évalué, Constitue normalement 20% du jeu de données, Aussi appelé données de développement (en anglais hold-out ou development set), Le modèle donne ses prédictions, Données jamais observées]
+
+
+
+
+**44. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
+
+⟶ Une fois que le modèle a été choisi, il est entrainé sur le jeu de données entier et testé sur l'ensemble de test (qui n'a jamais été vu). Ces derniers sont représentés dans la figure ci-dessous :
+
+
+
+
+**45. [Dataset, Unseen data, train, validation, test]**
+
+⟶ [Jeu de données, Données inconnues, entrainement, validation, test]
+
+
+
+
+**46. For a more detailed overview of the concepts above, check out the Machine Learning tips and tricks cheatsheets!**
+
+⟶ Pour un aperçu plus détaillé des concepts ci-dessus, rendez-vous sur le pense-bête de petites astuces d'apprentissage automatique !
+
+
+
+
+**47. Unsupervised Learning**
+
+⟶ Apprentissage non supervisé
+
+
+
+
+**48. The class of unsupervised learning methods aims at discovering the structure of the data, which may have of rich latent structures.**
+
+⟶ Les méthodes d'apprentissage non supervisé visent à découvrir la structure (parfois riche) des données.
+
+
+
+
+**49. k-means**
+
+⟶ k-moyennes (en anglais k-means)
+
+
+
+
+**50. Clustering ― Given a training set of input points Dtrain, the goal of a clustering algorithm is to assign each point ϕ(xi) to a cluster zi∈{1,...,k}**
+
+⟶ Partitionnement - Étant donné un ensemble d'entraînement Dtrain, le but d'un algorithme de partitionnement (en anglais clustering) est d'assigner chaque point ϕ(xi) à une partition zi∈{1,...,k}.
+
+
+
+
+**51. Objective function ― The loss function for one of the main clustering algorithms, k-means, is given by:**
+
+⟶ Fonction objectif - La fonction objectif d'un des principaux algorithmes de partitionnement, k-moyennes, est donné par :
+
+
+
+
+**52. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
+
+⟶ Après avoir aléatoirement initialisé les centroïdes de partitions μ1,μ2,...,μk∈Rn, l'algorithme k-moyennes répète l'étape suivante jusqu'à convergence :
+
+
+
+
+**53. and**
+
+⟶ et
+
+
+
+
+**54. [Means initialization, Cluster assignment, Means update, Convergence]**
+
+⟶ [Initialisation des moyennes, Assignation de partition, Mise à jour des moyennes, Convergence]
+
+
+
+
+**55. Principal Component Analysis**
+
+⟶ Analyse des composantes principales
+
+
+
+
+**56. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+⟶ Étant donnée une matrice A∈Rn×n, λ est dite être une valeur propre de A s'il existe un vecteur z∈Rn∖{0}, appelé vecteur propre, tel que :
+
+
+
+
+**57. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+⟶ Théorème spectral ― Soit A∈Rn×n. Si A est symétrique, alors A est diagonalisable par une matrice réelle orthogonale U∈Rn×n. En notant Λ=diag(λ1,...,λn), on a :
+
+
+
+
+**58. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
+
+⟶ Remarque : le vecteur propre associé à la plus grande valeur propre est appelé le vecteur propre principal de la matrice A.
+
+
+
+
+**59. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:**
+
+⟶ Algorithme ― La procédure d'analyse des composantes principales (en anglais PCA - Principal Component Analysis) est une technique de réduction de dimension qui projette les données sur k dimensions en maximisant la variance des données de la manière suivante :
+
+
+
+
+**60. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
+
+⟶ Étape 1: Normaliser les données pour avoir une moyenne de 0 et un écart-type de 1.
+
+
+
+
+**61. [where, and]**
+
+⟶ [où, et]
+
+
+
+
+**62. [Step 2: Compute Σ=1mm∑i=1ϕ(xi)ϕ(xi)T∈Rn×n, which is symmetric with real eigenvalues., Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues., Step 4: Project the data on spanR(u1,...,uk).]**
+
+⟶ [Étape 2: Calculer Σ=1mm∑i=1ϕ(xi)ϕ(xi)T∈Rn×n, qui est symétrique avec des valeurs propres réelles., Étape 3: Calculer u1,...,uk∈Rn les k valeurs propres principales orthogonales de Σ, i.e. les vecteurs propres orthogonaux des k valeurs propres les plus grandes., Étape 4: Projeter les données sur spanR(u1,...,uk).]
+
+
+
+
+**63. This procedure maximizes the variance among all k-dimensional spaces.**
+
+⟶ Cette procédure maximise la variance sur tous les espaces à k dimensions.
+
+
+
+
+**64. [Data in feature space, Find principal components, Data in principal components space]**
+
+⟶ [Données dans l'espace initial, Trouve les composantes principales, Données dans l'espace des composantes principales]
+
+
+
+
+**65. For a more detailed overview of the concepts above, check out the Unsupervised Learning cheatsheets!**
+
+⟶ Pour un aperçu plus détaillé des concepts ci-dessus, rendez-vous sur le pense-bête d'apprentissage non supervisé !
+
+
+
+
+**66. [Linear predictors, Feature vector, Linear classifier/regression, Margin]**
+
+⟶ [Prédicteurs linéaires, Vecteur caractéristique, Classification/régression linéaire, Marge]
+
+
+
+
+**67. [Loss minimization, Loss function, Framework]**
+
+⟶ [Minimisation de la fonction objectif, Fonction objectif, Cadre]
+
+
+
+
+**68. [Non-linear predictors, k-nearest neighbors, Neural networks]**
+
+⟶ [Prédicteurs non linéaires, k plus proches voisins, Réseaux de neurones]
+
+
+
+
+**69. [Stochastic gradient descent, Gradient, Stochastic updates, Batch updates]**
+
+⟶ [Algorithme du gradient stochastique, Gradient, Mises à jour stochastiques, Mises à jour par lots]
+
+
+
+
+**70. [Fine-tuning models, Hypothesis class, Backpropagation, Regularization, Sets vocabulary]**
+
+⟶ [Peaufiner les modèles, Classe d'hypothèses, Rétropropagation du gradient, Régularisation, Vocabulaire]
+
+
+
+
+**71. [Unsupervised Learning, k-means, Principal components analysis]**
+
+⟶ [Apprentissage non supervisé, k-means, Analyse des composantes principales]
+
+
+
+
+**72. View PDF version on GitHub**
+
+⟶ Voir la version PDF sur GitHub
+
+
+
+
+**73. Original authors**
+
+⟶ Auteurs d'origine
+
+
+
+
+**74. Translated by X, Y and Z**
+
+⟶ Traduit par X, Y et Z
+
+
+
+
+**75. Reviewed by X, Y and Z**
+
+⟶ Revu par X, Y et Z
+
+
+
+
+**76. By X and Y**
+
+⟶ De X et Y
+
+
+
+
+**77. The Artificial Intelligence cheatsheets are now available in [target language].**
+
+⟶ Les pense-bêtes d'intelligence artificielle sont maintenant disponibles en français.
diff --git a/fr/cs-221-states-models.md b/fr/cs-221-states-models.md
new file mode 100644
index 000000000..20be6ebb7
--- /dev/null
+++ b/fr/cs-221-states-models.md
@@ -0,0 +1,980 @@
+**States-based models translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-221/cheatsheet-states-models)
+
+
+
+**1. States-based models with search optimization and MDP**
+
+⟶ Modèles basés sur les états : optimisation de parcours et MDPs
+
+
+
+
+**2. Search optimization**
+
+⟶ Optimisation de parcours
+
+
+
+
+**3. In this section, we assume that by accomplishing action a from state s, we deterministically arrive in state Succ(s,a). The goal here is to determine a sequence of actions (a1,a2,a3,a4,...) that starts from an initial state and leads to an end state. In order to solve this kind of problem, our objective will be to find the minimum cost path by using states-based models.**
+
+⟶ Dans cette section, nous supposons qu'en effectuant une action a à partir d'un état s, on arrive de manière déterministe à l'état Succ(s,a). Le but de cette étude est de déterminer une séquence d'actions (a1,a2,a3,a4,...) démarrant d'un état initial et aboutissant à un état final. Pour y parvenir, notre objectif est de minimiser le coût associés à ces actions à l'aide de modèles basés sur les états (state-based model en anglais).
+
+
+
+
+**4. Tree search**
+
+⟶ Parcours d'arbre
+
+
+
+
+**5. This category of states-based algorithms explores all possible states and actions. It is quite memory efficient, and is suitable for huge state spaces but the runtime can become exponential in the worst cases.**
+
+⟶ Cette catégorie d'algorithmes explore tous les états et actions possibles. Même si leur consommation en mémoire est raisonnable et peut supporter des espaces d'états de taille très grande, ce type d'algorithmes est néanmoins susceptible d'engendrer des complexités en temps exponentielles dans le pire des cas.
+
+
+
+
+**6. [Self-loop, More than a parent, Cycle, More than a root, Valid tree]**
+
+⟶ [Boucle, Plus d'un parent, Cycle, Plus d'une racine, Arbre valide]
+
+
+
+
+**7. [Search problem ― A search problem is defined with:, a starting state sstart, possible actions Actions(s) from state s, action cost Cost(s,a) from state s with action a, successor Succ(s,a) of state s after action a, whether an end state was reached IsEnd(s)]**
+
+⟶ [Problème de recherche - Un problème de recherche est défini par :, un état de départ sstart, des actions Actions(s) pouvant être effectuées depuis l'état s, le coût de l'action Cost(s,a) depuis l'état s pour effectuer l'action a, le successeur Succ(s,a) de l'état s après avoir effectué l'action a, la connaissance d'avoir atteint ou non un état final IsEnd(s)]
+
+
+
+
+**8. The objective is to find a path that minimizes the cost.**
+
+⟶ L'objectif est de trouver un chemin minimisant le coût total des actions utilisées.
+
+
+
+
+**9. Backtracking search ― Backtracking search is a naive recursive algorithm that tries all possibilities to find the minimum cost path. Here, action costs can be either positive or negative.**
+
+⟶ Retour sur trace - L'algorithme de retour sur trace (en anglais backtracking search) est un algorithme récursif explorant naïvement toutes les possibilités jusqu'à trouver le chemin de coût minimal.
+
+
+
+
+**10. Breadth-first search (BFS) ― Breadth-first search is a graph search algorithm that does a level-by-level traversal. We can implement it iteratively with the help of a queue that stores at each step future nodes to be visited. For this algorithm, we can assume action costs to be equal to a constant c⩾0.**
+
+⟶ Parcours en largeur (BFS) - L'algorithme de parcours en largeur (en anglais breadth-first search ou BFS) est un algorithme de parcours de graphe traversant chaque niveau de manière successive. On peut le coder de manière itérative à l'aide d'une queue stockant à chaque étape les prochains nœuds à visiter. Cet algorithme suppose que le coût de toutes les actions est égal à une constante c⩾0.
+
+
+
+
+**11. Depth-first search (DFS) ― Depth-first search is a search algorithm that traverses a graph by following each path as deep as it can. We can implement it recursively, or iteratively with the help of a stack that stores at each step future nodes to be visited. For this algorithm, action costs are assumed to be equal to 0.**
+
+⟶ Parcours en profondeur (DFS) - L'algorithme de parcours en profondeur (en anglais depth-first search ou DFS) est un algorithme de parcours de graphe traversant chaque chemin qu'il emprunte aussi loin que possible. On peut le coder de manière récursive, ou itérative à l'aide d'une pile qui stocke à chaque étape les prochains nœuds à visiter. Cet algorithme suppose que le coût de toutes les actions est égal à 0.
+
+
+
+
+**12. Iterative deepening ― The iterative deepening trick is a modification of the depth-first search algorithm so that it stops after reaching a certain depth, which guarantees optimality when all action costs are equal. Here, we assume that action costs are equal to a constant c⩾0.**
+
+⟶ Approfondissement itératif - L'astuce de l'approfondissement itératif (en anglais iterative deepening) est une modification de l'algorithme de DFS qui l'arrête après avoir atteint une certaine profondeur, garantissant l'optimalité de la solution trouvée quand toutes les actions ont un même coût constant c⩾0.
+
+
+
+
+**13. Tree search algorithms summary ― By noting b the number of actions per state, d the solution depth, and D the maximum depth, we have:**
+
+⟶ Récapitulatif des algorithmes de parcours d'arbre - En notant b le nombre d'actions par état, d la profondeur de la solution et D la profondeur maximale, on a :
+
+
+
+
+**14. [Algorithm, Action costs, Space, Time]**
+
+⟶ [Algorithme, Coût des actions, Espace, Temps]
+
+
+
+
+**15. [Backtracking search, any, Breadth-first search, Depth-first search, DFS-Iterative deepening]**
+
+⟶ [Retour sur trace, peu importe, Parcours en largeur, Parcours en profondeur, DFS-approfondissement itératif]
+
+
+
+
+**16. Graph search**
+
+⟶ Parcours de graphe
+
+
+
+
+**17. This category of states-based algorithms aims at constructing optimal paths, enabling exponential savings. In this section, we will focus on dynamic programming and uniform cost search.**
+
+⟶ Cette catégorie d'algorithmes basés sur les états vise à trouver des chemins optimaux avec une complexité moins grande qu'exponentielle. Dans cette section, nous allons nous concentrer sur la programmation dynamique et la recherche à coût uniforme.
+
+
+
+
+**18. Graph ― A graph is comprised of a set of vertices V (also called nodes) as well as a set of edges E (also called links).**
+
+⟶ Graphe - Un graphe se compose d'un ensemble de sommets V (aussi appelés noeuds) et d'arêtes E (appelés arcs lorsque le graphe est orienté).
+
+
+
+
+**19. Remark: a graph is said to be acylic when there is no cycle.**
+
+⟶ Remarque : un graphe est dit être acyclique lorsqu'il ne contient pas de cycle.
+
+
+
+
+**20. State ― A state is a summary of all past actions sufficient to choose future actions optimally.**
+
+⟶ État - Un état contient le résumé des actions passées suffisant pour choisir les actions futures de manière optimale.
+
+
+
+
+**21. Dynamic programming ― Dynamic programming (DP) is a backtracking search algorithm with memoization (i.e. partial results are saved) whose goal is to find a minimum cost path from state s to an end state send. It can potentially have exponential savings compared to traditional graph search algorithms, and has the property to only work for acyclic graphs. For any given state s, the future cost is computed as follows:**
+
+⟶ Programmation dynamique - La programmation dynamique (en anglais dynamic programming ou DP) est un algorithme de recherche de type retour sur trace qui utilise le principe de mémoïsation (i.e. les résultats intermédiaires sont enregistrés) et ayant pour but de trouver le chemin à coût minimal allant de l'état s à l'état final send. Cette procédure peut potentiellement engendrer des économies exponentielles si on la compare aux algorithmes de parcours de graphe traditionnels, et a la propriété de ne marcher que dans le cas de graphes acycliques. Pour un état s donné, le coût futur est calculé de la manière suivante :
+
+
+
+
+**22. [if, otherwise]**
+
+⟶ [si, sinon]
+
+
+
+
+**23. Remark: the figure above illustrates a bottom-to-top approach whereas the formula provides the intuition of a top-to-bottom problem resolution.**
+
+⟶ Remarque : la figure ci-dessus illustre une approche ascendante alors que la formule nous donne l'intuition d'une résolution avec une approche descendante.
+
+
+
+
+**24. Types of states ― The table below presents the terminology when it comes to states in the context of uniform cost search:**
+
+⟶ Types d'états - La table ci-dessous présente la terminologie relative aux états dans le contexte de la recherche à coût uniforme :
+
+
+
+
+**25. [State, Explanation]**
+
+⟶ [État, Explication]
+
+
+
+
+**26. [Explored, Frontier, Unexplored]**
+
+⟶ [Exploré, Frontière, Inexploré]
+
+
+
+
+**27. [States for which the optimal path has already been found, States seen for which we are still figuring out how to get there with the cheapest cost, States not seen yet]**
+
+⟶ [États pour lesquels le chemin optimal a déjà été trouvé, États rencontrés mais pour lesquels on se demande toujours comment s'y rendre avec un coût minimal, États non rencontrés jusqu'à présent]
+
+
+
+
+**28. Uniform cost search ― Uniform cost search (UCS) is a search algorithm that aims at finding the shortest path from a state sstart to an end state send. It explores states s in increasing order of PastCost(s) and relies on the fact that all action costs are non-negative.**
+
+⟶ Recherche à coût uniforme - La recherche à coût uniforme (uniform cost search ou UCS en anglais) est un algorithme de recherche qui a pour but de trouver le chemin le plus court entre les états sstart et send. Celui-ci explore les états s en les triant par coût croissant de PastCost(s) et repose sur le fait que toutes les actions ont un coût non négatif.
+
+
+
+
+**29. Remark 1: the UCS algorithm is logically equivalent to Dijkstra's algorithm.**
+
+⟶ Remarque 1 : UCS fonctionne de la même manière que l'algorithme de Dijkstra.
+
+
+
+
+**30. Remark 2: the algorithm would not work for a problem with negative action costs, and adding a positive constant to make them non-negative would not solve the problem since this would end up being a different problem.**
+
+⟶ Remarque 2 : cet algorithme ne marche pas sur une configuration contenant des actions à coût négatif. Quelqu'un pourrait penser à ajouter une constante positive à tous les coûts, mais cela ne résoudrait rien puisque le problème résultant serait différent.
+
+
+
+
+**31. Correctness theorem ― When a state s is popped from the frontier F and moved to explored set E, its priority is equal to PastCost(s) which is the minimum cost path from sstart to s.**
+
+⟶ Théorème de correction - Lorsqu'un état s passe de la frontière F à l'ensemble exploré E, sa priorité est égale à PastCost(s), représentant le chemin de coût minimal allant de sstart à s.
+
+
+
+
+**32. Graph search algorithms summary ― By noting N the number of total states, n of which are explored before the end state send, we have:**
+
+⟶ Récapitulatif des algorithmes de parcours de graphe - En notant N le nombre total d'états dont n sont explorés avant l'état final send, on a :
+
+
+
+
+**33. [Algorithm, Acyclicity, Costs, Time/space]**
+
+⟶ [Algorithme, Acyclicité, Coûts, Temps/Espace]
+
+
+
+
+**34. [Dynamic programming, Uniform cost search]**
+
+⟶ [Programmation dynamique, Recherche à coût uniforme]
+
+
+
+
+**35. Remark: the complexity countdown supposes the number of possible actions per state to be constant.**
+
+⟶ Remarque : ce décompte de la complexité suppose que le nombre d'actions possibles à partir de chaque état est constant.
+
+
+
+
+**36. Learning costs**
+
+⟶ Apprendre les coûts
+
+
+
+
+**37. Suppose we are not given the values of Cost(s,a), we want to estimate these quantities from a training set of minimizing-cost-path sequence of actions (a1,a2,...,ak).**
+
+⟶ Supposons que nous ne sommes pas donnés les valeurs de Cost(s,a). Nous souhaitons estimer ces quantités à partir d'un ensemble d'apprentissage de chemins à coût minimaux d'actions (a1,a2,...,ak).
+
+
+
+
+**38. [Structured perceptron ― The structured perceptron is an algorithm aiming at iteratively learning the cost of each state-action pair. At each step, it:, decreases the estimated cost of each state-action of the true minimizing path y given by the training data, increases the estimated cost of each state-action of the current predicted path y' inferred from the learned weights.]**
+
+⟶ [Perceptron structuré - L'algorithme du perceptron structuré vise à apprendre de manière itérative les coûts des paires état-action. À chaque étape, il :, fait décroître le coût estimé de chaque état-action du vrai chemin minimisant y donné par la base d'apprentissage, fait croître le coût estimé de chaque état-action du chemin y' prédit comme étant minimisant par les paramètres appris par l'algorithme.]
+
+
+
+
+**39. Remark: there are several versions of the algorithm, one of which simplifies the problem to only learning the cost of each action a, and the other parametrizes Cost(s,a) to a feature vector of learnable weights.**
+
+⟶ Remarque : plusieurs versions de cette algorithme existent, l'une d'elles réduisant ce problème à l'apprentissage du coût de chaque action a et l'autre paramétrisant chaque Cost(s,a) à un vecteur de paramètres pouvant être appris.
+
+
+
+
+**40. A* search**
+
+⟶ Algorithme A*
+
+
+
+
+**41. Heuristic function ― A heuristic is a function h over states s, where each h(s) aims at estimating FutureCost(s), the cost of the path from s to send.**
+
+⟶ Fonction heuristique - Une heuristique est une fonction h opérant sur les états s, où chaque h(s) vise à estimer FutureCost(s), le coût du chemin optimal allant de s à send.
+
+
+
+
+**42. Algorithm ― A∗ is a search algorithm that aims at finding the shortest path from a state s to an end state send. It explores states s in increasing order of PastCost(s)+h(s). It is equivalent to a uniform cost search with edge costs Cost′(s,a) given by:**
+
+⟶ Algorithme - A* est un algorithme de recherche visant à trouver le chemin le plus court entre un état s et un état final send. Il le fait en explorant les états s triés par ordre croissant de PastCost(s)+h(s). Cela revient à utiliser l'algorithme UCS où chaque arête est associée au coût Cost′(s,a) donné par :
+
+
+
+
+**43. Remark: this algorithm can be seen as a biased version of UCS exploring states estimated to be closer to the end state.**
+
+⟶ Remarque : cet algorithme peut être vu comme une version biaisée de UCS explorant les états estimés comme étant plus proches de l'état final.
+
+
+
+
+**44. [Consistency ― A heuristic h is said to be consistent if it satisfies the two following properties:, For all states s and actions a, The end state verifies the following:]**
+
+⟶ [Consistance - Une heuristique h est dite consistante si elle satisfait les deux propriétés suivantes :, Pour tous états s et actions a, L'état final vérifie la propriété :]
+
+
+
+
+**45. Correctness ― If h is consistent, then A∗ returns the minimum cost path.**
+
+⟶ Correction - Si h est consistante, alors A* renvoie le chemin de coût minimal.
+
+
+
+
+**46. Admissibility ― A heuristic h is said to be admissible if we have:**
+
+⟶ Admissibilité - Une heuristique est dite admissible si l'on a :
+
+
+
+
+**47. Theorem ― Let h(s) be a given heuristic. We have:**
+
+⟶ Théorème - Soit h(s) une heuristique. On a :
+
+
+
+
+**48. [consistent, admissible]**
+
+⟶ [consistante, admissible]
+
+
+
+
+**49. Efficiency ― A* explores all states s satisfying the following equation:**
+
+⟶ Efficacité - A* explore les états s satisfaisant l'équation :
+
+
+
+
+**50. Remark: larger values of h(s) is better as this equation shows it will restrict the set of states s going to be explored.**
+
+⟶ Remarque : avoir h(s) élevé est préférable puisque cette équation montre que le nombre d'états s à explorer est alors réduit.
+
+
+
+
+**51. Relaxation**
+
+⟶ Relaxation
+
+
+
+
+**52. It is a framework for producing consistent heuristics. The idea is to find closed-form reduced costs by removing constraints and use them as heuristics.**
+
+⟶ C'est un type de procédure permettant de produire des heuristiques consistantes. L'idée est de trouver une fonction de coût facile à exprimer en enlevant des contraintes au problème, et ensuite l'utiliser en tant qu'heuristique.
+
+
+
+
+**53. Relaxed search problem ― The relaxation of search problem P with costs Cost is noted Prel with costs Costrel, and satisfies the identity:**
+
+⟶ Relaxation d'un problème de recherche - La relaxation d'un problème de recherche P aux coûts Cost est noté Prel avec coûts Costrel, et vérifie la relation :
+
+
+
+
+**54. Relaxed heuristic ― Given a relaxed search problem Prel, we define the relaxed heuristic h(s)=FutureCostrel(s) as the minimum cost path from s to an end state in the graph of costs Costrel(s,a).**
+
+⟶ Relaxation d'une heuristique - Étant donné la relaxation d'un problème de recherche Prel, on définit l'heuristique relaxée h(s)=FutureCostrel(s) comme étant le chemin de coût minimal allant de s à un état final dans le graphe de fonction de coût Costrel(s,a).
+
+
+
+
+**55. Consistency of relaxed heuristics ― Let Prel be a given relaxed problem. By theorem, we have:**
+
+⟶ Consistance de la relaxation d'heuristiques - Soit Prel une relaxation d'un problème de recherche. Par théorème, on a :
+
+
+
+
+**56. consistent**
+
+⟶ consistante
+
+
+
+
+**57. [Tradeoff when choosing heuristic ― We have to balance two aspects in choosing a heuristic:, Computational efficiency: h(s)=FutureCostrel(s) must be easy to compute. It has to produce a closed form, easier search and independent subproblems., Good enough approximation: the heuristic h(s) should be close to FutureCost(s) and we have thus to not remove too many constraints.]**
+
+⟶ [Compromis lors du choix d'heuristique - Le choix d'heuristique se repose sur un compromis entre :, Complexité de calcul : h(s)=FutureCostrel(s) doit être facile à calculer. De manière préférable, cette fonction peut s'exprimer de manière explicite et elle permet de diviser le problème en sous-parties indépendantes.]
+
+
+
+
+**58. Max heuristic ― Let h1(s), h2(s) be two heuristics. We have the following property:**
+
+⟶ Heuristique max - Soient h1(s) et h2(s) deux heuristiques. On a la propriété suivante :
+
+
+
+
+**59. Markov decision processes**
+
+⟶ Processus de décision markovien
+
+
+
+
+**60. In this section, we assume that performing action a from state s can lead to several states s′1,s′2,... in a probabilistic manner. In order to find our way between an initial state and an end state, our objective will be to find the maximum value policy by using Markov decision processes that help us cope with randomness and uncertainty.**
+
+⟶ Dans cette section, on suppose qu'effectuer l'action a à partir de l'état s peut mener de manière probabiliste à plusieurs états s′1,s′2,... Dans le but de trouver ce qu'il faudrait faire entre un état initial et un état final, on souhaite trouver une stratégie maximisant la quantité des récompenses en utilisant un outil adapté à l'imprévisibilité et l'incertitude : les processus de décision markoviens.
+
+
+
+
+**61. Notations**
+
+⟶ Notations
+
+
+
+
+**62. [Definition ― The objective of a Markov decision process is to maximize rewards. It is defined with:, a starting state sstart, possible actions Actions(s) from state s, transition probabilities T(s,a,s′) from s to s′ with action a, rewards Reward(s,a,s′) from s to s′ with action a, whether an end state was reached IsEnd(s), a discount factor 0⩽γ⩽1]**
+
+⟶ [Définition - L'objectif d'un processus de décision markovien (en anglais Markov decision process ou MDP) est de maximiser la quantité de récompenses. Un tel problème est défini par :, un état de départ sstart, l'ensemble des actions Actions(s) pouvant être effectuées à partir de l'état s, la probabilité de transition T(s,a,s′) de l'état s vers l'état s' après avoir pris l'action a, la récompense Reward(s,a,s′) pour être passé de l'état s à l'état s' après avoir pris l'action a, la connaissance d'avoir atteint ou non un état final IsEnd(s), un facteur de dévaluation 0⩽γ⩽1]
+
+
+
+
+**63. Transition probabilities ― The transition probability T(s,a,s′) specifies the probability of going to state s′ after action a is taken in state s. Each s′↦T(s,a,s′) is a probability distribution, which means that:**
+
+⟶ Probabilités de transitions - La probabilité de transition T(s,a,s′) représente la probabilité de transitionner vers l'état s' après avoir effectué l'action a en étant dans l'état s. Chaque s′↦T(s,a,s′) est une loi de probabilité :
+
+
+
+
+**64. states**
+
+⟶ états
+
+
+
+
+**65. Policy ― A policy π is a function that maps each state s to an action a, i.e.**
+
+⟶ Politique - Une politique π est une fonction liant chaque état s à une action a, i.e. :
+
+
+
+
+**66. Utility ― The utility of a path (s0,...,sk) is the discounted sum of the rewards on that path. In other words,**
+
+⟶ Utilité - L'utilité d'un chemin (s0,...,sk) est la somme des récompenses dévaluées récoltées sur ce chemin. En d'autres termes,
+
+
+
+
+**67. The figure above is an illustration of the case k=4.**
+
+⟶ La figure ci-dessus illustre le cas k=4.
+
+
+
+
+**68. Q-value ― The Q-value of a policy π at state s with action a, also noted Qπ(s,a), is the expected utility from state s after taking action a and then following policy π. It is defined as follows:**
+
+⟶ Q-value - La fonction de valeur des états-actions (Q-value en anglais) d'une politique π évaluée à l'état s avec l'action a, aussi notée Qπ(s,a), est l'espérance de l'utilité partant de l'état s avec l'action a et adoptant ensuite la politique π. Cette fonction est définie par :
+
+
+
+
+**69. Value of a policy ― The value of a policy π from state s, also noted Vπ(s), is the expected utility by following policy π from state s over random paths. It is defined as follows:**
+
+⟶ Fonction de valeur des états d'une politique - La fonction de valeur des états d'une politique π évaluée à l'état s, aussi notée Vπ(s), est l'espérance de l'utilité partant de l'état s et adoptant ensuite la politique π. Cette fonction est définie par :
+
+
+
+
+**70. Remark: Vπ(s) is equal to 0 if s is an end state.**
+
+⟶ Remarque : Vπ(s) vaut 0 si s est un état final.
+
+
+
+
+**71. Applications**
+
+⟶ Applications
+
+
+
+
+**72. [Policy evaluation ― Given a policy π, policy evaluation is an iterative algorithm that aims at estimating Vπ. It is done as follows:, Initialization: for all states s, we have:, Iteration: for t from 1 to TPE, we have, with]**
+
+⟶ [Évaluation d'une politique - Étant donnée une politique π, on peut utiliser l'algorithme itératif d'évaluation de politiques (en anglais policy evaluation) pour estimer Vπ :, Initialisation : pour tous les états s, on a, Itération : pour t allant de 1 à TPE, on a, avec]
+
+
+
+
+**73. Remark: by noting S the number of states, A the number of actions per state, S′ the number of successors and T the number of iterations, then the time complexity is of O(TPESS′).**
+
+⟶ Remarque : en notant S le nombre d'états, A le nombre d'actions par états, S' le nombre de successeurs et T le nombre d'itérations, la complexité en temps est alors de O(TPESS′).
+
+
+
+
+**74. Optimal Q-value ― The optimal Q-value Qopt(s,a) of state s with action a is defined to be the maximum Q-value attained by any policy starting. It is computed as follows:**
+
+⟶ Q-value optimale - La Q-value optimale Qopt(s,a) d'un état s avec l'action a est définie comme étant la Q-value maximale atteinte avec n'importe quelle politique. Elle est calculée avec la formule :
+
+
+
+
+**75. Optimal value ― The optimal value Vopt(s) of state s is defined as being the maximum value attained by any policy. It is computed as follows:**
+
+⟶ Valeur optimale - La valeur optimale Vopt(s) d'un état s est définie comme étant la valeur maximum atteinte par n'importe quelle politique. Elle est calculée avec la formule :
+
+
+
+
+**76. actions**
+
+⟶ actions
+
+
+
+
+**77. Optimal policy ― The optimal policy πopt is defined as being the policy that leads to the optimal values. It is defined by:**
+
+⟶ Politique optimale - La politique optimale πopt est définie comme étant la politique liée aux valeurs optimales. Elle est définie par :
+
+
+
+
+**78. [Value iteration ― Value iteration is an algorithm that finds the optimal value Vopt as well as the optimal policy πopt. It is done as follows:, Initialization: for all states s, we have:, Iteration: for t from 1 to TVI, we have:, with]**
+
+⟶ [Itération sur la valeur - L'algorithme d'itération sur la valeur (en anglais value iteration) vise à trouver la valeur optimale Vopt ainsi que la politique optimale πopt en deux temps :, Initialisation : pour tout état s, on a, Itération : pour t allant de 1 à TVI, on a, avec]
+
+
+
+
+**79. Remark: if we have either γ<1 or the MDP graph being acyclic, then the value iteration algorithm is guaranteed to converge to the correct answer.**
+
+⟶ Remarque : si γ<1 ou si le graphe associé au processus de décision markovien est acyclique, alors l'algorithme d'itération sur la valeur est garanti de converger vers la bonne solution.
+
+
+
+
+**80. When unknown transitions and rewards**
+
+⟶ Cas des transitions et récompenses inconnues
+
+
+
+
+**81. Now, let's assume that the transition probabilities and the rewards are unknown.**
+
+⟶ On suppose maintenant que les probabilités de transition et les récompenses sont inconnues.
+
+
+
+
+**82. Model-based Monte Carlo ― The model-based Monte Carlo method aims at estimating T(s,a,s′) and Reward(s,a,s′) using Monte Carlo simulation with:**
+
+⟶ Monte-Carlo basé sur modèle - La méthode de Monte-Carlo basée sur modèle (en anglais model-based Monte Carlo) vise à estimer T(s,a,s′) et Reward(s,a,s′) en utilisant des simulations de Monte-Carlo avec :
+
+
+
+
+**83. [# times (s,a,s′) occurs, and]**
+
+⟶ [# de fois où (s,a,s') se produit]
+
+
+
+
+**84. These estimations will be then used to deduce Q-values, including Qπ and Qopt.**
+
+⟶ Ces estimations sont ensuite utilisées pour trouver les Q-values, ainsi que Qπ et Qopt.
+
+
+
+
+**85. Remark: model-based Monte Carlo is said to be off-policy, because the estimation does not depend on the exact policy.**
+
+⟶ Remarque : la méthode de Monte-Carlo basée sur modèle est dite "hors politique" (en anglais "off-policy") car l'estimation produite ne dépend pas de la politique utilisée.
+
+
+
+
+**86. Model-free Monte Carlo ― The model-free Monte Carlo method aims at directly estimating Qπ, as follows:**
+
+⟶ Monte-Carlo sans modèle - La méthode de Monte-Carlo sans modèle (en anglais model-free Monte Carlo) vise à directement estimer Qπ de la manière suivante :
+
+
+
+
+**87. Qπ(s,a)=average of ut where st−1=s,at=a**
+
+⟶ Qπ(s,a)=moyenne de ut où st−1=s,at=a
+
+
+
+
+**88. where ut denotes the utility starting at step t of a given episode.**
+
+⟶ où ut désigne l'utilité à partir de l'étape t d'un épisode donné.
+
+
+
+
+**89. Remark: model-free Monte Carlo is said to be on-policy, because the estimated value is dependent on the policy π used to generate the data.**
+
+⟶ Remarque : la méthode de Monte-Carlo sans modèle est dite "sur politique" (en anglais "on-policy") car l'estimation produite dépend de la politique π utilisée pour générer les données.
+
+
+
+
+**90. Equivalent formulation - By introducing the constant η=11+(#updates to (s,a)) and for each (s,a,u) of the training set, the update rule of model-free Monte Carlo has a convex combination formulation:**
+
+⟶ Formulation équivalente - En introduisant la constante η=11+(#mises à jour à (s,a)) et pour chaque triplet (s,a,u) de la base d'apprentissage, la formule de récurrence de la méthode de Monte-Carlo sans modèle s'écrit à l'aide de la combinaison convexe :
+
+
+
+
+**91. as well as a stochastic gradient formulation:**
+
+⟶ ainsi qu'une formulation mettant en valeur une sorte de gradient :
+
+
+
+
+**92. SARSA ― State-action-reward-state-action (SARSA) is a boostrapping method estimating Qπ by using both raw data and estimates as part of the update rule. For each (s,a,r,s′,a′), we have:**
+
+⟶ SARSA - État-action-récompense-état-action (en anglais state-action-reward-state-action ou SARSA) est une méthode de bootstrap qui estime Qπ en utilisant à la fois des données réelles et estimées dans sa formule de mise à jour. Pour chaque (s,a,r,s′,a′), on a :
+
+
+
+
+**93. Remark: the SARSA estimate is updated on the fly as opposed to the model-free Monte Carlo one where the estimate can only be updated at the end of the episode.**
+
+⟶ Remarque : l'estimation donnée par SARSA est mise à jour à la volée contrairement à celle donnée par la méthode de Monte-Carlo sans modèle où la mise à jour est uniquement effectuée à la fin de l'épisode.
+
+
+
+
+**94. Q-learning ― Q-learning is an off-policy algorithm that produces an estimate for Qopt. On each (s,a,r,s′,a′), we have:**
+
+⟶ Q-learning - Le Q-apprentissage (en anglais Q-learning) est un algorithme hors politique (en anglais off-policy) donnant une estimation de Qopt. Pour chaque (s,a,r,s′,a′), on a :
+
+
+
+
+**95. Epsilon-greedy ― The epsilon-greedy policy is an algorithm that balances exploration with probability ϵ and exploitation with probability 1−ϵ. For a given state s, the policy πact is computed as follows:**
+
+⟶ Epsilon-glouton - La politique epsilon-gloutonne (en anglais epsilon-greedy) est un algorithme essayant de trouver un compromis entre l'exploration avec probabilité ϵ et l'exploitation avec probabilité 1-ϵ. Pour un état s, la politique πact est calculée par :
+
+
+
+
+**96. [with probability, random from Actions(s)]**
+
+⟶ [avec probabilité, aléatoire venant d'Actions(s)]
+
+
+
+
+**97. Game playing**
+
+⟶ Jeux
+
+
+
+
+**98. In games (e.g. chess, backgammon, Go), other agents are present and need to be taken into account when constructing our policy.**
+
+⟶ Dans les jeux (e.g. échecs, backgammon, Go), d'autres agents sont présents et doivent être pris en compte au moment d'élaborer une politique.
+
+
+
+
+**99. Game tree ― A game tree is a tree that describes the possibilities of a game. In particular, each node is a decision point for a player and each root-to-leaf path is a possible outcome of the game.**
+
+⟶ Arbre de jeu - Un arbre de jeu est un arbre détaillant toutes les issues possibles d'un jeu. En particulier, chaque noeud représente un point de décision pour un joueur et chaque chemin liant la racine à une des feuilles traduit une possible instance du jeu.
+
+
+
+
+**100. [Two-player zero-sum game ― It is a game where each state is fully observed and such that players take turns. It is defined with:, a starting state sstart, possible actions Actions(s) from state s, successors Succ(s,a) from states s with actions a, whether an end state was reached IsEnd(s), the agent's utility Utility(s) at end state s, the player Player(s) who controls state s]**
+
+⟶ [Jeu à somme nulle à deux joueurs - C'est un type jeu où chaque état est entièrement observé et où les joueurs jouent de manière successive. On le définit par :, un état de départ sstart, de possibles actions Actions(s) partant de l'état s, du successeur Succ(s,a) l'état s après avoir effectué l'action a, la connaissance d'avoir atteint ou non un état final IsEnd(s), l'utilité de l'agent Utility(s) à l'état final s, le joueur Player(s) qui contrôle l'état s]
+
+
+
+
+**101. Remark: we will assume that the utility of the agent has the opposite sign of the one of the opponent.**
+
+⟶ Remarque : nous assumerons que l'utilité de l'agent a le signe opposé de celui de son adversaire.
+
+
+
+
+**102. [Types of policies ― There are two types of policies:, Deterministic policies, noted πp(s), which are actions that player p takes in state s., Stochastic policies, noted πp(s,a)∈[0,1], which are probabilities that player p takes action a in state s.]**
+
+⟶ [Types de politiques - Il y a deux types de politiques :, Les politiques déterministes, notées πp(s), qui représentent pour tout s l'action que le joueur p prend dans l'état s., Les politiques stochastiques, notées πp(s,a)∈[0,1], qui sont décrites pour tout s et a par la probabilité que le joueur p prenne l'action a dans l'état s.]
+
+
+
+
+**103. Expectimax ― For a given state s, the expectimax value Vexptmax(s) is the maximum expected utility of any agent policy when playing with respect to a fixed and known opponent policy πopp. It is computed as follows:**
+
+⟶ Expectimax - Pour un état donné s, la valeur d'expectimax Vexptmax(s) est l'utilité maximum sur l'ensemble des politiques utilisées par l'agent lorsque celui-ci joue avec un adversaire de politique connue πopp. Cette valeur est calculée de la manière suivante :
+
+
+
+
+**104. Remark: expectimax is the analog of value iteration for MDPs.**
+
+⟶ Remarque : expectimax est l'analogue de l'algorithme d'itération sur la valeur pour les MDPs.
+
+
+
+
+**105. Minimax ― The goal of minimax policies is to find an optimal policy against an adversary by assuming the worst case, i.e. that the opponent is doing everything to minimize the agent's utility. It is done as follows:**
+
+⟶ Minimax - Le but des politiques minimax est de trouver une politique optimale contre un adversaire que l'on assume effectuer toutes les pires actions, i.e. toutes celles qui minimisent l'utilité de l'agent. La valeur correspondante est calculée par :
+
+
+
+
+**106. Remark: we can extract πmax and πmin from the minimax value Vminimax.**
+
+⟶ Remarque : on peut déduire πmax et πmin à partir de la valeur minimax Vminimax.
+
+
+
+
+**107. Minimax properties ― By noting V the value function, there are 3 properties around minimax to have in mind:**
+
+⟶ Propriétés de minimax - En notant V la fonction de valeur, il y a 3 propriétés sur minimax qu'il faut avoir à l'esprit :
+
+
+
+
+**108. Property 1: if the agent were to change its policy to any πagent, then the agent would be no better off.**
+
+⟶ Propriété 1 : si l'agent changeait sa politique en un quelconque πagent, alors il ne s'en sortirait pas mieux.
+
+
+
+
+**109. Property 2: if the opponent changes its policy from πmin to πopp, then he will be no better off.**
+
+⟶ Propriété 2 : si son adversaire change sa politique de πmin à πopp, alors il ne s'en sortira pas mieux.
+
+
+
+
+**110. Property 3: if the opponent is known to be not playing the adversarial policy, then the minimax policy might not be optimal for the agent.**
+
+⟶ Propriété 3 : si l'on sait que son adversaire ne joue pas les pires actions possibles, alors la politique minimax peut ne pas être optimale pour l'agent.
+
+
+
+
+**111. In the end, we have the following relationship:**
+
+⟶ À la fin, on a la relation suivante :
+
+
+
+
+**112. Speeding up minimax**
+
+⟶ Accélération de minimax
+
+
+
+
+**113. Evaluation function ― An evaluation function is a domain-specific and approximate estimate of the value Vminimax(s). It is noted Eval(s).**
+
+⟶ Fonction d'évaluation - Une fonction d'évaluation estime de manière approximative la valeur Vminimax(s) selon les paramètres du problème. Elle est notée Eval(s).
+
+
+
+
+**114. Remark: FutureCost(s) is an analogy for search problems.**
+
+⟶ Remarque : l'analogue de cette fonction utilisé dans les problèmes de recherche est FutureCost(s).
+
+
+
+
+**115. Alpha-beta pruning ― Alpha-beta pruning is a domain-general exact method optimizing the minimax algorithm by avoiding the unnecessary exploration of parts of the game tree. To do so, each player keeps track of the best value they can hope for (stored in α for the maximizing player and in β for the minimizing player). At a given step, the condition β<α means that the optimal path is not going to be in the current branch as the earlier player had a better option at their disposal.**
+
+⟶ Élagage alpha-bêta - L'élagage alpha-bêta (en anglais alpha-beta pruning) est une méthode exacte d'optimisation employée sur l'algorithme de minimax et a pour but d'éviter l'exploration de parties inutiles de l'arbre de jeu. Pour ce faire, chaque joueur garde en mémoire la meilleure valeur qu'il puisse espérer (appelée α chez le joueur maximisant et β chez le joueur minimisant). À une étape donnée, la condition β<α signifie que le chemin optimal ne peut pas passer par la branche actuelle puisque le joueur qui précédait avait une meilleure option à sa disposition.
+
+
+
+
+**116. TD learning ― Temporal difference (TD) learning is used when we don't know the transitions/rewards. The value is based on exploration policy. To be able to use it, we need to know rules of the game Succ(s,a). For each (s,a,r,s′), the update is done as follows:**
+
+⟶ TD learning - L'apprentissage par différence de temps (en anglais temporal difference learning ou TD learning) est une méthode utilisée lorsque l'on ne connait pas les transitions/récompenses. La valeur et alors basée sur la politique d'exploration. Pour pouvoir l'utiliser, on a besoin de connaître les règles du jeu Succ(s,a). Pour chaque (s,a,r,s′), la mise à jour des coefficients est faite de la manière suivante :
+
+
+
+
+**117. Simultaneous games**
+
+⟶ Jeux simultanés
+
+
+
+
+**118. This is the contrary of turn-based games, where there is no ordering on the player's moves.**
+
+⟶ Ce cas est opposé aux jeux joués tour à tour. Il n'y a pas d'ordre prédéterminé sur le mouvement du joueur.
+
+
+
+
+**119. Single-move simultaneous game ― Let there be two players A and B, with given possible actions. We note V(a,b) to be A's utility if A chooses action a, B chooses action b. V is called the payoff matrix.**
+
+⟶ Jeu simultané à un mouvement - Soient deux joueurs A et B, munis de possibles actions. On note V(a,b) l'utilité de A si A choisit l'action a et B l'action b. V est appelée la matrice de profit (en anglais payoff matrix).
+
+
+
+
+**120. [Strategies ― There are two main types of strategies:, A pure strategy is a single action:, A mixed strategy is a probability distribution over actions:]**
+
+⟶ [Stratégies - Il y a principalement deux types de stratégies :, Une stratégie pure est une seule action, Une stratégie mixte est une loi de probabilité sur les actions :]
+
+
+
+
+**121. Game evaluation ― The value of the game V(πA,πB) when player A follows πA and player B follows πB is such that:**
+
+⟶ Évaluation de jeu - La valeur d'un jeu V(πA,πB) quand le joueur A suit πA et le joueur B suit πB est telle que :
+
+
+
+
+**122. Minimax theorem ― By noting πA,πB ranging over mixed strategies, for every simultaneous two-player zero-sum game with a finite number of actions, we have:**
+
+⟶ Théorème Minimax - Soient πA et πB des stratégies mixtes. Pour chaque jeu à somme nulle à deux joueurs ayant un nombre fini d'actions, on a :
+
+
+
+
+**123. Non-zero-sum games**
+
+⟶ Jeux à somme non nulle
+
+
+
+
+**124. Payoff matrix ― We define Vp(πA,πB) to be the utility for player p.**
+
+⟶ Matrice de profit - On définit Vp(πA,πB) l'utilité du joueur p.
+
+
+
+
+**125. Nash equilibrium ― A Nash equilibrium is (π∗A,π∗B) such that no player has an incentive to change its strategy. We have:**
+
+⟶ Équilibre de Nash - Un équilibre de Nash est défini par (π∗A,π∗B) tel qu'aucun joueur n'a d'intérêt de changer sa stratégie. On a :
+
+
+
+
+**126. and**
+
+⟶ et
+
+
+
+
+**127. Remark: in any finite-player game with finite number of actions, there exists at least one Nash equilibrium.**
+
+⟶ Remarque : dans un jeu à nombre de joueurs et d'actions finis, il existe au moins un équilibre de Nash.
+
+
+
+
+**128. [Tree search, Backtracking search, Breadth-first search, Depth-first search, Iterative deepening]**
+
+⟶ [Parcours d'arbre, Retour sur trace, Parcours en largeur, Parcours en profondeur, Approfondissement itératif]
+
+
+
+
+**129. [Graph search, Dynamic programming, Uniform cost search]**
+
+⟶ [Parcours de graphe, Programmation dynamique, Recherche à coût uniforme]
+
+
+
+
+**130. [Learning costs, Structured perceptron]**
+
+⟶ [Apprendre les coûts, Perceptron structuré]
+
+
+
+
+**131. [A star search, Heuristic function, Algorithm, Consistency, correctness, Admissibility, efficiency]**
+
+⟶ [A étoile, Fonction heuristique, Algorithme, Consistance, Correction, Admissibilité, Efficacité]
+
+
+
+
+**132. [Relaxation, Relaxed search problem, Relaxed heuristic, Max heuristic]**
+
+⟶ [Relaxation, Relaxation d'un problème de recherche, Relaxation d'une heuristique, Heuristique max]
+
+
+
+
+**133. [Markov decision processes, Overview, Policy evaluation, Value iteration, Transitions, rewards]**
+
+⟶ [Processus de décision markovien, Aperçu, Évaluation d'une politique, Itération sur la valeur, Transitions, Récompenses]
+
+
+
+
+**134. [Game playing, Expectimax, Minimax, Speeding up minimax, Simultaneous games, Non-zero-sum games]**
+
+⟶ [Jeux, Expectimax, Minimax, Accélération de minimax, Jeux simultanés, Jeux à somme non nulle]
+
+
+
+
+**135. View PDF version on GitHub**
+
+⟶ Voir la version PDF sur GitHub.
+
+
+
+
+**136. Original authors**
+
+⟶ Auteurs d'origine.
+
+
+
+
+**137. Translated by X, Y and Z**
+
+⟶ Traduit de l'anglais par X, Y et Z.
+
+
+
+
+**138. Reviewed by X, Y and Z**
+
+⟶ Revu par X, Y et Z.
+
+
+
+
+**139. By X and Y**
+
+⟶ De X et Y.
+
+
+
+
+**140. The Artificial Intelligence cheatsheets are now available in [target language].**
+
+⟶ Les pense-bêtes d'intelligence artificielle sont maintenant disponibles en français !
diff --git a/fr/cs-221-variables-models.md b/fr/cs-221-variables-models.md
new file mode 100644
index 000000000..9c802583b
--- /dev/null
+++ b/fr/cs-221-variables-models.md
@@ -0,0 +1,617 @@
+**Variables-based models translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-221/cheatsheet-variables-models)
+
+
+
+**1. Variables-based models with CSP and Bayesian networks**
+
+⟶ Modèles basés sur les variables : CSP et réseaux bayésiens
+
+
+
+
+**2. Constraint satisfaction problems**
+
+⟶ Problèmes de satisfaction de contraintes
+
+
+
+
+**3. In this section, our objective is to find maximum weight assignments of variable-based models. One advantage compared to states-based models is that these algorithms are more convenient to encode problem-specific constraints.**
+
+⟶ Dans cette section, notre but est de trouver des affectations de poids maximisants dans des problèmes impliquant des modèles basés sur les variables. Un avantage comparé aux modèles basés sur les états est que ces algorithmes sont plus commodes lorsqu'il s'agit de transcrire des contraintes spécifiques à certains problèmes.
+
+
+
+
+**4. Factor graphs**
+
+⟶ Graphes de facteurs
+
+
+
+
+**5. Definition ― A factor graph, also referred to as a Markov random field, is a set of variables X=(X1,...,Xn) where Xi∈Domaini and m factors f1,...,fm with each fj(X)⩾0.**
+
+⟶ Définition - Un graphe de facteurs, aussi appelé champ aléatoire de Markov, est un ensemble de variables X=(X1,...,Xn) où Xi∈Domaini muni de m facteurs f1,...,fm où chaque fj(X)⩾0.
+
+
+
+
+**6. Domain**
+
+⟶ Domaine
+
+
+
+
+**7. Scope and arity ― The scope of a factor fj is the set of variables it depends on. The size of this set is called the arity.**
+
+⟶ Arité - Le nombre de variables dépendant d'un facteur fj est appelé son arité.
+
+
+
+
+**8. Remark: factors of arity 1 and 2 are called unary and binary respectively.**
+
+⟶ Remarque : les facteurs d'arité 1 et 2 sont respectivement appelés unaire et binaire.
+
+
+
+
+**9. Assignment weight ― Each assignment x=(x1,...,xn) yields a weight Weight(x) defined as being the product of all factors fj applied to that assignment. Its expression is given by:**
+
+⟶ Affectation de poids - Chaque affectation x=(x1,...,xn) donne un poids Weight(x) défini comme étant le produit de tous les facteurs fj appliqués à cette affectation. Son expression est donnée par :
+
+
+
+
+**10. Constraint satisfaction problem ― A constraint satisfaction problem (CSP) is a factor graph where all factors are binary; we call them to be constraints:**
+
+⟶ Problème de satisfaction de contraintes - Un problème de satisfaction de contraintes (en anglais constraint satisfaction problem ou CSP) est un graphe de facteurs où tous les facteurs sont binaires ; on les appelle "contraintes".
+
+
+
+
+**11. Here, the constraint j with assignment x is said to be satisfied if and only if fj(x)=1.**
+
+⟶ Ici, on dit que l'affectation x satisfait la contrainte j si et seulement si fj(x)=1.
+
+
+
+
+**12. Consistent assignment ― An assignment x of a CSP is said to be consistent if and only if Weight(x)=1, i.e. all constraints are satisfied.**
+
+⟶ Affectation consistante - Une affectation x d'un CSP est dite consistante si et seulement si Weight(x)=1, i.e. toutes les contraintes sont satisfaites.
+
+
+
+
+**13. Dynamic ordering**
+
+⟶ Mise en ordre dynamique
+
+
+
+
+**14. Dependent factors ― The set of dependent factors of variable Xi with partial assignment x is called D(x,Xi), and denotes the set of factors that link Xi to already assigned variables.**
+
+⟶ Facteurs dépendants - L'ensemble des facteurs dépendants de la variable Xi dont l'affectation partielle est x est appelé D(x,Xi) et désigne l'ensemble des facteurs liant Xi à des variables déjà affectées.
+
+
+
+
+**15. Backtracking search ― Backtracking search is an algorithm used to find maximum weight assignments of a factor graph. At each step, it chooses an unassigned variable and explores its values by recursion. Dynamic ordering (i.e. choice of variables and values) and lookahead (i.e. early elimination of inconsistent options) can be used to explore the graph more efficiently, although the worst-case runtime stays exponential: O(|Domain|n).**
+
+⟶ Recherche avec retour sur trace - L'algorithme de recherche avec retour sur trace (en anglais backtracking search) est utilisé pour trouver l'affectation de poids maximum d'un graphe de facteurs. À chaque étape, une variable non assignée est choisie et ses valeurs sont explorées par récursivité. On peut utiliser un processus de mise en ordre dynamique sur le choix des variables et valeurs et/ou d'anticipation (i.e. élimination précoce d'options non consistantes) pour explorer le graphe de manière plus efficace. La complexité temporelle dans tous les cas reste néanmoins exponentielle : O(|Domaine|n).
+
+
+
+
+**16. [Forward checking ― It is a one-step lookahead heuristic that preemptively removes inconsistent values from the domains of neighboring variables. It has the following characteristics:, After assigning a variable Xi, it eliminates inconsistent values from the domains of all its neighbors., If any of these domains becomes empty, we stop the local backtracking search., If we un-assign a variable Xi, we have to restore the domain of its neighbors.]**
+
+⟶ [Vérification en avant - La vérification en avant (forward checking en anglais) est une heuristique d'anticipation à une étape qui enlève des variables voisines les valeurs impossibles de manière préemptive. Cette méthode a les caractéristiques suivantes :, Après l'affectation d'une variable Xi, les valeurs non consistantes sont éliminées du domaine de tous ses voisins., Si l'un de ces domaines devient vide, la recherche locale s'arrête., Si l'on enlève l'affectation d'une valeur Xi, on doit restaurer le domaine de ses voisins.]
+
+
+
+
+**17. Most constrained variable ― It is a variable-level ordering heuristic that selects the next unassigned variable that has the fewest consistent values. This has the effect of making inconsistent assignments to fail earlier in the search, which enables more efficient pruning.**
+
+⟶ Variable la plus contrainte - L'heuristique de la variable la plus contrainte (en anglais most constrained variable ou MCV) sélectionne la prochaine variable sans affectation ayant le moins de valeurs consistantes. Cette procédure a pour effet de faire échouer les affectations impossibles plus tôt dans la recherche, permettant un élagage plus efficace.
+
+
+
+
+**18. Least constrained value ― It is a value-level ordering heuristic that assigns the next value that yields the highest number of consistent values of neighboring variables. Intuitively, this procedure chooses first the values that are most likely to work.**
+
+⟶ Valeur la moins contraignante - L'heuristique de la valeur la moins contraignante (en anglais least constrained value ou LCV) sélectionne pour une variable donnée la prochaine valeur maximisant le nombre de valeurs consistantes chez les variables voisines. De manière intuitive, on peut dire que cette procédure choisit en premier les valeurs qui sont le plus susceptible de marcher.
+
+
+
+
+**19. Remark: in practice, this heuristic is useful when all factors are constraints.**
+
+⟶ Remarque : en pratique, cette heuristique est utile quand tous les facteurs sont des contraintes.
+
+
+
+
+**20. The example above is an illustration of the 3-color problem with backtracking search coupled with most constrained variable exploration and least constrained value heuristic, as well as forward checking at each step.**
+
+⟶ L'exemple ci-dessus est une illustration du problème de coloration de graphe à 3 couleurs en utilisant l'algorithme de recherche avec retour sur trace couplé avec les heuristiques de MCV, de LCV ainsi que de vérification en avant à chaque étape.
+
+
+
+
+**21. [Arc consistency ― We say that arc consistency of variable Xl with respect to Xk is enforced when for each xl∈Domainl:, unary factors of Xl are non-zero, there exists at least one xk∈Domaink such that any factor between Xl and Xk is non-zero.]**
+
+⟶ [Arc-consistance - On dit que l'arc-consistance de la variable Xl par rapport à Xk est vérifiée lorsque pour tout xl∈Domainl :, les facteurs unaires de Xl sont non-nuls, il existe au moins un xk∈Domaink tel que n'importe quel facteur entre Xl et Xk est non nul.]
+
+
+
+
+**22. AC-3 ― The AC-3 algorithm is a multi-step lookahead heuristic that applies forward checking to all relevant variables. After a given assignment, it performs forward checking and then successively enforces arc consistency with respect to the neighbors of variables for which the domain change during the process.**
+
+⟶ AC-3 - L'algorithme d'AC-3 est une heuristique qui applique le principe de vérification en avant à toutes les variables susceptibles d'être concernées. Après l'affectation d'une variable, cet algorithme effectue une vérification en avant et applique successivement l'arc-consistance avec tous les voisins de variables pour lesquels le domaine change.
+
+
+
+
+**23. Remark: AC-3 can be implemented both iteratively and recursively.**
+
+⟶ Remarque : AC-3 peut être codé de manière itérative ou récursive.
+
+
+
+
+**24. Approximate methods**
+
+⟶ Méthodes approximatives
+
+
+
+
+**25. Beam search ― Beam search is an approximate algorithm that extends partial assignments of n variables of branching factor b=|Domain| by exploring the K top paths at each step. The beam size K∈{1,...,bn} controls the tradeoff between efficiency and accuracy. This algorithm has a time complexity of O(n⋅Kblog(Kb)).**
+
+⟶ Recherche en faisceau - L'algorithme de recherche en faisceau (en anglais beam search) est une technique approximative qui étend les affectations partielles de n variables de facteur de branchement b=|Domain| en explorant les K meilleurs chemins qui s'offrent à chaque étape. La largeur du faisceau K∈{1,...,bn} détermine la balance entre efficacité et précision de l'algorithme. Sa complexité en temps est de O(n⋅Kblog(Kb)).
+
+
+
+
+**26. The example below illustrates a possible beam search of parameters K=2, b=3 and n=5.**
+
+⟶ L'exemple ci-dessous illustre une recherche en faisceau de paramètres K=2, b=3 et n=5.
+
+
+
+
+**27. Remark: K=1 corresponds to greedy search whereas K→+∞ is equivalent to BFS tree search.**
+
+⟶ Remarque : K=1 correspond à la recherche gloutonne alors que K→+∞ est équivalent à effectuer un parcours en largeur.
+
+
+
+
+**28. Iterated conditional modes ― Iterated conditional modes (ICM) is an iterative approximate algorithm that modifies the assignment of a factor graph one variable at a time until convergence. At step i, we assign to Xi the value v that maximizes the product of all factors connected to that variable.**
+
+⟶ Modes conditionnels itérés - L'algorithme des modes conditionnels itérés (en anglais iterated conditional modes ou ICM) est une technique itérative et approximative qui modifie l'affectation d'un graphe de facteurs une variable à la fois jusqu'à convergence. À l'étape i, Xi prend la valeur v qui maximise le produit de tous les facteurs connectés à cette variable.
+
+
+
+
+**29. Remark: ICM may get stuck in local minima.**
+
+⟶ Remarque : il est possible qu'ICM reste bloqué dans un minimum local.
+
+
+
+
+**30. [Gibbs sampling ― Gibbs sampling is an iterative approximate method that modifies the assignment of a factor graph one variable at a time until convergence. At step i:, we assign to each element u∈Domaini a weight w(u) that is the product of all factors connected to that variable, we sample v from the probability distribution induced by w and assign it to Xi.]**
+
+⟶ [Échantillonnage de Gibbs - La méthode d'échantillonnage de Gibbs (en anglais Gibbs sampling) est une technique itérative et approximative qui modifie les affectations d'un graphe de facteurs une variable à la fois jusqu'à convergence. À l'étape i :, on assigne à chaque élément u∈Domaini un poids w(u) qui est le produit de tous les facteurs connectés à cette variable, on échantillonne v de la loi de probabilité engendrée par w et on l'associe à Xi.]
+
+
+
+
+**31. Remark: Gibbs sampling can be seen as the probabilistic counterpart of ICM. It has the advantage to be able to escape local minima in most cases.**
+
+⟶ Remarque : la méthode d'échantillonnage de Gibbs peut être vue comme étant la version probabiliste de ICM. Cette méthode a l'avantage de pouvoir échapper aux potentiels minimum locaux dans la plupart des situations.
+
+
+
+
+**32. Factor graph transformations**
+
+⟶ Transformations sur les graphes de facteurs
+
+
+
+
+**33. Independence ― Let A,B be a partitioning of the variables X. We say that A and B are independent if there are no edges between A and B and we write:**
+
+⟶ Indépendance - Soit A, B une partition des variables X. On dit que A et B sont indépendants s'il n'y a pas d'arête connectant A et B et on écrit :
+
+
+
+
+**34. Remark: independence is the key property that allows us to solve subproblems in parallel.**
+
+⟶ Remarque : l'indépendance est une propriété importante car elle nous permet de décomposer la situation en sous-problèmes que l'on peut résoudre en parallèle.
+
+
+
+
+**35. Conditional independence ― We say that A and B are conditionally independent given C if conditioning on C produces a graph in which A and B are independent. In this case, it is written:**
+
+⟶ Indépendance conditionnelle - On dit que A et B sont conditionnellement indépendants par rapport à C si le fait de conditionner sur C produit un graphe dans lequel A et B sont indépendants. Dans ce cas, on écrit :
+
+
+
+
+**36. [Conditioning ― Conditioning is a transformation aiming at making variables independent that breaks up a factor graph into smaller pieces that can be solved in parallel and can use backtracking. In order to condition on a variable Xi=v, we do as follows:, Consider all factors f1,...,fk that depend on Xi, Remove Xi and f1,...,fk, Add gj(x) for j∈{1,...,k} defined as:]**
+
+⟶ [Conditionnement - Le conditionnement est une transformation visant à rendre des variables indépendantes et ainsi diviser un graphe de facteurs en pièces plus petites qui peuvent être traitées en parallèle et utiliser le retour sur trace. Pour conditionner par rapport à une variable Xi=v, on :, considère toues les facteurs f1,...,fk qui dépendent de Xi, enlève Xi et f1,...,fk, ajoute gj(x) pour j∈{1,...,k} défini par :]
+
+
+
+
+**37. Markov blanket ― Let A⊆X be a subset of variables. We define MarkovBlanket(A) to be the neighbors of A that are not in A.**
+
+⟶ Couverture de Markov - Soit A⊆X une partie des variables. On définit MarkovBlanket(A) comme étant les voisins de A qui ne sont pas dans A.
+
+
+
+
+**38. Proposition ― Let C=MarkovBlanket(A) and B=X∖(A∪C). Then we have:**
+
+⟶ Proposition - Soit C=MarkovBlanket(A) et B=X∖(A∪C). On a alors :
+
+
+
+
+**39. [Elimination ― Elimination is a factor graph transformation that removes Xi from the graph and solves a small subproblem conditioned on its Markov blanket as follows:, Consider all factors fi,1,...,fi,k that depend on Xi, Remove Xi
+and fi,1,...,fi,k, Add fnew,i(x) defined as:]**
+
+⟶ [Élimination - L'élimination est une transformation consistant à enlever Xi d'un graphe de facteurs pour ensuite résoudre un sous-problème conditionné sur sa couverture de Markov où l'on :, considère tous les facteurs fi,1,...,fi,k qui dépendent de Xi, enlève Xi et fi,1,...,fi,k, ajoute fnew,i(x) défini par :]
+
+
+
+
+**40. Treewidth ― The treewidth of a factor graph is the maximum arity of any factor created by variable elimination with the best variable ordering. In other words,**
+
+⟶ Largeur arborescente - La largeur arborescente (en anglais treewidth) d'un graphe de facteurs est l'arité maximum de n'importe quel facteur créé par élimination avec le meilleur ordre de variable. En d'autres termes,
+
+
+
+
+**41. The example below illustrates the case of a factor graph of treewidth 3.**
+
+⟶ L'exemple ci-dessous illustre le cas d'un graphe de facteurs ayant une largeur arborescente égale à 3.
+
+
+
+
+**42. Remark: finding the best variable ordering is a NP-hard problem.**
+
+⟶ Remarque : trouver le meilleur ordre de variable est un problème NP-difficile.
+
+
+
+
+**43. Bayesian networks**
+
+⟶ Réseaux bayésiens
+
+
+
+
+**44. In this section, our goal will be to compute conditional probabilities. What is the probability of a query given evidence?**
+
+⟶ Dans cette section, notre but est de calculer des probabilités conditionnelles. Quelle est la probabilité d'un événement étant donné des observations ?
+
+
+
+
+**45. Introduction**
+
+⟶ Introduction
+
+
+
+
+**46. Explaining away ― Suppose causes C1 and C2 influence an effect E. Conditioning on the effect E and on one of the causes (say C1) changes the probability of the other cause (say C2). In this case, we say that C1 has explained away C2.**
+
+⟶ Explication - Supposons que les causes C1 et C2 influencent un effet E. Le conditionnement sur l'effet E et une des causes (disons C1) change la probabilité de l'autre cause (disons C2). Dans ce cas, on dit que C1 a expliqué C2.
+
+
+
+
+**47. Directed acyclic graph ― A directed acyclic graph (DAG) is a finite directed graph with no directed cycles.**
+
+⟶ Graphe orienté acyclique - Un graphe orienté acyclique (en anglais directed acyclic graph ou DAG) est un graphe orienté fini sans cycle orienté.
+
+
+
+
+**48. Bayesian network ― A Bayesian network is a directed acyclic graph (DAG) that specifies a joint distribution over random variables X=(X1,...,Xn) as a product of local conditional distributions, one for each node:**
+
+⟶ Réseau bayésien - Un réseau bayésien (en anglais Bayesian network) est un DAG qui définit une loi de probabilité jointe sur les variables aléatoires X=(X1,...,Xn) comme étant le produit des lois de probabilités conditionnelles locales (une pour chaque noeud) :
+
+
+
+
+**49. Remark: Bayesian networks are factor graphs imbued with the language of probability.**
+
+⟶ Remarque : les réseaux bayésiens sont des graphes de facteurs imprégnés de concepts de probabilité.
+
+
+
+
+**50. Locally normalized ― For each xParents(i), all factors are local conditional distributions. Hence they have to satisfy:**
+
+⟶ Normalisation locale - Pour chaque xParents(i), tous les facteurs sont localement des lois de probabilité conditionnelles. Elles doivent donc vérifier :
+
+
+
+
+**51. As a result, sub-Bayesian networks and conditional distributions are consistent.**
+
+⟶ De ce fait, les sous-réseaux bayésiens et les distributions conditionnelles sont consistants.
+
+
+
+
+**52. Remark: local conditional distributions are the true conditional distributions.**
+
+⟶ Remarque : les lois locales de probabilité conditionnelles sont de vraies lois de probabilité conditionnelles.
+
+
+
+
+**53. Marginalization ― The marginalization of a leaf node yields a Bayesian network without that node.**
+
+⟶ Marginalisation - La marginalisation d'un noeud sans enfant entraine un réseau bayésian sans ce noeud.
+
+
+
+
+**54. Probabilistic programs**
+
+⟶ Programmes probabilistes
+
+
+
+
+**55. Concept ― A probabilistic program randomizes variables assignment. That way, we can write down complex Bayesian networks that generate assignments without us having to explicitly specify associated probabilities.**
+
+⟶ Concept - Un programme probabiliste rend aléatoire l'affectation de variables. De ce fait, on peut imaginer des réseaux bayésiens compliqués pour la génération d'affectations sans avoir à écrire de manière explicite les probabilités associées.
+
+
+
+
+**56. Remark: examples of probabilistic programs include Hidden Markov model (HMM), factorial HMM, naive Bayes, latent Dirichlet allocation, diseases and symptoms and stochastic block models.**
+
+⟶ Remarque : quelques exemples de programmes probabilistes incluent parmi d'autres le modèle de Markov caché (en anglais hidden Markov model ou HMM), HMM factoriel, le modèle bayésien naïf (en anglais naive Bayes), l'allocation de Dirichlet latente (en anglais latent Dirichlet allocation ou LDA), le modèle à blocs stochastiques (en anglais stochastic block model).
+
+
+
+
+**57. Summary ― The table below summarizes the common probabilistic programs as well as their applications:**
+
+⟶ Récapitulatif - La table ci-dessous résume les programmes probabilistes les plus fréquents ainsi que leur champ d'application associé :
+
+
+
+
+**58. [Program, Algorithm, Illustration, Example]**
+
+⟶ [Programme, Algorithme, Illustration, Exemple]
+
+
+
+
+**59. [Markov Model, Hidden Markov Model (HMM), Factorial HMM, Naive Bayes, Latent Dirichlet Allocation (LDA)]**
+
+⟶ [Modèle de Markov, Modèle de Markov caché (HMM), HMM factoriel, Bayésien naïf, Allocation de Dirichlet latente (LDA)]
+
+
+
+
+**60. [Generate, distribution]**
+
+⟶ [Génère, distribution]
+
+
+
+
+**61. [Language modeling, Object tracking, Multiple object tracking, Document classification, Topic modeling]**
+
+⟶ [Modélisation du langage, Suivi d'objet, Suivi de plusieurs objets, Classification de document, Modélisation de sujet]
+
+
+
+
+**62. Inference**
+
+⟶ Inférence
+
+
+
+
+**63. [General probabilistic inference strategy ― The strategy to compute the probability P(Q|E=e) of query Q given evidence E=e is as follows:, Step 1: Remove variables that are not ancestors of the query Q or the evidence E by marginalization, Step 2: Convert Bayesian network to factor graph, Step 3: Condition on the evidence E=e, Step 4: Remove nodes disconnected from the query Q by marginalization, Step 5: Run a probabilistic inference algorithm (manual, variable elimination, Gibbs sampling, particle filtering)]**
+
+⟶ [Stratégie générale pour l'inférence probabiliste - La stratégie que l'on utilise pour calculer la probabilité P(Q|E=e) d'une requête Q étant donnée l'observation E=e est la suivante :, Étape 1 : on enlève les variables qui ne sont pas les ancêtres de la requête Q ou de l'observation E par marginalisation, Étape 2 : on convertit le réseau bayésien en un graphe de facteurs, Étape 3 : on conditionne sur l'observation E=e, Étape 4 : on enlève les noeuds déconnectés de la requête Q par marginalisation, Étape 5 : on lance un algorithme d'inférence probabiliste (manuel, élimination de variables, échantillonnage de Gibbs, filtrage particulaire)]
+
+
+
+
+**64. Forward-backward algorithm ― This algorithm computes the exact value of P(H=hk|E=e) (smoothing query) for any k∈{1,...,L} in the case of an HMM of size L. To do so, we proceed in 3 steps:**
+
+⟶ Algorithme progressif-rétrogressif - L'algorithme progressif-rétrogressif (en anglais forward-backward) calcule la valeur exacte de P(H=hk|E=e) pour chaque k∈{1,...,L} dans le cas d'un HMM de taille L. Pour ce faire, on procède en 3 étapes :
+
+
+
+
+**65. Step 1: for ..., compute ...**
+
+⟶ Étape 1 : pour ..., calculer ...
+
+
+
+
+**66. with the convention F0=BL+1=1. From this procedure and these notations, we get that**
+
+⟶ avec la convention F0=BL+1=1. À partir de cette procédure et avec ces notations, on obtient
+
+
+
+
+**67. Remark: this algorithm interprets each assignment to be a path where each edge hi−1→hi is of weight p(hi|hi−1)p(ei|hi).**
+
+⟶ Remarque : cet algorithme interprète une affectation comme étant un chemin où chaque arête hi−1→hi a un poids p(hi|hi−1)p(ei|hi).
+
+
+
+
+**68. [Gibbs sampling ― This algorithm is an iterative approximate method that uses a small set of assignments (particles) to represent a large probability distribution. From a random assignment x, Gibbs sampling performs the following steps for i∈{1,...,n} until convergence:, For all u∈Domaini, compute the weight w(u) of assignment x where Xi=u, Sample v from the probability distribution induced by w: v∼P(Xi=v|X−i=x−i), Set Xi=v]**
+
+⟶ [Échantillonnage de Gibbs - L'algorithme d'échantillonnage de Gibbs (en anglais Gibbs sampling) est une méthode itérative et approximative qui utilise un petit ensemble d'affectations (particules) pour représenter une loi de probabilité. Pour une affectation aléatoire x, l'échantillonnage de Gibbs effectue les étapes suivantes pour i∈{1,...,n} jusqu'à convergence :, Pour tout u∈Domaini, on calcule le poids w(u) de l'affectation x où Xi=u, On échantillonne v de la loi de probabilité engendrée par w : v∼P(Xi=v|X−i=x−i), On pose Xi=v]
+
+
+
+
+**69. Remark: X−i denotes X∖{Xi} and x−i represents the corresponding assignment.**
+
+⟶ Remarque X-i veut dire X∖{Xi} et x−i représente l'affectation correspondante.
+
+
+
+
+**70. [Particle filtering ― This algorithm approximates the posterior density of state variables given the evidence of observation variables by keeping track of K particles at a time. Starting from a set of particles C of size K, we run the following 3 steps iteratively:, Step 1: proposal - For each old particle xt−1∈C, sample x from the transition probability distribution p(x|xt−1) and add x to a set C′., Step 2: weighting - Weigh each x of the set C′ by w(x)=p(et|x), where et is the evidence observed at time t., Step 3: resampling - Sample K elements from the set C′ using the probability distribution induced by w and store them in C: these are the current particles xt.]**
+
+⟶ [Filtrage particulaire - L'algorithme de filtrage particulaire (en anglais particle filtering) approxime la densité postérieure de variables d'états à partir des variables observées en suivant K particules à la fois. En commençant avec un ensemble de particules C de taille K, on répète les 3 étapes suivantes :, Étape 1 : proposition - Pour chaque particule xt−1∈C, on échantillonne x avec loi de probabilité p(x|xt−1) et on ajoute x à un ensemble C′., Étape 2 : pondération - On associe chaque x de l'ensemble C′ au poids w(x)=p(et|x), où et est l'observation vue à l'instant t. Étape 3 : ré-échantillonnage - On échantillonne K éléments de l'ensemble C´ en utilisant la loi de probabilité engendrée par w et on les met dans C : ce sont les particules actuelles xt.]
+
+
+
+
+**71. Remark: a more expensive version of this algorithm also keeps track of past particles in the proposal step.**
+
+⟶ Remarque : une version plus coûteuse de cet algorithme tient aussi compte des particules passée à l'étape de proposition.
+
+
+
+
+**72. Maximum likelihood ― If we don't know the local conditional distributions, we can learn them using maximum likelihood.**
+
+⟶ Maximum de vraisemblance - Si l'on ne connaît pas les lois de probabilité locales, on peut les trouver en utilisant le maximum de vraisemblance.
+
+
+
+
+**73. Laplace smoothing ― For each distribution d and partial assignment (xParents(i),xi), add λ to countd(xParents(i),xi), then normalize to get probability estimates.**
+
+⟶ Lissage de Laplace - Pour chaque loi de probabilité d et affectation partielle (xParents(i),xi), on ajoute λ à countd(xParents(i),xi) et on normalise ensuite pour obtenir des probabilités.
+
+
+
+
+**74. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
+
+⟶ Espérance-maximisation - L'algorithme d'espérance-maximisation (en anglais expectation-maximization ou EM) est une méthode efficace utilisée pour estimer le paramètre θ via l'estimation du maximum de vraisemblance en construisant de manière répétée une borne inférieure de la vraisemblance (étape E) et en optimisant cette borne inférieure (étape M) :
+
+
+
+
+**75. [E-step: Evaluate the posterior probability q(h) that each data point e came from a particular cluster h as follows:, M-step: Use the posterior probabilities q(h) as cluster specific weights on data points e to determine θ through maximum likelihood.]**
+
+⟶ [Étape E : on évalue la probabilité postérieure q(h) que chaque point e vienne d'une partition particulière h avec :, Étape M : on utilise la probabilité postérieure q(h) en tant que poids de la partition h sur les points e pour déterminer θ via maximum de vraisemblance]
+
+
+
+
+**76. [Factor graphs, Arity, Assignment weight, Constraint satisfaction problem, Consistent assignment]**
+
+⟶ [Graphe de facteurs, Arité, Poids, Satisfaction de contraintes, Affectation consistante]
+
+
+
+
+**77. [Dynamic ordering, Dependent factors, Backtracking search, Forward checking, Most constrained variable, Least constrained value]**
+
+⟶ [Mise en ordre dynamique, Facteurs dépendants, Retour sur trace, Vérification en avant, Variable la plus contrainte, Valeur la moins contraignante]
+
+
+
+
+**78. [Approximate methods, Beam search, Iterated conditional modes, Gibbs sampling]**
+
+⟶ [Méthodes approximatives, Recherche en faisceau, Modes conditionnels itérés, Échantillonnage de Gibbs]
+
+
+
+
+**79. [Factor graph transformations, Conditioning, Elimination]**
+
+⟶ [Transformations de graphes de facteurs, Conditionnement, Élimination]
+
+
+
+
+**80. [Bayesian networks, Definition, Locally normalized, Marginalization]**
+
+⟶ [Réseaux bayésiens, Définition, Normalisé localement, Marginalisation]
+
+
+
+
+**81. [Probabilistic program, Concept, Summary]**
+
+⟶ [Programme probabiliste, Concept, Récapitulatif]
+
+
+
+
+**82. [Inference, Forward-backward algorithm, Gibbs sampling, Laplace smoothing]**
+
+⟶ [Inférence, Algorithme progressif-rétrogressif, Échantillonnage de Gibbs, Lissage de Laplace]
+
+
+
+
+**83. View PDF version on GitHub**
+
+⟶ Voir la version PDF sur GitHub.
+
+
+
+
+**84. Original authors**
+
+⟶ Auteurs d'origine.
+
+
+
+
+**85. Translated by X, Y and Z**
+
+⟶ Traduit de l'anglais par X, Y et Z.
+
+
+
+
+**86. Reviewed by X, Y and Z**
+
+⟶ Revu par X, Y et Z.
+
+
+
+
+**87. By X and Y**
+
+⟶ De X et Y.
+
+
+
+
+**88. The Artificial Intelligence cheatsheets are now available in [target language].**
+
+⟶ Les pense-bêtes d'intelligence artificielle sont maintenant disponibles en français !
diff --git a/fr/cheatsheet-deep-learning.md b/fr/cs-229-deep-learning.md
similarity index 95%
rename from fr/cheatsheet-deep-learning.md
rename to fr/cs-229-deep-learning.md
index 4045d723c..56073a5e8 100644
--- a/fr/cheatsheet-deep-learning.md
+++ b/fr/cs-229-deep-learning.md
@@ -120,7 +120,7 @@
**21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
-⟶ Pré-requis de la couche convolutionelle ― Si l'on note W la taille du volume d'entrée, F la taille de la couche de neurones convolutionelle, P la quantité de zero padding, alors le nombre de neurones N qui tient dans un volume donné est tel que :
+⟶ Pré-requis de la couche convolutionnelle ― Si l'on note W la taille du volume d'entrée, F la taille de la couche de neurones convolutionnelle, P la quantité de zero padding, alors le nombre de neurones N qui tient dans un volume donné est tel que :
@@ -132,7 +132,7 @@
**23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
-⟶ Cela est normalement effectué après une couche fully-connected/couche convolutionelle et avant une couche de non-linéarité et a pour but de permettre un taux d'apprentissage plus grand et de réduire une dépendance trop forte à l'initialisation.
+⟶ Cela est normalement effectué après une couche fully-connected/couche convolutionnelle et avant une couche de non-linéarité et a pour but de permettre un taux d'apprentissage plus grand et de réduire une dépendance trop forte à l'initialisation.
diff --git a/fr/refresher-linear-algebra.md b/fr/cs-229-linear-algebra.md
similarity index 92%
rename from fr/refresher-linear-algebra.md
rename to fr/cs-229-linear-algebra.md
index 37329faa3..f1aea7efd 100644
--- a/fr/refresher-linear-algebra.md
+++ b/fr/cs-229-linear-algebra.md
@@ -42,7 +42,7 @@
**8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:**
-⟶ Matrice identitée ― La matrice identitée I∈Rn×n est une matrice carrée avec des 1 sur sa diagonale et des 0 partout ailleurs :
+⟶ Matrice identité ― La matrice identité I∈Rn×n est une matrice carrée avec des 1 sur sa diagonale et des 0 partout ailleurs :
@@ -150,7 +150,7 @@
**26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:**
-⟶ Trace ― La trace d'une matrice carée A, notée tr(A), est définie comme la somme de ses coefficients diagonaux:
+⟶ Trace ― La trace d'une matrice carrée A, notée tr(A), est définie comme la somme de ses coefficients diagonaux:
@@ -186,7 +186,7 @@
**32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:**
-⟶ Décomposition symmétrique ― Une matrice donnée A peut être exprimée en termes de ses parties symétrique et antisymétrique de la manière suivante :
+⟶ Décomposition symétrique ― Une matrice donnée A peut être exprimée en termes de ses parties symétrique et antisymétrique de la manière suivante :
@@ -252,7 +252,7 @@
**43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.**
-⟶ Remarque : de manière similaire, une matrice A est dite définie positive et est notée A≻0 si elle est semi-définie positive et que pour tout vector x non-nul, on a xTAx>0.
+⟶ Remarque : de manière similaire, une matrice A est dite définie positive et est notée A≻0 si elle est semi-définie positive et que pour tout vecteur x non-nul, on a xTAx>0.
@@ -264,7 +264,7 @@
**45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
-⟶ Théorème spectral ― Soit A∈Rn×n. Si A est symmétrique, alors A est diagonalisable par une matrice orthogonale réelle U∈Rn×n. En notant Λ=diag(λ1,...,λn), on a :
+⟶ Théorème spectral ― Soit A∈Rn×n. Si A est symétrique, alors A est diagonalisable par une matrice orthogonale réelle U∈Rn×n. En notant Λ=diag(λ1,...,λn), on a :
@@ -300,7 +300,7 @@
**51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:**
-⟶ Hessienne ― Soit f:Rn→R une fonction et x∈Rn un vecteur. La hessienne de f par rapport à x est une matrice symmetrique n×n, notée ∇2xf(x), telle que :
+⟶ Hessienne ― Soit f:Rn→R une fonction et x∈Rn un vecteur. La hessienne de f par rapport à x est une matrice symétrique n×n, notée ∇2xf(x), telle que :
diff --git a/fr/cheatsheet-machine-learning-tips-and-tricks.md b/fr/cs-229-machine-learning-tips-and-tricks.md
similarity index 99%
rename from fr/cheatsheet-machine-learning-tips-and-tricks.md
rename to fr/cs-229-machine-learning-tips-and-tricks.md
index d74182df0..2adf1db50 100644
--- a/fr/cheatsheet-machine-learning-tips-and-tricks.md
+++ b/fr/cs-229-machine-learning-tips-and-tricks.md
@@ -198,7 +198,7 @@
**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
-⟶ [Réduit les coefficients à 0, Bon pour la sélection de variables, Rend les coefficients plus petits, Compromis entre la selection de variables et la réduction de coefficients]
+⟶ [Réduit les coefficients à 0, Bon pour la sélection de variables, Rend les coefficients plus petits, Compromis entre la sélection de variables et la réduction de coefficients]
diff --git a/fr/refresher-probability.md b/fr/cs-229-probability.md
similarity index 98%
rename from fr/refresher-probability.md
rename to fr/cs-229-probability.md
index fe4562f80..8e407b9b2 100644
--- a/fr/refresher-probability.md
+++ b/fr/cs-229-probability.md
@@ -36,7 +36,7 @@
**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
-⟶ Axiome 2 ― La probabilité qu'au moins un des évènements élementaires de tout l'univers se produise est 1, i.e.
+⟶ Axiome 2 ― La probabilité qu'au moins un des évènements élémentaires de tout l'univers se produise est 1, i.e.
@@ -120,7 +120,7 @@
**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
-⟶ Variable aléatoire ― Une variable aléatoire, souvent notée X, est une fonction qui associe chaque élement de l'univers de probabilité à la droite des réels.
+⟶ Variable aléatoire ― Une variable aléatoire, souvent notée X, est une fonction qui associe chaque élément de l'univers de probabilité à la droite des réels.
diff --git a/fr/cheatsheet-supervised-learning.md b/fr/cs-229-supervised-learning.md
similarity index 96%
rename from fr/cheatsheet-supervised-learning.md
rename to fr/cs-229-supervised-learning.md
index 2f4850d1f..b79583323 100644
--- a/fr/cheatsheet-supervised-learning.md
+++ b/fr/cs-229-supervised-learning.md
@@ -42,7 +42,7 @@
**8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
-⟶ [Modèle discriminatif, Modèle génératif, But, Ce qui est appris, Illustration, Exemples]
+⟶ [Modèle discriminant, Modèle génératif, But, Ce qui est appris, Illustration, Exemples]
@@ -66,7 +66,7 @@
**12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
-⟶ Fonction de loss ― Une fonction de loss est une fonction L:(z,y)∈R×Y⟼L(z,y)∈R prennant comme entrée une valeur prédite z correspondant à une valeur réelle y, et nous renseigne sur la ressemblance de ces deux valeurs. Les fonctions de loss courantes sont récapitulées dans le tableau ci-dessous :
+⟶ Fonction de loss ― Une fonction de loss est une fonction L:(z,y)∈R×Y⟼L(z,y)∈R prenant comme entrée une valeur prédite z correspondant à une valeur réelle y, et nous renseigne sur la ressemblance de ces deux valeurs. Les fonctions de loss courantes sont récapitulées dans le tableau ci-dessous :
@@ -138,7 +138,7 @@
**24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
-⟶ Équations normales ― En notant X la matrice de design, la valeur de θ qui minimize la fonction de cost a une solution de forme fermée tel que :
+⟶ Équations normales ― En notant X la matrice de design, la valeur de θ qui minimise la fonction de cost a une solution de forme fermée tel que :
@@ -186,7 +186,7 @@
**32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
-⟶ Régression softmax ― Une régression softmax, aussi appelée un régression logistique multiclasse, est utilisée pour généraliser la régression logistique lorsqu'il y a plus de 2 classes à prédire. Par convention, on fixe θK=0, ce qui oblige le paramètre de Bernoulli ϕi de chaque classe i à être égal à :
+⟶ Régression softmax ― Une régression softmax, aussi appelée un régression logistique multi-classe, est utilisée pour généraliser la régression logistique lorsqu'il y a plus de 2 classes à prédire. Par convention, on fixe θK=0, ce qui oblige le paramètre de Bernoulli ϕi de chaque classe i à être égal à :
@@ -210,7 +210,7 @@
**36. Here are the most common exponential distributions summed up in the following table:**
-⟶ Les distributions exponentielles les plus communémment rencontrées sont récapitulées dans le tableau ci-dessous :
+⟶ Les distributions exponentielles les plus communément rencontrées sont récapitulées dans le tableau ci-dessous :
@@ -324,7 +324,7 @@
**55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
-⟶ Un modèle génératif essaie d'abord d'apprendre comment les données sont générées en estimant P(x|y), nous permettant ensuite d'estimer P(y|x) par le biais du théorème de Bayes.
+⟶ Un modèle génératif essaie d'abord d'apprendre comment les données sont générées en estimant P(x|y), nous permettant ensuite d'estimer P(y|x) par le biais du théorème de Bayes.
diff --git a/fr/cheatsheet-unsupervised-learning.md b/fr/cs-229-unsupervised-learning.md
similarity index 95%
rename from fr/cheatsheet-unsupervised-learning.md
rename to fr/cs-229-unsupervised-learning.md
index f64268a4b..7757f9539 100644
--- a/fr/cheatsheet-unsupervised-learning.md
+++ b/fr/cs-229-unsupervised-learning.md
@@ -12,7 +12,7 @@
**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
-⟶ Motivation ― Le but de l'apprentissage non-supervisé est de trouver des formes cachées dans un jeu de données non-labelées {x(1),...,x(m)}.
+⟶ Motivation ― Le but de l'apprentissage non-supervisé est de trouver des formes cachées dans un jeu de données non annotées {x(1),...,x(m)}.
@@ -66,7 +66,7 @@
**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
-⟶ M-step : Utiliser les probabilités postérieures Qi(z(i)) en tant que coefficients propres aux partitions sur les points x(i) pour ré-estimer séparemment chaque modèle de partition de la manière suivante :
+⟶ M-step : Utiliser les probabilités postérieures Qi(z(i)) en tant que coefficients propres aux partitions sur les points x(i) pour ré-estimer séparément chaque modèle de partition de la manière suivante :
@@ -102,7 +102,7 @@
**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
-⟶ Fonction de distortion ― Pour voir si l'algorithme converge, on regarde la fonction de distortion définie de la manière suivante :
+⟶ Fonction de distorsion ― Pour voir si l'algorithme converge, on regarde la fonction de distorsion définie de la manière suivante :
@@ -192,7 +192,7 @@
**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
-⟶ Théorème spectral ― Soit A∈Rn×n. Si A est symmétrique, alors A est diagonalisable par une matrice réelle orthogonale U∈Rn×n. En notant Λ=diag(λ1,...,λn), on a :
+⟶ Théorème spectral ― Soit A∈Rn×n. Si A est symétrique, alors A est diagonalisable par une matrice réelle orthogonale U∈Rn×n. En notant Λ=diag(λ1,...,λn), on a :
@@ -222,7 +222,7 @@
**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
-⟶ Étape 2 : Calculer Σ=1mm∑i=1x(i)x(i)T∈Rn×n, qui est symmétrique et aux valeurs propres réelles.
+⟶ Étape 2 : Calculer Σ=1mm∑i=1x(i)x(i)T∈Rn×n, qui est symétrique et aux valeurs propres réelles.
@@ -264,7 +264,7 @@
**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
-⟶ Hypothèses ― On suppose que nos données x ont été générées par un vecteur source à n dimensions s=(s1,...,sn), où les si sont des variables aléatoires indépendantes, par le biais d'une matrice de mélange et inversible A de la manière suivante :
+⟶ Hypothèses ― On suppose que nos données x ont été générées par un vecteur source à n dimensions s=(s1,...,sn), où les si sont des variables aléatoires indépendantes, par le biais d'une matrice de mélange et inversible A de la manière suivante :
@@ -294,4 +294,4 @@
**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
-⟶ Par conséquent, l'algorithme du gradient stochastique est tel que pour chaque example de ensemble d'apprentissage x(i), on met à jour W de la manière suivante :
+⟶ Par conséquent, l'algorithme du gradient stochastique est tel que pour chaque exemple de ensemble d'apprentissage x(i), on met à jour W de la manière suivante :
diff --git a/fr/cs-230-convolutional-neural-networks.md b/fr/cs-230-convolutional-neural-networks.md
new file mode 100644
index 000000000..29cca030e
--- /dev/null
+++ b/fr/cs-230-convolutional-neural-networks.md
@@ -0,0 +1,716 @@
+**Convolutional Neural Networks translation**
+
+
+
+**1. Convolutional Neural Networks cheatsheet**
+
+⟶ Pense-bête de réseaux de neurones convolutionnels
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶ CS 230 - Apprentissage profond
+
+
+
+
+**3. [Overview, Architecture structure]**
+
+⟶ [Vue d'ensemble, Structure de l'architecture]
+
+
+
+
+**4. [Types of layer, Convolution, Pooling, Fully connected]**
+
+⟶ [Types de couche, Convolution, Pooling, Fully connected]
+
+
+
+
+**5. [Filter hyperparameters, Dimensions, Stride, Padding]**
+
+⟶ [Paramètres du filtre, Dimensions, Stride, Padding]
+
+
+
+
+**6. [Tuning hyperparameters, Parameter compatibility, Model complexity, Receptive field]**
+
+⟶ [Réglage des paramètres, Compatibilité des paramètres, Complexité du modèle, Champ récepteur]
+
+
+
+
+**7. [Activation functions, Rectified Linear Unit, Softmax]**
+
+⟶ [Fonction d'activation, Unité linéaire rectifiée, Softmax]
+
+
+
+
+**8. [Object detection, Types of models, Detection, Intersection over Union, Non-max suppression, YOLO, R-CNN]**
+
+⟶ [Détection d'objet, Types de modèle, Détection, Intersection sur union, Suppression non-max, YOLO, R-CNN]
+
+
+
+
+**9. [Face verification/recognition, One shot learning, Siamese network, Triplet loss]**
+
+⟶ [Vérification/reconnaissance de visage, Apprentissage par coup, Réseau siamois, Loss triple]
+
+
+
+
+**10. [Neural style transfer, Activation, Style matrix, Style/content cost function]**
+
+⟶ [Transfert de style de neurones, Activation, Matrice de style, Fonction de coût de style/contenu]
+
+
+
+
+**11. [Computational trick architectures, Generative Adversarial Net, ResNet, Inception Network]**
+
+⟶ [Architectures à astuces calculatoires, Generative Adversarial Net, ResNet, Inception Network]
+
+
+
+
+**12. Overview**
+
+⟶ Vue d'ensemble
+
+
+
+
+**13. Architecture of a traditional CNN ― Convolutional neural networks, also known as CNNs, are a specific type of neural networks that are generally composed of the following layers:**
+
+⟶ Architecture d'un CNN traditionnel ― Les réseaux de neurones convolutionnels (en anglais Convolutional neural networks), aussi connus sous le nom de CNNs, sont un type spécifique de réseaux de neurones qui sont généralement composés des couches suivantes :
+
+
+
+
+**14. The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters that are described in the next sections.**
+
+⟶ La couche convolutionnelle et la couche de pooling peuvent être ajustées en utilisant des paramètres qui sont décrites dans les sections suivantes.
+
+
+
+
+**15. Types of layer**
+
+⟶ Types de couche
+
+
+
+
+**16. Convolution layer (CONV) ― The convolution layer (CONV) uses filters that perform convolution operations as it is scanning the input I with respect to its dimensions. Its hyperparameters include the filter size F and stride S. The resulting output O is called feature map or activation map.**
+
+⟶ Couche convolutionnelle (CONV) ― La couche convolutionnelle (en anglais convolution layer) (CONV) utilise des filtres qui scannent l'entrée I suivant ses dimensions en effectuant des opérations de convolution. Elle peut être réglée en ajustant la taille du filtre F et le stride S. La sortie O de cette opération est appelée *feature map* ou aussi *activation map*.
+
+
+
+
+**17. Remark: the convolution step can be generalized to the 1D and 3D cases as well.**
+
+⟶ Remarque : l'étape de convolution peut aussi être généralisée dans les cas 1D et 3D.
+
+
+
+
+**18. Pooling (POOL) ― The pooling layer (POOL) is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance. In particular, max and average pooling are special kinds of pooling where the maximum and average value is taken, respectively.**
+
+⟶ Pooling (POOL) ― La couche de pooling (en anglais pooling layer) (POOL) est une opération de sous-échantillonnage typiquement appliquée après une couche convolutionnelle. En particulier, les types de pooling les plus populaires sont le max et l'average pooling, où les valeurs maximales et moyennes sont prises, respectivement.
+
+
+
+
+**19. [Type, Purpose, Illustration, Comments]**
+
+⟶ [Type, But, Illustration, Commentaires]
+
+
+
+
+**20. [Max pooling, Average pooling, Each pooling operation selects the maximum value of the current view, Each pooling operation averages the values of the current view]**
+
+⟶ [Max pooling, Average pooling, Chaque opération de pooling sélectionne la valeur maximale de la surface. Chaque opération de pooling sélectionne la valeur moyenne de la surface.]
+
+
+
+
+**21. [Preserves detected features, Most commonly used, Downsamples feature map, Used in LeNet]**
+
+⟶ [Garde les caractéristiques détectées. Plus communément utilisé, Sous-échantillonne la feature map, Utilisé dans LeNet]
+
+
+
+
+**22. Fully Connected (FC) ― The fully connected layer (FC) operates on a flattened input where each input is connected to all neurons. If present, FC layers are usually found towards the end of CNN architectures and can be used to optimize objectives such as class scores.**
+
+⟶ Fully Connected (FC) ― La couche de fully connected (en anglais fully connected layer) (FC) s'applique sur une entrée préalablement aplatie où chaque entrée est connectée à tous les neurones. Les couches de fully connected sont typiquement présentes à la fin des architectures de CNN et peuvent être utilisées pour optimiser des objectifs tels que les scores de classe.
+
+
+
+
+**23. Filter hyperparameters**
+
+⟶ Paramètres du filtre
+
+
+
+
+**24. The convolution layer contains filters for which it is important to know the meaning behind its hyperparameters.**
+
+⟶ La couche convolutionnelle contient des filtres pour lesquels il est important de savoir comment ajuster ses paramètres.
+
+
+
+
+**25. Dimensions of a filter ― A filter of size F×F applied to an input containing C channels is a F×F×C volume that performs convolutions on an input of size I×I×C and produces an output feature map (also called activation map) of size O×O×1.**
+
+⟶ Dimensions d'un filtre ― Un filtre de taille F×F appliqué à une entrée contenant C canaux est un volume de taille F×F×C qui effectue des convolutions sur une entrée de taille I×I×C et qui produit un feature map de sortie (aussi appelé activation map) de taille O×O×1.
+
+
+
+
+**26. Filter**
+
+⟶ Filtre
+
+
+
+
+**27. Remark: the application of K filters of size F×F results in an output feature map of size O×O×K.**
+
+⟶ Remarque : appliquer K filtres de taille F×F engendre un feature map de sortie de taille O×O×K.
+
+
+
+
+**28. Stride ― For a convolutional or a pooling operation, the stride S denotes the number of pixels by which the window moves after each operation.**
+
+⟶ Stride ― Dans le contexte d'une opération de convolution ou de pooling, la stride S est un paramètre qui dénote le nombre de pixels par lesquels la fenêtre se déplace après chaque opération.
+
+
+
+
+**29. Zero-padding ― Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. This value can either be manually specified or automatically set through one of the three modes detailed below:**
+
+⟶ Zero-padding ― Le zero-padding est une technique consistant à ajouter P zeros à chaque côté des frontières de l'entrée. Cette valeur peut être spécifiée soit manuellement, soit automatiquement par le biais d'une des configurations détaillées ci-dessous :
+
+
+
+
+**30. [Mode, Value, Illustration, Purpose, Valid, Same, Full]**
+
+⟶ [Configuration, Valeur, Illustration, But, Valide, Pareil, Total]
+
+
+
+
+**31. [No padding, Drops last convolution if dimensions do not match, Padding such that feature map size has size ⌈IS⌉, Output size is mathematically convenient, Also called 'half' padding, Maximum padding such that end convolutions are applied on the limits of the input, Filter 'sees' the input end-to-end]**
+
+⟶ [Pas de padding, Enlève la dernière opération de convolution si les dimensions ne collent pas, Le padding tel que la feature map est de taille ⌈IS⌉, La taille de sortie est mathématiquement satisfaisante, Aussi appelé 'demi' padding, Padding maximum tel que les dernières convolutions sont appliquées sur les bords de l'entrée, Le filtre 'voit' l'entrée du début à la fin]
+
+
+
+
+**32. Tuning hyperparameters**
+
+⟶ Ajuster les paramètres
+
+
+
+
+**33. Parameter compatibility in convolution layer ― By noting I the length of the input volume size, F the length of the filter, P the amount of zero padding, S the stride, then the output size O of the feature map along that dimension is given by:**
+
+⟶ Compatibilité des paramètres dans la couche convolutionnelle ― En notant I le côté du volume d'entrée, F la taille du filtre, P la quantité de zero-padding, S la stride, la taille O de la feature map de sortie suivant cette dimension est telle que :
+
+
+
+
+**34. [Input, Filter, Output]**
+
+⟶ [Entrée, Filtre, Sortie]
+
+
+
+
+**35. Remark: often times, Pstart=Pend≜P, in which case we can replace Pstart+Pend by 2P in the formula above.**
+
+⟶ Remarque : on a souvent Pstart=Pend≜P, auquel cas on remplace Pstart+Pend par 2P dans la formule au-dessus.
+
+
+
+
+**36. Understanding the complexity of the model ― In order to assess the complexity of a model, it is often useful to determine the number of parameters that its architecture will have. In a given layer of a convolutional neural network, it is done as follows:**
+
+⟶ Comprendre la complexité du modèle ― Pour évaluer la complexité d'un modèle, il est souvent utile de déterminer le nombre de paramètres que l'architecture va avoir. Dans une couche donnée d'un réseau de neurones convolutionnels, on a :
+
+
+
+
+**37. [Illustration, Input size, Output size, Number of parameters, Remarks]**
+
+⟶ [Illustration, Taille d'entrée, Taille de sortie, Nombre de paramètres, Remarques]
+
+
+
+
+**38. [One bias parameter per filter, In most cases, S
+
+
+**39. [Pooling operation done channel-wise, In most cases, S=F]**
+
+⟶ [L'opération de pooling est effectuée pour chaque canal, Dans la plupart des cas, S=F]
+
+
+
+
+**40. [Input is flattened, One bias parameter per neuron, The number of FC neurons is free of structural constraints]**
+
+⟶ [L'entrée est aplatie, Un paramètre de biais par neurone, Le choix du nombre de neurones de FC est libre]
+
+
+
+
+**41. Receptive field ― The receptive field at layer k is the area denoted Rk×Rk of the input that each pixel of the k-th activation map can 'see'. By calling Fj the filter size of layer j and Si the stride value of layer i and with the convention S0=1, the receptive field at layer k can be computed with the formula:**
+
+⟶ Champ récepteur ― Le champ récepteur à la couche k est la surface notée Rk×Rk de l'entrée que chaque pixel de la k-ième activation map peut 'voir'. En notant Fj la taille du filtre de la couche j et Si la valeur de stride de la couche i et avec la convention S0=1, le champ récepteur à la couche k peut être calculé de la manière suivante :
+
+
+
+
+**42. In the example below, we have F1=F2=3 and S1=S2=1, which gives R2=1+2⋅1+2⋅1=5.**
+
+⟶ Dans l'exemple ci-dessous, on a F1=F2=3 et S1=S2=1, ce qui donne R2=1+2⋅1+2⋅1=5.
+
+
+
+
+**43. Commonly used activation functions**
+
+⟶ Fonctions d'activation communément utilisées
+
+
+
+
+**44. Rectified Linear Unit ― The rectified linear unit layer (ReLU) is an activation function g that is used on all elements of the volume. It aims at introducing non-linearities to the network. Its variants are summarized in the table below:**
+
+⟶ Unité linéaire rectifiée ― La couche d'unité linéaire rectifiée (en anglais rectified linear unit layer) (ReLU) est une fonction d'activation g qui est utilisée sur tous les éléments du volume. Elle a pour but d'introduire des complexités non-linéaires au réseau. Ses variantes sont récapitulées dans le tableau suivant :
+
+
+
+
+**45. [ReLU, Leaky ReLU, ELU, with]**
+
+⟶ [ReLU, Leaky ReLU, ELU, avec]
+
+
+
+
+**46. [Non-linearity complexities biologically interpretable, Addresses dying ReLU issue for negative values, Differentiable everywhere]**
+
+⟶ [Complexités non-linéaires interprétables d'un point de vue biologique, Répond au problème de dying ReLU, Dérivable partout]
+
+
+
+
+**47. Softmax ― The softmax step can be seen as a generalized logistic function that takes as input a vector of scores x∈Rn and outputs a vector of output probability p∈Rn through a softmax function at the end of the architecture. It is defined as follows:**
+
+⟶ Softmax ― L'étape softmax peut être vue comme une généralisation de la fonction logistique qui prend comme argument un vecteur de scores x∈Rn et qui renvoie un vecteur de probabilités p∈Rn à travers une fonction softmax à la fin de l'architecture. Elle est définie de la manière suivante :
+
+
+
+
+**48. where**
+
+⟶ où
+
+
+
+
+**49. Object detection**
+
+⟶ Détection d'objet
+
+
+
+
+**50. Types of models ― There are 3 main types of object recognition algorithms, for which the nature of what is predicted is different. They are described in the table below:**
+
+⟶ Types de modèles ― Il y a 3 principaux types d'algorithme de reconnaissance d'objet, pour lesquels la nature de ce qui est prédit est different. Ils sont décrits dans la table ci-dessous :
+
+
+
+
+**51. [Image classification, Classification w. localization, Detection]**
+
+⟶ [Classification d'image, Classification avec localisation, Détection]
+
+
+
+
+**52. [Teddy bear, Book]**
+
+⟶ [Ours en peluche, Livre]
+
+
+
+
+**53. [Classifies a picture, Predicts probability of object, Detects an object in a picture, Predicts probability of object and where it is located, Detects up to several objects in a picture, Predicts probabilities of objects and where they are located]**
+
+⟶ [Classifie une image, Prédit la probabilité d'un objet, Détecte un objet dans une image, Prédit la probabilité de présence d'un objet et où il est situé, Peut détecter plusieurs objets dans une image, Prédit les probabilités de présence des objets et où ils sont situés]
+
+
+
+
+**54. [Traditional CNN, Simplified YOLO, R-CNN, YOLO, R-CNN]**
+
+⟶ [CNN traditionnel, YOLO simplifié, R-CNN, YOLO, R-CNN]
+
+
+
+
+**55. Detection ― In the context of object detection, different methods are used depending on whether we just want to locate the object or detect a more complex shape in the image. The two main ones are summed up in the table below:**
+
+⟶ Détection ― Dans le contexte de la détection d'objet, des méthodes différentes sont utilisées selon si l'on veut juste localiser l'objet ou alors détecter une forme plus complexe dans l'image. Les deux méthodes principales sont résumées dans le tableau ci-dessous :
+
+
+
+
+**56. [Bounding box detection, Landmark detection]**
+
+⟶ [Détection de zone délimitante, Détection de forme complexe]
+
+
+
+
+**57. [Detects the part of the image where the object is located, Detects a shape or characteristics of an object (e.g. eyes), More granular]**
+
+⟶ [Détecte la partie de l'image où l'objet est situé, Détecte la forme ou les caractéristiques d'un objet (e.g. yeux), Plus granulaire]
+
+
+
+
+**58. [Box of center (bx,by), height bh and width bw, Reference points (l1x,l1y), ..., (lnx,lny)]**
+
+⟶ [Zone de centre (bx,by), hauteur bh et largeur bw, Points de référence (l1x,l1y), ..., (lnx,lny)]
+
+
+
+
+**59. Intersection over Union ― Intersection over Union, also known as IoU, is a function that quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding box Ba. It is defined as:**
+
+⟶ Intersection sur Union ― Intersection sur Union (en anglais Intersection over Union), aussi appelé IoU, est une fonction qui quantifie à quel point la zone délimitante prédite Bp est correctement positionnée par rapport à la zone délimitante vraie Ba. Elle est définie de la manière suivante :
+
+
+
+
+**60. Remark: we always have IoU∈[0,1]. By convention, a predicted bounding box Bp is considered as being reasonably good if IoU(Bp,Ba)⩾0.5.**
+
+⟶ Remarque : on a toujours IoU∈[0,1]. Par convention, la prédiction Bp d'une zone délimitante est considérée comme étant satisfaisante si l'on a IoU(Bp,Ba)⩾0.5.
+
+
+
+
+**61. Anchor boxes ― Anchor boxing is a technique used to predict overlapping bounding boxes. In practice, the network is allowed to predict more than one box simultaneously, where each box prediction is constrained to have a given set of geometrical properties. For instance, the first prediction can potentially be a rectangular box of a given form, while the second will be another rectangular box of a different geometrical form.**
+
+⟶ Zone d'accroche ― La technique des zones d'accroche (en anglais anchor boxing) sert à prédire des zones délimitantes qui se chevauchent. En pratique, on permet au réseau de prédire plus d'une zone délimitante simultanément, où chaque zone prédite doit respecter une forme géométrique particulière. Par exemple, la première prédiction peut potentiellement être une zone rectangulaire d'une forme donnée, tandis qu'une seconde prédiction doit être une zone rectangulaire d'une autre forme.
+
+
+
+
+**62. Non-max suppression ― The non-max suppression technique aims at removing duplicate overlapping bounding boxes of a same object by selecting the most representative ones. After having removed all boxes having a probability prediction lower than 0.6, the following steps are repeated while there are boxes remaining:**
+
+⟶ Suppression non-max ― La technique de suppression non-max (en anglais non-max suppression) a pour but d'enlever des zones délimitantes qui se chevauchent et qui prédisent un seul et même objet, en sélectionnant les zones les plus représentatives. Après avoir enlevé toutes les zones ayant une probabilité prédite de moins de 0.6, les étapes suivantes sont répétées pour éliminer les zones redondantes :
+
+
+
+
+**63. [For a given class, Step 1: Pick the box with the largest prediction probability., Step 2: Discard any box having an IoU⩾0.5 with the previous box.]**
+
+⟶ [Pour une classe donnée, Étape 1 : Choisir la zone ayant la plus grande probabilité de prédiction., Étape 2 : Enlever toute zone ayant IoU⩾0.5 avec la zone choisie précédemment.]
+
+
+
+
+**64. [Box predictions, Box selection of maximum probability, Overlap removal of same class, Final bounding boxes]**
+
+⟶ [Zones prédites, Sélection de la zone de probabilité maximum, Suppression des chevauchements d'une même classe, Zones délimitantes finales]
+
+
+
+
+**65. YOLO ― You Only Look Once (YOLO) is an object detection algorithm that performs the following steps:**
+
+⟶ YOLO ― L'algorithme You Only Look Once (YOLO) est un algorithme de détection d'objet qui fonctionne de la manière suivante :
+
+
+
+
+**66. [Step 1: Divide the input image into a G×G grid., Step 2: For each grid cell, run a CNN that predicts y of the following form:, repeated k times]**
+
+⟶ [Étape 1 : Diviser l'image d'entrée en une grille de taille G×G., Étape 2 : Pour chaque cellule, faire tourner un CNN qui prédit y de la forme suivante :, répété k fois]
+
+
+
+
+**67. where pc is the probability of detecting an object, bx,by,bh,bw are the properties of the detected bouding box, c1,...,cp is a one-hot representation of which of the p classes were detected, and k is the number of anchor boxes.**
+
+⟶ où pc est la probabilité de détecter un objet, bx,by,bh,bw sont les propriétés de la zone délimitante détectée, c1,...,cp est une représentation binaire (en anglais one-hot representation) de l'une des p classes détectée, et k est le nombre de zones d'accroche.
+
+
+
+
+**68. Step 3: Run the non-max suppression algorithm to remove any potential duplicate overlapping bounding boxes.**
+
+⟶ Étape 3 : Faire tourner l'algorithme de suppression non-max pour enlever des doublons potentiels qui chevauchent des zones délimitantes.
+
+
+
+
+**69. [Original image, Division in GxG grid, Bounding box prediction, Non-max suppression]**
+
+⟶ [Image originale, Division en une grille de taille GxG, Prédiction de zone délimitante, Suppression non-max]
+
+
+
+
+**70. Remark: when pc=0, then the network does not detect any object. In that case, the corresponding predictions bx,...,cp have to be ignored.**
+
+⟶ Remarque : lorsque pc=0, le réseau ne détecte plus d'objet. Dans ce cas, les prédictions correspondantes bx,...,cp doivent être ignorées.
+
+
+
+
+**71. R-CNN ― Region with Convolutional Neural Networks (R-CNN) is an object detection algorithm that first segments the image to find potential relevant bounding boxes and then run the detection algorithm to find most probable objects in those bounding boxes.**
+
+⟶ R-CNN ― L'algorithme de région avec des réseaux de neurones convolutionnels (en anglais Region with Convolutional Neural Networks) (R-CNN) est un algorithme de détection d'objet qui segmente l'image d'entrée pour trouver des zones délimitantes pertinentes, puis fait tourner un algorithme de détection pour trouver les objets les plus probables d'apparaître dans ces zones délimitantes.
+
+
+
+
+**72. [Original image, Segmentation, Bounding box prediction, Non-max suppression]**
+
+⟶ [Image originale, Segmentation, Prédiction de zone délimitante, Suppression non-max]
+
+
+
+
+**73. Remark: although the original algorithm is computationally expensive and slow, newer architectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.**
+
+⟶ Remarque : bien que l'algorithme original soit lent et coûteux en temps de calcul, de nouvelles architectures ont permis de faire tourner l'algorithme plus rapidement, tels que le Fast R-CNN et le Faster R-CNN.
+
+
+
+
+**74. Face verification and recognition**
+
+⟶ Vérification et reconnaissance de visage
+
+
+
+
+**75. Types of models ― Two main types of model are summed up in table below:**
+
+⟶ Types de modèles ― Deux principaux types de modèle sont récapitulés dans le tableau ci-dessous :
+
+
+
+
+**76. [Face verification, Face recognition, Query, Reference, Database]**
+
+⟶ [Vérification de visage, Reconnaissance de visage, Requête, Référence, Base de données]
+
+
+
+
+**77. [Is this the correct person?, One-to-one lookup, Is this one of the K persons in the database?, One-to-many lookup]**
+
+⟶ [Est-ce la bonne personne ?, , Est-ce une des K personnes dans la base de données ?, ]
+
+
+
+
+**78. One Shot Learning ― One Shot Learning is a face verification algorithm that uses a limited training set to learn a similarity function that quantifies how different two given images are. The similarity function applied to two images is often noted d(image 1,image 2).**
+
+⟶ Apprentissage par coup ― L'apprentissage par coup (en anglais One Shot Learning) est un algorithme de vérification de visage qui utilise un training set de petite taille pour apprendre une fonction de similarité qui quantifie à quel point deux images données sont différentes. La fonction de similarité appliquée à deux images est souvent notée d(image 1,image 2).
+
+
+
+
+**79. Siamese Network ― Siamese Networks aim at learning how to encode images to then quantify how different two images are. For a given input image x(i), the encoded output is often noted as f(x(i)).**
+
+⟶ Réseaux siamois ― Les réseaux siamois (en anglais Siamese Networks) ont pour but d'apprendre comment encoder des images pour quantifier le degré de différence de deux images données. Pour une image d'entrée donnée x(i), l'encodage de sortie est souvent notée f(x(i)).
+
+
+
+
+**80. Triplet loss ― The triplet loss ℓ is a loss function computed on the embedding representation of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive example belong to a same class, while the negative example to another one. By calling α∈R+ the margin parameter, this loss is defined as follows:**
+
+⟶ Loss triple ― Le loss triple (en anglais triplet loss) ℓ est une fonction de loss calculée sur une représentation encodée d'un triplet d'images A (accroche), P (positif), et N (négatif). L'exemple d'accroche et l'exemple positif appartiennent à la même classe, tandis que l'exemple négatif appartient à une autre. En notant α∈R+ le paramètre de marge, le loss est défini de la manière suivante :
+
+
+
+
+**81. Neural style transfer**
+
+⟶ Transfert de style neuronal
+
+
+
+
+**82. Motivation ― The goal of neural style transfer is to generate an image G based on a given content C and a given style S.**
+
+⟶ Motivation ― Le but du transfert de style neuronal (en anglais neural style transfer) est de générer une image G à partir d'un contenu C et d'un style S.
+
+
+
+
+**83. [Content C, Style S, Generated image G]**
+
+⟶ [Contenu C, Style S, Image générée G]
+
+
+
+
+**84. Activation ― In a given layer l, the activation is noted a[l] and is of dimensions nH×nw×nc**
+
+⟶ Activation ― Dans une couche l donnée, l'activation est notée a[l] et est de dimensions nH×nw×nc
+
+
+
+
+**85. Content cost function ― The content cost function Jcontent(C,G) is used to determine how the generated image G differs from the original content image C. It is defined as follows:**
+
+⟶ Fonction de coût de contenu ― La fonction de coût de contenu (en anglais content cost function), notée Jcontenu(C,G), est utilisée pour quantifier à quel point l'image générée G diffère de l'image de contenu original C. Elle est définie de la manière suivante :
+
+
+
+
+**86. Style matrix ― The style matrix G[l] of a given layer l is a Gram matrix where each of its elements G[l]kk′ quantifies how correlated the channels k and k′ are. It is defined with respect to activations a[l] as follows:**
+
+⟶ Matrice de style ― La matrice de style (en anglais style matrix) G[l] d'une couche l donnée est une matrice de Gram dans laquelle chacun des éléments G[l]kk′ quantifie le degré de corrélation des canaux k and k′. Elle est définie en fonction des activations a[l] de la manière suivante :
+
+
+
+
+**87. Remark: the style matrix for the style image and the generated image are noted G[l] (S) and G[l] (G) respectively.**
+
+⟶ Remarque : les matrices de style de l'image de style et de l'image générée sont notées G[l] (S) and G[l] (G) respectivement.
+
+
+
+
+**88. Style cost function ― The style cost function Jstyle(S,G) is used to determine how the generated image G differs from the style S. It is defined as follows:**
+
+⟶ Fonction de coût de style ― La fonction de coût de style (en anglais style cost function), notée Jstyle(S,G), est utilisée pour quantifier à quel point l'image générée G diffère de l'image de style S. Elle est définie de la manière suivante :
+
+
+
+
+**89. Overall cost function ― The overall cost function is defined as being a combination of the content and style cost functions, weighted by parameters α,β, as follows:**
+
+⟶ Fonction de coût totale ― La fonction de coût totale (en anglais overall cost function) est définie comme étant une combinaison linéaire des fonctions de coût de contenu et de style, pondérées par les paramètres α,β, de la manière suivante :
+
+
+
+
+**90. Remark: a higher value of α will make the model care more about the content while a higher value of β will make it care more about the style.**
+
+⟶ Remarque : plus α est grand, plus le modèle privilégiera le contenu et plus β est grand, plus le modèle sera fidèle au style.
+
+
+
+
+**91. Architectures using computational tricks**
+
+⟶ Architectures utilisant des astuces de calcul
+
+
+
+
+**92. Generative Adversarial Network ― Generative adversarial networks, also known as GANs, are composed of a generative and a discriminative model, where the generative model aims at generating the most truthful output that will be fed into the discriminative which aims at differentiating the generated and true image.**
+
+⟶ Réseau antagoniste génératif ― Les réseaux antagonistes génératifs (en anglais generative adversarial networks), aussi connus sous le nom de GANs, sont composés d'un modèle génératif et d'un modèle discriminatif, où le modèle génératif a pour but de générer des prédictions aussi réalistes que possibles, qui seront ensuite envoyées dans un modèle discriminatif qui aura pour but de différencier une image générée d'une image réelle.
+
+
+
+
+**93. [Training, Noise, Real-world image, Generator, Discriminator, Real Fake]**
+
+⟶ [Training, Bruit, Image réelle, Générateur, Discriminant, Vrai faux]
+
+
+
+
+**94. Remark: use cases using variants of GANs include text to image, music generation and synthesis.**
+
+⟶ Remarque : les GANs sont utilisées dans des applications pouvant aller de la génération de musique au traitement de texte vers image.
+
+
+
+
+**95. ResNet ― The Residual Network architecture (also called ResNet) uses residual blocks with a high number of layers meant to decrease the training error. The residual block has the following characterizing equation:**
+
+⟶ ResNet ― L'architecture du réseau résiduel (en anglais Residual Network), aussi appelé ResNet, utilise des blocs résiduels avec un nombre élevé de couches et a pour but de réduire l'erreur de training. Le bloc résiduel est caractérisé par l'équation suivante :
+
+
+
+
+**96. Inception Network ― This architecture uses inception modules and aims at giving a try at different convolutions in order to increase its performance through features diversification. In particular, it uses the 1×1 convolution trick to limit the computational burden.**
+
+⟶ Inception Network ― Cette architecture utilise des modules d'inception et a pour but de tester toute sorte de configuration de convolution pour améliorer sa performance en diversifiant ses attributs. En particulier, elle utilise l'astuce de la convolution 1x1 pour limiter sa complexité de calcul.
+
+
+
+
+**97. The Deep Learning cheatsheets are now available in [target language].**
+
+⟶ Les pense-bêtes d'apprentissage profond sont maintenant disponibles en français.
+
+
+
+
+**98. Original authors**
+
+⟶ Auteurs
+
+
+
+
+**99. Translated by X, Y and Z**
+
+⟶ Traduit par X, Y et Z
+
+
+
+
+**100. Reviewed by X, Y and Z**
+
+⟶ Relu par X, Y et Z
+
+
+
+
+**101. View PDF version on GitHub**
+
+⟶ Voir la version PDF sur GitHub
+
+
+
+
+**102. By X and Y**
+
+⟶ Par X et Y
+
+
diff --git a/fr/cs-230-deep-learning-tips-and-tricks.md b/fr/cs-230-deep-learning-tips-and-tricks.md
new file mode 100644
index 000000000..4c84b51f4
--- /dev/null
+++ b/fr/cs-230-deep-learning-tips-and-tricks.md
@@ -0,0 +1,457 @@
+**Deep Learning Tips and Tricks translation**
+
+
+
+**1. Deep Learning Tips and Tricks cheatsheet**
+
+⟶ Pense-bête de petites astuces d'apprentissage profond
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶ CS 230 - Apprentissage profond
+
+
+
+
+**3. Tips and tricks**
+
+⟶ Petites astuces
+
+
+
+
+**4. [Data processing, Data augmentation, Batch normalization]**
+
+⟶ [Traitement des données, Augmentation des données, Normalisation de lot]
+
+
+
+
+**5. [Training a neural network, Epoch, Mini-batch, Cross-entropy loss, Backpropagation, Gradient descent, Updating weights, Gradient checking]**
+
+⟶ [Entrainement d'un réseau de neurones, Epoch, Mini-lot, Entropie croisée, Rétropropagation du gradient, Algorithme du gradient, Mise à jour des coefficients, Vérification de gradient]
+
+
+
+
+**6. [Parameter tuning, Xavier initialization, Transfer learning, Learning rate, Adaptive learning rates]**
+
+⟶ [Ajustement de paramètres, Initialisation de Xavier, Apprentissage par transfert, Taux d'apprentissage, Taux d'apprentissage adaptatifs]
+
+
+
+
+**7. [Regularization, Dropout, Weight regularization, Early stopping]**
+
+⟶ [Régularisation, Abandon, Régularisation des coefficients, Arrêt prématuré]
+
+
+
+
+**8. [Good practices, Overfitting small batch, Gradient checking]**
+
+⟶ [Bonnes pratiques, Surapprentissage d'un mini-lot, Vérification de gradient]
+
+
+
+
+**9. View PDF version on GitHub**
+
+⟶ Voir la version PDF sur GitHub
+
+
+
+
+**10. Data processing**
+
+⟶ Traitement des données
+
+
+
+
+**11. Data augmentation ― Deep learning models usually need a lot of data to be properly trained. It is often useful to get more data from the existing ones using data augmentation techniques. The main ones are summed up in the table below. More precisely, given the following input image, here are the techniques that we can apply:**
+
+⟶ Augmentation des données - Les modèles d'apprentissage profond ont typiquement besoin de beaucoup de données afin d'être entrainés convenablement. Il est souvent utile de générer plus de données à partir de celles déjà existantes à l'aide de techniques d'augmentation de données. Celles les plus souvent utilisées sont résumées dans le tableau ci-dessous. À partir d'une image, voici les techniques que l'on peut utiliser :
+
+
+
+
+**12. [Original, Flip, Rotation, Random crop]**
+
+⟶ [Original, Symétrie axiale, Rotation, Recadrage aléatoire]
+
+
+
+
+**13. [Image without any modification, Flipped with respect to an axis for which the meaning of the image is preserved, Rotation with a slight angle, Simulates incorrect horizon calibration, Random focus on one part of the image, Several random crops can be done in a row]**
+
+⟶ [Image sans aucune modification, Symétrie par rapport à un axe pour lequel le sens de l'image est conservé, Rotation avec un petit angle, Reproduit une calibration imparfaite de l'horizon, Concentration aléatoire sur une partie de l'image, Plusieurs rognements aléatoires peuvent être faits à la suite]
+
+
+
+
+**14. [Color shift, Noise addition, Information loss, Contrast change]**
+
+⟶ [Changement de couleur, Addition de bruit, Perte d'information, Changement de contraste]
+
+
+
+
+**15. [Nuances of RGB is slightly changed, Captures noise that can occur with light exposure, Addition of noise, More tolerance to quality variation of inputs, Parts of image ignored, Mimics potential loss of parts of image, Luminosity changes, Controls difference in exposition due to time of day]**
+
+⟶ [Nuances de RGB sont légèrement changées, Capture le bruit qui peut survenir avec de l'exposition lumineuse, Addition de bruit, Plus de tolérance envers la variation de la qualité de l'entrée, Parties de l'image ignorées, Imite des pertes potentielles de parties de l'image, Changement de luminosité, Contrôle la différence de l'exposition dû à l'heure de la journée]
+
+
+
+
+**16. Remark: data is usually augmented on the fly during training.**
+
+⟶ Remarque : les données sont normalement augmentées à la volée durant l'étape de training.
+
+
+
+
+**17. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
+
+⟶ Normalisation de lot ― La normalisation de lot (en anglais batch normalization) est une étape qui normalise le lot {xi} avec un choix de paramètres γ,β. En notant μB,σ2B la moyenne et la variance de ce que l'on veut corriger du lot, on a :
+
+
+
+
+**18. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
+
+⟶ Ceci est couramment fait après un fully connected/couche de convolution et avant une couche non-linéaire. Elle vise à permettre d'avoir de plus grands taux d'apprentissages et de réduire la dépendance à l'initialisation.
+
+
+
+
+**19. Training a neural network**
+
+⟶ Entraîner un réseau de neurones
+
+
+
+
+**20. Definitions**
+
+⟶ Définitions
+
+
+
+
+**21. Epoch ― In the context of training a model, epoch is a term used to refer to one iteration where the model sees the whole training set to update its weights.**
+
+⟶ Epoch ― Dans le contexte de l'entraînement d'un modèle, l'epoch est un terme utilisé pour référer à une itération où le modèle voit tout le training set pour mettre à jour ses coefficients.
+
+
+
+
+**22. Mini-batch gradient descent ― During the training phase, updating weights is usually not based on the whole training set at once due to computation complexities or one data point due to noise issues. Instead, the update step is done on mini-batches, where the number of data points in a batch is a hyperparameter that we can tune.**
+
+⟶ Gradient descent sur mini-lots ― Durant la phase d'entraînement, la mise à jour des coefficients n'est souvent basée ni sur tout le training set d'un coup à cause de temps de calculs coûteux, ni sur un seul point à cause de bruits potentiels. À la place de cela, l'étape de mise à jour est faite sur des mini-lots, où le nombre de points dans un lot est un paramètre que l'on peut régler.
+
+
+
+
+**23. Loss function ― In order to quantify how a given model performs, the loss function L is usually used to evaluate to what extent the actual outputs y are correctly predicted by the model outputs z.**
+
+⟶ Fonction de loss ― Pour pouvoir quantifier la performance d'un modèle donné, la fonction de loss (en anglais loss function) L est utilisée pour évaluer la mesure dans laquelle les sorties vraies y sont correctement prédites par les prédictions du modèle z.
+
+
+
+
+**24. Cross-entropy loss ― In the context of binary classification in neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
+
+⟶ Entropie croisée ― Dans le contexte de la classification binaire d'un réseau de neurones, l'entropie croisée (en anglais cross-entropy loss) L(z,y) est couramment utilisée et est définie par :
+
+
+
+
+**25. Finding optimal weights**
+
+⟶ Recherche de coefficients optimaux
+
+
+
+
+**26. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight w is computed using the chain rule.**
+
+⟶ Backpropagation ― La backpropagation est une méthode de mise à jour des coefficients d'un réseau de neurones en prenant en compte les sorties vraies et désirées. La dérivée par rapport à chaque coefficient w est calculée en utilisant la règle de la chaîne.
+
+
+
+
+**27. Using this method, each weight is updated with the rule:**
+
+⟶ En utilisant cette méthode, chaque coefficient est mis à jour par :
+
+
+
+
+**28. Updating weights ― In a neural network, weights are updated as follows:**
+
+⟶ Mettre à jour les coefficients ― Dans un réseau de neurones, les coefficients sont mis à jour par :
+
+
+
+
+**29. [Step 1: Take a batch of training data and perform forward propagation to compute the loss, Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight, Step 3: Use the gradients to update the weights of the network.]**
+
+⟶ [Étape 1 : Prendre un lot de training data et effectuer une forward propagation pour calculer le loss, Étape 2 : Backpropaguer le loss pour obtenir le gradient du loss par rapport à chaque coefficient, Étape 3 : Utiliser les gradients pour mettre à jour les coefficients du réseau.]
+
+
+
+
+**30. [Forward propagation, Backpropagation, Weights update]**
+
+⟶ [Forward propagation, Backpropagation, Mise à jour des coefficients]
+
+
+
+
+**31. Parameter tuning**
+
+⟶ Réglage des paramètres
+
+
+
+
+**32. Weights initialization**
+
+⟶ Initialisation des coefficients
+
+
+
+
+**33. Xavier initialization ― Instead of initializing the weights in a purely random manner, Xavier initialization enables to have initial weights that take into account characteristics that are unique to the architecture.**
+
+⟶ Initialization de Xavier ― Au lieu de laisser les coefficients s'initialiser de manière purement aléatoire, l'initialisation de Xavier permet d'avoir des coefficients initiaux qui prennent en compte les caractéristiques uniques de l'architecture.
+
+
+
+
+**34. Transfer learning ― Training a deep learning model requires a lot of data and more importantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets that took days/weeks to train, and leverage it towards our use case. Depending on how much data we have at hand, here are the different ways to leverage this:**
+
+⟶ Apprentissage de transfert ― Entraîner un modèle d'apprentissage profond requière beaucoup de données et beaucoup de temps. Il est souvent utile de profiter de coefficients pre-entraînés sur des données énormes qui ont pris des jours/semaines pour être entraînés, et profiter de cela pour notre cas. Selon la quantité de données que l'on a sous la main, voici différentes manières d'utiliser cette méthode :
+
+
+
+
+**35. [Training size, Illustration, Explanation]**
+
+⟶ [Taille du training, Illustration, Explication]
+
+
+
+
+**36. [Small, Medium, Large]**
+
+⟶ [Petit, Moyen, Grand]
+
+
+
+
+**37. [Freezes all layers, trains weights on softmax, Freezes most layers, trains weights on last layers and softmax, Trains weights on layers and softmax by initializing weights on pre-trained ones]**
+
+⟶ [Gèle toutes les couches, entraîne les coefficients du softmax, Gèle la plupart des couches, entraîne les coefficients des dernières couches et du softmax, Entraîne les coefficients des couches et du softmax en initialisant les coefficients sur ceux qui ont été pré-entraînés]
+
+
+
+
+**38. Optimizing convergence**
+
+⟶ Optimisation de la convergence
+
+
+
+
+**39. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. It can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
+
+⟶ Taux d'apprentissage ― Le taux d'apprentissage (en anglais learning rate), souvent noté α ou η, indique la vitesse à laquelle les coefficients sont mis à jour. Il peut être fixe ou variable. La méthode actuelle la plus populaire est appelée Adam, qui est une méthode faisant varier le taux d'apprentissage.
+
+
+
+
+**40. Adaptive learning rates ― Letting the learning rate vary when training a model can reduce the training time and improve the numerical optimal solution. While Adam optimizer is the most commonly used technique, others can also be useful. They are summed up in the table below:**
+
+⟶ Taux d'apprentissage adaptatifs ― Laisser le taux d'apprentissage varier pendant la phase d'entraînement du modèle peut réduire le temps d'entraînement et améliorer la qualité de la solution numérique optimale. Bien que la méthode d'Adam est la plus utilisée, d'autres peuvent aussi être utiles. Les différentes méthodes sont récapitulées dans le tableau ci-dessous :
+
+
+
+
+**41. [Method, Explanation, Update of w, Update of b]**
+
+⟶ [Méthode, Explication, Mise à jour de b, Mise à jour de b]
+
+
+
+
+**42. [Momentum, Dampens oscillations, Improvement to SGD, 2 parameters to tune]**
+
+⟶ [Momentum, Amortit les oscillations, Amélioration par rapport à la méthode SGD, 2 paramètres à régler]
+
+
+
+
+**43. [RMSprop, Root Mean Square propagation, Speeds up learning algorithm by controlling oscillations]**
+
+⟶ [RMSprop, Root Mean Square propagation, Accélère l'algorithme d'apprentissage en contrôlant les oscillations]
+
+
+
+
+**44. [Adam, Adaptive Moment estimation, Most popular method, 4 parameters to tune]**
+
+⟶ [Adam, Adaptive Moment estimation, Méthode la plus populaire, 4 paramètres à régler]
+
+
+
+
+**45. Remark: other methods include Adadelta, Adagrad and SGD.**
+
+⟶ Remarque : parmi les autres méthodes existantes, on trouve Adadelta, Adagrad et SGD.
+
+
+
+
+**46. Regularization**
+
+⟶ Régularisation
+
+
+
+
+**47. Dropout ― Dropout is a technique used in neural networks to prevent overfitting the training data by dropping out neurons with probability p>0. It forces the model to avoid relying too much on particular sets of features.**
+
+⟶ Dropout ― Le dropout est une technique qui est destinée à empêcher le sur-ajustement sur les données de training en abandonnant des unités dans un réseau de neurones avec une probabilité p>0. Cela force le modèle à éviter de trop s'appuyer sur un ensemble particulier de features.
+
+
+
+
+**48. Remark: most deep learning frameworks parametrize dropout through the 'keep' parameter 1−p.**
+
+⟶ Remarque : la plupart des frameworks d'apprentissage profond paramétrisent le dropout à travers le paramètre 'garder' 1-p.
+
+
+
+
+**49. Weight regularization ― In order to make sure that the weights are not too large and that the model is not overfitting the training set, regularization techniques are usually performed on the model weights. The main ones are summed up in the table below:**
+
+⟶ Régularisation de coefficient ― Pour s'assurer que les coefficients ne sont pas trop grands et que le modèle ne sur-ajuste pas sur le training set, on utilise des techniques de régularisation sur les coefficients du modèle. Les techniques principales sont résumées dans le tableau suivant :
+
+
+
+
+**50. [LASSO, Ridge, Elastic Net]**
+
+⟶ [LASSO, Ridge, Elastic Net]
+
+
+
+**50 bis. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+⟶ [Réduit les coefficients à 0, Bon pour la sélection de variables, Rend les coefficients plus petits, Compromis entre la sélection de variables et la réduction de la taille des coefficients]
+
+
+
+**51. Early stopping ― This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase.**
+
+⟶ Arrêt prématuré ― L'arrêt prématuré (en anglais early stopping) est une technique de régularisation qui consiste à stopper l'étape d'entraînement dès que le loss de validation atteint un plateau ou commence à augmenter.
+
+
+
+
+**52. [Error, Validation, Training, early stopping, Epochs]**
+
+⟶ [Erreur, Validation, Training, arrêt prématuré, Epochs]
+
+
+
+
+**53. Good practices**
+
+⟶ Bonnes pratiques
+
+
+
+
+**54. Overfitting small batch ― When debugging a model, it is often useful to make quick tests to see if there is any major issue with the architecture of the model itself. In particular, in order to make sure that the model can be properly trained, a mini-batch is passed inside the network to see if it can overfit on it. If it cannot, it means that the model is either too complex or not complex enough to even overfit on a small batch, let alone a normal-sized training set.**
+
+⟶ Sur-ajuster un mini-lot ― Lorsque l'on débugge un modèle, il est souvent utile de faire de petits tests pour voir s'il y a un gros souci avec l'architecture du modèle lui-même. En particulier, pour s'assurer que le modèle peut être entraîné correctement, un mini-lot est passé dans le réseau pour voir s'il peut sur-ajuster sur lui. Si le modèle ne peut pas le faire, cela signifie que le modèle est soit trop complexe ou pas assez complexe pour être sur-ajusté sur un mini-lot.
+
+
+
+
+**55. Gradient checking ― Gradient checking is a method used during the implementation of the backward pass of a neural network. It compares the value of the analytical gradient to the numerical gradient at given points and plays the role of a sanity-check for correctness.**
+
+⟶ Gradient checking ― La méthode de gradient checking est utilisée durant l'implémentation d'un backward pass d'un réseau de neurones. Elle compare la valeur du gradient analytique par rapport au gradient numérique au niveau de certains points et joue un rôle de vérification élémentaire.
+
+
+
+
+**56. [Type, Numerical gradient, Analytical gradient]**
+
+⟶ [Type, Gradient numérique, Gradient analytique]
+
+
+
+
+**57. [Formula, Comments]**
+
+⟶ [Formule, Commentaires]
+
+
+
+
+**58. [Expensive; loss has to be computed two times per dimension, Used to verify correctness of analytical implementation, Trade-off in choosing h not too small (numerical instability) nor too large (poor gradient approximation)]**
+
+⟶ [Coûteux; le loss doit être calculé deux fois par dimension, Utilisé pour vérifier l'exactitude d'une implémentation analytique, Compromis dans le choix de h entre pas trop petit (instabilité numérique) et pas trop grand (estimation du gradient approximative)]
+
+
+
+
+**59. ['Exact' result, Direct computation, Used in the final implementation]**
+
+⟶ [Résultat 'exact', Calcul direct, Utilisé dans l'implémentation finale]
+
+
+
+
+**60. The Deep Learning cheatsheets are now available in [target language].**
+
+⟶ Les pense-bêtes d'apprentissage profond sont maintenant disponibles en français.
+
+
+
+**61. Original authors**
+
+⟶ Auteurs
+
+
+
+**62. Translated by X, Y and Z**
+
+⟶ Traduit par X, Y et Z
+
+
+
+**63. Reviewed by X, Y and Z**
+
+⟶ Relu par X, Y et Z
+
+
+
+**64. View PDF version on GitHub**
+
+⟶ Voir la version PDF sur GitHub
+
+
+
+**65. By X and Y**
+
+⟶ Par X et Y
+
+
diff --git a/fr/cs-230-recurrent-neural-networks.md b/fr/cs-230-recurrent-neural-networks.md
new file mode 100644
index 000000000..e7d8f5343
--- /dev/null
+++ b/fr/cs-230-recurrent-neural-networks.md
@@ -0,0 +1,678 @@
+**Recurrent Neural Networks translation**
+
+
+
+**1. Recurrent Neural Networks cheatsheet**
+
+⟶ Pense-bête de réseaux de neurones récurrents
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶ CS 230 - Apprentissage profond
+
+
+
+
+**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]**
+
+⟶ [Vue d'ensemble, Structure d'architecture, Applications des RNNs, Fonction de loss, Backpropagation]
+
+
+
+
+**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]**
+
+⟶ [Dépendances à long terme, Fonctions d'activation communes, Gradient qui disparait/explose, Coupure de gradient, GRU/LSTM, Types de porte, RNN bi-directionnel, RNN profond]
+
+
+
+
+**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]**
+
+⟶ [Apprentissage de la représentation de mots, Notations, Matrice de représentation, Word2vec, Skip-gram, Échantillonnage négatif, GloVe]
+
+
+
+
+**6. [Comparing words, Cosine similarity, t-SNE]**
+
+⟶ [Comparaison des mots, Similarité cosinus, t-SNE]
+
+
+
+
+**7. [Language model, n-gram, Perplexity]**
+
+⟶ [Modèle de langage, n-gram, Perplexité]
+
+
+
+
+**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]**
+
+⟶ [Traduction machine, Recherche en faisceau, Normalisation de longueur, Analyse d'erreur, Score bleu]
+
+
+
+
+**9. [Attention, Attention model, Attention weights]**
+
+⟶ [Attention, Modèle d'attention, Coefficients d'attention]
+
+
+
+
+**10. Overview**
+
+⟶ Vue d'ensemble
+
+
+
+
+**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:**
+
+⟶ Architecture d'un RNN traditionnel ― Les réseaux de neurones récurrents (en anglais recurrent neural networks), aussi appelés RNNs, sont une classe de réseaux de neurones qui permettent aux prédictions antérieures d'être utilisées comme entrées, par le biais d'états cachés (en anglais hidden states). Ils sont de la forme suivante :
+
+
+
+
+**12. For each timestep t, the activation a and the output y are expressed as follows:**
+
+⟶ À l'instant t, l'activation a et la sortie y sont de la forme suivante :
+
+
+
+
+**13. and**
+
+⟶ et
+
+
+
+
+**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.**
+
+⟶ où Wax,Waa,Wya,ba,by sont des coefficients indépendants du temps et où g1,g2 sont des fonctions d'activation.
+
+
+
+
+**15. The pros and cons of a typical RNN architecture are summed up in the table below:**
+
+⟶ Les avantages et inconvénients des architectures de RNN traditionnelles sont résumés dans le tableau ci-dessous :
+
+
+
+
+**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]**
+
+⟶ [Avantages, Possibilité de prendre en compte des entrées de toute taille, La taille du modèle n'augmente pas avec la taille de l'entrée, Les calculs prennent en compte les informations antérieures, Les coefficients sont indépendants du temps]
+
+
+
+
+**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]**
+
+⟶ [Inconvénients, Le temps de calcul est long, Difficulté d'accéder à des informations d'un passé lointain, Impossibilité de prendre en compte des informations futures un état donné]
+
+
+
+
+**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:**
+
+⟶ Applications des RNNs ― Les modèles RNN sont surtout utilisés dans les domaines du traitement automatique du langage naturel et de la reconnaissance vocale. Le tableau suivant détaille les applications principales à retenir :
+
+
+
+
+**19. [Type of RNN, Illustration, Example]**
+
+⟶ [Type de RNN, Illustration, Exemple]
+
+
+
+
+**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]**
+
+⟶ [Un à un, Un à plusieurs, Plusieurs à un, Plusieurs à plusieurs]
+
+
+
+
+**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]**
+
+⟶ [Réseau de neurones traditionnel, Génération de musique, Classification de sentiment, Reconnaissance d'entité, Traduction machine]
+
+
+
+
+**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:**
+
+⟶ Fonction de loss ― Dans le contexte des réseaux de neurones récurrents, la fonction de loss L prend en compte le loss à chaque temps T de la manière suivante :
+
+
+
+
+**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:**
+
+⟶ Backpropagation temporelle ― L'étape de backpropagation est appliquée dans la dimension temporelle. À l'instant T, la dérivée du loss L par rapport à la matrice de coefficients W est donnée par :
+
+
+
+
+**24. Handling long term dependencies**
+
+⟶ Dépendances à long terme
+
+
+
+
+**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:**
+
+⟶ Fonctions d'activation communément utilisées ― Les fonctions d'activation les plus utilisées dans les RNNs sont décrits ci-dessous :
+
+
+
+
+**26. [Sigmoid, Tanh, RELU]**
+
+⟶ [Sigmoïde, Tanh, RELU]
+
+
+
+
+**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.**
+
+⟶ Gradient qui disparait/explose ― Les phénomènes de gradient qui disparait et qui explose (en anglais vanishing gradient et exploding gradient) sont souvent rencontrés dans le contexte des RNNs. Ceci est dû au fait qu'il est difficile de capturer des dépendances à long terme à cause du gradient multiplicatif qui peut décroître/croître de manière exponentielle en fonction du nombre de couches.
+
+
+
+
+**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.**
+
+⟶ Coupure de gradient ― Cette technique est utilisée pour atténuer le phénomène de gradient qui explose qui peut être rencontré lors de l'étape de backpropagation. En plafonnant la valeur qui peut être prise par le gradient, ce phénomène est maîtrisé en pratique.
+
+
+
+
+**29. clipped**
+
+⟶ coupé
+
+
+
+
+**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:**
+
+⟶ Types de porte ― Pour remédier au problème du gradient qui disparait, certains types de porte sont spécifiquement utilisés dans des variantes de RNNs et ont un but bien défini. Les portes sont souvent notées Γ et sont telles que :
+
+
+
+
+**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:**
+
+⟶ où W,U,b sont des coefficients spécifiques à la porte et σ est une sigmoïde. Les portes à retenir sont récapitulées dans le tableau ci-dessous :
+
+
+
+
+**32. [Type of gate, Role, Used in]**
+
+⟶ [Type de porte, Rôle, Utilisée dans]
+
+
+
+
+**33. [Update gate, Relevance gate, Forget gate, Output gate]**
+
+⟶ [Porte d'actualisation, Porte de pertinence, Porte d'oubli, Porte de sortie]
+
+
+
+
+**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]**
+
+⟶ [Dans quelle mesure le passé devrait être important ?, Enlever les informations précédentes ?, Enlever une cellule ?, Combien devrait-on révéler d'une cellule ?]
+
+
+
+
+**35. [LSTM, GRU]**
+
+⟶ [LSTM, GRU]
+
+
+
+
+**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:**
+
+⟶ GRU/LSTM ― Les unités de porte récurrente (en anglais Gated Recurrent Unit) (GRU) et les unités de mémoire à long/court terme (en anglais Long Short-Term Memory units) (LSTM) apaisent le problème du gradient qui disparait rencontré par les RNNs traditionnels, où le LSTM peut être vu comme étant une généralisation du GRU. Le tableau ci-dessous résume les équations caractéristiques de chacune de ces architectures :
+
+
+
+
+**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]**
+
+⟶ [Caractérisation, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dépendances]
+
+
+
+
+**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.**
+
+⟶ Remarque : le signe ⋆ dénote le produit de Hadamard entre deux vecteurs.
+
+
+
+
+**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:**
+
+⟶ Variantes des RNNs ― Le tableau ci-dessous récapitule les autres architectures RNN communément utilisées :
+
+
+
+
+**40. [Bidirectional (BRNN), Deep (DRNN)]**
+
+⟶ [Bi-directionnel (BRNN), Profond (DRNN)]
+
+
+
+
+**41. Learning word representation**
+
+⟶ Apprentissage de la représentation de mots
+
+
+
+
+**42. In this section, we note V the vocabulary and |V| its size.**
+
+⟶ Dans cette section, on note V le vocabulaire et |V| sa taille.
+
+
+
+
+**43. Motivation and notations**
+
+⟶ Motivation et notations
+
+
+
+
+**44. Representation techniques ― The two main ways of representing words are summed up in the table below:**
+
+⟶ Techniques de représentation ― Les deux manières principales de représenter des mots sont décrits dans le tableau suivant :
+
+
+
+
+**45. [1-hot representation, Word embedding]**
+
+⟶ [Représentation binaire, Représentation du mot]
+
+
+
+
+**46. [teddy bear, book, soft]**
+
+⟶ [ours en peluche, livre, doux]
+
+
+
+
+**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]**
+
+⟶ [Noté ow, Approche naïve, pas d'information de similarité, Noté ew, Prend en compte la similarité des mots]
+
+
+
+
+**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:**
+
+⟶ Matrice de représentation ― Pour un mot donné w, la matrice de représentation (en anglais embedding matrix) E est une matrice qui relie une représentation binaire ow à sa représentation correspondante ew de la manière suivante :
+
+
+
+
+**49. Remark: learning the embedding matrix can be done using target/context likelihood models.**
+
+⟶ Remarque : l'apprentissage d'une matrice de représentation peut être effectuée en utilisant des modèles probabilistiques de cible/contexte.
+
+
+
+
+**50. Word embeddings**
+
+⟶ Représentation de mots
+
+
+
+
+**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.**
+
+⟶ Word2vec ― Word2vec est un ensemble de techniques visant à apprendre comment représenter les mots en estimant la probabilité qu'un mot donné a d'être entouré par d'autres mots. Le skip-gram, l'échantillonnage négatif et le CBOW font parti des modèles les plus populaires.
+
+
+
+
+**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]**
+
+⟶ [Un ours en peluche mignon est en train de lire, ours en peluche, doux, poésie persane, art]
+
+
+
+
+**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]**
+
+⟶ [Entraîner le réseau, Extraire une représentation globale, Calculer une représentation des mots]
+
+
+
+
+**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:**
+
+⟶ Skip-gram ― Le skip-gram est un modèle de type supervisé qui apprend comment représenter les mots en évaluant la probabilité de chaque mot cible t donné dans un mot contexte c. En notant θt le paramètre associé à t, la probabilité P(t|c) est donnée par :
+
+
+
+
+**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.**
+
+⟶ Remarque : le fait d'additionner tout le vocabulaire dans le dénominateur du softmax rend le modèle coûteux en temps de calcul. CBOW est un autre modèle utilisant les mots avoisinants pour prédire un mot donné.
+
+
+
+
+**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:**
+
+⟶ Échantillonnage négatif ― Cette méthode utilise un ensemble de classifieurs binaires utilisant des régressions logistiques qui visent à évaluer dans quelle mesure des mots contexte et cible sont susceptible d'apparaître simultanément, avec des modèles étant entraînés sur des ensembles de k exemples négatifs et 1 exemple positif. Étant donnés un mot contexte c et un mot cible t, la prédiction est donnée par :
+
+
+
+
+**57. Remark: this method is less computationally expensive than the skip-gram model.**
+
+⟶ Remarque : cette méthode est moins coûteuse en calcul par rapport au modèle skip-gram.
+
+
+
+
+**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:**
+
+⟶ GloVe ― Le modèle GloVe (en anglais global vectors for word representation) est une technique de représentation des mots qui utilise une matrice de co-occurrence X où chaque Xi,j correspond au nombre de fois qu'une cible i se produit avec un contexte j. Sa fonction de coût J est telle que :
+
+
+
+
+**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0.
+Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:**
+
+⟶ où f est une fonction à coefficients telle que Xi,j=0⟹f(Xi,j)=0.
+Étant donné la symétrie que e et θ ont dans un modèle, la représentation du mot final e(final)w est donnée par :
+
+
+
+
+**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.**
+
+⟶ Remarque : les composantes individuelles de la représentation d'un mot n'est pas nécessairement facilement interprétable.
+
+
+
+
+**60. Comparing words**
+
+⟶ Comparaison de mots
+
+
+
+
+**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:**
+
+⟶ Similarité cosinus ― La similarité cosinus (en anglais cosine similarity) entre les mots w1 et w2 est donnée par :
+
+
+
+
+**62. Remark: θ is the angle between words w1 and w2.**
+
+⟶ Remarque : θ est l'angle entre les mots w1 et w2.
+
+
+
+
+**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.**
+
+⟶ t-SNE ― La méthode t-SNE (en anglais t-distributed Stochastic Neighbor Embedding) est une technique visant à réduire une représentation dans un espace de haute dimension en un espace de plus faible dimension. En pratique, on visualise les vecteur-mots dans un espace 2D.
+
+
+
+
+**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]**
+
+⟶ [littérature, art, livre, culture, poème, lecture, connaissance, divertissant, aimable, enfance, gentil, ours en peluche, doux, câlin, mignon, adorable]
+
+
+
+
+**65. Language model**
+
+⟶ Modèle de langage
+
+
+
+
+**66. Overview ― A language model aims at estimating the probability of a sentence P(y).**
+
+⟶ Vue d'ensemble ― Un modèle de langage vise à estimer la probabilité d'une phrase P(y).
+
+
+
+
+**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.**
+
+⟶ Modèle n-gram ― Ce modèle consiste en une approche naïve qui vise à quantifier la probabilité qu'une expression apparaisse dans un corpus en comptabilisant le nombre de son apparition dans le training data.
+
+
+
+
+**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:**
+
+⟶ Perplexité ― Les modèles de langage sont communément évalués en utilisant la perplexité, aussi noté PP, qui peut être interprété comme étant la probabilité inverse des données normalisée par le nombre de mots T. La perplexité est telle que plus elle est faible, mieux c'est. Elle est définie de la manière suivante :
+
+
+
+
+**69. Remark: PP is commonly used in t-SNE.**
+
+⟶ Remarque : PP est souvent utilisée dans le cadre du t-SNE.
+
+
+
+
+**70. Machine translation**
+
+⟶ Traduction machine
+
+
+
+
+**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:**
+
+⟶ Vue d'ensemble ― Un modèle de traduction machine est similaire à un modèle de langage ayant un auto-encodeur placé en amont. Pour cette raison, ce modèle est souvent surnommé modèle conditionnel de langage. Le but est de trouver une phrase y telle que :
+
+
+
+
+**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.**
+
+⟶ Recherche en faisceau ― Cette technique (en anglais beam search) est un algorithme de recherche heuristique, utilisé dans le cadre de la traduction machine et de la reconnaissance vocale, qui vise à trouver la phrase la plus probable y sachant l'entrée x.
+
+
+
+
+**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]**
+
+⟶ [Étape 1 : Trouver les B mots les plus probables y<1>, Étape 2 : Calculer les probabilités conditionnelles y|x,y<1>,...,y, Étape 3 : Garder les B combinaisons les plus probables x,y<1>,...,y, Arrêter la procédure à un mot stop]
+
+
+
+
+**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.**
+
+⟶ Remarque : si la largeur du faisceau est prise égale à 1, alors ceci est équivalent à un algorithme glouton.
+
+
+
+
+**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.**
+
+⟶ Largeur du faisceau ― La largeur du faisceau (en anglais beam width) B est un paramètre de la recherche en faisceau. De grandes valeurs de B conduisent à avoir de meilleurs résultats mais avec un coût de mémoire plus lourd et à un temps de calcul plus long. De faibles valeurs de B conduisent à de moins bons résultats mais avec un coût de calcul plus faible. Une valeur de B égale à 10 est standard et est souvent utilisée.
+
+
+
+
+**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:**
+
+⟶ Normalisation de longueur ― Pour que la stabilité numérique puisse être améliorée, la recherche en faisceau utilise un objectif normalisé, souvent appelé l'objectif de log-probabilité normalisé, défini par :
+
+
+
+
+**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.**
+
+⟶ Remarque : le paramètre α est souvent comprise entre 0.5 et 1.
+
+
+
+
+**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:**
+
+⟶ Analyse d'erreur ― Lorsque l'on obtient une mauvaise traduction prédite ˆy, on peut se demander la raison pour laquelle l'algorithme n'a pas obtenu une bonne traduction y∗ en faisant une analyse d'erreur de la manière suivante :
+
+
+
+
+**79. [Case, Root cause, Remedies]**
+
+⟶ [Cas, Cause, Remèdes]
+
+
+
+
+**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]**
+
+⟶ [Recherche en faisceau défectueuse, RNN défectueux, Augmenter la largeur du faisceau, Essayer une différente architecture, Régulariser, Obtenir plus de données]
+
+
+
+
+**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:**
+
+⟶ Score bleu ― Le score bleu (en anglais bilingual evaluation understudy) a pour but de quantifier à quel point une traduction est bonne en calculant un score de similarité basé sur une précision n-gram. Il est défini de la manière suivante :
+
+
+
+
+**82. where pn is the bleu score on n-gram only defined as follows:**
+
+⟶ où pn est le score bleu uniquement basé sur les n-gram, défini par :
+
+
+
+
+**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.**
+
+⟶ Remarque : une pénalité de brièveté peut être appliquée aux traductions prédites courtes pour empêcher que le score bleu soit artificiellement haut.
+
+
+
+
+**84. Attention**
+
+⟶ Attention
+
+
+
+
+**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:**
+
+⟶ Modèle d'attention ― Le modèle d'attention (en anglais attention model) permet au RNN de mettre en valeur des parties spécifiques de l'entrée qui peuvent être considérées comme étant importantes, ce qui améliore la performance du modèle final en pratique. En notant α la quantité d'attention que la sortie y devrait porter à l'activation a et au contexte c à l'instant t, on a :
+
+
+
+
+**86. with**
+
+⟶ avec
+
+
+
+
+**87. Remark: the attention scores are commonly used in image captioning and machine translation.**
+
+⟶ Remarque : les scores d'attention sont communément utilisés dans la génération de légende d'image ainsi que dans la traduction machine.
+
+
+
+
+**88. A cute teddy bear is reading Persian literature.**
+
+⟶ Un ours en peluche mignon est en train de lire de la littérature persane.
+
+
+
+
+**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:**
+
+⟶ Coefficient d'attention ― La quantité d'attention que la sortie y devrait porter à l'activation a est donné α, qui est calculé de la manière suivante :
+
+
+
+
+**90. Remark: computation complexity is quadratic with respect to Tx.**
+
+⟶ Remarque : la complexité de calcul est quadratique par rapport à Tx.
+
+
+
+
+**91. The Deep Learning cheatsheets are now available in [target language].**
+
+⟶ Les pense-bêtes d'apprentissage profond sont maintenant disponibles en français.
+
+
+
+**92. Original authors**
+
+⟶ Auteurs
+
+
+
+**93. Translated by X, Y and Z**
+
+⟶ Traduit par X, Y et Z
+
+
+
+**94. Reviewed by X, Y and Z**
+
+⟶ Relu par X, Y et Z
+
+
+
+**95. View PDF version on GitHub**
+
+⟶ Voir la version PDF sur GitHub
+
+
+
+**96. By X and Y**
+
+⟶ Par X et Y
+
+
diff --git a/he/cheatsheet-deep-learning.md b/he/cheatsheet-deep-learning.md
deleted file mode 100644
index a5aa3756c..000000000
--- a/he/cheatsheet-deep-learning.md
+++ /dev/null
@@ -1,321 +0,0 @@
-**1. Deep Learning cheatsheet**
-
-⟶
-
-
-
-**2. Neural Networks**
-
-⟶
-
-
-
-**3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.**
-
-⟶
-
-
-
-**4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:**
-
-⟶
-
-
-
-**5. [Input layer, hidden layer, output layer]**
-
-⟶
-
-
-
-**6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
-
-⟶
-
-
-
-**7. where we note w, b, z the weight, bias and output respectively.**
-
-⟶
-
-
-
-**8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:**
-
-⟶
-
-
-
-**9. [Sigmoid, Tanh, ReLU, Leaky ReLU]**
-
-⟶
-
-
-
-**10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
-
-⟶
-
-
-
-**11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
-
-⟶
-
-
-
-**12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:**
-
-⟶
-
-
-
-**13. As a result, the weight is updated as follows:**
-
-⟶
-
-
-
-**14. Updating weights ― In a neural network, weights are updated as follows:**
-
-⟶
-
-
-
-**15. Step 1: Take a batch of training data.**
-
-⟶
-
-
-
-**16. Step 2: Perform forward propagation to obtain the corresponding loss.**
-
-⟶
-
-
-
-**17. Step 3: Backpropagate the loss to get the gradients.**
-
-⟶
-
-
-
-**18. Step 4: Use the gradients to update the weights of the network.**
-
-⟶
-
-
-
-**19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
-
-⟶
-
-
-
-**20. Convolutional Neural Networks**
-
-⟶
-
-
-
-**21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
-
-⟶
-
-
-
-**22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
-
-⟶
-
-
-
-**23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
-
-⟶
-
-
-
-**24. Recurrent Neural Networks**
-
-⟶
-
-
-
-**25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:**
-
-⟶
-
-
-
-**26. [Input gate, forget gate, gate, output gate]**
-
-⟶
-
-
-
-**27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]**
-
-⟶
-
-
-
-**28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
-
-⟶
-
-
-
-**29. Reinforcement Learning and Control**
-
-⟶
-
-
-
-**30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.**
-
-⟶
-
-
-
-**31. Definitions**
-
-⟶
-
-
-
-**32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:**
-
-⟶
-
-
-
-**33. S is the set of states**
-
-⟶
-
-
-
-**34. A is the set of actions**
-
-⟶
-
-
-
-**35. {Psa} are the state transition probabilities for s∈S and a∈A**
-
-⟶
-
-
-
-**36. γ∈[0,1[ is the discount factor**
-
-⟶
-
-
-
-**37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize**
-
-⟶
-
-
-
-**38. Policy ― A policy π is a function π:S⟶A that maps states to actions.**
-
-⟶
-
-
-
-**39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).**
-
-⟶
-
-
-
-**40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:**
-
-⟶
-
-
-
-**41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:**
-
-⟶
-
-
-
-**42. Remark: we note that the optimal policy π∗ for a given state s is such that:**
-
-⟶
-
-
-
-**43. Value iteration algorithm ― The value iteration algorithm is in two steps:**
-
-⟶
-
-
-
-**44. 1) We initialize the value:**
-
-⟶
-
-
-
-**45. 2) We iterate the value based on the values before:**
-
-⟶
-
-
-
-**46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
-
-⟶
-
-
-
-**47. times took action a in state s and got to s′**
-
-⟶
-
-
-
-**48. times took action a in state s**
-
-⟶
-
-
-
-**49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:**
-
-⟶
-
-
-
-**50. View PDF version on GitHub**
-
-⟶
-
-
-
-**51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
-
-⟶
-
-
-
-**52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
-
-⟶
-
-
-
-**53. [Recurrent Neural Networks, Gates, LSTM]**
-
-⟶
-
-
-
-**54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
-
-⟶
diff --git a/he/cheatsheet-machine-learning-tips-and-tricks.md b/he/cheatsheet-machine-learning-tips-and-tricks.md
deleted file mode 100644
index 9712297b8..000000000
--- a/he/cheatsheet-machine-learning-tips-and-tricks.md
+++ /dev/null
@@ -1,285 +0,0 @@
-**1. Machine Learning tips and tricks cheatsheet**
-
-⟶
-
-
-
-**2. Classification metrics**
-
-⟶
-
-
-
-**3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
-
-⟶
-
-
-
-**4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
-
-⟶
-
-
-
-**5. [Predicted class, Actual class]**
-
-⟶
-
-
-
-**6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
-
-⟶
-
-
-
-**7. [Metric, Formula, Interpretation]**
-
-⟶
-
-
-
-**8. Overall performance of model**
-
-⟶
-
-
-
-**9. How accurate the positive predictions are**
-
-⟶
-
-
-
-**10. Coverage of actual positive sample**
-
-⟶
-
-
-
-**11. Coverage of actual negative sample**
-
-⟶
-
-
-
-**12. Hybrid metric useful for unbalanced classes**
-
-⟶
-
-
-
-**13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
-
-⟶
-
-
-
-**14. [Metric, Formula, Equivalent]**
-
-⟶
-
-
-
-**15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
-
-⟶
-
-
-
-**16. [Actual, Predicted]**
-
-⟶
-
-
-
-**17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
-
-⟶
-
-
-
-**18. [Total sum of squares, Explained sum of squares, Residual sum of squares]**
-
-⟶
-
-
-
-**19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
-
-⟶
-
-
-
-**20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
-
-⟶
-
-
-
-**21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
-
-⟶
-
-
-
-**22. Model selection**
-
-⟶
-
-
-
-**23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
-
-⟶
-
-
-
-**24. [Training set, Validation set, Testing set]**
-
-⟶
-
-
-
-**25. [Model is trained, Model is assessed, Model gives predictions]**
-
-⟶
-
-
-
-**26. [Usually 80% of the dataset, Usually 20% of the dataset]**
-
-⟶
-
-
-
-**27. [Also called hold-out or development set, Unseen data]**
-
-⟶
-
-
-
-**28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
-
-⟶
-
-
-
-**29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
-
-⟶
-
-
-
-**30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
-
-⟶
-
-
-
-**31. [Generally k=5 or 10, Case p=1 is called leave-one-out]**
-
-⟶
-
-
-
-**32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
-
-⟶
-
-
-
-**33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
-
-⟶
-
-
-
-**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
-
-⟶
-
-
-
-**35. Diagnostics**
-
-⟶
-
-
-
-**36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
-
-⟶
-
-
-
-**37. Variance ― The variance of a model is the variability of the model prediction for given data points.**
-
-⟶
-
-
-
-**38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
-
-⟶
-
-
-
-**39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
-
-⟶
-
-
-
-**40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
-
-⟶
-
-
-
-**41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
-
-⟶
-
-
-
-**42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
-
-⟶
-
-
-
-**43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
-
-⟶
-
-
-
-**44. Regression metrics**
-
-⟶
-
-
-
-**45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
-
-⟶
-
-
-
-**46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]**
-
-⟶
-
-
-
-**47. [Model selection, cross-validation, regularization]**
-
-⟶
-
-
-
-**48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
-
-⟶
diff --git a/he/cheatsheet-supervised-learning.md b/he/cheatsheet-supervised-learning.md
deleted file mode 100644
index a6b19ea1c..000000000
--- a/he/cheatsheet-supervised-learning.md
+++ /dev/null
@@ -1,567 +0,0 @@
-**1. Supervised Learning cheatsheet**
-
-⟶
-
-
-
-**2. Introduction to Supervised Learning**
-
-⟶
-
-
-
-**3. Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.**
-
-⟶
-
-
-
-**4. Type of prediction ― The different types of predictive models are summed up in the table below:**
-
-⟶
-
-
-
-**5. [Regression, Classifier, Outcome, Examples]**
-
-⟶
-
-
-
-**6. [Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
-
-⟶
-
-
-
-**7. Type of model ― The different models are summed up in the table below:**
-
-⟶
-
-
-
-**8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
-
-⟶
-
-
-
-**9. [Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]**
-
-⟶
-
-
-
-**10. Notations and general concepts**
-
-⟶
-
-
-
-**11. Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).**
-
-⟶
-
-
-
-**12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
-
-⟶
-
-
-
-**13. [Least squared error, Logistic loss, Hinge loss, Cross-entropy]**
-
-⟶
-
-
-
-**14. [Linear regression, Logistic regression, SVM, Neural Network]**
-
-⟶
-
-
-
-**15. Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:**
-
-⟶
-
-
-
-**16. Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:**
-
-⟶
-
-
-
-**17. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.**
-
-⟶
-
-
-
-**18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
-
-⟶
-
-
-
-**19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**
-
-⟶
-
-
-
-**20. Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:**
-
-⟶
-
-
-
-**21. Linear models**
-
-⟶
-
-
-
-**22. Linear regression**
-
-⟶
-
-
-
-**23. We assume here that y|x;θ∼N(μ,σ2)**
-
-⟶
-
-
-
-**24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
-
-⟶
-
-
-
-**25. LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:**
-
-⟶
-
-
-
-**26. Remark: the update rule is a particular case of the gradient ascent.**
-
-⟶
-
-
-
-**27. LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:**
-
-⟶
-
-
-
-**28. Classification and logistic regression**
-
-⟶
-
-
-
-**29. Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:**
-
-⟶
-
-
-
-**30. Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:**
-
-⟶
-
-
-
-**31. Remark: there is no closed form solution for the case of logistic regressions.**
-
-⟶
-
-
-
-**32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
-
-⟶
-
-
-
-**33. Generalized Linear Models**
-
-⟶
-
-
-
-**34. Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:**
-
-⟶
-
-
-
-**35. Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.**
-
-⟶
-
-
-
-**36. Here are the most common exponential distributions summed up in the following table:**
-
-⟶
-
-
-
-**37. [Distribution, Bernoulli, Gaussian, Poisson, Geometric]**
-
-⟶
-
-
-
-**38. Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:**
-
-⟶
-
-
-
-**39. Remark: ordinary least squares and logistic regression are special cases of generalized linear models.**
-
-⟶
-
-
-
-**40. Support Vector Machines**
-
-⟶
-
-
-
-**41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
-
-⟶
-
-
-
-**42: Optimal margin classifier ― The optimal margin classifier h is such that:**
-
-⟶
-
-
-
-**43: where (w,b)∈Rn×R is the solution of the following optimization problem:**
-
-⟶
-
-
-
-**44. such that**
-
-⟶
-
-
-
-**45. support vectors**
-
-⟶
-
-
-
-**46. Remark: the line is defined as wTx−b=0.**
-
-⟶
-
-
-
-**47. Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:**
-
-⟶
-
-
-
-**48. Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:**
-
-⟶
-
-
-
-**49. In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.**
-
-⟶
-
-
-
-**50. [Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]**
-
-⟶
-
-
-
-**51. Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.**
-
-⟶
-
-
-
-**52. Lagrangian ― We define the Lagrangian L(w,b) as follows:**
-
-⟶
-
-
-
-**53. Remark: the coefficients βi are called the Lagrange multipliers.**
-
-⟶
-
-
-
-**54. Generative Learning**
-
-⟶
-
-
-
-**55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
-
-⟶
-
-
-
-**56. Gaussian Discriminant Analysis**
-
-⟶
-
-
-
-**57. Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:**
-
-⟶
-
-
-
-**58. Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:**
-
-⟶
-
-
-
-**59. Naive Bayes**
-
-⟶
-
-
-
-**60. Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:**
-
-⟶
-
-
-
-**61. Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]**
-
-⟶
-
-
-
-**62. Remark: Naive Bayes is widely used for text classification and spam detection.**
-
-⟶
-
-
-
-**63. Tree-based and ensemble methods**
-
-⟶
-
-
-
-**64. These methods can be used for both regression and classification problems.**
-
-⟶
-
-
-
-**65. CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.**
-
-⟶
-
-
-
-**66. Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.**
-
-⟶
-
-
-
-**67. Remark: random forests are a type of ensemble methods.**
-
-⟶
-
-
-
-**68. Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:**
-
-⟶
-
-
-
-**69. [Adaptive boosting, Gradient boosting]**
-
-⟶
-
-
-
-**70. High weights are put on errors to improve at the next boosting step**
-
-⟶
-
-
-
-**71. Weak learners trained on remaining errors**
-
-⟶
-
-
-
-**72. Other non-parametric approaches**
-
-⟶
-
-
-
-**73. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
-
-⟶
-
-
-
-**74. Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
-
-⟶
-
-
-
-**75. Learning Theory**
-
-⟶
-
-
-
-**76. Union bound ― Let A1,...,Ak be k events. We have:**
-
-⟶
-
-
-
-**77. Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:**
-
-⟶
-
-
-
-**78. Remark: this inequality is also known as the Chernoff bound.**
-
-⟶
-
-
-
-**79. Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:**
-
-⟶
-
-
-
-**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: **
-
-⟶
-
-
-
-**81: the training and testing sets follow the same distribution **
-
-⟶
-
-
-
-**82. the training examples are drawn independently**
-
-⟶
-
-
-
-**83. Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:**
-
-⟶
-
-
-
-**84. Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:**
-
-⟶
-
-
-
-**85. VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.**
-
-⟶
-
-
-
-**86. Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.**
-
-⟶
-
-
-
-**87. Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:**
-
-⟶
-
-
-
-**88. [Introduction, Type of prediction, Type of model]**
-
-⟶
-
-
-
-**89. [Notations and general concepts, loss function, gradient descent, likelihood]**
-
-⟶
-
-
-
-**90. [Linear models, linear regression, logistic regression, generalized linear models]**
-
-⟶
-
-
-
-**91. [Support vector machines, Optimal margin classifier, Hinge loss, Kernel]**
-
-⟶
-
-
-
-**92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]**
-
-⟶
-
-
-
-**93. [Trees and ensemble methods, CART, Random forest, Boosting]**
-
-⟶
-
-
-
-**94. [Other methods, k-NN]**
-
-⟶
-
-
-
-**95. [Learning theory, Hoeffding inequality, PAC, VC dimension]**
-
-⟶
diff --git a/he/cheatsheet-unsupervised-learning.md b/he/cheatsheet-unsupervised-learning.md
deleted file mode 100644
index 40724eb28..000000000
--- a/he/cheatsheet-unsupervised-learning.md
+++ /dev/null
@@ -1,340 +0,0 @@
-**1. Unsupervised Learning cheatsheet**
-
-⟶
-
-
-
-**2. Introduction to Unsupervised Learning**
-
-⟶
-
-
-
-**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
-
-⟶
-
-
-
-**4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
-
-⟶
-
-
-
-**5. Clustering**
-
-⟶
-
-
-
-**6. Expectation-Maximization**
-
-⟶
-
-
-
-**7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
-
-⟶
-
-
-
-**8. [Setting, Latent variable z, Comments]**
-
-⟶
-
-
-
-**9. [Mixture of k Gaussians, Factor analysis]**
-
-⟶
-
-
-
-**10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
-
-⟶
-
-
-
-**11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
-
-⟶
-
-
-
-**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
-
-⟶
-
-
-
-**13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
-
-⟶
-
-
-
-**14. k-means clustering**
-
-⟶
-
-
-
-**15. We note c(i) the cluster of data point i and μj the center of cluster j.**
-
-⟶
-
-
-
-**16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
-
-⟶
-
-
-
-**17. [Means initialization, Cluster assignment, Means update, Convergence]**
-
-⟶
-
-
-
-**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
-
-⟶
-
-
-
-**19. Hierarchical clustering**
-
-⟶
-
-
-
-**20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
-
-⟶
-
-
-
-**21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
-
-⟶
-
-
-
-**22. [Ward linkage, Average linkage, Complete linkage]**
-
-⟶
-
-
-
-**23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
-
-⟶
-
-
-
-**24. Clustering assessment metrics**
-
-⟶
-
-
-
-**25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
-
-⟶
-
-
-
-**26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
-
-⟶
-
-
-
-**27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
-
-⟶
-
-
-
-**28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
-
-⟶
-
-
-
-**29. Dimension reduction**
-
-⟶
-
-
-
-**30. Principal component analysis**
-
-⟶
-
-
-
-**31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
-
-⟶
-
-
-
-**32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
-
-⟶
-
-
-
-**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
-
-⟶
-
-
-
-**34. diagonal**
-
-⟶
-
-
-
-**35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
-
-⟶
-
-
-
-**36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k
-dimensions by maximizing the variance of the data as follows:**
-
-⟶
-
-
-
-**37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
-
-⟶
-
-
-
-**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
-
-⟶
-
-
-
-**39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
-
-⟶
-
-
-
-**40. Step 4: Project the data on spanR(u1,...,uk).**
-
-⟶
-
-
-
-**41. This procedure maximizes the variance among all k-dimensional spaces.**
-
-⟶
-
-
-
-**42. [Data in feature space, Find principal components, Data in principal components space]**
-
-⟶
-
-
-
-**43. Independent component analysis**
-
-⟶
-
-
-
-**44. It is a technique meant to find the underlying generating sources.**
-
-⟶
-
-
-
-**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
-
-⟶
-
-
-
-**46. The goal is to find the unmixing matrix W=A−1.**
-
-⟶
-
-
-
-**47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
-
-⟶
-
-
-
-**48. Write the probability of x=As=W−1s as:**
-
-⟶
-
-
-
-**49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
-
-⟶
-
-
-
-**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
-
-⟶
-
-
-
-**51. The Machine Learning cheatsheets are now available in Hebrew.**
-
-⟶
-
-
-
-**52. Original authors**
-
-⟶
-
-
-
-**53. Translated by X, Y and Z**
-
-⟶
-
-
-
-**54. Reviewed by X, Y and Z**
-
-⟶
-
-
-
-**55. [Introduction, Motivation, Jensen's inequality]**
-
-⟶
-
-
-
-**56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
-
-⟶
-
-
-
-**57. [Dimension reduction, PCA, ICA]**
-
-⟶
diff --git a/he/refresher-linear-algebra.md b/he/refresher-linear-algebra.md
deleted file mode 100644
index a6b440d1e..000000000
--- a/he/refresher-linear-algebra.md
+++ /dev/null
@@ -1,339 +0,0 @@
-**1. Linear Algebra and Calculus refresher**
-
-⟶
-
-
-
-**2. General notations**
-
-⟶
-
-
-
-**3. Definitions**
-
-⟶
-
-
-
-**4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:**
-
-⟶
-
-
-
-**5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:**
-
-⟶
-
-
-
-**6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.**
-
-⟶
-
-
-
-**7. Main matrices**
-
-⟶
-
-
-
-**8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:**
-
-⟶
-
-
-
-**9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.**
-
-⟶
-
-
-
-**10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:**
-
-⟶
-
-
-
-**11. Remark: we also note D as diag(d1,...,dn).**
-
-⟶
-
-
-
-**12. Matrix operations**
-
-⟶
-
-
-
-**13. Multiplication**
-
-⟶
-
-
-
-**14. Vector-vector ― There are two types of vector-vector products:**
-
-⟶
-
-
-
-**15. inner product: for x,y∈Rn, we have:**
-
-⟶
-
-
-
-**16. outer product: for x∈Rm,y∈Rn, we have:**
-
-⟶
-
-
-
-**17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:**
-
-⟶
-
-
-
-**18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.**
-
-⟶
-
-
-
-**19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:**
-
-⟶
-
-
-
-**20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively**
-
-⟶
-
-
-
-**21. Other operations**
-
-⟶
-
-
-
-**22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:**
-
-⟶
-
-
-
-**23. Remark: for matrices A,B, we have (AB)T=BTAT**
-
-⟶
-
-
-
-**24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:**
-
-⟶
-
-
-
-**25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1**
-
-⟶
-
-
-
-**26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:**
-
-⟶
-
-
-
-**27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)**
-
-⟶
-
-
-
-**28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:**
-
-⟶
-
-
-
-**29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.**
-
-⟶
-
-
-
-**30. Matrix properties**
-
-⟶
-
-
-
-**31. Definitions**
-
-⟶
-
-
-
-**32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:**
-
-⟶
-
-
-
-**33. [Symmetric, Antisymmetric]**
-
-⟶
-
-
-
-**34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:**
-
-⟶
-
-
-
-**35. N(ax)=|a|N(x) for a scalar**
-
-⟶
-
-
-
-**36. if N(x)=0, then x=0**
-
-⟶
-
-
-
-**37. For x∈V, the most commonly used norms are summed up in the table below:**
-
-⟶
-
-
-
-**38. [Norm, Notation, Definition, Use case]**
-
-⟶
-
-
-
-**39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.**
-
-⟶
-
-
-
-**40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent**
-
-⟶
-
-
-
-**41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.**
-
-⟶
-
-
-
-**42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:**
-
-⟶
-
-
-
-**43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.**
-
-⟶
-
-
-
-**44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
-
-⟶
-
-
-
-**45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
-
-⟶
-
-
-
-**46. diagonal**
-
-⟶
-
-
-
-**47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:**
-
-⟶
-
-
-
-**48. Matrix calculus**
-
-⟶
-
-
-
-**49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:**
-
-⟶
-
-
-
-**50. Remark: the gradient of f is only defined when f is a function that returns a scalar.**
-
-⟶
-
-
-
-**51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:**
-
-⟶
-
-
-
-**52. Remark: the hessian of f is only defined when f is a function that returns a scalar**
-
-⟶
-
-
-
-**53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:**
-
-⟶
-
-
-
-**54. [General notations, Definitions, Main matrices]**
-
-⟶
-
-
-
-**55. [Matrix operations, Multiplication, Other operations]**
-
-⟶
-
-
-
-**56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]**
-
-⟶
-
-
-
-**57. [Matrix calculus, Gradient, Hessian, Operations]**
-
-⟶
diff --git a/he/refresher-probability.md b/he/refresher-probability.md
deleted file mode 100644
index 5c9b34656..000000000
--- a/he/refresher-probability.md
+++ /dev/null
@@ -1,381 +0,0 @@
-**1. Probabilities and Statistics refresher**
-
-⟶
-
-
-
-**2. Introduction to Probability and Combinatorics**
-
-⟶
-
-
-
-**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
-
-⟶
-
-
-
-**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
-
-⟶
-
-
-
-**5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
-
-⟶
-
-
-
-**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
-
-⟶
-
-
-
-**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
-
-⟶
-
-
-
-**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
-
-⟶
-
-
-
-**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
-
-⟶
-
-
-
-**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
-
-⟶
-
-
-
-**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
-
-⟶
-
-
-
-**12. Conditional Probability**
-
-⟶
-
-
-
-**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
-
-⟶
-
-
-
-**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
-
-⟶
-
-
-
-**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
-
-⟶
-
-
-
-**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
-
-⟶
-
-
-
-**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
-
-⟶
-
-
-
-**18. Independence ― Two events A and B are independent if and only if we have:**
-
-⟶
-
-
-
-**19. Random Variables**
-
-⟶
-
-
-
-**20. Definitions**
-
-⟶
-
-
-
-**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
-
-⟶
-
-
-
-**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
-
-⟶
-
-
-
-**23. Remark: we have P(a
-
-**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
-
-⟶
-
-
-
-**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
-
-⟶
-
-
-
-**26. [Case, CDF F, PDF f, Properties of PDF]**
-
-⟶
-
-
-
-**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
-
-⟶
-
-
-
-**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
-
-⟶
-
-
-
-**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
-
-⟶
-
-
-
-**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
-
-⟶
-
-
-
-**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
-
-⟶
-
-
-
-**32. Probability Distributions**
-
-⟶
-
-
-
-**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
-
-⟶
-
-
-
-**34. Main distributions ― Here are the main distributions to have in mind:**
-
-⟶
-
-
-
-**35. [Type, Distribution]**
-
-⟶
-
-
-
-**36. Jointly Distributed Random Variables**
-
-⟶
-
-
-
-**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
-
-⟶
-
-
-
-**38. [Case, Marginal density, Cumulative function]**
-
-⟶
-
-
-
-**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
-
-⟶
-
-
-
-**40. Independence ― Two random variables X and Y are said to be independent if we have:**
-
-⟶
-
-
-
-**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
-
-⟶
-
-
-
-**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
-
-⟶
-
-
-
-**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
-
-⟶
-
-
-
-**44. Remark 2: If X and Y are independent, then ρXY=0.**
-
-⟶
-
-
-
-**45. Parameter estimation**
-
-⟶
-
-
-
-**46. Definitions**
-
-⟶
-
-
-
-**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
-
-⟶
-
-
-
-**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
-
-⟶
-
-
-
-**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
-
-⟶
-
-
-
-**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
-
-⟶
-
-
-
-**51. Estimating the mean**
-
-⟶
-
-
-
-**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
-
-⟶
-
-
-
-**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
-
-⟶
-
-
-
-**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
-
-⟶
-
-
-
-**55. Estimating the variance**
-
-⟶
-
-
-
-**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
-
-⟶
-
-
-
-**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
-
-⟶
-
-
-
-**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
-
-⟶
-
-
-
-**59. [Introduction, Sample space, Event, Permutation]**
-
-⟶
-
-
-
-**60. [Conditional probability, Bayes' rule, Independence]**
-
-⟶
-
-
-
-**61. [Random variables, Definitions, Expectation, Variance]**
-
-⟶
-
-
-
-**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
-
-⟶
-
-
-
-**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
-
-⟶
-
-
-
-**64. [Parameter estimation, Mean, Variance]**
-
-⟶
diff --git a/hi/cheatsheet-deep-learning.md b/hi/cheatsheet-deep-learning.md
deleted file mode 100644
index a5aa3756c..000000000
--- a/hi/cheatsheet-deep-learning.md
+++ /dev/null
@@ -1,321 +0,0 @@
-**1. Deep Learning cheatsheet**
-
-⟶
-
-
-
-**2. Neural Networks**
-
-⟶
-
-
-
-**3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.**
-
-⟶
-
-
-
-**4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:**
-
-⟶
-
-
-
-**5. [Input layer, hidden layer, output layer]**
-
-⟶
-
-
-
-**6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
-
-⟶
-
-
-
-**7. where we note w, b, z the weight, bias and output respectively.**
-
-⟶
-
-
-
-**8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:**
-
-⟶
-
-
-
-**9. [Sigmoid, Tanh, ReLU, Leaky ReLU]**
-
-⟶
-
-
-
-**10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
-
-⟶
-
-
-
-**11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
-
-⟶
-
-
-
-**12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:**
-
-⟶
-
-
-
-**13. As a result, the weight is updated as follows:**
-
-⟶
-
-
-
-**14. Updating weights ― In a neural network, weights are updated as follows:**
-
-⟶
-
-
-
-**15. Step 1: Take a batch of training data.**
-
-⟶
-
-
-
-**16. Step 2: Perform forward propagation to obtain the corresponding loss.**
-
-⟶
-
-
-
-**17. Step 3: Backpropagate the loss to get the gradients.**
-
-⟶
-
-
-
-**18. Step 4: Use the gradients to update the weights of the network.**
-
-⟶
-
-
-
-**19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
-
-⟶
-
-
-
-**20. Convolutional Neural Networks**
-
-⟶
-
-
-
-**21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
-
-⟶
-
-
-
-**22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
-
-⟶
-
-
-
-**23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
-
-⟶
-
-
-
-**24. Recurrent Neural Networks**
-
-⟶
-
-
-
-**25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:**
-
-⟶
-
-
-
-**26. [Input gate, forget gate, gate, output gate]**
-
-⟶
-
-
-
-**27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]**
-
-⟶
-
-
-
-**28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
-
-⟶
-
-
-
-**29. Reinforcement Learning and Control**
-
-⟶
-
-
-
-**30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.**
-
-⟶
-
-
-
-**31. Definitions**
-
-⟶
-
-
-
-**32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:**
-
-⟶
-
-
-
-**33. S is the set of states**
-
-⟶
-
-
-
-**34. A is the set of actions**
-
-⟶
-
-
-
-**35. {Psa} are the state transition probabilities for s∈S and a∈A**
-
-⟶
-
-
-
-**36. γ∈[0,1[ is the discount factor**
-
-⟶
-
-
-
-**37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize**
-
-⟶
-
-
-
-**38. Policy ― A policy π is a function π:S⟶A that maps states to actions.**
-
-⟶
-
-
-
-**39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).**
-
-⟶
-
-
-
-**40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:**
-
-⟶
-
-
-
-**41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:**
-
-⟶
-
-
-
-**42. Remark: we note that the optimal policy π∗ for a given state s is such that:**
-
-⟶
-
-
-
-**43. Value iteration algorithm ― The value iteration algorithm is in two steps:**
-
-⟶
-
-
-
-**44. 1) We initialize the value:**
-
-⟶
-
-
-
-**45. 2) We iterate the value based on the values before:**
-
-⟶
-
-
-
-**46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
-
-⟶
-
-
-
-**47. times took action a in state s and got to s′**
-
-⟶
-
-
-
-**48. times took action a in state s**
-
-⟶
-
-
-
-**49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:**
-
-⟶
-
-
-
-**50. View PDF version on GitHub**
-
-⟶
-
-
-
-**51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
-
-⟶
-
-
-
-**52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
-
-⟶
-
-
-
-**53. [Recurrent Neural Networks, Gates, LSTM]**
-
-⟶
-
-
-
-**54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
-
-⟶
diff --git a/hi/cheatsheet-unsupervised-learning.md b/hi/cheatsheet-unsupervised-learning.md
deleted file mode 100644
index d07b74750..000000000
--- a/hi/cheatsheet-unsupervised-learning.md
+++ /dev/null
@@ -1,340 +0,0 @@
-**1. Unsupervised Learning cheatsheet**
-
-⟶
-
-
-
-**2. Introduction to Unsupervised Learning**
-
-⟶
-
-
-
-**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
-
-⟶
-
-
-
-**4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
-
-⟶
-
-
-
-**5. Clustering**
-
-⟶
-
-
-
-**6. Expectation-Maximization**
-
-⟶
-
-
-
-**7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
-
-⟶
-
-
-
-**8. [Setting, Latent variable z, Comments]**
-
-⟶
-
-
-
-**9. [Mixture of k Gaussians, Factor analysis]**
-
-⟶
-
-
-
-**10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
-
-⟶
-
-
-
-**11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
-
-⟶
-
-
-
-**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
-
-⟶
-
-
-
-**13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
-
-⟶
-
-
-
-**14. k-means clustering**
-
-⟶
-
-
-
-**15. We note c(i) the cluster of data point i and μj the center of cluster j.**
-
-⟶
-
-
-
-**16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
-
-⟶
-
-
-
-**17. [Means initialization, Cluster assignment, Means update, Convergence]**
-
-⟶
-
-
-
-**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
-
-⟶
-
-
-
-**19. Hierarchical clustering**
-
-⟶
-
-
-
-**20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
-
-⟶
-
-
-
-**21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
-
-⟶
-
-
-
-**22. [Ward linkage, Average linkage, Complete linkage]**
-
-⟶
-
-
-
-**23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
-
-⟶
-
-
-
-**24. Clustering assessment metrics**
-
-⟶
-
-
-
-**25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
-
-⟶
-
-
-
-**26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
-
-⟶
-
-
-
-**27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
-
-⟶
-
-
-
-**28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
-
-⟶
-
-
-
-**29. Dimension reduction**
-
-⟶
-
-
-
-**30. Principal component analysis**
-
-⟶
-
-
-
-**31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
-
-⟶
-
-
-
-**32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
-
-⟶
-
-
-
-**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
-
-⟶
-
-
-
-**34. diagonal**
-
-⟶
-
-
-
-**35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
-
-⟶
-
-
-
-**36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k
-dimensions by maximizing the variance of the data as follows:**
-
-⟶
-
-
-
-**37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
-
-⟶
-
-
-
-**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
-
-⟶
-
-
-
-**39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
-
-⟶
-
-
-
-**40. Step 4: Project the data on spanR(u1,...,uk).**
-
-⟶
-
-
-
-**41. This procedure maximizes the variance among all k-dimensional spaces.**
-
-⟶
-
-
-
-**42. [Data in feature space, Find principal components, Data in principal components space]**
-
-⟶
-
-
-
-**43. Independent component analysis**
-
-⟶
-
-
-
-**44. It is a technique meant to find the underlying generating sources.**
-
-⟶
-
-
-
-**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
-
-⟶
-
-
-
-**46. The goal is to find the unmixing matrix W=A−1.**
-
-⟶
-
-
-
-**47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
-
-⟶
-
-
-
-**48. Write the probability of x=As=W−1s as:**
-
-⟶
-
-
-
-**49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
-
-⟶
-
-
-
-**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
-
-⟶
-
-
-
-**51. The Machine Learning cheatsheets are now available in Hindi.**
-
-⟶
-
-
-
-**52. Original authors**
-
-⟶
-
-
-
-**53. Translated by X, Y and Z**
-
-⟶
-
-
-
-**54. Reviewed by X, Y and Z**
-
-⟶
-
-
-
-**55. [Introduction, Motivation, Jensen's inequality]**
-
-⟶
-
-
-
-**56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
-
-⟶
-
-
-
-**57. [Dimension reduction, PCA, ICA]**
-
-⟶
diff --git a/id/cs-230-convolutional-neural-networks.md b/id/cs-230-convolutional-neural-networks.md
new file mode 100644
index 000000000..c72c81736
--- /dev/null
+++ b/id/cs-230-convolutional-neural-networks.md
@@ -0,0 +1,715 @@
+**Convolutional Neural Networks translation**
+
+
+
+**1. Convolutional Neural Networks cheatsheet**
+
+⟶Cheatsheet Convolutional Neural Network
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶Deep Learning
+
+
+
+
+**3. [Intisari, Struktur arsitektur]**
+
+⟶[Overview, Struktur Arsitektur]
+
+
+
+
+**4. [Types of layer, Convolution, Pooling, Fully connected]**
+
+⟶[Jenis-jenis layer, Konvolusi, Pooling, Fully connected]
+
+
+
+
+**5. [Filter hyperparameters, Dimensions, Stride, Padding]**
+
+⟶[Hiperparameter filter, Dimensi, Stride, Padding]
+
+
+
+
+**6. [Tuning hyperparameters, Parameter compatibility, Model complexity, Receptive field]**
+
+⟶[Pengaturan hiperparameter, kompatibilitas parameter, kompleksitas model, Receptive field]
+
+
+
+
+**7. [Activation functions, Rectified Linear Unit, Softmax]**
+
+⟶[Fungsi-fungsi aktivasi, Rectified Linear Unit, Softmax]
+
+
+
+
+**8. [Object detection, Types of models, Detection, Intersection over Union, Non-max suppression, YOLO, R-CNN]**
+
+⟶[Deteksi objek, Tipe-tipe model, Deteksi, Perbandingan Irisan terhadap Gabungan, Non-max suppression, YOLO, R-CNN]
+
+
+
+
+**9. [Face verification/recognition, One shot learning, Siamese network, Triplet loss]**
+
+⟶[Verifikasi/pengenalan wajah, Pembelajaran Satu-kali, Jaringan Siamese, Rugi-rugi triplet]
+
+
+
+
+**10. [Neural style transfer, Activation, Style matrix, Style/content cost function]**
+
+⟶[Transfer neural style, Aktivasi, Matriks style, Fungsi kos style/konten]
+
+
+
+
+**11. [Computational trick architectures, Generative Adversarial Net, ResNet, Inception Network]**
+
+⟶[Arsitektur trik komputasional, Generative Adversarial Net, ResNet, Inception Network]
+
+
+
+
+**12. Overview**
+
+⟶Ringkasan
+
+
+
+
+**13. Architecture of a traditional CNN ― Convolutional neural networks, also known as CNNs, are a specific type of neural networks that are generally composed of the following layers:**
+
+⟶Arkitektur dari sebuah CNN - Convolutional neural network (jaringan saraf kovolusional) tradisional, juga dikenal sebagai CNN, adalah sebuah tipe khusus dari jaringan saraf tiruan (neural network) yang secara umum terdiri dari lapisan-lapisan (layer) seperti berikut ini:
+
+
+
+
+**14. The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters that are described in the next sections.**
+
+⟶Lapisan konvolusi and lapisan pooling dapat disesuaikan terhadap hiperparameter yang dijelaskan pada bagian selanjutnya.
+
+
+
+
+**15. Types of layer**
+
+⟶Jenis-jenis lapisan
+
+
+
+
+**16. Convolution layer (CONV) ― The convolution layer (CONV) uses filters that perform convolution operations as it is scanning the input I with respect to its dimensions. Its hyperparameters include the filter size F and stride S. The resulting output O is called feature map or activation map.**
+
+⟶Lapisan konvolusi (CONV) - Lapisan konvolusi (CONV) menggunakan banyak filter dalam proses (operasi) konvolusi ketika CONV memindai masukan (input) I dengan memperhatikan dimensinya. Hiperparameter dari CONV meliputi ukuran filter F dan stride S. Luaran yang dihasilkan O disebut peta ciri (feature map) ataupun peta aktivasi (activation map).
+
+
+
+
+**17. Remark: the convolution step can be generalized to the 1D and 3D cases as well.**
+
+⟶Catatan: tahap konvolusi dapat digeneralisasi juga dalam kasus 1D dan 3D.
+
+
+
+
+**18. Pooling (POOL) ― The pooling layer (POOL) is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance. In particular, max and average pooling are special kinds of pooling where the maximum and average value is taken, respectively.**
+
+⟶Pooling (POOL) - Lapisan pooling adalah sebuah operasi downsampling, biasanya dilakukan setelah lapisan konvolusi, yang juga menghasilkan invarians spasial yang sama. Terutama, pooling maks dan rata-rata adalah jenis pooling yang khusus, yang mengambil nilai maksimal dan rata-rata.
+
+
+
+
+**19. [Type, Purpose, Illustration, Comments]**
+
+⟶[Jenis, Tujuan, Ilustrasi, Komentar]
+
+
+
+
+**20. [Max pooling, Average pooling, Each pooling operation selects the maximum value of the current view, Each pooling operation averages the values of the current view]**
+
+⟶[Pooling maks, Pooling Rerata, Setiap operasi pooling memilih nilai maksimum dari tampilan terkini, Setiap operasi pooling mencari nilai rata-rata dari tampilan terkini]
+
+
+
+
+**21. [Preserves detected features, Most commonly used, Downsamples feature map, Used in LeNet]**
+
+⟶[Mempertahankan ciri (fitur) yang terdeteksi, yang paling sering digunakan, peta ciri downsamples, dipakai di LeNet]
+
+
+
+
+**22. Fully Connected (FC) ― The fully connected layer (FC) operates on a flattened input where each input is connected to all neurons. If present, FC layers are usually found towards the end of CNN architectures and can be used to optimize objectives such as class scores.**
+
+⟶Fully Connected (FC) - Lapisan fully connected (FC) digunakan untuk masukan yang telah disusun menjadi 1D, sehingga setiap masukan terhubung ke seluruh neuron. Bila ada, lapisan FC biasanya diletakkan pada akhir dari arsitektur CNN dan dapat digunakan untuk mengoptimalkan objektif (tujuan) seperti skor (nilai) pada kelas (pada kasus klasifikasi).
+
+
+
+
+**23. Filter hyperparameters**
+
+⟶Hiperparameter pada filter
+
+
+
+
+**24. The convolution layer contains filters for which it is important to know the meaning behind its hyperparameters.**
+
+⟶Lapisan konvolusi berisi filter, sehingga adalah penting untuk mengetahui makna dari hiperparameter pada filter itu sendiri.
+
+
+
+**25. Dimensions of a filter ― A filter of size F×F applied to an input containing C channels is a F×F×C volume that performs convolutions on an input of size I×I×C and produces an output feature map (also called activation map) of size O×O×1.**
+
+⟶Dimensi dari sebuah filter - Sebuah filter dengan ukuran FxF digunakan pada sebuah masukan (input) yang memiliki channel C, akan memiliki volume FxFxC saat proses konvolusi pada sebuah masukan (input) yang berukuran IxIxC dan akan menghasilkan sebuah luaran peta ciri (feature map) yang juga dikenal sebagai peta aktivasi yang berukuran O×O×1
+
+
+
+
+**26. Filter**
+
+⟶Filter
+
+
+
+
+**27. Remark: the application of K filters of size F×F results in an output feature map of size O×O×K.**
+
+⟶Catatan: penggunaan filter sejumlah K dengan ukuran FxF akan menghasilkan sebuah luaran berupa peta ciri dengan ukuran O×O×K.
+
+
+
+
+**28. Stride ― For a convolutional or a pooling operation, the stride S denotes the number of pixels by which the window moves after each operation.**
+
+⟶Stride - Dalam proses konvolusi atau operasi pooling, stide S melambangkan banyaknya pixel yang dilewati oleh sebuah jendela (window) setelah setiap operasi.
+
+
+
+
+**29. Zero-padding ― Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. This value can either be manually specified or automatically set through one of the three modes detailed below:**
+
+⟶Zero-padding - Zero-padding melambangkan proses penambahan nilai 0 sebanyak P pada setiap tepi (ujung) dari masukan. Nilai ini ditentukan secara manual atau secara otomatis melalui salah satu dari tiga mode yang dijelaskan dibawah ini:
+
+
+
+
+**30. [Mode, Value, Illustration, Purpose, Valid, Same, Full]**
+
+⟶[Mode, Nilai, Ilustrasi, Tujuan, Valid, Sama, Penuh]
+
+
+
+
+**31. [No padding, Drops last convolution if dimensions do not match, Padding such that feature map size has size ⌈IS⌉, Output size is mathematically convenient, Also called 'half' padding, Maximum padding such that end convolutions are applied on the limits of the input, Filter 'sees' the input end-to-end]**
+
+⟶[No padding, Hapus konvolusi terakhir jika dimensi tidak sesuai, Padding sehingga menghasilkan peta ciri (feature map) memiliki ukuran ⌈IS⌉, Ukuran luaran cocok secara matematis, Juga disebut 'half' padding, Maximum padding sehingga konvolusi di akhir dapat dilakukan pada batasan dari input, Filter 'melihat' masukan end-to-end]
+
+
+
+
+**32. Tuning hyperparameters**
+
+⟶Pengaturan hiperparameter
+
+
+
+
+**33. Parameter compatibility in convolution layer ― By noting I the length of the input volume size, F the length of the filter, P the amount of zero padding, S the stride, then the output size O of the feature map along that dimension is given by:**
+
+⟶Kompabilitas parameter pada lapisan konvolusi - Dengan catatan I sebagai panjang dari ukuran volume masukan, F sebagai panjang dari filter, P sebagai banyaknya dari zero padding, S sebagai stride, maka ukuran luaran O dari peta ciri (feature map) pada dimensi tersebut ditandai dengan:
+
+
+
+
+**34. [Input, Filter, Output]**
+
+⟶[Masukan, Filter, Luaran]
+
+
+
+
+**35. Remark: often times, Pstart=Pend≜P, in which case we can replace Pstart+Pend by 2P in the formula above.**
+
+⟶Catatan: sering kali, Pstart=Pend≜P, pada kasus tersebut kita dapat mengganti Pstart+Pend dengan 2P pada formula di atas.
+
+
+
+
+**36. Understanding the complexity of the model ― In order to assess the complexity of a model, it is often useful to determine the number of parameters that its architecture will have. In a given layer of a convolutional neural network, it is done as follows:**
+
+⟶Memahami kompleksitas dari model - Untuk menilai kompleksitas dari sebuah model, sangatlah penting untuk menentukan banyaknya parameter yang akan dimiliki pada suatu arsitektur. Pada sebuah lapisan dari convolutional neural network, hal tersebut dilakukan sebagai berikut:
+
+
+
+
+**37. [Illustration, Input size, Output size, Number of parameters, Remarks]**
+
+⟶[Ilustrasi, Ukuran masukan, Ukuran luaran, Banyaknya parameter, Catatan]
+
+
+
+
+**38. [One bias parameter per filter, In most cases, S
+
+
+**39. [Pooling operation done channel-wise, In most cases, S=F]**
+
+⟶[Operasi pooling yang dilakukan pada tiap kanal (channel-wise), Pada banyak kasus, S=F]
+
+
+
+
+**40. [Input is flattened, One bias parameter per neuron, The number of FC neurons is free of structural constraints]**
+
+⟶[Masukan diratakan, satu parameter bias untuk setiap neuron, Jumlah dari neuron FC adalah terbebas dari batasan struktural]
+
+
+
+
+**41. Receptive field ― The receptive field at layer k is the area denoted Rk×Rk of the input that each pixel of the k-th activation map can 'see'. By calling Fj the filter size of layer j and Si the stride value of layer i and with the convention S0=1, the receptive field at layer k can be computed with the formula:**
+
+⟶Receptive field - Receptive field pada layer k adalah area yang dinotasikan RkxRk dari masukan yang setiap pixel dari k-th activation map dapat "melihat". Dengan menyebut Fj (sebagai) ukuran penyaring dari lapisan j dan Si (sebagai) nilai stride dari lapisan i dan dengan konvensi 50=1, receptive field pada lapisan k dapat dihitung dengan formula:
+
+
+
+
+**42. In the example below, we have F1=F2=3 and S1=S2=1, which gives R2=1+2⋅1+2⋅1=5.**
+
+⟶Pada contoh dibawah ini, kita memiliki F1=F2=3 dan S1=S2=1, yang menghasilkan R2=1+2⋅1+2⋅1=5.
+
+
+
+
+**43. Commonly used activation functions**
+
+⟶Fungsi-fungsi aktivasi yang biasa dipakai
+
+
+
+
+**44. Rectified Linear Unit ― The rectified linear unit layer (ReLU) is an activation function g that is used on all elements of the volume. It aims at introducing non-linearities to the network. Its variants are summarized in the table below:**
+
+⟶Rectified Linear Unit - Layer rectified linear unit (ReLU) adalah sebuat fungsi aktivasi g yang digunakan pada seluruh elemen volume. Unit ini bertujuan untuk menempatkan non-linearitas pada jaringan. Variasi-variasi ReLU ini dirangkum pada tabel di bawah ini:
+
+
+
+
+**45. [ReLU, Leaky ReLU, ELU, with]**
+
+⟶[ReLU, Leaky ReLU, ELU, dengan]
+
+
+
+
+**46. [Non-linearity complexities biologically interpretable, Addresses dying ReLU issue for negative values, Differentiable everywhere]**
+
+⟶[Kompleksitas non-linearitas yang dapat ditafsirkan secara biologi, Menangani permasalahan dying ReLU yang bernilai negatif, Yang dapat dibedakan di mana pun]
+
+
+
+
+**47. Softmax ― The softmax step can be seen as a generalized logistic function that takes as input a vector of scores x∈Rn and outputs a vector of output probability p∈Rn through a softmax function at the end of the architecture. It is defined as follows:**
+
+⟶Softmax - Langkah softmax dapat dilihat sebagai sebuah fungsi logistik umum yang berperan sebagai masukan dari nilai skor vektor x∈Rn dan mengualarkan probabilitas produk vektor p∈Rn melalui sebuah fungsi softmax pada akhir dari jaringan arsitektur. Softmax didefinisikan sebagai berikut:
+
+
+
+
+**48. where**
+
+⟶Di mana
+
+
+
+
+**49. Object detection**
+
+⟶Deteksi objek
+
+
+
+
+**50. Types of models ― There are 3 main types of object recognition algorithms, for which the nature of what is predicted is different. They are described in the table below:**
+
+⟶Tipe-tipe model - Ada tiga tipe utama dari algoritma rekognisi objek, yang mana hakikat yang diprediksi tersebut berbeda. Tipe-tipe tersebut dijelaskan pada tabel di bawah ini:
+
+
+
+
+**51. [Image classification, Classification w. localization, Detection]**
+
+⟶[Klasifikasi gambar, Klasifikasi w. lokalisasi, Deteksi]
+
+
+
+
+**52. [Teddy bear, Book]**
+
+⟶[Boneka beruang, Buku]
+
+
+
+
+**53. [Classifies a picture, Predicts probability of object, Detects an object in a picture, Predicts probability of object and where it is located, Detects up to several objects in a picture, Predicts probabilities of objects and where they are located]**
+
+⟶[Mengklasifikasikan sebuah gambar, Memprediksi probabilitas dari objek, Mendeteksi objek pada sebuah gambar, Memprediksi probabilitas dari objek dan lokasinya pada gambar, Mendeteksi hingga beberapa objek pada sebuah gambar, Memprediksi probabilitas dari objek-objek dan dimana lokasi mereka]
+
+
+
+
+**54. [Traditional CNN, Simplified YOLO, R-CNN, YOLO, R-CNN]**
+
+⟶[CNN tradisional, Simplified YOLO, R-CNN, YOLO, R-CNN]
+
+
+
+
+**55. Detection ― In the context of object detection, different methods are used depending on whether we just want to locate the object or detect a more complex shape in the image. The two main ones are summed up in the table below:**
+
+⟶Deteksi - Pada objek deteksi, metode yang berbeda digunakan tergantung apakah kita hanya ingin untuk mengetahui lokasi objek atau mendeteksi sebuah bentuk yang lebih rumit pada gambar. Dua metode yang utama dirangkum pada tabel dibawah ini:
+
+
+
+
+**56. [Bounding box detection, Landmark detection]**
+
+⟶[Deteksi bounding box, Deteksi landmark]
+
+
+
+
+**57. [Detects the part of the image where the object is located, Detects a shape or characteristics of an object (e.g. eyes), More granular]**
+
+⟶[Mendeteksi bagian dari gambar dinama objek berlokasi, Mendetek bentuk atau karakteristik dari sebuah objek (contoh: mata), Lebih granular]
+
+
+
+
+**58. [Box of center (bx,by), height bh and width bw, Reference points (l1x,l1y), ..., (lnx,lny)]**
+
+⟶[Pusat dari box (bx,by), tinggi bh dan lebah bw, Poin referensi (l1x,l1y), ..., (lnx,lny)]
+
+
+
+
+**59. Intersection over Union ― Intersection over Union, also known as IoU, is a function that quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding box Ba. It is defined as:**
+
+⟶[Intersection over Union - Intersection over Union, juga dikenal sebagai IoU, adalah sebuah fungsi yang mengkuantifikasi seberapa benar posisi dari sebuah prediksi bounding box Bp terhadap bounding box yang sebenarnya Ba. IoU didefinisikan sebagai berikut:]
+
+
+
+
+**60. Remark: we always have IoU∈[0,1]. By convention, a predicted bounding box Bp is considered as being reasonably good if IoU(Bp,Ba)⩾0.5.**
+
+⟶Perlu diperhatikan: kita selalu memiliki nilai IoU∈[0,1]. Umumnya, sebuah prediksi bounding box dianggap cukup bagus jika IoU(Bp,Ba)⩾0.5.
+
+
+
+
+**61. Anchor boxes ― Anchor boxing is a technique used to predict overlapping bounding boxes. In practice, the network is allowed to predict more than one box simultaneously, where each box prediction is constrained to have a given set of geometrical properties. For instance, the first prediction can potentially be a rectangular box of a given form, while the second will be another rectangular box of a different geometrical form.**
+
+⟶Anchor boxes ― Anchor boxing adalah sebuah teknik yang digunakan untuk memprediksi bounding box yang overlap. Pada pengaplikasiannya, network diperbolehkan untuk memprediksi lebih dari satu box secara bersamaan, dimana setiap prediksi box dibatasi untuk memiliki kumpulan properti geometri. Contohnya, prediksi pertama dapat berupa sebuah box persegi panjang untuk sebuah bentuk, sedangkan prediksi kedua adalah persegi panjang lainnya dengan bentuk geometri yang berbeda.
+
+
+
+
+**62. Non-max suppression ― The non-max suppression technique aims at removing duplicate overlapping bounding boxes of a same object by selecting the most representative ones. After having removed all boxes having a probability prediction lower than 0.6, the following steps are repeated while there are boxes remaining:**
+
+⟶Non-max suppression ― Teknik non-max suppression bertujuan untuk menghapus duplikasi bounding box yang overlap satu sama lain dari sebuah objek yang sama dengan memilih box yang paling representatif. Setelah menghapus seluruh box dengan prediksi probability lebih kecil dari 0.6, langkah berikut diulang selama terdapat box tersisa.
+
+
+
+
+**63. [For a given class, Step 1: Pick the box with the largest prediction probability., Step 2: Discard any box having an IoU⩾0.5 with the previous box.]**
+
+⟶[Untuk sebuah kelas, Langkah 1: Pilih box dengan probabilitas prediksi tertinggi., Langkah 2: Singkirkan box manapun yang yang memiliki IoU⩾0.5 dengan box yang dipilih pada tahap 1.]
+
+
+
+
+**64. [Box predictions, Box selection of maximum probability, Overlap removal of same class, Final bounding boxes]**
+
+⟶[Prediksi-prediksi box, Seleksi box dari probabilitas tertinggi, Penghapusan overlap pada kelas yang sama, Bounding box akhir]
+
+
+
+
+**65. YOLO ― You Only Look Once (YOLO) is an object detection algorithm that performs the following steps:**
+
+⟶YOLO - You Only Look Once (YOLO) adalah sebuah algoritma deteksi objek yang melakukan langkah-langkah berikut:
+
+
+
+
+**66. [Step 1: Divide the input image into a G×G grid., Step 2: For each grid cell, run a CNN that predicts y of the following form:, repeated k times]**
+
+⟶Langkah 1: Bagi gambar masukan kedalam sebuah grid dengan ukuran GxG, Langkah 2: Untuk setiap sel grid, gunakan sebuah CNN yang memprediksi y dengan bentuk sebagai berikut; lakukan sebanyak k kali]
+
+
+
+
+**67. where pc is the probability of detecting an object, bx,by,bh,bw are the properties of the detected bouding box, c1,...,cp is a one-hot representation of which of the p classes were detected, and k is the number of anchor boxes.**
+
+⟶dimana pc adalah deteksi probabilitas dari sebuah objek, bx,by,bh,bw adalah properti dari box bounding yang terdeteksi, c1,...,cp adalah representasi one-hot yang mana p classes terdeteksi, dan k adalah jumlah box anchor.
+
+
+
+
+**68. Step 3: Run the non-max suppression algorithm to remove any potential duplicate overlapping bounding boxes.**
+
+⟶Langkah 3: Jalankan algoritma non-max suppression yang menghapus duplikasi potensial yang mengoverlap box bounding yang sebenarnya.
+
+
+
+
+**69. [Original image, Division in GxG grid, Bounding box prediction, Non-max suppression]**
+
+⟶[Gambar asli, Pembagian kedalam grid berukuran GxG, Prediksi box bounding, Non-max suppression]
+
+
+
+
+**70. Remark: when pc=0, then the network does not detect any object. In that case, the corresponding predictions bx,...,cp have to be ignored.**
+
+⟶Perlu diperhatikan: ketika pc=0, maka netwok tidak mendeteksi objek apapun. Pada kasus seperti itu, prediksi yang bersangkutan bx,...,cp harus diabaikan.
+
+
+
+
+**71. R-CNN ― Region with Convolutional Neural Networks (R-CNN) is an object detection algorithm that first segments the image to find potential relevant bounding boxes and then run the detection algorithm to find most probable objects in those bounding boxes.**
+
+⟶R-CNN ― Region with Convolutional Neural Networks (R-CNN) adalah sebuah algoritma objek deteksi yang pertama-tama mensegmentasi gambar untuk menemukan potensial box-box bounding yang relevan dan selanjutnya menjalankan algoritma deteksi untuk menemukan objek yang paling memungkinkan pada box-box bounding tersebut..
+
+
+
+
+**72. [Original image, Segmentation, Bounding box prediction, Non-max suppression]**
+
+⟶[Gambar asli, Segmentasi, Prediksi box bounding, Non-max suppressio]
+
+
+
+
+**73. Remark: although the original algorithm is computationally expensive and slow, newer architectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.**
+
+⟶Perlu diperhatikan: meskipun algoritma asli dari R-CNN membutuhkan komputasi resource yang besar dan lambar, arsitektur terbaru memungkinan algoritma untuk memiliki performa yang lebih cepat, yang dikenal sebagai Fast R-CNN dan Faster R-CNN.
+
+
+
+
+**74. Face verification and recognition**
+
+⟶Verifikasi wajah dan rekognisi
+
+
+
+
+**75. Types of models ― Two main types of model are summed up in table below:**
+
+⟶Jenis-jenis model - Dua jenis tipe utama dirangkum pada tabel dibawah ini:
+
+
+
+
+**76. [Face verification, Face recognition, Query, Reference, Database]**
+
+⟶[Ferivikasi wajah, Rekognisi wajah, Query, Referensi, Database]
+
+
+
+
+**77. [Is this the correct person?, One-to-one lookup, Is this one of the K persons in the database?, One-to-many lookup]**
+
+⟶[Apakah ini adalah orang yang sesuai?, One-to-one lookup, Apakah ini salah satu dari K orang pada database?, One-to-many lookup]
+
+
+
+
+**78. One Shot Learning ― One Shot Learning is a face verification algorithm that uses a limited training set to learn a similarity function that quantifies how different two given images are. The similarity function applied to two images is often noted d(image 1,image 2).**
+
+⟶One Shot Learning ― One Shot Learning adalah sebuah algoritma verifikasi wajah yang menggunakan sebuah training set yang terbatas untuk belajar fungsi kemiripan yang mengkuantifikasi seberapa berbeda dua gambar yang diberikan. Fungsi kemiripan yang diaplikasikan pada dua gambar sering dinotasikan sebagai d(image 1,image 2).
+
+
+
+
+**79. Siamese Network ― Siamese Networks aim at learning how to encode images to then quantify how different two images are. For a given input image x(i), the encoded output is often noted as f(x(i)).**
+
+⟶Siamese Network ― Siamese Networks didesain untuk mengkodekan gambar dan mengkuantifikasi seberapa berbeda dua buah gambar. Untuk sebuah gambar masukan x(i), keluaran yang dikodekan sering dinotasikan sebagai f(x(i)).
+
+
+
+
+**80. Triplet loss ― The triplet loss ℓ is a loss function computed on the embedding representation of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive example belong to a same class, while the negative example to another one. By calling α∈R+ the margin parameter, this loss is defined as follows:**
+
+⟶Loss triplet - Loss triplet adalah sebuah fungsi loss yang dihitung pada representasi embedding dari sebuah tiga pasang gambar A (anchor), P (positif) dan N (negatif). Sampel anchor dan positif berdasal dari sebuah kelas yang sama, sedangkan sampel negatif berasal dari kelas yang lain. Dengan menuliskan α∈R+ sebagai parameter margin, fungsi loss ini dapat didefinisikan sebagai berikut:
+
+
+
+
+**81. Neural style transfer**
+
+⟶Transfer neural style
+
+
+
+
+**82. Motivation ― The goal of neural style transfer is to generate an image G based on a given content C and a given style S.**
+
+⟶Motivasi: Tujuan dari mentransfer Neural style adalah untuk menghasilakn sebuah gambar G berdasarkan sebuah konten dan sebuah style S.
+
+
+
+
+**83. [Content C, Style S, Generated image G]**
+
+⟶[Konten C, Style S, gambar yang dihasilkan G]
+
+
+
+
+**84. Activation ― In a given layer l, the activation is noted a[l] and is of dimensions nH×nw×nc**
+
+⟶Aktifasi - Pada sebuah layer l, aktifasi dinotasikan sebagai a[l] dan berdimensi nH×nw×nc
+
+
+
+
+**85. Content cost function ― The content cost function Jcontent(C,G) is used to determine how the generated image G differs from the original content image C. It is defined as follows:**
+
+⟶Fungsi cost content - Fungsi cost content Jcontent(C,G) digunakan untuk menghitung perbedaan antara gambar yang dihasilkan dan gambar konten yang sebenarnya C. Fungsi cost content didefinsikan sebagai berikut:
+
+
+
+
+**86. Style matrix ― The style matrix G[l] of a given layer l is a Gram matrix where each of its elements G[l]kk′ quantifies how correlated the channels k and k′ are. It is defined with respect to activations a[l] as follows:**
+
+⟶Matriks style - Matriks style G[l] dari sebuah layer l adalah sebuah matrix Gram dimana setiap elemennya G[l]kk′ mengkuantifikasi seberapa besar korelasi antara channel k dan k'. Matriks style didefinisikan terhadap aktifasi a[l] sebagai berikut:
+
+
+
+
+**87. Remark: the style matrix for the style image and the generated image are noted G[l] (S) and G[l] (G) respectively.**
+
+⟶Perlu diperhatikan: matriks style untuk gambar style dan gambar yang dihasilkan masing-masing dituliskan sebagai G[l] (S) dan G[l] (G).
+
+
+
+
+**88. Style cost function ― The style cost function Jstyle(S,G) is used to determine how the generated image G differs from the style S. It is defined as follows:**
+
+⟶Fungsi cost style - Fungsi cost style Jstyle(S,G) digunakan untuk menentukan perbedaan antara gambar yang dihasilkan G dengan style yang diberikan S. Fungsi tersebut definisikan sebagai berikut:
+
+
+
+
+**89. Overall cost function ― The overall cost function is defined as being a combination of the content and style cost functions, weighted by parameters α,β, as follows:**
+
+⟶Fungsi cost overall - Fungsi cost overall didefinisikan sebagai sebuah kombinasi dari fungsi cost konten dan syle, dibobotkan oleh parameter α,β, sebagai berikut:
+
+
+
+
+**90. Remark: a higher value of α will make the model care more about the content while a higher value of β will make it care more about the style.**
+
+⟶Perlu diperhatikan: semakin tinggi nilai α akan membuat model lebih memperhatikan konten sedangkan semakin tinggi nilai β akan membuat model lebih memprehatikan style.
+
+
+
+
+**91. Architectures using computational tricks**
+
+⟶Arsitektur menggunakan trik komputasi.
+
+
+
+
+**92. Generative Adversarial Network ― Generative adversarial networks, also known as GANs, are composed of a generative and a discriminative model, where the generative model aims at generating the most truthful output that will be fed into the discriminative which aims at differentiating the generated and true image.**
+
+⟶Generative Adversarial Network - Generative adversarial networks, juga dikenala sebagai GANs, terdiri dari sebuah generatif dan diskriminatif model , dimana generatif model didesain untuk menghasilkan keluaran palsu yang mendekati keluaran sebenarnya yang akan diberikan kepada diskriminatif model yang didesain untuk membedakan gambar palsu dan gambar sebenarnya.
+
+
+
+
+**93. [Training, Noise, Real-world image, Generator, Discriminator, Real Fake]**
+
+⟶[Training, Noise, Gambar real-world, Generator, Discriminator, Real Fake]
+
+
+
+
+**94. Remark: use cases using variants of GANs include text to image, music generation and synthesis.**
+
+⟶Perlu diperhatikan: penggunaan dari variasi GANs meliputi sistem yang dapat mengubah teks ke gambar, dan menghasilkan dan mensintese musik.
+
+
+
+
+**95. ResNet ― The Residual Network architecture (also called ResNet) uses residual blocks with a high number of layers meant to decrease the training error. The residual block has the following characterizing equation:**
+
+⟶ ResNet ― Arsitektur Residual Network (juga disebut ResNet) menggunakan blok-blok residual dengan jumlah layer yang banyak untuk mengurangi training error. Blok residual memiliki karakteristik formula sebagai berikut:
+
+
+
+
+**96. Inception Network ― This architecture uses inception modules and aims at giving a try at different convolutions in order to increase its performance through features diversification. In particular, it uses the 1×1 convolution trick to limit the computational burden.**
+
+⟶Inception Network ― Arsitektur ini menggunakan modul inception dan didesain dengan tujuan untuk meningkatkan performa network melalu diversifikasi fitur dengan menggunakan CNN yang berbeda-beda. Khususnya, inception model menggunakan trik 1×1 CNN untuk membatasi beban komputasi.
+
+
+
+
+**97. The Deep Learning cheatsheets are now available in [target language].**
+
+⟶Deep Learning cheatsheet sekarang tersedia di [Bahasa Indonesia]
+
+
+
+
+**98. Original authors**
+
+⟶Penulis asli
+
+
+
+
+**99. Translated by X, Y and Z**
+
+⟶Diterjemahkan oleh X, Y dan Z
+
+
+
+
+**100. Reviewed by X, Y and Z**
+
+⟶Diulas oleh X, Y dan Z
+
+
+
+
+**101. View PDF version on GitHub**
+
+⟶Lihat versi PDF pada GitHub
+
+
+
+
+**102. By X and Y**
+
+⟶Oleh X dan Y
+
+
diff --git a/id/refresher-linear-algebra.md b/id/refresher-linear-algebra.md
new file mode 100644
index 000000000..ecc0222db
--- /dev/null
+++ b/id/refresher-linear-algebra.md
@@ -0,0 +1,339 @@
+**1. Linear Algebra and Calculus refresher**
+
+⟶Ulasan Aljabar Linier dan Kalkulus
+
+
+
+**2. General notations**
+
+⟶Notasi-notasi umum
+
+
+
+**3. Definitions**
+
+⟶Definisi-definisi
+
+
+
+**4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:**
+
+⟶Vektor ― Kita mendefinisikan x∈Rn sebagai sebuah vektor yang memiliki n elemen, dengan xi∈R adalah elemen ke-i:
+
+
+
+**5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:**
+
+⟶Matriks ― Kita mendefinisikan A∈Rm×n sebagai sebuah matriks yang memiliki baris sebanyak m dan kolom sebanyak n, dengan Ai,j∈R adalah sebuah elemen yang berada pada baris ke-i dan kolom ke-j:
+
+
+
+**6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.**
+
+⟶Catatan: vektor x yang didefinisikan di atas dapat juga dianggap sebagai matriks n×1 dan lebih khususnya disebut vektor dengan satu kolom.
+
+
+
+**7. Main matrices**
+
+⟶Matriks-matriks utama
+
+
+
+**8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:**
+
+⟶Matriks identitas ― Matriks identitas I∈Rn×n adalah matriks bujursangkar dengan elemen diagonal bernilai 1 dan sisanya bernilai 0.
+
+
+
+**9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.**
+
+⟶Catatan: untuk semua matriks A∈Rn×n, kita memiliki A×I=I×A=A.
+
+
+
+**10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:**
+
+⟶Matriks diagonal ― Matriks diagonal D∈Rn×n adalah sebuah matriks bujursangkar yang nilainya tidak berjumlah nol pada diagonalnya dan sisanya bernilai nol.
+
+
+
+**11. Remark: we also note D as diag(d1,...,dn).**
+
+⟶Catatan: kita juga melambangkan D sebagai diag(d1,...,dn).
+
+
+
+**12. Matrix operations**
+
+⟶Operasi matriks
+
+
+
+**13. Multiplication**
+
+⟶ Perkalian
+
+
+
+**14. Vector-vector ― There are two types of vector-vector products:**
+
+⟶Vektor-vektor ― Terdapat dua tipe perkalian dari vektor-vektor:
+
+
+
+**15. inner product: for x,y∈Rn, we have:**
+
+⟶perkalian dalam: untuk x,y∈Rn, kita memiliki:
+
+
+
+**16. outer product: for x∈Rm,y∈Rn, we have:**
+
+⟶outer product: untuk x∈Rm,y∈Rn, kita memiliki:
+
+
+
+**17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:**
+
+⟶Matriks-vektor - Produk dari matriks A∈Rm×n dan vektor x∈Rn adalah sebuah vektor dengan ukuran Rn, seperti:
+
+
+
+**18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.**
+
+⟶di mana aTr,i adalah baris vektor dan ac,j adalah kolom vektor dari matriks A, dan xi adalah elemen dari x.
+
+
+
+**19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:**
+
+⟶Matriks-matriks - Produk dari matriks A∈Rm×n dan matriks B∈Rn×p adalah sebuah matriks dengan ukuran Rn×p, seperti:
+
+
+
+**20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively**
+
+⟶di mana aTr,i,bTr,i adalah baris vektor dan ac,j,bc,j adalah kolom vektor dari masing-masing matriks A dan B
+
+
+
+**21. Other operations**
+
+⟶Operasi lainnya
+
+
+
+**22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:**
+
+⟶Transpos ― Transpos matriks A∈Rm×n, yang dilambangkan dengan AT, adalah matriks yang elemennya terbalik:
+
+
+
+**23. Remark: for matrices A,B, we have (AB)T=BTAT**
+
+⟶Catatan: untuk matriks A,B, disimbolkan (AB)T=BTAT
+
+
+
+**24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:**
+
+⟶Invers ― Invers matriks bujursangkar terbalikkan A dilambangkan dengan A-1 dan itu hanya terdapat di matrix tersebut:
+
+
+
+**25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1**
+
+⟶Catatan: tidak semua matriks bujursangkar adalah terbalikkan. Dan juga, untuk matriks A,B, dapat ditandai dengan (AB)−1=B−1A−1
+
+
+
+**26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:**
+
+⟶Runutan ― Runutan matriks bujursangkar A, dilambangkan dengan tr(A), adalah jumlah dari entri diagonal tersebut:
+
+
+
+**27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)**
+
+⟶Catatan: untuk matriks A,B, kita memiliki tr(AT)=tr(A) dan tr(AB)=tr(BA)
+
+
+
+**28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:**
+
+⟶Determinan - Determinan matriks bujursangkar A∈Rn×n, yang dilambangkan dengan |A| atau det(A) dinyatakan secara rekursif dalam konteks A∖i,∖j, yang mana matriks A tanpa baris ke-i dan kolom ke-j, sebagai berikut:
+
+
+
+**29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.**
+
+⟶Catatan: A adalah terbalikkan jika dan hanya jika |A|≠0. Serta, |AB|=|A||B| dan |AT|=|A|.
+
+
+
+**30. Matrix properties**
+
+⟶Sifat-sifat matriks
+
+
+
+**31. Definitions**
+
+⟶Definisi
+
+
+
+**32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:**
+
+⟶Dekomposisi simetrik ― Sebuah matriks A dapat dinyatakan dari segi simetrik dan antisimetrik ke dalam bentuk sebagai berikut:
+
+
+
+**33. [Symmetric, Antisymmetric]**
+
+⟶[Simetrik, Antisimetrik]
+
+
+
+**34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:**
+
+⟶Norma ― Norma adalah sebuah fungsi N:V⟶[0,+∞[ yang mana V adalah sebuah vektor ruang, dan begitu pun untuk semua x,y∈V, kita mendapatkan:
+
+
+
+**35. N(ax)=|a|N(x) for a scalar**
+
+⟶N(ax)=|a|N(x) untuk sebuah skalar
+
+
+
+**36. if N(x)=0, then x=0**
+
+⟶Jika N(x)=0, maka x=0
+
+
+
+**37. For x∈V, the most commonly used norms are summed up in the table below:**
+
+⟶Untuk x∈V, norma yang paling sering digunakan, dijabarkan pada tabel di bawah ini:
+
+
+
+**38. [Norm, Notation, Definition, Use case]**
+
+⟶[Norma, Notasi, Definisi, Use case]
+
+
+
+**39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.**
+
+⟶Tak-Bebas Linier - Sebuah himpunan vektor disebut tak-bebas linier jika salah satu vektor dalam himpunan vektor tersebut dapat didefinisikan sebagai kombinasi linier dari vektor-vektor lainnya.
+
+
+
+**40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent**
+
+⟶Catatan: jika tidak ada vektor yang dapat ditulis seperti ini, maka vektor-vektor tersebut dinyatakan independen linear.
+
+
+
+**41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.**
+
+⟶Matriks berpangkat: Pangkat matriks A dinotasikan dengan rank(A) dan merupakan dimensi dari ruang vektor yang dihasilkan oleh kolom matriks tersebut. Hal tersebut setara dengan jumlah maksimum dari kolom independen linear dari matriks A.
+
+
+
+**42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:**
+
+⟶Matriks semi definit positif ― Sebuah matriks A∈Rn×n adalah semi definit positif (PSD) dan dinotasikan dengan A⪰0 jika kita memiliki:
+
+
+
+**43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.**
+
+⟶Catatan: demikian pula, sebuah matriks A dikategorikan sebagai matriks definit positif, dan dinotasikan dengan A≻0, jika matriks tersebut adalah matriks PSD yang memenuhi ketentuan untuk seluruh vektor non-zero x, xTAx>0.
+
+
+
+**44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+⟶Eigennilai, eigenvektor - Untuk sebuah matriks A∈Rn×n, λ dikategorikan sebagai sebuah eigennilai dari A jika terdapat sebuah vektor z∈Rn∖{0}, disebut eigenvektor, sehingga kita mendapatkan:
+
+
+
+**45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+⟶Teorama spektral ― A∈Rn×n. Jika A adalah simetris, maka A terdiagonalkan oleh sebuah matriks ortogonal U∈Rn×n. Dengan menotasikan Λ=diag(λ1,...,λn), kita mendapatkan:
+
+
+
+**46. diagonal**
+
+⟶diagonal
+
+
+
+**47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:**
+
+⟶Dekomposisi nilai tunggal ― Untuk sebuah matriks berdimensi mxn, dekomposisi nilai tunggal (SVD) adalah sebuah teknik pemfaktoran yang menghasilkan sebuah matriks uniter U m×m, matriks diagonal Σ m×n dan matriks uniter V n×n, sehingga:
+
+
+
+**48. Matrix calculus**
+
+⟶Kalkulus matriks
+
+
+
+**49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:**
+
+⟶Gradien - Diketahui f:Rm×n→R sebagai sebuah fungsi dan A∈Rm×n sebagai sebuah matriks. Gradien f terhadap A adalah matriks berdimensi m×n, dinotasikan ∇Af(A), sehingga:
+
+
+
+**50. Remark: the gradient of f is only defined when f is a function that returns a scalar.**
+
+⟶Catatan: gradien f hanya dapat didefinisikan ketika f adalah sebuah fungsi yang mengembalikan sebuah skalar.
+
+
+
+**51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:**
+
+⟶Hesse ― Diketahui f:Rn→R adalah sebuah fungsi dan x∈Rn adalah sebuah vektor. Matriks Hesse dari f terhadap x adalah sebuah matriks simetris n×n, dinotasikan ∇2xf(x), sehingga:
+
+
+
+**52. Remark: the hessian of f is only defined when f is a function that returns a scalar**
+
+⟶Catatan: matriks Hesse dari f hanya dapat didefinisikan ketika f adalah sebuah fungsi yang mengembalikan sebuah skalar
+
+
+
+**53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:**
+
+⟶Operasi gradien - Untuk matriks A, B, C, properti-properti gradient berikut layak untuk diingat:
+
+
+
+**54. [General notations, Definitions, Main matrices]**
+Multiplication
+⟶[Notasi umum, Definisi, Matriks-matriks Utama]
+
+
+
+**55. [Matrix operations, Multiplication, Other operations]**
+
+⟶[Sifat-sifat Matriks, Norm, Nilai eigen/vektor eigen, Dekomposisi nilai-singular]
+
+
+
+**56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]**
+
+⟶[Properti matriks, Norm, Eigenvalue/Eigenvektor, Dekomposisi singular-value]
+
+
+
+**57. [Matrix calculus, Gradient, Hessian, Operations]**
+
+⟶[Kalkulus matriks, Gradien, Hesse, Operasi]
diff --git a/id/refresher-probability.md b/id/refresher-probability.md
new file mode 100644
index 000000000..92c190930
--- /dev/null
+++ b/id/refresher-probability.md
@@ -0,0 +1,381 @@
+**1. Probabilities and Statistics refresher**
+
+⟶Review Probabilitas dan Statistika
+
+
+
+**2. Introduction to Probability and Combinatorics**
+
+⟶Pengenalan Probabilitas dan Kombinatorik
+
+
+
+**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
+
+⟶Ruang sampel - Himpunan dari semua hasil yang mungkin muncul dalam sebuah percobaan dikenal sebagai ruang sampel dari percobaan dan dinotasikan sebagai S.
+
+
+
+**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
+
+⟶Kejadian - Setiap himpunan bagian E dari suatu ruang sampel disebut sebagai sebuah Kejadian. Dengan demikian, sebuah kejadian adalah sebuah himpunan yang berisikan hasil yang mungkin muncul dalam suatu percobaan. Jika suatu hasil percobaan termasuk di dalam E, maka dapat dikatakan bahwa kejadian E telah terjadi.
+
+
+
+**5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
+
+⟶Aksioma dalam probabilitas. Untuk setiap kejadian E, kita notasikan P(E) sebagai probabilitas terjadinya kejadian E.
+
+
+
+**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
+
+⟶Aksioma 1 - Setiap probabilitas bernilai antara (dan termasuk) 0 dan 1, yaitu:
+
+
+
+**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
+
+⟶Aksioma 2 - Probabilitas bahwa setidaknya satu dari kejadian elementer dalam keseluruhan ruang sampel akan terjadi, adalah 1, yaitu:
+
+
+
+**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
+
+⟶Aksioma 3 - Untuk sebarang deretan kejadian yang saling eksklusif (terpisah) yaitu E1,...,En, kita memiliki:
+
+
+
+**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
+
+⟶Permutasi - Sebuah permutasi adalah sebuah penyusunan sejumlah r objek yang berasal dari kumpulan n objek (dengan memperhatikan urutan). Banyaknya penyusunan tersebut dapat dituliskan sebagai P(n,r), yang didefinisikan sebagai:
+
+
+
+**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
+
+⟶Kombinasi - Sebuah kombinasi adalah sebuah penyusunan sejumlah r objek yang berasal dari kumpulan n objek, tanpa memperhatikan urutan. Banyaknya penyusunan tersebut dapat dituliskan sebagai C(n,r), yang didefinsikan sebagai:
+
+
+
+**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
+
+⟶Perlu diperhatikan: kita mendefinisikan bahwa untuk 0⩽r⩽n, kita memiliki P(n,r)⩾C(n,r)
+
+
+
+**12. Conditional Probability**
+
+⟶Probabilitas Bersyarat
+
+
+
+**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
+
+⟶Aturan Bayes - Untuk kejadian A dan B, dengan P(B)>0, kita memiliki:
+
+
+
+**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
+
+⟶Perlu diperhatikan: kita memiliki P(A∩B)=P(A)P(B|A)=P(A|B)P(B)
+
+
+
+**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
+
+⟶Partisi - Diketahui {Ai,i∈[[1,n]]} sehingga untuk semua i berlaku, Ai≠∅. Kita menyatakan bahwa {Ai} adalah sebuah partisi jika kita memiliki:
+
+
+
+**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
+
+⟶Perlu diperhatikan: untuk sebarang kejadian B dalam ruang sampel, kita memiliki P(B)=n∑i=1P(B|Ai)P(Ai).
+
+
+
+**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
+
+⟶Bentuk lanjutan dari Aturan Bayes - Diketahui {Ai,i∈[[1,n]]} adalah sebuah partisi dari ruang sampel. Kita memiliki:
+
+
+
+**18. Independence ― Two events A and B are independent if and only if we have:**
+
+⟶Kebebasan - Dua kejadian A dan B adalah saling bebas, jika dan hanya jika kita memiliki:
+
+
+
+**19. Random Variables**
+
+⟶Variabel Acak:
+
+
+
+**20. Definitions**
+
+⟶Definisi-definisi
+
+
+
+**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
+
+⟶Variabel acak - Sebuah variabel acak, sering dituliskan sebagai X, adalah sebuah fungsi yang memetakan setiap elemen (unsur) dalam ruang sampel ke sebuah bilangan ril.
+
+
+
+**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
+
+⟶Fungsi distribusi kumulatif (CDF) - Fungsi distribusi kumulatif F, yang tidak menurun secara monoton sehingga limx→−∞F(x)=0 dan limx→+∞F(x)=1, didefinisikan sebagai:
+
+
+
+**23. Remark: we have P(a
+
+**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
+
+⟶Fungsi densitas probabilitas (PDF) - Fungsi densitas probabilitas f adalah probabilitas bahwa X memiliki nilai di antara dua variabel acak yang berdekatan.
+
+
+
+**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
+
+⟶Hubungan antara PDF dan CDF - Berikut ini adalah properti (karakteristik) penting yang harus diketahui pada kasus diskrit (D) dan kontinu (C).
+
+
+
+**26. [Case, CDF F, PDF f, Properties of PDF]**
+
+⟶[Kasus, CDF F, PDF F, properti dari PDF]
+
+
+
+**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
+
+⟶Ekspektasi dan Momen dari Distribusi - Berikut ini adalah ekspresi dari nilai ekspektasi E[X], bentuk umum nilai ekspektasi E[g(x)], momen ke-k E[Xk] dan fungsi karakteristik ψ(ω) untuk kasus diskrit dan kontinu.
+
+
+
+**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
+
+⟶Varians - Varians dari sebuah variabel acak, sering dituliskan sebagai Var(x) atau σ2, adalah sebuah ukuran penyebaran dari fungsi distribusi. Varians ditentukan sebagai berikut:
+
+
+
+**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
+
+⟶Standar deviasi - Standar deviasi dari sebuah variabel acak, sering dinyatakan sebagai σ, adalah sebuah ukuran penyebaran dari fungsi distribusi yang sama dengan satuan (unit) dari variabel acak tersebut. Standar deviasi ditentukan sebagai berikut:
+
+
+
+**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
+
+⟶Transformasi dari variabel acak - Diketahui bahwa variabel X dan Y dihubungkan oleh beberapa fungsi. Dengan mendefinisikan fX dan fY sebagai masing-masing fungsi distribusi dari X dan Y, kita memiliki
+
+
+
+**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
+
+⟶Aturan integral Leibniz - Diketahui g adalah sebuah fungsi dari x serta secara potensi c, dan batas a, b yang mungkin tergantung pada c. Kita memiliki:
+
+
+
+**32. Probability Distributions**
+
+⟶Distribusi probabilitas
+
+
+
+**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
+
+⟶Ketaksamaan Chebyshev - Misal X adalah sebuah variabel acak dengan nilai ekspektasi μ. Untuk k,σ>0, kita memiliki ketaksamaan sebagai berikut:
+
+
+
+**34. Main distributions ― Here are the main distributions to have in mind:**
+
+⟶Distribusi-distribusi yang utama - Berikut adalah distribusi-distribusi yang utama dan perlu diingat:
+
+
+
+**35. [Type, Distribution]**
+
+⟶[Tipe, Distribusi]
+
+
+
+**36. Jointly Distributed Random Variables**
+
+⟶Variabel Acak yang Terdistribusi Bersamaan
+
+
+
+**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
+
+⟶Densitas marjinal dan distribusi kumulatif - Dari fungsi probabilitas densitas gabungan fXY, kita mendapatkan
+
+
+
+**38. [Case, Marginal density, Cumulative function]**
+
+⟶[Kasus, Densitas marjinal, Fungsi kumulatif]
+
+
+
+**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
+
+⟶Densitas Bersyarat - Densitas bersyarat dari X terhadap Y, sering dinotasikan sebagai fX|Y, didefinisikan sebagai berikut:
+
+
+
+**40. Independence ― Two random variables X and Y are said to be independent if we have:**
+
+⟶Kebebasan - Dua variabel acak X dan Y disebut saling bebas jika kita memiliki:
+
+
+
+**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
+
+⟶Kovarians - Kita definsikan kovarians dari dua variabel acak X dan Y, yang dinotasikan sebagai σ2XY atau lebih umumnya Cov(X,Y), sebagai berikut:
+
+
+
+**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
+
+⟶Korelasi - Dengan σX,σY sebagai standar deviasi dari X dan Y, kita mendefinisikan korelasi antara variabel acak X dan Y, dinotasikan ρXY, sebagai berikut:
+
+
+
+**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
+
+⟶Catatan 1: kita menyatakan bahwa untuk sebarang variabel acak X, Y, kita memiliki ρXY∈[−1,1].
+
+
+
+**44. Remark 2: If X and Y are independent, then ρXY=0.**
+
+⟶Catatan 2: Jika X dan Y bersifat saling bebas, maka ρXY=0.
+
+
+
+**45. Parameter estimation**
+
+⟶Estimasi parameter
+
+
+
+**46. Definitions**
+
+⟶Definisi-definisi
+
+
+
+**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
+
+⟶Sampel acak - sebuah sampel acak adalah koleksi dari n variabel acak X1,...,Xn yang independen dan terdistribusi secara identik dengan X.
+
+
+
+**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
+
+⟶Estimator - Sebuah estimator adalah sebuah fungsi dari data yang digunakan untuk menduga nilai dari sebuah parameter yang tidak diketahui pada sebuah model statistik.
+
+
+
+**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
+
+⟶Bias - Bias dari sebuah estimator ^θ didefinisikan sebagai perbedaan antara distribusi dari nilai yang diharapkan ^θ dan nilai yang sesungguhnya, yaitu:
+
+
+
+**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
+
+⟶Perlu diperhatikan: sebuah estimator dikatakan tidak bias ketika kita memiliki E[^θ]=θ.
+
+
+
+**51. Estimating the mean**
+
+⟶Mengestimasi nilai rata-rata
+
+
+
+**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
+
+⟶Rata-rata dari sampel - Nilai rata-rata sample dari sebuah sampel acak digunakan untuk mengestimasi nilai rata-rata sesungguhnya μ dari sebuah distribusi, sering dinotasikan sebagai ¯¯¯¯¯X dan didefinsikan sebagai berikut:
+
+
+
+**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
+
+⟶Perlu diperhatikan: nilai rata-rata sampel tidak bias, yaitu E[¯¯¯¯¯X]=μ.
+
+
+
+**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
+
+⟶Teorema Limit Terpusat - Diketahui sebuah sampel acak X1, ..., Xn yang berasal dari distribusi dengan nilai rata-rata μ dan nilai varians σ2, maka kita memiliki:
+
+
+
+**55. Estimating the variance**
+
+⟶Mengestimasi nilai varians
+
+
+
+**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
+
+⟶Varians sampel - Varians dari sebuah sampel acak digunakan untuk mengestimasi nilai varians sesungguhnya σ2 dari sebuah distribusi, sering dituliskan sebagai s2 atau ^σ2 dan didefinisikan sebagai berikut:
+
+
+
+**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
+
+⟶Perlu diperhatikan: varians sampel tidak bias, yaitu E[s2]=σ2.
+
+
+
+**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
+
+⟶Relasi Chi-Squared dengan varians sampel - Diketahui s2 adalah varians dari sebuah sampel acak. Kita memiliki:
+
+
+
+**59. [Introduction, Sample space, Event, Permutation]**
+
+⟶[Pendahuluan, Ruang sampel, Kejadian, Permutasi]
+
+
+
+**60. [Conditional probability, Bayes' rule, Independence]**
+
+⟶[Probabilitas Bersyarat, Aturan Bayes, Kebebasan]
+
+
+
+**61. [Random variables, Definitions, Expectation, Variance]**
+
+⟶[Variabel acak, definisi-definisi, Ekspektasi, Varians]
+
+
+
+**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
+
+⟶[Distribusi-distribusi probabilitas, Ketaksamaan Chebyshev, Distribusi-distribusi utama]
+
+
+
+**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
+
+⟶[Variabel acak yang terdistribusi bersamaan, Densitas, Kovarians, Korelasi]
+
+
+
+**64. [Parameter estimation, Mean, Variance]**
+
+⟶[Estimasi parameter, Rata-rata, Varians]
diff --git a/it/cs-229-linear-algebra.md b/it/cs-229-linear-algebra.md
new file mode 100644
index 000000000..4d05b1fc4
--- /dev/null
+++ b/it/cs-229-linear-algebra.md
@@ -0,0 +1,346 @@
+**Linear Algebra and Calculus translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/refresher-algebra-calculus)
+
+⟶ Algebra lineare e Analisi
+
+
+
+**1. Linear Algebra and Calculus refresher**
+
+⟶ Ripasso di Algebra lineare e Analisi
+
+
+
+**2. General notations**
+
+⟶ Notazione generale
+
+
+
+**3. Definitions**
+
+⟶ Definizioni
+
+
+
+**4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:**
+
+⟶ Vettore ― Definiamo x∈Rn un vettore con n elementi, dove xi∈R è l'i-esimo elemento:
+
+
+
+**5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:**
+
+⟶ Matrice ― Definiamo A∈Rm×n una matriche con m righe e n colonne, dove Ai,j∈R è l'elemento posizionato alla i-esima riga e j-esima colonna:
+
+
+
+**6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.**
+
+⟶ Osservazione: il vettore x, definito precedentemente, può essere visto come una matrice nx1 ed è chiamato, più particolarmente, un vettore colonna.
+
+
+
+**7. Main matrices**
+
+⟶ Matrici principali
+
+
+
+**8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:**
+
+⟶ matrice identità ― La matrice identità I∈Rn×n è una matrice quadrata con tutti 1 sulla diagonale principale e 0 in tutte le altre posizioni:
+
+
+
+**9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.**
+
+⟶ Osservazione: per tutte le matrici A∈Rn×n, abbiamo che A×I=I×A=A.
+
+
+
+**10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:**
+
+⟶ matrice diagonale ― Una matrice diagonale D∈Rn×n è una matrice quadrata con valori diversi da zero sulla diagonale principale e zero in tutte le altre posizioni:
+
+
+
+**11. Remark: we also note D as diag(d1,...,dn).**
+
+⟶ Osservazione: definiamo, inoltre, D come diag(d1,...,dn)
+
+
+
+**12. Matrix operations**
+
+⟶ Operazioni sulle matrici
+
+
+
+**13. Multiplication**
+
+⟶ Moltiplicazione
+
+
+
+**14. Vector-vector ― There are two types of vector-vector products:**
+
+⟶ Vettore-vettore ― Ci sono due tipi di prodotto vettore-vettore:
+
+
+
+**15. inner product: for x,y∈Rn, we have:**
+
+⟶ prodotto scalare: per x,y∈Rn, abbiamo che:
+
+
+
+**16. outer product: for x∈Rm,y∈Rn, we have:**
+
+⟶ prodotto vettoriale: per x∈Rm,y∈Rn, abbiamo che:
+
+
+
+**17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:**
+
+⟶ Matrice-vettore ― Il prodotto di una matrice A∈Rm×n ed un vettore x∈Rn, è un vettore di dimensione Rn, tale che:
+
+
+
+**18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.**
+
+⟶ dove aTr,i sono i vettori riga, ac,j sono i vettori colonna di A e xi sono gli elementi di x.
+
+
+
+**19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:**
+
+⟶ Matrice-matrice — Il prodotto di matrici A∈Rm×n e B∈Rn×p è una matriche di dimensione Rn×p, tale che:
+
+
+
+**20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively**
+
+⟶ dove aTr,i,bTr,i sono i vettori riga e ac,j,bc,j sono i vettori colonna rispettivamente di A e di B
+
+
+
+**21. Other operations**
+
+⟶ Altre operazioni
+
+
+
+**22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:**
+
+⟶ Trasposta — La trasposta di una matrice A∈Rm×n, indicata con AT, è tale che i suoi elementi sono scambiati:
+
+
+
+**23. Remark: for matrices A,B, we have (AB)T=BTAT**
+
+⟶ Osservazione: per le matrici A,B abbiamo che (AB)T=BTAT
+
+
+
+**24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:**
+
+⟶ Inversa — L'inversa di una matrice quadrata invertibile A è indicata con A-1 ed è l'unica matrice tale che:
+
+
+
+**25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1**
+
+⟶ Osservazione: non tutte le matrici quadrate sono invertibili. Inoltre, per le matrici A,B, abbiamo che (AB)−1=B−1A−1
+
+
+
+**26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:**
+
+⟶ Traccia — La traccia di una matrice quadrata A, indicata con tr(A), è la somma degli elementi sulla diagonale principale:
+
+
+
+**27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)**
+
+⟶ Osservazione: per le matrici A,C, abbiamo che tr(AT)=tr(A) e tr(AB)=tr(BA)
+
+
+
+**28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:**
+
+⟶ Determinante — Il determinante di una matrice quadrata A∈Rn×n, indicata con |A| o det(A) è espresso ricorsivamente rispetto a A\i,\j, che è la matrice A senza l'i-esima riga e la j-esima colonna, come segue:
+
+
+
+**29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.**
+
+⟶ Osservazione: A è invertibile se e solo se |A|≠0. Inoltre, |AB|=|A||B| e |AT|=|A|.
+
+
+
+**30. Matrix properties**
+
+⟶ Proprietà delle matrici
+
+
+
+**31. Definitions**
+
+⟶ Definizioni
+
+
+
+**32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:**
+
+⟶ Decomposizione simmetrica — Una matrice A può essere espressa tramite la sua componente simmetrica ed antisimmetrica come segue:
+
+
+
+**33. [Symmetric, Antisymmetric]**
+
+⟶ [Simmetrica, Antisimmetrica]
+
+
+
+**34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:**
+
+⟶ Norma — La norma è una funzione N:V⟶[0,+∞[ dove V è uno spazio vettoriale, tale che per
+x,y∈V, abbiamo che:
+
+
+
+**35. N(ax)=|a|N(x) for a scalar**
+
+⟶ N(ax)=|a|N(x) per uno scalare
+
+
+
+**36. if N(x)=0, then x=0**
+
+⟶ if N(x)=0, allora x=0
+
+
+
+**37. For x∈V, the most commonly used norms are summed up in the table below:**
+
+⟶ Per x∈V, le norme più usate comunemente sono riassunte nella tabella seguente:
+
+
+
+**38. [Norm, Notation, Definition, Use case]**
+
+⟶ [Norma, Notazione, Definizione, Caso d'uso]
+
+
+
+**39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.**
+
+⟶ Linearmente dipendente — Un insieme di vettori è detto linearmente dipendente, se uno dei vettori dell'insieme può essere definito come combinazione lineare degli altri.
+
+
+
+**40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent**
+
+⟶ Osservazione: se nessun vettore può essere scritto in questo modo, allora i vettori sono detti linearmente indipendenti
+
+
+
+**41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.**
+
+⟶ Rango di una matrice — Il rango di una data matrice A si indica rg(A) ed è la dimensione dello spazio vettoriale generato dalle sue colonne. Questo è equivalente al numero massimo di colonne linearmente indipendenti di A.
+
+
+
+**42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:**
+
+⟶ Matrice semidefinita positiva — Una matrice A∈Rn×n è semidefinita positiva (PSD) ed è indicata da A⪰0, se abbiamo che:
+
+
+
+**43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.**
+
+⟶ Osservazione: analogamente, una matrice A è detta definita positiva, ed è indicata con A≻0, se è una matrice PSD che soddisfa per ogni vettore x non nullo, xTAx>0.
+
+
+
+**44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+⟶ Autovalore, autovettore — Data una matrice A∈Rn×n, si dice che λ è un autovalore di A, se esiste un vettore z∈Rn∖{0}, chiamato autovettore, tale che abbiamo:
+
+
+
+**45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+⟶ Teorema spettrale — Sia A∈Rn×n. Se A è simmetrico, allora A è diagonalizzabile da una matrice reale ortogonale U∈Rn×n. Chiamando Λ=diag(λ1,...,λn), abbiamo che:
+
+
+
+**46. diagonal**
+
+⟶ diagonale
+
+
+
+**47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:**
+
+⟶ Decomposizione ai valori singolari — Per una data matrice A di dimensione m×n, la decomposizione ai valori singolari (SVD) è una tecnica di fattorizzazione che garantisce l'esistenza della matrice unitaria U m×m, della matrice diagonale Σ m×n e della matrice unitaria V n×n, tale che:
+
+
+
+**48. Matrix calculus**
+
+⟶ Matrice
+
+
+
+**49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:**
+
+⟶ Gradiente — Sia f:Rm×n→R una funzione e A∈Rm×n una matrice. Il gradiente di f in funzione di A è una matrice m×n, indicata con ∇Af(A), tale che:
+
+
+
+**50. Remark: the gradient of f is only defined when f is a function that returns a scalar.**
+
+⟶ Osservazione: il gradiente di f è definito solamente quando f è una funzione che restituisce uno scalare.
+
+
+
+**51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:**
+
+⟶ Matrice Hessiana — Sia f:Rn→R una funzione e x∈Rn un vettore. La matrice hessiana di f in funzione di x è una matrice simmetrica n×n, indicata con ∇2xf(x), tale che:
+
+
+
+**52. Remark: the hessian of f is only defined when f is a function that returns a scalar**
+
+⟶ Osservazione: la matrice Hessiana di f è definita solamente quando f è una funzione che restituisce uno scalare
+
+
+
+**53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:**
+
+⟶ Operazioni del gradiente — Per le matrici A,B,C, vale la pena ricordare le seguenti proprietà del gradiente:
+
+
+
+**54. [General notations, Definitions, Main matrices]**
+
+⟶ [Notazione generale, Definizioni, Matrici principali]
+
+
+
+**55. [Matrix operations, Multiplication, Other operations]**
+
+⟶ [Operazioni tra matrici, Moltiplicazione, Altre operazioni]
+
+
+
+**56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]**
+
+⟶ [Proprietà delle matrici, Norma, Autovalore/Autovettore, Decomposizione ai valori singolari]
+
+
+
+**57. [Matrix calculus, Gradient, Hessian, Operations]**
+
+⟶ [Calcolo tra matrici, Gradiente, Matrice Hessiana, Operazioni]
diff --git a/it/cs-229-probability.md b/it/cs-229-probability.md
new file mode 100644
index 000000000..7165a859e
--- /dev/null
+++ b/it/cs-229-probability.md
@@ -0,0 +1,385 @@
+**Probabilities and Statistics translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/refresher-probabilities-statistics)
+
+
+
+**1. Probabilities and Statistics refresher**
+
+⟶ Ripasso di Probabilità e Statistica
+
+
+
+**2. Introduction to Probability and Combinatorics**
+
+⟶ Introduzione di Probabilità e Calcolo Combinatorio
+
+
+
+**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
+
+⟶ Spazio campionario ― L'insieme di tutti i risultati possibili di un esperimento è noto come spazio campionario dell'esperimento ed è denotato da S.
+
+
+
+**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
+
+⟶ Evento ― Ogni sottinsieme E dello spazio campionario è chiamato evento. Un evento è quindi un insieme di possibili risultati dell'esperimento. Se il risultato dell'esperimento è contenuto in E, diciamo che E è accaduto.
+
+
+
+**5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
+
+⟶ Assiomi della probabilità Per ogni evento E, chiamiamo P(E) la probabilità che E accada.
+
+
+
+**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
+
+⟶ Assioma 1 ― Ogni probabilità ha valore tra 0 e 1 inclusi, quindi:
+
+
+
+**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
+
+⟶ Assioma 2 ― La probabilità che almeno uno degli eventi elementari dell'intero spazio campionario avvenga è 1, quindi:
+
+
+
+**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
+
+⟶ Assioma 3 ― Per ogni sequenza di eventi mutualmente esclusivi E1, ..., En, abbiamo:
+
+
+
+**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
+
+⟶ Permutazione ― Una permutazione è una raggruppamento di r oggetti fra n disponibili in un ordine dato. Il numero di tali raggruppamenti è dato da P(n,r) definito come:
+
+
+
+**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
+
+⟶ Combinazione ― Una combinazione è un raggruppamento di r oggetti fra n disponibili dove l'ordine non importa. Il numero di tali raggruppamenti è dato da C(n,r) definito come:
+
+
+
+**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
+
+⟶ Osservazione: notiamo che per 0⩽r⩽n abbiamo che P(n,r)⩾C(n,r)
+
+
+
+**12. Conditional Probability**
+
+⟶ Probabilità Condizionata
+
+
+
+**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
+
+⟶ Teorema di Bayes ― Dati due eventi A e B tali che P(B)>0, abbiamo che:
+
+
+
+**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
+
+⟶ Osservazione: abbiamo che P(A∩B)=P(A)P(B|A)=P(A|B)P(B)
+
+
+
+**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
+
+⟶ Partizione ― Sia {Ai,i∈[[1,n]]} tale che per ogni i, Ai≠∅. Diciamo che {Ai} è una partizione se abbiamo che:
+
+
+
+**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
+
+⟶ Osservazione: per ogni evento B nello spazio campionario, abbiamo che P(B)=n∑i=1P(B|Ai)P(Ai).
+
+
+
+**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
+
+⟶ Forma estesa del teorema di Bayes ― Sia {Ai,i∈[[1,n]]} una partizione dello spazio campionario. Abbiamo che:
+
+
+
+**18. Independence ― Two events A and B are independent if and only if we have:**
+
+⟶ Indipendenza ― Due eventi A e B sono indipendenti se e solo se abbiamo che:
+
+
+
+**19. Random Variables**
+
+⟶ Variabili Aleatorie
+
+
+
+**20. Definitions**
+
+⟶ Definizioni
+
+
+
+**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
+
+⟶ Variabile aleatoria ― Una variabile aleatoria, spesso chiamata X, è una funzione che associa ogni elemento dello spazio campionario a un reale.
+
+
+
+**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
+
+⟶ Funzione di ripartizione (cumulativa) ― La funzione di ripartizione F, che è monotona non-decrescente e tale che limx→−∞F(x)=0 e limx→+∞F(x)=1, è definita come:
+
+
+
+**23. Remark: we have P(a
+
+**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
+
+⟶ Funzione di densità ― La funzione di densità f è la probabilità che X assuma un valore tra due realizzazioni consecutive della variabile aleatoria.
+
+
+
+**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
+
+⟶ Relazioni tra funzione di densità e di ripartizione ― Sono riportate le proprietà importanti da sapere nel caso discreto (D) e continuo (C).
+
+
+
+**26. [Case, CDF F, PDF f, Properties of PDF]**
+
+⟶ [Caso, funzione di ripartizione F, funzione di densità f, Proprietà della funzione di densità]
+
+
+
+**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
+
+⟶ Valore atteso e Momenti della Distribuzione ― Sono riportate le espressioni del valore atteso E[X], valore atteso generalizzato E[g(X)], momento k-esimo E[Xk] e funzione caratteristica ψ(ω) per il caso discreto e continuo:
+
+
+
+**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
+
+⟶ Varianza ― La varianza di una variable aleatoria, spesso denotata da Var(X) o σ2, è una misura della variabilità della funzione di distribuzione. È determinata nel modo seguente:
+
+
+
+**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
+
+⟶ Deviazione standard ― La deviazione standard di una variabile aleatoria, spesso denotata da σ, è una misura della variabilità della funzione di distribuzione che è compatibile con l'unità di misura della variabile aleatoria. È determinata nel modo seguente:
+
+
+
+**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
+
+⟶ Trasformazione di una variabile aleatoria ― Siano X e Y variabili collegate da qualche funzione. Siano fX e fY le funzioni di distribuzione di X e Y, rispettivamente. Abbiamo che:
+
+
+
+**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
+
+⟶ Regola di integrazione di Leibniz ― Sia g una funzione di x e potenzialmente c, e siano a e b contorni che possono dipendere da c. Abbiamo che:
+
+
+
+**32. Probability Distributions**
+
+⟶ Distribuzioni di Probabilità
+
+
+
+**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
+
+⟶ Disuguaglianza di Chebyshev ― Sia X una variabile aleatoria con valore atteso μ. Per k,σ>0 abbiamo la seguente disuaglianza:
+
+
+
+**34. Main distributions ― Here are the main distributions to have in mind:**
+
+⟶ Distribuzioni principali ― Di seguito le distribuzioni principali da tenere a mente:
+
+
+
+**35. [Type, Distribution]**
+
+⟶ [Tipologia, Distrubuzione]
+
+
+
+**36. Jointly Distributed Random Variables**
+
+⟶ Distribuzione congiunta di variabili aleatorie
+
+
+
+**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
+
+⟶ Densità marginale e distribuzione cumulativa ― Dalla funzione di densità congiunta fXY abbiamo che:
+
+
+
+**38. [Case, Marginal density, Cumulative function]**
+
+⟶ [Caso, Densità marginale, Funzione cumulativa]
+
+
+
+**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
+
+⟶ Densità condizionata ― La densità condizionata di X rispetto a Y, spesso denotata come fX|Y, è definita come:
+
+
+
+**40. Independence ― Two random variables X and Y are said to be independent if we have:**
+
+⟶ Indipendenza ― Due variabili aleatorie X e Y si dicono indipendenti se:
+
+
+
+**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
+
+⟶ Covarianza ― Si definisce la covarianza di due variabili aleatorie X e Y, denotata da σ2XY o più comunemente Cov(X,Y), come segue:
+
+
+
+**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
+
+⟶ Correlazione ― Date σX,σY le deviazioni standard di X e Y, definiamo la correlazione tra le variabili aleatorie X e Y, denotata da ρXY, come segue:
+
+
+
+**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
+
+⟶ Osservazione 1: notiamo che per ogni variabile aleatoria X,Y, abbiamo che ρXY∈[−1,1].
+
+
+
+**44. Remark 2: If X and Y are independent, then ρXY=0.**
+
+⟶ Osservazione 2: Se X e Y sono indipendenti, allora ρXY=0.
+
+
+
+**45. Parameter estimation**
+
+⟶ Stima dei parametri
+
+
+
+**46. Definitions**
+
+⟶ Definizioni
+
+
+
+**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
+
+⟶ Campione casuale ― Un campione casuale è un gruppo di n variabili aleatorie X1,...,Xn distribuite in modo indipendente e indentico con X.
+
+
+
+**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
+
+⟶ Stimatore ― Uno stimatore è una funzione dei dati usata per dedurre il valore di un parametro sconosciuto in un modello statistico.
+
+
+
+**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
+
+⟶ Distorsione ― La distorsione di uno stimatore ^θ è definita come la differenza tra il valore atteso della distribuzione di ^θ e il vero valore, quindi:
+
+
+
+**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
+
+⟶ Osservazione: uno stimatore si dice non distorto quando abbiamo E[^θ]=θ.
+
+
+
+**51. Estimating the mean**
+
+⟶ Stima della media
+
+
+
+**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
+
+⟶ Media campionaria ― La media campionaria di un campione casuale è usata per stimare la vera media μ di una distribuzione, è spesso denotata da ¯¯¯¯¯X ed è definita come segue:
+
+
+
+**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
+
+⟶ Osservazione: la media campionaria non è distorta, quindi E[¯¯¯¯¯X]=μ.
+
+
+
+**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
+
+⟶ Teorema del Limite Centrale ― Sia X1,...,Xn un campione casuale che segue una data distribuzione di media μ e varianza σ2, di conseguenza abbiamo che:
+
+
+
+**55. Estimating the variance**
+
+⟶ Stima della varianza
+
+
+
+**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
+
+⟶ Varianza campionaria ― La varianza campionaria di un campione casuale è usata per stimare il vero valore della varianza σ2 di una distribuzione, è spesso denotata da s2 o ^σ2 ed è definita come segue:
+
+
+
+**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
+
+⟶ Osservazione: la varianza campionaria non è distorta, quindi E[s2]=σ2.
+
+
+
+**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
+
+⟶ Relazione tra Chi-Quadro e la varianza campionaria ― Sia s2 la varianza campionaria di un campione casuale. Abbiamo che:
+
+
+
+**59. [Introduction, Sample space, Event, Permutation]**
+
+⟶ [Introduzione, Spazio campionario, Evento, Permutazione]
+
+
+
+**60. [Conditional probability, Bayes' rule, Independence]**
+
+⟶ [Probabilità condizionata, Teorema di Bayes, Indipendenza]
+
+
+
+**61. [Random variables, Definitions, Expectation, Variance]**
+
+⟶ [Variabile aleatoria, Definizioni, Valore atteso, Varianza]
+
+
+
+**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
+
+⟶ [Distribuzioni di probabilità, Disuguaglianza di Chebyshev, Distribuzioni principali]
+
+
+
+**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
+
+⟶ [Distribuzione congiunta di variabili aleatorie, Densità, Covarianza, Correlazione]
+
+
+
+**64. [Parameter estimation, Mean, Variance]**
+
+⟶ [Stima del parametro, Media, Varianza]
diff --git a/ja/cs-229-deep-learning.md b/ja/cs-229-deep-learning.md
new file mode 100644
index 000000000..50557a63f
--- /dev/null
+++ b/ja/cs-229-deep-learning.md
@@ -0,0 +1,321 @@
+**1. Deep Learning cheatsheet**
+
+⟶ ディープラーニングチートシート
+
+
+
+**2. Neural Networks**
+
+⟶ ニューラルネットワーク
+
+
+
+**3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.**
+
+⟶ ニューラルネットワークとは複数の層を用いて構成されたモデルの種類を指します。一般的に利用されるネットワークとして畳み込みニューラルネットワークとリカレントニューラルネットワークが挙げられます。
+
+
+
+**4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:**
+
+⟶ アーキテクチャ - ニューラルネットワークに関する用語は以下の図により説明されます:
+
+
+
+**5. [Input layer, hidden layer, output layer]**
+
+⟶ [入力層, 隠れ層, 出力層]
+
+
+
+**6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
+
+⟶ iをネットワーク上のi層目の層とし、隠れ層のj個目のユニットをjとすると:
+
+
+
+**7. where we note w, b, z the weight, bias and output respectively.**
+
+⟶ この場合重みをw、バイアス項をb、出力をzとします。
+
+
+
+**8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:**
+
+⟶ 活性化関数 ー 活性化関数はモデルに非線形性を与えるために隠れユニットの最後で利用されます。一般的には以下の関数がよく使われます:
+
+
+
+**9. [Sigmoid, Tanh, ReLU, Leaky ReLU]**
+
+⟶ [Sigmoid(シグモイド関数), Tanh(双曲線関数), ReLU(正規化線形ユニット), Leaky ReLU(漏洩ReLU)]
+
+
+
+**10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
+
+⟶ 交差エントロピーロス ー ニューラルネットにおいて交差エントロピーロスL(z,y)は一般的に使われ、以下のように定義されています:
+
+
+
+**11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
+
+⟶ 学習率 ー 学習率は多くの場合α、しばしばηで表記され、勾配法による重み付けのアップデートをする速度を表します。学習率は固定または適応的に変更することができます。現在一般的に使われている学習法はAdam(アダム)であり、学習率を適用的に変更する方法です。
+
+
+
+**12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:**
+
+⟶ 誤差逆伝播法(backpropagation)ー 誤差逆伝播法はニューラルネットにおいて実際の出力と期待される出力との差異を考慮して重みを更新する方法の一つです。重みwに関する導関数は連鎖律を使用して計算され、次の形式で表されます:
+
+
+
+**13. As a result, the weight is updated as follows:**
+
+⟶ 結果として、重みは以下のように更新されます:
+
+
+
+**14. Updating weights ― In a neural network, weights are updated as follows:**
+
+⟶ 重みの更新 ー ニューラルネットでは以下のように重みが更新されます:
+
+
+
+**15. Step 1: Take a batch of training data.**
+
+⟶ ステップ1: 訓練データを1バッチ用意する。
+
+
+
+**16. Step 2: Perform forward propagation to obtain the corresponding loss.**
+
+⟶ ステップ2: 順伝播を行いそれに対する誤差を求める。
+
+
+
+**17. Step 3: Backpropagate the loss to get the gradients.**
+
+⟶ 誤差を逆伝播し、勾配を計算する。
+
+
+
+**18. Step 4: Use the gradients to update the weights of the network.**
+
+⟶ 勾配を使いネットワークの重みを更新する。
+
+
+
+**19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
+
+⟶ドロップアウト ー ドロップアウトはニューラルネット内の一部のユニットを無効にすることで学習データへの過学習を防ぐテクニックです。実際には、ニューロンは確率pで無効、確率1-pで有効のどちらかになります。
+
+
+
+**20. Convolutional Neural Networks**
+
+⟶ 畳み込みニューラルネットワーク
+
+
+
+**21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
+
+⟶ 畳み込みレイヤーの条件 ー Wを入力サイズ、Fを畳み込み層のニューロンのサイズ、Pをゼロ埋めの量とすると、これらに対応するニューロンの数Nは次のようになります。
+
+
+
+**22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
+
+⟶ バッチ正規化 ー バッチ{xi}を正規化するハイパーパラメータγ、βのステップです。補正を行うバッチの平均と分散をμB,σ2Bとすると、正規化は以下のように行われます:
+
+
+
+**23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
+
+⟶ これは通常、学習率を高め、初期値への依存性を減らすことを目的でFully Connected層と畳み込み層の後、非線形化を行う前に行われます。
+
+
+
+**24. Recurrent Neural Networks**
+
+⟶ リカレントニューラルネットワーク (RNN)
+
+
+
+**25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:**
+
+⟶ ゲートの種類 ー 典型的なRNNに使われるゲートです:
+
+
+
+**26. [Input gate, forget gate, gate, output gate]**
+
+⟶ [入力ゲート, 忘却ゲート, ゲート, 出力ゲート]
+
+
+
+**27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]**
+
+⟶ [セルに追加するべき?, セルを削除するべき?, 情報をどれだけセルに追加するべき?, セルをどの程度通すべき?]
+
+
+
+**28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
+
+⟶ LSTM - A long short-term memory (LSTM) ネットワークは勾配消失問題を解決するために忘却ゲートが追加されているRNNの一種です。
+
+
+
+**29. Reinforcement Learning and Control**
+
+⟶ 強化学習と制御
+
+
+
+**30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.**
+
+⟶ 強化学習はある環境内においてエージェントが学習し、進化することを目標とします。
+
+
+
+**31. Definitions**
+
+⟶ 定義
+
+
+
+**32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:**
+
+⟶ マルコフ決定過程 ー マルコフ決定過程(Markov decision process; MDP)を5タプル(S,A,{Psa},γ,R)としたとき:
+
+
+
+**33. S is the set of states**
+
+⟶ Sは状態の集合
+
+
+
+**34. A is the set of actions**
+
+⟶ Aは行動の集合
+
+
+
+**35. {Psa} are the state transition probabilities for s∈S and a∈A**
+
+⟶ {Psa}は状態s∈Sと行動a∈Aの状態遷移確率
+
+
+
+**36. γ∈[0,1[ is the discount factor**
+
+⟶ γ∈[0,1[は割引因子
+
+
+
+**37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize**
+
+⟶ R:S×A⟶R or R:S⟶Rはアルゴリズムが最大化したい報酬関数
+
+
+
+**38. Policy ― A policy π is a function π:S⟶A that maps states to actions.**
+
+⟶ 方策 - 方策πは状態を行動に写像する関数π:S⟶A
+
+
+
+**39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).**
+
+⟶ 備考: 状態sを与えられた際に行動a=π(s)を行うことを方策πを実行すると言います。
+
+
+
+**40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:**
+
+⟶ 価値関数 - ある方策πとある状態sにおける価値関数Vπを以下のように定義します:
+
+
+
+**41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:**
+
+⟶ ベルマン方程式 - 最適ベルマン方程式は最適方策π∗の価値関数Vπ∗で記述されます:
+
+
+
+**42. Remark: we note that the optimal policy π∗ for a given state s is such that:**
+
+⟶ 備考: 与えられた状態sに対する最適方策π*はこのようになります:
+
+
+
+**43. Value iteration algorithm ― The value iteration algorithm is in two steps:**
+
+⟶ 価値反復法アルゴリズム - 価値反復法アルゴリズムは2段階で行われます:
+
+
+
+**44. 1) We initialize the value:**
+
+⟶ 1) 値を初期化する:
+
+
+
+**45. 2) We iterate the value based on the values before:**
+
+⟶ 2) 前の値を元に値を繰り返す:
+
+
+
+**46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
+
+⟶ 最尤推定 ー 状態遷移確率の最尤推定(maximum likelihood estimate; MLE):
+
+
+
+**47. times took action a in state s and got to s′**
+
+⟶ 状態sで行動aを行い状態s′に遷移した回数
+
+
+
+**48. times took action a in state s**
+
+⟶ 状態sで行動aを行った回数
+
+
+
+**49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:**
+
+⟶ Q学習 ― Q学習はモデルフリーのQ値の推定であり、以下のように行われます:
+
+
+
+**50. View PDF version on GitHub**
+
+⟶ GitHubでPDF版を見る
+
+
+
+**51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
+
+⟶ [ニューラルネットワーク, アーキテクチャ, 活性化関数, 誤差逆伝播法, ドロップアウト]
+
+
+
+**52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
+
+⟶ [畳み込みニューラルネットワーク, 畳み込み層, バッチ正規化]
+
+
+
+**53. [Recurrent Neural Networks, Gates, LSTM]**
+
+⟶ [リカレントニューラルネットワーク, ゲート, LSTM]
+
+
+
+**54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
+
+⟶ [強化学習, マルコフ決定過程, 価値/方策反復, 近似動的計画法, 方策探索]
diff --git a/de/refresher-linear-algebra.md b/ja/cs-229-linear-algebra.md
similarity index 51%
rename from de/refresher-linear-algebra.md
rename to ja/cs-229-linear-algebra.md
index a6b440d1e..c806cb4ca 100644
--- a/de/refresher-linear-algebra.md
+++ b/ja/cs-229-linear-algebra.md
@@ -1,339 +1,342 @@
**1. Linear Algebra and Calculus refresher**
⟶
-
+線形代数と微積分の復習
**2. General notations**
⟶
-
+一般表記
**3. Definitions**
⟶
-
+定義
**4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:**
⟶
-
+ベクトル - x∈Rn はn個の要素を持つベクトルを表し、xi∈R はi番目の要素を表します。
**5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:**
⟶
-
+行列 - m行n列の行列を A∈Rm×n と表記し、Ai,j∈R はi行目のj列目の要素を指します。
**6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.**
⟶
-
+備考:上記で定義されたベクトル x は n×1 の行列と見なすことができ、列ベクトルと呼ばれます。
**7. Main matrices**
⟶
-
+主な行列の種類
**8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:**
⟶
-
+単位行列 - 単位行列 I∈Rn×n は、対角成分に 1 が並び、他は全て 0 となる正方行列です。
**9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.**
⟶
-
+備考:すべての行列 A∈Rn×n に対して、A×I=I×A=A となります。
**10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:**
⟶
-
+対角行列 - 対角行列 D∈Rn×n は、対角成分の値が 0 以外で、それ以外は 0 である正方行列です。
**11. Remark: we also note D as diag(d1,...,dn).**
⟶
-
+備考:Dをdiag(d1,...,dn) とも表記します。
**12. Matrix operations**
⟶
-
+行列演算
**13. Multiplication**
⟶
-
+行列乗算
**14. Vector-vector ― There are two types of vector-vector products:**
⟶
-
+ベクトル-ベクトル - ベクトル-ベクトル積には2種類あります。
**15. inner product: for x,y∈Rn, we have:**
⟶
-
+内積: x,y∈Rn に対して、内積の定義は下記の通りです:
**16. outer product: for x∈Rm,y∈Rn, we have:**
⟶
-
+外積: x∈Rm,y∈Rn に対して、外積の定義は下記の通りです:
**17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:**
⟶
-
+行列-ベクトル - 行列 A∈Rm×n とベクトル x∈Rn の積は以下の条件を満たすようなサイズ Rn のベクトルです。
**18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.**
⟶
-
+上記 aTr,i は A の行ベクトルで、ac,j は A の列ベクトルです。 xi は x の要素です。
**19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:**
⟶
-
+行列-行列 - 行列 A∈Rm×n と B∈Rn×p の積は以下の条件を満たすようなサイズ Rm×p の行列です。 (There is a typo in the original: Rn×p)
**20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively**
⟶
-
+aTr,i,bTr,i は A と B の行ベクトルで ac,j,bc,j は A と B の列ベクトルです。
**21. Other operations**
⟶
-
+その他の演算
**22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:**
⟶
-
+転置 ― A∈Rm×n の転置行列は AT と表記し、A の行列要素が交換した行列です。
**23. Remark: for matrices A,B, we have (AB)T=BTAT**
⟶
-
+備考: 行列AとBの場合、(AB)T=BTAT** となります。
**24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:**
⟶
-
+逆行列 ― 可逆正方行列 A の逆行列は A-1 と表記し、 以下の条件を満たす唯一の行列です。
**25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1**
⟶
-
+備考: すべての正方行列が可逆とは限りません。 行列 A,B については、(AB)−1=B−1A−1
**26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:**
⟶
-
+跡 - 正方行列 A の跡は、tr(A) と表記し、その対角成分の要素の和です。
**27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)**
⟶
-
+備考: 行列 A,B の場合: tr(AT)=tr(A) と tr(AB)=tr(BA) となります。
**28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:**
⟶
-
+行列式 ― 正方行列 A∈Rn×n の行列式は |A| または det(A) と表記し、以下のように i番目の行とj番目の列を抜いた行列A、Aij によって再帰的に表現されます。
+ それはi番目の行とj番目の列のない行列Aです。 次のように:
**29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.**
⟶
-
+備考: |A|≠0の場合に限り、行列は可逆行列です。また |AB|=|A||B| と |AT|=|A|。
**30. Matrix properties**
⟶
-
+行列の性質
**31. Definitions**
⟶
-
+定義
**32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:**
⟶
-
+対称分解 ― 行列Aは次のように対称および反対称的な部分で表現できます。
**33. [Symmetric, Antisymmetric]**
⟶
-
+[対称、反対称]
**34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:**
⟶
-
+ノルムは関数N:V⟶[0,+∞[ Vはすべての x,y∈V に対して、以下の条件を満たすようなベクトル空間です。
+]]
**35. N(ax)=|a|N(x) for a scalar**
⟶
-
+スカラー a に対して N(ax)=|a|N(x)
**36. if N(x)=0, then x=0**
⟶
-
+N(x)= 0ならば x = 0
**37. For x∈V, the most commonly used norms are summed up in the table below:**
⟶
-
+x∈Vに対して、最も多用されているノルムは、以下の表にまとめられています。
**38. [Norm, Notation, Definition, Use case]**
⟶
-
+[ノルム、表記法、定義、使用事例]
**39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.**
⟶
-
+線形従属 ― ベクトルの集合に対して、少なくともどれか一つのベクトルを他のベクトルの線形結合として定義できる場合、その集合が線形従属であるといいます。
**40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent**
⟶
-
+備考:この方法でベクトルを書くことができない場合、ベクトルは線形独立していると言われます。
**41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.**
⟶
-
+行列の階数 ― 行列Aの階数は rank(A) と表記し、列空間の次元を表します。これは、Aの線形独立の列の最大数に相当します。
**42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:**
⟶
-
+半正定値行列 ― 行列A、A∈Rn×nに対して、以下の式が成り立つならば、 Aを半正定値(PSD)といい、A⪰0 と表記します。
**43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.**
⟶
-
+備考: 同様に、全ての非ゼロベクトルx、xTAx>0 に対して条件を満たすような行列Aは正定値行列といい、A≻0 と表記します。
**44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
⟶
-
+固有値、固有ベクトル ― 行列A、A∈Rn×n に対して、以下の条件を満たすようなベクトルz、z∈Rn∖{0} が存在するならば、λ は固有値といい、z は固有ベクトルといいます。
**45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
⟶
-
+スペクトル定理 ― A∈Rn×n とします。A が対称ならば、A は実直交行列 U∈Rn×n によって対角化可能です。Λ=diag(λ1,...,λn) と表記すると、次のように表現できます。
**46. diagonal**
⟶
-
+対角
**47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:**
⟶
-
+特異値分解 ― A を m×n の行列とします。特異値分解(SVD)は、ユニタリ行列 U m×m、Σ m×n の対角行列、およびユニタリ行列 V n×n の存在を保証する因数分解手法で、以下の条件を満たします。
**48. Matrix calculus**
⟶
-
+行列微積分
**49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:**
⟶
-
+勾配 ― f:Rm×n→R を関数とし、A∈Rm×n を行列とします。 A に対する f の勾配は m×n 行列で、∇Af(A) と表記し、次の条件を満たします。
**50. Remark: the gradient of f is only defined when f is a function that returns a scalar.**
⟶
-
+備考: f の勾配は、f がスカラーを返す関数であるときに限り存在します。
**51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:**
⟶
-
+ヘッセ行列 ― f:Rn→R を関数とし、x∈Rn をベクトルとします。 x に対する f のヘッセ行列は、n×n 対称行列で ∇2xf(x) と表記し、以下の条件を満たします。
**52. Remark: the hessian of f is only defined when f is a function that returns a scalar**
⟶
-
+備考: f のヘッセ行列は、f がスカラーを返す関数である場合に限り存在します。
**53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:**
⟶
-
+勾配演算 ― 行列 A,B,C の場合、特に以下の勾配の性質を意識する甲斐があります。
**54. [General notations, Definitions, Main matrices]**
⟶
-
+[表記, 定義, 主な行列の種類]
**55. [Matrix operations, Multiplication, Other operations]**
⟶
-
+[行列演算, 乗算, その他の演算]
**56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]**
⟶
-
+[行列特性, 行列ノルム, 固有値/固有ベクトル, 特異値分解]
**57. [Matrix calculus, Gradient, Hessian, Operations]**
⟶
+[行列微積分, 勾配, ヘッセ行列, 演算]
diff --git a/ja/cs-229-machine-learning-tips-and-tricks.md b/ja/cs-229-machine-learning-tips-and-tricks.md
new file mode 100644
index 000000000..214cee2e8
--- /dev/null
+++ b/ja/cs-229-machine-learning-tips-and-tricks.md
@@ -0,0 +1,285 @@
+**1. Machine Learning tips and tricks cheatsheet**
+
+⟶ 機械学習のアドバイスやコツのチートシート
+
+
+
+**2. Classification metrics**
+
+⟶ 分類評価指標
+
+
+
+**3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
+
+⟶ 二値分類において、モデルの性能を評価する際の主要な指標として次のものがあります。
+
+
+
+**4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
+
+⟶ 混同行列 ― 混同行列はモデルの性能を評価する際に、より完全に理解するために用いられます。次のように定義されます:
+
+
+
+**5. [Predicted class, Actual class]**
+
+⟶ [予測したクラス, 実際のクラス]
+
+
+
+**6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
+
+⟶ 主要な評価指標 ― 分類モデルの性能を評価するために、一般的に次の指標が用いられます。
+
+
+
+**7. [Metric, Formula, Interpretation]**
+
+⟶ [評価指標,式,解釈]
+
+
+
+**8. Overall performance of model**
+
+⟶ モデルの全体的な性能
+
+
+
+**9. How accurate the positive predictions are**
+
+⟶ 陽性判定は、どれくらい正確ですか
+
+
+
+**10. Coverage of actual positive sample**
+
+⟶ 実際に陽性であるサンプル
+
+
+
+**11. Coverage of actual negative sample**
+
+⟶ 実際に陰性であるサンプル
+
+
+
+**12. Hybrid metric useful for unbalanced classes**
+
+⟶ 不均衡データに対する有用な複合指標
+
+
+
+**13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
+
+⟶ ROC曲線 ― 受信者動作特性曲線(ROC)は閾値を変えていく際のFPRに対するTPRのグラフです。これらの指標は下表の通りまとめられます。
+
+
+
+**14. [Metric, Formula, Equivalent]**
+
+⟶ [評価指標,式,等価な指標]
+
+
+
+**15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
+
+⟶ AUC ― ROC曲線下面積(AUC,AUROC)は次の図に示される通りROC曲線の下側面積のことです。
+
+
+
+**16. [Actual, Predicted]**
+
+⟶ [実際,予測]
+
+
+
+**17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
+
+⟶ 基本的な評価指標 ― 回帰モデルfが与えられたとき,次のようなよう化指標がモデルの性能を評価するために一般的に用いられます。
+
+
+
+**18. [Total sum of squares, Explained sum of squares, Residual sum of squares]**
+
+⟶ [全平方和,回帰平方和,残差平方和]
+
+
+
+**19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
+
+⟶ 決定係数 ― よくR2やr2と書かれる決定係数は,実際の結果がモデルによってどの程度よく再現されているかを測る評価指標であり,次のように定義される。
+
+
+
+**20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
+
+⟶ 主要な評価指標 ― 次の評価指標は説明変数の数を考慮して回帰モデルの性能を評価するために,一般的に用いられています。
+
+
+
+**21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
+
+⟶ ここでLは尤度であり,ˆσ2は各応答に対する誤差分散の推定値です。
+
+
+
+**22. Model selection**
+
+⟶ モデル選択
+
+
+
+**23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
+
+⟶ 用語 ― モデルを選択するときには,次のようにデータの種類を異なる3つに区別します。
+
+
+
+**24. [Training set, Validation set, Testing set]**
+
+⟶ [訓練セット,検証セット,テストセット]
+
+
+
+**25. [Model is trained, Model is assessed, Model gives predictions]**
+
+⟶ [モデルを学習させる,モデルを評価する,モデルが予測する]
+
+
+
+**26. [Usually 80% of the dataset, Usually 20% of the dataset]**
+
+⟶ [通常はデータセットの80%,通常はデータセットの20%]
+
+
+
+**27. [Also called hold-out or development set, Unseen data]**
+
+⟶ [ホールドアウトセットや,開発セットとも呼ばれる,未知のデータ]
+
+
+
+**28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
+
+⟶ 一度モデル選択が行われた場合,学習にはデータセット全体が用いられ,テストには未知のテストセットが使用されます。これらは次の図ように表されます。
+
+
+
+**29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
+
+⟶ 交差検証 ― 交差検証(CV)は,初期の学習データセットに強く依存しないようにモデル選択を行う方法です。2つの方法を下表にまとめました。
+
+
+
+**30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
+
+⟶ [k-1群で学習,残りの1群で評価,n-p個で学習,残りのp個で評価]
+
+
+
+**31. [Generally k=5 or 10, Case p=1 is called leave-one-out]**
+
+⟶ [一般的にはk=5または10,p=1の場合は一個抜き交差検証と呼ばれます]
+
+
+
+**32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
+
+⟶ 最も一般的に用いられている方法はk交差検証法です.データセットをk群に分けた後,1群を検証に使用し残りのk-1群を学習に使用するという操作を順番にk回繰り返します。求められた検証誤差はk群すべてにわたって平均化されます。この平均された誤差のことを交差検証誤差と呼びます。
+
+
+
+**33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
+
+⟶ 正則化 ― 正則化はモデルの過学習状態を回避することが目的であり,したがってハイバリアンス問題(オーバーフィット問題)に対処できます。一般的に使用されるいくつかの正則化法を下表にまとめました。
+
+
+
+**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+⟶ [係数を0にする,変数選択に適する,係数を小さくする,変数選択と係数を小さくすることのトレードオフ]
+
+
+
+**35. Diagnostics**
+
+⟶ 診断方法
+
+
+
+**36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
+
+⟶ バイアス ― ある標本値群を予測する際の期待値と正しいモデルの結果との差異のことです。
+
+
+
+**37. Variance ― The variance of a model is the variability of the model prediction for given data points.**
+
+⟶ バリアンス ― モデルのバリアンスとは,ある標本値群に対するモデルの予測値のばらつきのことです。
+
+
+
+**38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
+
+⟶ バイアス・バリアンストレードオフ ― よりシンプルなモデルではバイアスが高くなり,より複雑なモデルはバリアンスが高くなります。
+
+
+
+**39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
+
+⟶ [症状,回帰モデルでの図,分類モデルでの図,深層学習での図,可能な解決策]
+
+
+
+**40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
+
+⟶ [高い訓練誤差,訓練誤差がテスト誤差に近い,高いバイアス,訓練誤差がテスト誤差より少しだけ小さい,極端に小さい訓練誤差,訓練誤差がテスト誤差に比べて非常に小さい,高いバリアンス]
+
+
+
+**41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
+
+⟶ [より複雑なモデルを試す,特徴量を増やす,より長く学習する,正則化を導入する,データ数を増やす]
+
+
+
+**42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
+
+⟶ エラー分析 ― エラー分析は現在のモデルと完璧なモデル間の性能差の主要な要因を分析することです。
+
+
+
+**43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
+
+⟶ アブレーション分析 ― アブレーション分析は,ベースライン・モデルと現在されたモデル間で発生したパフォーマンスの差異の原因を分析することです。
+
+
+
+**44. Regression metrics**
+
+⟶ 回帰評価指標
+
+
+
+**45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
+
+⟶ [分類評価指標,混同行列,正解率,適合率,再現率,F値,ROC曲線]
+
+
+
+**46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]**
+
+⟶ [回帰評価指標,R二乗,マローズのCp,AIC,BIC]
+
+
+
+**47. [Model selection, cross-validation, regularization]**
+
+⟶ [モデルの選択,交差検証,正則化]
+
+
+
+**48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
+
+⟶ [診断方法,バイアス・バリアンストレードオフ,エラー・アブレーション分析]
diff --git a/ja/cs-229-probability.md b/ja/cs-229-probability.md
new file mode 100644
index 000000000..16fca9ea5
--- /dev/null
+++ b/ja/cs-229-probability.md
@@ -0,0 +1,381 @@
+**1. Probabilities and Statistics refresher**
+
+⟶確率と統計の復習
+
+
+
+**2. Introduction to Probability and Combinatorics**
+
+⟶確率と組合せの導入
+
+
+
+**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
+
+⟶標本空間 - ある試行のすべての起こりうる結果の集合はその試行の標本空間として知られ、Sと表します。
+
+
+
+**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
+
+⟶事象 - 標本空間の任意の部分集合Eを事象と言います。つまり、ある事象はある試行の起こりうる結果により構成された集合です。ある試行結果がEに含まれるなら、Eが起きたと言います。
+
+
+
+**5. Axioms of probability ― For each event E, we denote P(E) as the probability of event E occuring.**
+
+⟶確率の公理 - 各事象Eに対して、事象Eが起こる確率をP(E)と書きます。
+
+
+
+**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
+
+⟶公理1 - すべての確率は0と1を含んでその間にあります。すなわち:
+
+
+
+**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
+
+⟶公理2 - 標本空間全体において少なくとも一つの根元事象が起こる確率は1です。すなわち:
+
+
+
+**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
+
+⟶公理3 - 互いに排反な事象の任意の数列E1,...,Enに対し、次が成り立ちます:
+
+
+
+**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
+
+⟶順列(Permutation) - 順列とはn個のものの中からr個をある順序で並べた配列です。このような配列の数はP(n,r)と表し、次のように定義します:
+
+
+
+**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
+
+⟶組合せ(Combination) - 組合せはn個の中からr個の順番を勘案しない配列です。このような配列の数はC(n,r)と表し、次のように定義します:
+
+
+
+**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
+
+⟶注釈: 0⩽r⩽nのとき、P(n,r)⩾C(n,r)となります。
+
+
+
+**12. Conditional Probability**
+
+⟶条件付き確率
+
+
+
+**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
+
+⟶ベイズの定理 - P(B)>0であるような事象A, Bに対して、次が成り立ちます:
+
+
+
+**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
+
+⟶注釈: P(A∩B)=P(A)P(B|A)=P(A|B)P(B)となります。
+
+
+
+**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
+
+⟶分割(Partition) - {Ai,i∈[[1,n]]}はすべてのiに対してAi≠∅としましょう。次が成り立つとき、{Ai}は分割であると言います:
+
+
+
+**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
+
+⟶注釈: 標本空間において任意の事象Bに対して、P(B)=n∑i=1P(B|Ai)P(Ai)が成り立ちます。
+
+
+
+**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
+
+⟶ベイズの定理の応用 - {Ai,i∈[[1,n]]}を標本空間の分割とすると、次が成り立ちます:
+
+
+
+**18. Independence ― Two events A and B are independent if and only if we have:**
+
+⟶独立性 - 次が成り立ちかつその場合に限り(必要十分)、2つの事象AとBは独立であるといいます:
+
+
+
+**19. Random Variables**
+
+⟶確率変数
+
+
+
+**20. Definitions**
+
+⟶定義
+
+
+
+**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
+
+⟶確率変数 - 確率変数は、よくXと表記され、ある標本空間のすべての要素を実数直線に対応させる関数です。
+
+
+
+**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
+
+⟶累積分布関数(CDF) - 累積分布関数Fは、単調非減少かつlimx→−∞F(x)=0 and limx→+∞F(x)=1であり、次のように定義されます:
+
+
+
+**23. Remark: we have P(a
+
+**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
+
+⟶確率密度関数(PDF) - 確率密度関数fは確率変数Xが2つの隣接する実現値の間の値をとる確率です。
+
+
+
+**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
+
+⟶PDFとCDFについての関係性 - 離散値(D)と連続値(C)のそれぞれの場合について知っておくべき重要な特性をここに挙げます。
+
+
+
+**26. [Case, CDF F, PDF f, Properties of PDF]**
+
+⟶[種類、CDF F、PDF f、PDFの特性]
+
+
+
+**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
+
+⟶分布の期待値と積率 - 離散値と連続値のそれぞれの場合における期待値E[X]、一般化した期待値E[g(X)]、k次の積率E[Xk]と特性関数ψ(ω)をここに挙げます:
+
+
+
+**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
+
+⟶分散(Variance) - 確率変数の分散は、よくVar(X)またはσ2と表記され、その確率変数の分布関数のばらつきの尺度です。次のように計算されます。
+
+
+
+**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
+
+⟶標準偏差(Standard deviation) - 確率変数の標準偏差は、よくσと表記され、その確率変数の分布関数のばらつきの尺度であり、その確率変数の単位に則ったものです。次のように計算されます。
+
+
+
+**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
+
+⟶確率変数の変換 - 変数XとYはなんらかの関数により関連づけられているとします。fXとfYをそれぞれXとYの分布関数として表記すると次が成り立ちます:
+
+
+
+**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
+
+⟶ライプニッツの積分則 - gをxと潜在的にcの関数とし、a,bをcに従属的な境界とすると、次が成り立ちます。
+
+
+
+**32. Probability Distributions**
+
+⟶確率分布
+
+
+
+**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
+
+⟶チェビシェフの不等式 - Xを期待値μの確率変数とします。k,σ>0のとき次の不等式が成り立ちます:
+
+
+
+**34. Main distributions ― Here are the main distributions to have in mind:**
+
+⟶主な分布 - 覚えておくべき主な分布をここに挙げます:
+
+
+
+**35. [Type, Distribution]**
+
+⟶[種類、分布]
+
+
+
+**36. Jointly Distributed Random Variables**
+
+⟶同時分布の確率変数
+
+
+
+**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
+
+⟶周辺密度と累積分布 - 同時確率密度関数fXYから次が成り立ちます。
+
+
+
+**38. [Case, Marginal density, Cumulative function]**
+
+⟶[種類、周辺密度、累積関数]
+
+
+
+**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
+
+⟶条件付き密度(Conditional density) - Yに対するXの条件付き密度はよくfX|Yと表記され、次のように定義されます:
+
+
+
+**40. Independence ― Two random variables X and Y are said to be independent if we have:**
+
+⟶独立性(Independence) - 2つの確率変数XとYは次が成り立つとき、独立であると言います:
+
+
+
+**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
+
+⟶共分散(Covariance) - 2つの確率変数XとYの共分散を、σ2XYまたはより一般的にはCov(X,Y)と表記し、次のように定義します:
+
+
+
+**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
+
+⟶相関係数(Correlation) - X, Yの標準偏差をσX,σYと表記し、確率変数X,Yの相関関係をρXYと表記し、次のように定義します:
+
+
+
+**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
+
+⟶注釈 1: 任意の確率変数X,Yに対してρXY∈[−1,1]が成り立ちます。
+
+
+
+**44. Remark 2: If X and Y are independent, then ρXY=0.**
+
+⟶注釈 2: XとYが独立ならば、ρXY=0です。
+
+
+
+**45. Parameter estimation**
+
+⟶母数推定
+
+
+
+**46. Definitions**
+
+⟶定義
+
+
+
+**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
+
+⟶確率標本(Random sample) - 確率標本とはXに従う独立同分布のn個の確率変数X1,...,Xnの集合です。
+
+
+
+**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
+
+⟶推定量(Estimator) - 推定量とは統計モデルにおける未知のパラメータの値を推定するのに用いられるデータの関数です。
+
+
+
+**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
+
+⟶偏り(Bias) - 推定量^θの偏りは^θのの分布の期待値と真の値との差として定義されます。すなわち:
+
+
+
+**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
+
+⟶注釈: E[^θ]=θが成り立つとき、推定量は不偏であるといいます。
+
+
+
+**51. Estimating the mean**
+
+⟶平均の推定
+
+
+
+**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
+
+⟶標本平均(Sample mean) - 確率標本の標本平均は、ある分布の真の平均μを推定するのに用いられ、よく¯¯¯¯¯Xと表記され、次のように定義されます:
+
+
+
+**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
+
+⟶注釈: 標本平均は不偏です。すなわちE[¯¯¯¯¯X]=μが成り立ちます。
+
+
+
+**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
+
+⟶中心極限定理 - 確率標本X1,...,Xnが平均μと分散σ2を持つある分布に従うとすると、次が成り立ちます:
+
+
+
+**55. Estimating the variance**
+
+⟶分散の推定
+
+
+
+**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
+
+⟶標本分散 - 確率標本の標本分散は、ある分布の真の分散σ2を推定するのに用いられ、よくs2または^σ2と表記され、次のように定義されます:
+
+
+
+**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
+
+⟶注釈: 標本分散は不偏です。すなわちE[s2]=σ2が成り立ちます。
+
+
+
+**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
+
+⟶標本分散とカイ二乗分布との関係 - 確率標本の標本分散をs2とすると、次が成り立ちます:
+
+
+
+**59. [Introduction, Sample space, Event, Permutation]**
+
+⟶[導入、標本空間、事象、順列]
+
+
+
+**60. [Conditional probability, Bayes' rule, Independence]**
+
+⟶[条件付き確率、ベイズの定理、独立]
+
+
+
+**61. [Random variables, Definitions, Expectation, Variance]**
+
+⟶[確率変数、定義、期待値、分散]
+
+
+
+**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
+
+⟶[確率分布、チェビシェフの不等式、主な分布]
+
+
+
+**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
+
+⟶[同時分布の確率変数、密度、共分散、相関係数]
+
+
+
+**64. [Parameter estimation, Mean, Variance]**
+
+⟶[母数推定、平均、分散]
diff --git a/ja/cs-229-supervised-learning.md b/ja/cs-229-supervised-learning.md
new file mode 100644
index 000000000..71f63afdd
--- /dev/null
+++ b/ja/cs-229-supervised-learning.md
@@ -0,0 +1,567 @@
+**1. Supervised Learning cheatsheet**
+
+⟶教師あり学習チートシート
+
+
+
+**2. Introduction to Supervised Learning**
+
+⟶教師あり学習入門
+
+
+
+**3. Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.**
+
+⟶入力が{x(1),...,x(m)}、出力が{y(1),...,y(m)}であるとき、xからyを予測する分類器を構築したい。
+
+
+
+**4. Type of prediction ― The different types of predictive models are summed up in the table below:**
+
+⟶予測の種類 ― 様々な種類の予測モデルは下表に集約される:
+
+
+
+**5. [Regression, Classifier, Outcome, Examples]**
+
+⟶回帰、分類、出力、例
+
+
+
+**6. [Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
+
+⟶連続値、クラス、線形回帰、ロジスティック回帰、SVM、ナイーブベイズ
+
+
+
+**7. Type of model ― The different models are summed up in the table below:**
+
+⟶モデルの種類 ― 様々な種類のモデルは下表に集約される:
+
+
+
+**8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
+
+⟶判別モデル、生成モデル、目的、学習対象、イメージ図、例
+
+
+
+**9. [Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]**
+
+⟶P(y|x)の直接推定、後にP(y|x)を推測するためのP(x|y)の推定、決定境界、データの確率分布、回帰、SVM、GDA、ナイーブベイズ
+
+
+
+**10. Notations and general concepts**
+
+⟶記法と全般的な概念
+
+
+
+**11. Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).**
+
+⟶仮説 ― 仮説はhθと表され、選択されたモデルのことである。与えられた入力x(i)に対して、モデルの予測結果はhθ(x(i))である。
+
+
+
+**12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
+
+⟶損失関数 ― 損失関数とは(z,y)∈R×Y⟼L(z,y)∈Rを満たす関数Lで、予測値zとそれに対応する正解データ値yを入力とし、その誤差を出力するものである。一般的な損失関数は次表に集約される:
+
+
+
+**13. [Least squared error, Logistic loss, Hinge loss, Cross-entropy]**
+
+⟶最小2乗誤差、ロジスティック損失、ヒンジ損失、交差エントロピー
+
+
+
+**14. [Linear regression, Logistic regression, SVM, Neural Network]**
+
+⟶線形回帰、ロジスティック回帰、SVM、ニューラルネットワーク
+
+
+
+**15. Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:**
+
+⟶コスト関数 ― コスト関数Jは一般的にモデルの性能を評価するために用いられ、損失関数をLとして次のように定義される:
+
+
+
+**16. Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:**
+
+⟶勾配降下法 ― 学習率をα∈Rとし、勾配降下法における更新ルールは学習率とコスト関数Jを用いて次のように表される:
+
+
+
+**17. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.**
+
+⟶備考:確率的勾配降下法(SGD)は学習標本全体を用いてパラメータを更新し、バッチ勾配降下法は学習標本の各バッチ毎に更新する。
+
+
+
+**18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
+
+⟶尤度 ― パラメータをθとすると、あるモデルの尤度L(θ)を最大にすることにより最適なパラメータを求められる。実際には、最適化しやすい対数尤度ℓ(θ)=log(L(θ))を用いる。すなわち:
+
+
+
+**19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**
+
+⟶ニュートン法 ― ニュートン法とはℓ′(θ)=0となるθを求める数値法である。その更新ルールは次の通りである:
+
+
+
+**20. Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:**
+
+⟶備考:多次元一般化またはニュートン-ラフソン法の更新ルールは次の通りである:
+
+
+
+**21. Linear models**
+
+⟶線形モデル
+
+
+
+**22. Linear regression**
+
+⟶線形回帰
+
+
+
+**23. We assume here that y|x;θ∼N(μ,σ2)**
+
+⟶ここでy|x;θ∼N(μ,σ2)であるとする。
+
+
+
+**24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
+
+⟶正規方程式 ― Xを行列とすると、コスト関数を最小化するθの値は次のような閉形式の解である:
+
+
+
+**25. LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:**
+
+⟶最小2乗法 ― 学習率をαとすると、m個のデータ点からなる学習データに対する最小2乗法(LMSアルゴリズム)による更新ルールは、ウィドロウ-ホフの学習規則としても知られており、次の通りである:
+
+
+
+**26. Remark: the update rule is a particular case of the gradient ascent.**
+
+⟶備考:この更新ルールは勾配上昇法の特殊な例である。
+
+
+
+**27. LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:**
+
+⟶局所重み付き回帰 ― 局所重み付き回帰は、LWRとも呼ばれ、線形回帰の派生形である。パラメータをτ∈Rとして次のように定義されるw(i)(x)により、個々の学習標本をそのコスト関数において重み付けする:
+
+
+
+**28. Classification and logistic regression**
+
+⟶分類とロジスティック回帰
+
+
+
+**29. Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:**
+
+⟶シグモイド関数 ― シグモイド関数gは、ロジスティック関数とも呼ばれ、次のように定義される:
+
+
+
+**30. Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:**
+
+⟶ロジスティック回帰 ― ここでy|x;θ∼Bernoulli(ϕ)であるとすると、次の形式を得る:
+
+
+
+**31. Remark: there is no closed form solution for the case of logistic regressions.**
+
+⟶備考:ロジスティック回帰については閉形式の解は存在しない。
+
+
+
+**32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
+
+⟶ソフトマックス回帰 ― ソフトマックス回帰は、多クラス分類ロジスティック回帰とも呼ばれ、3個以上の結果クラスがある場合にロジスティック回帰を一般化するためのものである。慣習的に、θK=0とすると、各クラスiのベルヌーイ分布のパラメータϕiは次と等しくなる:
+
+
+
+**33. Generalized Linear Models**
+
+⟶一般化線形モデル
+
+
+
+**34. Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:**
+
+⟶指数分布族 ― ある分布の集合は指数分布族と呼ばれ、正準パラメータまたはリンク関数とも呼ばれる自然パラメータη、十分統計量T(y)及び対数分配関数a(η)を用いて、次のように表される:
+
+
+
+**35. Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.**
+
+⟶備考:T(y)=yとすることが多い。また、exp(−a(η))は確率の合計が1になることを保証する正規化定数と見なせる。
+
+
+
+**36. Here are the most common exponential distributions summed up in the following table:**
+
+⟶最も一般的な指数分布族は下表に集約される:
+
+
+
+**37. [Distribution, Bernoulli, Gaussian, Poisson, Geometric]**
+
+⟶分布、ベルヌーイ、ガウス、ポワソン、幾何
+
+
+
+**38. Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function of x∈Rn+1 and rely on the following 3 assumptions:**
+
+⟶GLMの仮定 ― 一般化線形モデル(GLM)はランダムな変数yをx∈Rn+1の関数として予測することを目的とし、次の3つの仮定に依拠する:
+
+
+
+**39. Remark: ordinary least squares and logistic regression are special cases of generalized linear models.**
+
+⟶備考:最小2乗回帰とロジスティック回帰は一般化線形モデルの特殊な例である。
+
+
+
+**40. Support Vector Machines**
+
+⟶サポートベクターマシン
+
+
+
+**41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
+
+⟶サポートベクターマシンの目的は、データ点からの最短距離が最大となる境界線を求めることである。
+
+
+
+**42: Optimal margin classifier ― The optimal margin classifier h is such that:**
+
+⟶最適マージン分類器 ― 最適マージン分類器hは次のようなものである:
+
+
+
+**43: where (w,b)∈Rn×R is the solution of the following optimization problem:**
+
+⟶ここで、(w,b)∈Rn×Rは次の最適化問題の解である:
+
+
+
+**44. such that**
+
+⟶ただし
+
+
+
+**45. support vectors**
+
+⟶サポートベクター
+
+
+
+**46. Remark: the line is defined as wTx−b=0.**
+
+⟶備考:直線はwTx−b=0と定義する。
+
+
+
+**47. Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:**
+
+⟶ヒンジ損失 ― ヒンジ損失はSVMの設定に用いられ、次のように定義される:
+
+
+
+**48. Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:**
+
+⟶カーネル ― 特徴写像をϕとすると、カーネルKは次のように定義される:
+
+
+
+**49. In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.**
+
+⟶実際には、K(x,z)=exp(−||x−z||22σ2)と定義され、ガウシアンカーネルと呼ばれるカーネルKがよく使われる。
+
+
+
+**50. [Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]**
+
+⟶非線形分離問題、カーネル写像の適用、元の空間における決定境界
+
+
+
+**51. Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.**
+
+⟶備考:カーネルを用いてコスト関数を計算する「カーネルトリック」を用いる。なぜなら、明示的な写像ϕを実際には知る必要はないし、それはしばしば非常に複雑になってしまうからである。代わりに、K(x,z)の値のみが必要である。
+
+
+
+**52. Lagrangian ― We define the Lagrangian L(w,b) as follows:**
+
+⟶ラグランジアン ― ラグランジアンL(w,b)を次のように定義する:
+
+
+
+**53. Remark: the coefficients βi are called the Lagrange multipliers.**
+
+⟶備考:係数βiはラグランジュ乗数と呼ばれる。
+
+
+
+**54. Generative Learning**
+
+⟶生成学習
+
+
+
+**55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
+
+⟶生成モデルは、P(x|y)を推定することによりデータがどのように生成されるのかを学習しようとする。それはベイズの定理を用いてP(y|x)を推定するために使える。
+
+
+
+**56. Gaussian Discriminant Analysis**
+
+⟶ガウシアン判別分析
+
+
+
+**57. Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:**
+
+⟶前提条件 ― ガウシアン判別分析はyとx|y=0とx|y=1は次のようであることを前提とする:
+
+
+
+**58. Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:**
+
+⟶推定 ― 尤度を最大にすると得られる推定量は下表に集約される:
+
+
+
+**59. Naive Bayes**
+
+⟶ナイーブベイズ
+
+
+
+**60. Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:**
+
+⟶仮定 ― ナイーブベイズモデルは、個々のデータ点の特徴量が全て独立であると仮定する:
+
+
+
+**61. Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]**
+
+⟶解 ― 対数尤度を最大にすると次の解を得る。ただし、k∈{0,1},l∈[[1,L]]とする。
+
+
+
+**62. Remark: Naive Bayes is widely used for text classification and spam detection.**
+
+⟶備考:ナイーブベイズはテキスト分類やスパム検知に幅広く使われている。
+
+
+
+**63. Tree-based and ensemble methods**
+
+⟶決定木とアンサンブル学習
+
+
+
+**64. These methods can be used for both regression and classification problems.**
+
+⟶これらの方法は回帰と分類問題の両方に使える。
+
+
+
+**65. CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.**
+
+⟶CART ― 分類・回帰木 (CART)は、一般には決定木として知られ、二分木として表される。非常に解釈しやすいという利点がある。
+
+
+
+**66. Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.**
+
+⟶ランダムフォレスト ― これは決定木をベースにしたもので、ランダムに選択された特徴量の集合から構築された多数の決定木を用いる。単純な決定木と異なり、非常に解釈しにくいが、一般的に良い性能が出るのでよく使われるアルゴリズムである。
+
+
+
+**67. Remark: random forests are a type of ensemble methods.**
+
+⟶備考:ランダムフォレストはアンサンブル学習の一種である。
+
+
+
+**68. Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:**
+
+⟶ブースティング ― ブースティングの考え方は、複数の弱い学習器を束ねることで1つのより強い学習器を作るというものである。主なものは次の表に集約される:
+
+
+
+**69. [Adaptive boosting, Gradient boosting]**
+
+⟶[適応的ブースティング、勾配ブースティング]
+
+
+
+**70. High weights are put on errors to improve at the next boosting step**
+
+⟶次のブースティングステップにて改善すべき誤分類に大きい重みが課される。
+
+
+
+**71. Weak learners trained on remaining errors**
+
+⟶残っている誤分類を弱い学習器が学習する。
+
+
+
+**72. Other non-parametric approaches**
+
+⟶他のノンパラメトリックな手法
+
+
+
+**73. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
+
+⟶k近傍法 ― k近傍法は、一般的にk-NNとして知られ、あるデータ点の応答はそのk個の最近傍点の性質によって決まるノンパラメトリックな手法である。分類と回帰の両方に用いることができる。
+
+
+
+**74. Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
+
+⟶備考:パラメータkが大きくなるほど、バイアスが大きくなる。パラメータkが小さくなるほど、分散が大きくなる。
+
+
+
+**75. Learning Theory**
+
+⟶学習理論
+
+
+
+**76. Union bound ― Let A1,...,Ak be k events. We have:**
+
+⟶和集合上界 ― A1,...,Akというk個の事象があるとき、次が成り立つ:
+
+
+
+**77. Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:**
+
+⟶ヘフディング不等式 ― パラメータϕのベルヌーイ分布から得られるm個の独立同分布変数をZ1,..,Zmとする。その標本平均をˆϕとし、γは正の定数であるとすると、次が成り立つ:
+
+
+
+**78. Remark: this inequality is also known as the Chernoff bound.**
+
+⟶備考:この不等式はチェルノフ上界としても知られる。
+
+
+
+**79. Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:**
+
+⟶学習誤差 ― ある分類器hに対して、学習誤差、あるいは経験損失か経験誤差としても知られるˆϵ(h)を次のように定義する:
+
+
+
+**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: **
+
+⟶確率的に近似的に正しい (PAC) ― PACとは、その下で学習理論に関する様々な業績が証明されてきたフレームワークであり、次の前提がある:
+
+
+
+**81: the training and testing sets follow the same distribution **
+
+⟶学習データと検証データは同じ分布に従う。
+
+
+
+**82. the training examples are drawn independently**
+
+⟶学習標本は独立に取得される。
+
+
+
+**83. Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:**
+
+⟶細分化 ― 集合S={x(1),...,x(d)}と分類器の集合Hがあるとき、もし任意のラベル{y(1),...,y(d)}の集合に対して次が成り立つとき、HはSを細分化する:
+
+
+
+**84. Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:**
+
+⟶上界定理 ― Hを|H|=kで有限の仮説集合とし、δとサンプルサイズmは定数とする。そのとき、少なくとも1-δの確率で次が成り立つ:
+
+
+
+**85. VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.**
+
+⟶VC次元 ― ある仮説集合Hのヴァプニク・チェルヴォーネンキス次元 (VC)は、VC(H)と表記され、それはHによって細分化される最大の集合のサイズである。
+
+
+
+**86. Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.**
+
+⟶備考:2次元の線形分類器の集合であるHのVC次元は3である。
+
+
+
+**87. Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:**
+
+⟶定理(ヴァプニク) ― あるHについてVC(H)=dであり、mを学習標本の数とする。少なくとも1−δの確率で次が成り立つ:
+
+
+
+**88. [Introduction, Type of prediction, Type of model]**
+
+⟶[導入、予測の種類、モデルの種類]
+
+
+
+**89. [Notations and general concepts, loss function, gradient descent, likelihood]**
+
+⟶[記法と全般的な概念、損失関数、勾配降下、尤度]
+
+
+
+**90. [Linear models, linear regression, logistic regression, generalized linear models]**
+
+⟶
+
+
[線形モデル、線形回帰、ロジスティック回帰、一般化線形モデル]
+
+**91. [Support vector machines, Optimal margin classifier, Hinge loss, Kernel]**
+
+⟶
+
+
[サポートベクターマシン、最適マージン分類器、ヒンジ損失、カーネル]
+
+**92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]**
+
+⟶
+
+
[生成学習、ガウシアン判別分析、ナイーブベイズ]
+
+**93. [Trees and ensemble methods, CART, Random forest, Boosting]**
+
+⟶[ツリーとアンサンブル学習、CART、ランダムフォレスト、ブースティング]
+
+
+
+**94. [Other methods, k-NN]**
+
+⟶[他の手法、k近傍法]
+
+
+
+**95. [Learning theory, Hoeffding inequality, PAC, VC dimension]**
+
+⟶[学習理論、ヘフディング不等式、PAC、VC次元]
diff --git a/ja/cs-229-unsupervised-learning.md b/ja/cs-229-unsupervised-learning.md
new file mode 100644
index 000000000..cc8111e7c
--- /dev/null
+++ b/ja/cs-229-unsupervised-learning.md
@@ -0,0 +1,339 @@
+**1. Unsupervised Learning cheatsheet**
+
+⟶教師なし学習チートシート
+
+
+
+**2. Introduction to Unsupervised Learning**
+
+⟶教師なし学習の概要
+
+
+
+**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
+
+⟶モチベーション - 教師なし学習の目的はラベルのないデータ{x(1),...,x(m)}に隠されたパターンを探すことです。
+
+
+
+**4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
+
+⟶イェンセンの不等式 - fを凸関数、Xを確率変数とすると、次の不等式が成り立ちます:
+
+
+
+**5. Clustering**
+
+⟶クラスタリング
+
+
+
+**6. Expectation-Maximization**
+
+⟶期待値最大化
+
+
+
+**7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
+
+⟶潜在変数 - 潜在変数は推定問題を困難にする隠れた/観測されていない変数であり、多くの場合zで示されます。潜在変数がある最も一般的な設定は次のとおりです:
+
+
+
+**8. [Setting, Latent variable z, Comments]**
+
+⟶[設定、潜在変数z、コメント]
+
+
+
+**9. [Mixture of k Gaussians, Factor analysis]**
+
+⟶[k個のガウス分布の混合、因子分析]
+
+
+
+**10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
+
+⟶アルゴリズム - EMアルゴリズムは次のように尤度の下限の構築(E-ステップ)と、その下限の最適化(M-ステップ)を繰り返し行うことによる最尤推定によりパラメーターθを推定する効率的な方法を提供します:
+
+
+
+**11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
+
+⟶E-ステップ: 各データポイントx(i)が特定クラスターz(i)に由来する事後確率Qi(z(i))を次のように評価します:
+
+
+
+**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
+
+⟶M-ステップ: 事後確率Qi(z(i))をデータポイントx(i)のクラスター固有の重みとして使い、次のように各クラスターモデルを個別に再推定します:
+
+
+
+**13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
+
+⟶[ガウス分布初期化、期待値ステップ、最大化ステップ、収束]
+
+
+
+**14. k-means clustering**
+
+⟶k平均法
+
+
+
+**15. We note c(i) the cluster of data point i and μj the center of cluster j.**
+
+⟶データポイントiのクラスタをc(i)、クラスタjの中心をμjと表記します。
+
+
+
+**16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
+
+⟶クラスターの重心μ1,μ2,...,μk∈Rnをランダムに初期化後、k-meansアルゴリズムが収束するまで次のようなステップを繰り返します:
+
+
+
+**17. [Means initialization, Cluster assignment, Means update, Convergence]**
+
+⟶ [平均の初期化、クラスター割り当て、平均の更新、収束]
+
+
+
+**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
+
+⟶ひずみ関数 - アルゴリズムが収束するかどうかを確認するため、次のように定義されたひずみ関数を参照します:
+
+
+
+**19. Hierarchical clustering**
+
+⟶ 階層的クラスタリング
+
+
+
+**20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
+
+⟶アルゴリズム - これは入れ子になったクラスタを逐次的に構築する凝集階層アプローチによるクラスタリングアルゴリズムです。
+
+
+
+**21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
+
+⟶ 種類 ― 様々な目的関数を最適化するための様々な種類の階層クラスタリングアルゴリズムが以下の表にまとめられています。
+
+
+
+**22. [Ward linkage, Average linkage, Complete linkage]**
+
+⟶ [ウォードリンケージ、平均リンケージ、完全リンケージ]
+
+
+
+**23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
+
+⟶ [クラスター内の距離最小化、クラスターペア間の平均距離の最小化、クラスターペア間の最大距離の最小化]
+
+
+
+**24. Clustering assessment metrics**
+
+⟶ クラスタリング評価指標
+
+
+
+**25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
+
+⟶ 教師なし学習では、教師あり学習の場合のような正解ラベルがないため、モデルの性能を評価することが困難な場合が多くあります。
+
+
+
+**26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
+
+⟶ シルエット係数 ― ある1つのサンプルと同じクラス内のその他全ての点との平均距離をa、そのサンプルから最も近いクラスタ内の全ての点との平均距離をbと表記すると、そのサンプルのシルエット係数sは次のように定義されます:
+
+
+
+**27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
+
+⟶ Calinski-Harabazインデックス ― クラスタの数をkと表記すると、クラスタ間およびクラスタ内の分散行列であるBkおよびWkはそれぞれ以下のように定義されます。
+
+
+
+**28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
+
+⟶ Calinski-Harabazインデックスs(k)はクラスタリングモデルが各クラスタをどの程度適切に定義しているかを示します。つまり、スコアが高いほど、各クラスタはより密で、十分に分離されています。それは次のように定義されます:
+
+
+
+**29. Dimension reduction**
+
+⟶ 次元削減
+
+
+
+**30. Principal component analysis**
+
+⟶ 主成分分析
+
+
+
+**31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
+
+⟶ これは分散を最大にするデータの射影方向を見つける次元削減手法です。
+
+
+
+**32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+⟶ 固有値、固有ベクトル - 行列 A∈Rn×nが与えられたとき、次の式で固有ベクトルと呼ばれるベクトルz∈Rn∖{0}が存在した場合に、λはAの固有値と呼ばれます。
+
+
+
+**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+⟶ スペクトル定理 - A∈Rn×nとする。Aが対称のとき、Aは実直交行列U∈Rn×nを用いて対角化可能です。Λ=diag(λ1,...,λn)と表記することで、次の式を得ます。
+
+
+
+**34. diagonal**
+
+⟶ diagonal
+
+
+
+**35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
+
+⟶ 注釈: 最大固有値に対応する固有ベクトルは行列Aの第1固有ベクトルと呼ばれる。
+
+
+
+**36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:**
+
+⟶ アルゴリズム ― 主成分分析(PCA)の過程は、次のようにデータの分散を最大化することによりデータをk次元に射影する次元削減の技術です。
+
+
+
+**37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
+
+⟶ ステップ1:平均が0で標準偏差が1となるようにデータを正規化します。
+
+
+
+**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
+
+⟶ ステップ2:実固有値に関して対称であるΣ=1mm∑i=1x(i)x(i)T∈Rn×nを計算します。
+
+
+
+**39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
+
+⟶ ステップ3:k個のΣの対角主値固有ベクトルu1,...,uk∈Rn、すなわちk個の最大の固有値の対角固有ベクトルを計算します。
+
+
+
+**40. Step 4: Project the data on spanR(u1,...,uk).**
+
+⟶ ステップ4:データをspanR(u1,...,uk)に射影します。
+
+
+
+**41. This procedure maximizes the variance among all k-dimensional spaces.**
+
+⟶ この過程は全てのk次元空間の間の分散を最大化します。
+
+
+
+**42. [Data in feature space, Find principal components, Data in principal components space]**
+
+⟶ [特徴空間内のデータ、主成分の探索、主成分空間内のデータ]
+
+
+
+**43. Independent component analysis**
+
+⟶ 独立成分分析
+
+
+
+**44. It is a technique meant to find the underlying generating sources.**
+
+⟶ 隠れた生成源を見つけることを意図した技術です。
+
+
+
+**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
+
+⟶ 仮定 ― 混合かつ非特異行列Aを通じて、データxはn次元の元となるベクトルs=(s1,...,sn)から次のように生成されると仮定します。ただしsiは独立でランダムな変数です:
+
+
+
+**46. The goal is to find the unmixing matrix W=A−1.**
+
+⟶ 非混合行列W=A−1を見つけることが目的です。
+
+
+
+**47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
+
+⟶ ベルとシノスキーのICAアルゴリズム ― このアルゴリズムは非混合行列Wを次のステップによって見つけます:
+
+
+
+**48. Write the probability of x=As=W−1s as:**
+
+⟶ x=As=W−1sの確率を次のように表します:
+
+
+
+**49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
+
+⟶ 学習データを{x(i),i∈[[1,m]]}、シグモイド関数をgとし、対数尤度を次のように表します:
+
+
+
+**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
+
+⟶ そのため、確率的勾配上昇法の学習規則は、学習サンプルx(i)に対して次のようにWを更新するものです:
+
+
+
+**51. The Machine Learning cheatsheets are now available in [target language].**
+
+⟶ 機械学習チートシートは日本語で読めます。
+
+
+
+**52. Original authors**
+
+⟶ 原著者
+
+
+
+**53. Translated by X, Y and Z**
+
+⟶ X・Y・Z 訳
+
+
+
+**54. Reviewed by X, Y and Z**
+
+⟶ X・Y・Z 校正
+
+
+
+**55. [Introduction, Motivation, Jensen's inequality]**
+
+⟶ [導入、動機、イェンセンの不等式]
+
+
+
+**56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
+
+⟶[クラスタリング、期待値最大化法、k-means、階層クラスタリング、指標]
+
+
+
+**57. [Dimension reduction, PCA, ICA]**
+
+⟶ [次元削減、PCA、ICA]
diff --git a/ja/cs-230-convolutional-neural-networks.md b/ja/cs-230-convolutional-neural-networks.md
new file mode 100644
index 000000000..178592414
--- /dev/null
+++ b/ja/cs-230-convolutional-neural-networks.md
@@ -0,0 +1,717 @@
+**Convolutional Neural Networks translation**
+
+
+
+**1. Convolutional Neural Networks cheatsheet**
+
+⟶ 畳み込みニューラルネットワーク チートシート
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶ CS 230 - ディープラーニング
+
+
+
+
+**3. [Overview, Architecture structure]**
+
+⟶ [概要、アーキテクチャ構造]
+
+
+
+
+**4. [Types of layer, Convolution, Pooling, Fully connected]**
+
+⟶ [層の種類、畳み込み、プーリング、全結合]
+
+
+
+
+**5. [Filter hyperparameters, Dimensions, Stride, Padding]**
+
+⟶ [フィルタハイパーパラメータ、次元、ストライド、パディング]
+
+
+
+
+**6. [Tuning hyperparameters, Parameter compatibility, Model complexity, Receptive field]**
+
+⟶ [ハイパーパラメータの調整、パラメータの互換性、モデルの複雑さ、受容野]
+
+
+
+
+**7. [Activation functions, Rectified Linear Unit, Softmax]**
+
+⟶ [活性化関数、正規化線形ユニット、ソフトマックス]
+
+
+
+
+**8. [Object detection, Types of models, Detection, Intersection over Union, Non-max suppression, YOLO, R-CNN]**
+
+⟶ [物体検出、モデルの種類、検出、IoU、非極大抑制、YOLO、R-CNN]
+
+
+
+
+**9. [Face verification/recognition, One shot learning, Siamese network, Triplet loss]**
+
+⟶ [顔認証/認識、One shot学習、シャムネットワーク、トリプレット損失]
+
+
+
+
+**10. [Neural style transfer, Activation, Style matrix, Style/content cost function]**
+
+⟶ [ニューラルスタイル変換、活性化、スタイル行列、スタイル/コンテンツコスト関数]
+
+
+
+
+**11. [Computational trick architectures, Generative Adversarial Net, ResNet, Inception Network]**
+
+⟶ [計算トリックアーキテクチャ、敵対的生成ネットワーク、ResNet、インセプションネットワーク]
+
+
+
+
+**12. Overview**
+
+⟶ 概要
+
+
+
+
+**13. Architecture of a traditional CNN ― Convolutional neural networks, also known as CNNs, are a specific type of neural networks that are generally composed of the following layers:**
+
+⟶ 伝統的な畳み込みニューラルネットワークのアーキテクチャ - CNNとしても知られる畳み込みニューラルネットワークは一般的に次の層で構成される特定種類のニューラルネットワークです。
+
+
+
+
+**14. The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters that are described in the next sections.**
+
+⟶ 畳み込み層とプーリング層は次のセクションで説明されるハイパーパラメータに関してファインチューニングできます。
+
+
+
+
+**15. Types of layer**
+
+⟶ 層の種類
+
+
+
+
+**16. Convolution layer (CONV) ― The convolution layer (CONV) uses filters that perform convolution operations as it is scanning the input I with respect to its dimensions. Its hyperparameters include the filter size F and stride S. The resulting output O is called feature map or activation map.**
+
+⟶ 畳み込み層 (CONV) - 畳み込み層 (CONV)は入力Iを各次元に関して走査する時に、畳み込み演算を行うフィルタを使用します。畳み込み層のハイパーパラメータにはフィルタサイズFとストライドSが含まれます。結果出力Oは特徴マップまたは活性化マップと呼ばれます。
+
+
+
+
+**17. Remark: the convolution step can be generalized to the 1D and 3D cases as well.**
+
+⟶ 注: 畳み込みステップは1次元や3次元の場合にも一般化できます。
+
+
+
+
+**18. Pooling (POOL) ― The pooling layer (POOL) is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance. In particular, max and average pooling are special kinds of pooling where the maximum and average value is taken, respectively.**
+
+⟶ プーリング (POOL) - プーリング層 (POOL)は位置不変性をもつ縮小操作で、通常は畳み込み層の後に適用されます。特に、最大及び平均プーリングはそれぞれ最大と平均値が取られる特別な種類のプーリングです。
+
+
+
+
+**19. [Type, Purpose, Illustration, Comments]**
+
+⟶ [種類、目的、図、コメント]
+
+
+
+
+**20. [Max pooling, Average pooling, Each pooling operation selects the maximum value of the current view, Each pooling operation averages the values of the current view]**
+
+⟶ [最大プーリング、平均プーリング、各プーリング操作は現在のビューの中から最大値を選ぶ、各プーリング操作は現在のビューに含まれる値を平均する]
+
+
+
+
+**21. [Preserves detected features, Most commonly used, Downsamples feature map, Used in LeNet]**
+
+⟶ [検出された特徴の保持、最も一般的な利用、特徴マップをダウンサンプリング、LeNetでの利用]
+
+
+
+
+**22. Fully Connected (FC) ― The fully connected layer (FC) operates on a flattened input where each input is connected to all neurons. If present, FC layers are usually found towards the end of CNN architectures and can be used to optimize objectives such as class scores.**
+
+⟶ 全結合 (FC) - 全結合 (FC) 層は平坦化された入力に対して演算を行います。各入力は全てのニューロンに接続されています。FC層が存在する場合、通常CNNアーキテクチャの末尾に向かって見られ、クラススコアなどの目的を最適化するため利用できます。
+
+
+
+
+**23. Filter hyperparameters**
+
+⟶ フィルタハイパーパラメータ
+
+
+
+
+**24. The convolution layer contains filters for which it is important to know the meaning behind its hyperparameters.**
+
+⟶ 畳み込み層にはハイパーパラメータの背後にある意味を知ることが重要なフィルタが含まれています。
+
+
+
+
+**25. Dimensions of a filter ― A filter of size F×F applied to an input containing C channels is a F×F×C volume that performs convolutions on an input of size I×I×C and produces an output feature map (also called activation map) of size O×O×1.**
+
+⟶ フィルタの次元 - C個のチャネルを含む入力に適用されるF×Fサイズのフィルタの体積はF×F×Cで、それはI×I×Cサイズの入力に対して畳み込みを実行してO×O×1サイズの特徴マップ(活性化マップとも呼ばれる)出力を生成します。
+
+
+
+
+
+**26. Filter**
+
+⟶ フィルタ
+
+
+
+
+**27. Remark: the application of K filters of size F×F results in an output feature map of size O×O×K.**
+
+⟶ 注: F×FサイズのK個のフィルタを適用すると、O×O×Kサイズの特徴マップの出力を得られます。
+
+
+
+
+**28. Stride ― For a convolutional or a pooling operation, the stride S denotes the number of pixels by which the window moves after each operation.**
+
+⟶ ストライド - 畳み込みまたはプーリング操作において、ストライドSは各操作の後にウィンドウを移動させるピクセル数を表します。
+
+
+
+
+**29. Zero-padding ― Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. This value can either be manually specified or automatically set through one of the three modes detailed below:**
+
+⟶ ゼロパディング - ゼロパディングとは入力の各境界に対してP個のゼロを追加するプロセスを意味します。この値は手動で指定することも、以下に詳述する3つのモードのいずれかを使用して自動的に設定することもできます。
+
+
+
+
+**30. [Mode, Value, Illustration, Purpose, Valid, Same, Full]**
+
+⟶ [モード、値、図、目的、Valid、Same、Full]
+
+
+
+
+**31. [No padding, Drops last convolution if dimensions do not match, Padding such that feature map size has size ⌈IS⌉, Output size is mathematically convenient, Also called 'half' padding, Maximum padding such that end convolutions are applied on the limits of the input, Filter 'sees' the input end-to-end]**
+
+⟶ [パディングなし、次元が合わなかったら場合の最後の畳み込みの終了, 特徴マップのサイズが⌈IS⌉になるようなパディング、出力サイズは数学的に扱いやすい、「ハーフ」パディングとも呼ばれる、入力の一番端まで畳み込みが適用されるような最大パディング, フィルタは入力を端から端まで「見る」]
+
+
+
+
+**32. Tuning hyperparameters**
+
+⟶ ハイパーパラメータの調整
+
+
+
+
+**33. Parameter compatibility in convolution layer ― By noting I the length of the input volume size, F the length of the filter, P the amount of zero padding, S the stride, then the output size O of the feature map along that dimension is given by:**
+
+⟶ 畳み込み層内のパラメータ互換性 - Iを入力ボリュームサイズの長さ、Fをフィルタの長さ、Pをゼロパディングの量, Sをストライドとすると、その次元に沿った特徴マップの出力サイズOは次式で与えられます:
+
+
+
+
+**34. [Input, Filter, Output]**
+
+⟶ [入力、フィルタ、出力]
+
+
+
+
+**35. Remark: often times, Pstart=Pend≜P, in which case we can replace Pstart+Pend by 2P in the formula above.**
+
+⟶ 注: 多くの場合Pstart=Pend≜Pであり、上記の式のPstart+Pendを2Pに置き換える事ができます。
+
+
+
+
+**36. Understanding the complexity of the model ― In order to assess the complexity of a model, it is often useful to determine the number of parameters that its architecture will have. In a given layer of a convolutional neural network, it is done as follows:**
+
+⟶ モデルの複雑さを理解する - モデルの複雑さを評価するために、モデルのアーキテクチャが持つパラメータの数を測定することがしばしば有用です。畳み込みニューラルネットワークの各層では、以下のように行なわれます:
+
+
+
+
+**37. [Illustration, Input size, Output size, Number of parameters, Remarks]**
+
+⟶ [図、入力サイズ、出力サイズ、パラメータの数、備考]
+
+
+
+
+**38. [One bias parameter per filter, In most cases, S
+
+
+**39. [Pooling operation done channel-wise, In most cases, S=F]**
+
+⟶ [チャネルごとに行われるプーリング操作、ほとんどの場合、S=F]
+
+
+
+
+**40. [Input is flattened, One bias parameter per neuron, The number of FC neurons is free of structural constraints]**
+
+⟶ [入力は平坦化される、ニューロンごとにひとつのバイアスパラメータ、FCのニューロンの数には構造的制約がない]
+
+
+
+
+**41. Receptive field ― The receptive field at layer k is the area denoted Rk×Rk of the input that each pixel of the k-th activation map can 'see'. By calling Fj the filter size of layer j and Si the stride value of layer i and with the convention S0=1, the receptive field at layer k can be computed with the formula:**
+
+⟶ 受容野 - 層kにおける受容野は、k番目の活性化マップの各ピクセルが「見る」ことができる入力のRk×Rkの領域です。層jのフィルタサイズをFj、層iのストライド値をSiとし、慣例に従ってS0=1とすると、層kでの受容野は次の式で計算されます:
+
+
+
+
+**42. In the example below, we have F1=F2=3 and S1=S2=1, which gives R2=1+2⋅1+2⋅1=5.**
+
+⟶ 下記の例のようにF1=F2=3、S1=S2=1とすると、R2=1+2⋅1+2⋅1=5となります。
+
+
+
+
+**43. Commonly used activation functions**
+
+⟶ よく使われる活性化関数
+
+
+
+
+**44. Rectified Linear Unit ― The rectified linear unit layer (ReLU) is an activation function g that is used on all elements of the volume. It aims at introducing non-linearities to the network. Its variants are summarized in the table below:**
+
+⟶ 正規化線形ユニット - 正規化線形ユニット層(ReLU)はボリュームの全ての要素に利用される活性化関数gです。ReLUの目的は非線型性をネットワークに導入することです。変種は以下の表でまとめられています:
+
+
+
+
+**45. [ReLU, Leaky ReLU, ELU, with]**
+
+⟶[ReLU、Leaky ReLU、ELU、ただし]
+
+
+
+
+**46. [Non-linearity complexities biologically interpretable, Addresses dying ReLU issue for negative values, Differentiable everywhere]**
+
+⟶ [生物学的に解釈可能な非線形複雑性、負の値に対してReLUが死んでいる問題への対処、どこても微分可能]
+
+
+
+
+**47. Softmax ― The softmax step can be seen as a generalized logistic function that takes as input a vector of scores x∈Rn and outputs a vector of output probability p∈Rn through a softmax function at the end of the architecture. It is defined as follows:**
+
+⟶ ソフトマックス - ソフトマックスのステップは入力としてスコアx∈Rnのベクトルを取り、アーキテクチャの最後にあるソフトマックス関数を通じて確率p∈Rnのベクトルを出力する一般化されたロジスティック関数として見ることができます。次のように定義されます:
+
+
+
+
+**48. where**
+
+⟶ ここで
+
+
+
+
+**49. Object detection**
+
+⟶ 物体検出
+
+
+
+
+**50. Types of models ― There are 3 main types of object recognition algorithms, for which the nature of what is predicted is different. They are described in the table below:**
+
+⟶ モデルの種類 - 物体認識アルゴリズムは主に3つの種類があり、予測されるものの性質は異なります。次の表で説明されています:
+
+
+
+
+**51. [Image classification, Classification w. localization, Detection]**
+
+⟶ [画像分類、位置特定を伴う分類、検出]
+
+
+
+
+**52. [Teddy bear, Book]**
+
+⟶ [テディベア、本]
+
+
+
+
+**53. [Classifies a picture, Predicts probability of object, Detects an object in a picture, Predicts probability of object and where it is located, Detects up to several objects in a picture, Predicts probabilities of objects and where they are located]**
+
+⟶ [画像の分類、物体の確率の予測, 画像内の物体の検出、物体の確率とその位置の予測、画像内の複数の物体の検出、複数の物体の確率と位置の予測]
+
+
+
+
+**54. [Traditional CNN, Simplified YOLO, R-CNN, YOLO, R-CNN]**
+
+⟶ [伝統的なCNN、単純されたYOLO、R-CNN、YOLO、R-CNN]
+
+
+
+
+**55. Detection ― In the context of object detection, different methods are used depending on whether we just want to locate the object or detect a more complex shape in the image. The two main ones are summed up in the table below:**
+
+⟶ 検出 - 物体検出の文脈では、画像内の物体の位置を特定したいだけなのかあるいは複雑な形状を検出したいのかによって、異なる方法が使用されます。二つの主なものは次の表でまとめられています:
+
+
+
+
+**56. [Bounding box detection, Landmark detection]**
+
+⟶ [バウンディングボックス検出、ランドマーク検出]
+
+
+
+
+**57. [Detects the part of the image where the object is located, Detects a shape or characteristics of an object (e.g. eyes), More granular]**
+
+⟶ [物体が配置されている画像の部分の検出、物体(たとえば目)の形状または特徴の検出、詳細]
+
+
+
+
+**58. [Box of center (bx,by), height bh and width bw, Reference points (l1x,l1y), ..., (lnx,lny)]**
+
+⟶ [中心(bx, by)、高さbh、幅bwのボックス、参照点(l1x,l1y), ..., (lnx,lny)]
+
+
+
+
+**59. Intersection over Union ― Intersection over Union, also known as IoU, is a function that quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding box Ba. It is defined as:**
+
+⟶ Intersection over Union - Intersection over Union (IoUとしても知られる)は予測された境界ボックスBpが実際の境界ボックスBaに対してどれだけ正しく配置されているかを定量化する関数です。次のように定義されます:
+
+
+
+
+**60. Remark: we always have IoU∈[0,1]. By convention, a predicted bounding box Bp is considered as being reasonably good if IoU(Bp,Ba)⩾0.5.**
+
+⟶ 注:常にIoU∈[0,1]となります。慣例では、IoU(Bp,Ba)⩾0.5の場合、予測された境界ボックスBpはそこそこ良いと見なされます。
+
+
+
+
+**61. Anchor boxes ― Anchor boxing is a technique used to predict overlapping bounding boxes. In practice, the network is allowed to predict more than one box simultaneously, where each box prediction is constrained to have a given set of geometrical properties. For instance, the first prediction can potentially be a rectangular box of a given form, while the second will be another rectangular box of a different geometrical form.**
+
+⟶ アンカーボックス - アンカーボクシングは重なり合う境界ボックスを予測するために使用される手法です。 実際には、ネットワークは同時に複数のボックスを予測することを許可されており、各ボックスの予測は特定の幾何学的属性の組み合わせを持つように制約されます。例えば、最初の予測は特定の形式の長方形のボックスになる可能性があり、2番目の予測は異なる幾何学的形式の別の長方形のボックスになります。
+
+
+
+
+**62. Non-max suppression ― The non-max suppression technique aims at removing duplicate overlapping bounding boxes of a same object by selecting the most representative ones. After having removed all boxes having a probability prediction lower than 0.6, the following steps are repeated while there are boxes remaining:**
+
+⟶ 非極大抑制 - 非極大抑制技術のねらいは、最も代表的なものを選択することによって、同じ物体の重複した重なり合う境界ボックスを除去することです。0.6未満の予測確率を持つボックスを全て除去した後、残りのボックスがある間、以下の手順が繰り返されます:
+
+
+
+
+**63. [For a given class, Step 1: Pick the box with the largest prediction probability., Step 2: Discard any box having an IoU⩾0.5 with the previous box.]**
+
+⟶ [特定のクラスに対して、ステップ1: 最大の予測確率を持つボックスを選ぶ。ステップ2: そのボックスに対してIoU⩾0.5となる全てのボックスを破棄する。]
+
+
+
+
+**64. [Box predictions, Box selection of maximum probability, Overlap removal of same class, Final bounding boxes]**
+
+⟶ [ボックス予測、最大確率のボックス選択、同じクラスの重複除去、最終的な境界ボックス]
+
+
+
+
+**65. YOLO ― You Only Look Once (YOLO) is an object detection algorithm that performs the following steps:**
+
+⟶ YOLO - You Only Look Once (YOLO)は次の手順を実行する物体検出アルゴリズムです:
+
+
+
+
+**66. [Step 1: Divide the input image into a G×G grid., Step 2: For each grid cell, run a CNN that predicts y of the following form:, repeated k times]**
+
+⟶ [ステップ1: 入力画像をGxGグリッドに分割する。ステップ2: 各グリッドセルに対して次の形式のyを予測するCNNを実行する:,k回繰り返す。]
+
+
+
+
+**67. where pc is the probability of detecting an object, bx,by,bh,bw are the properties of the detected bouding box, c1,...,cp is a one-hot representation of which of the p classes were detected, and k is the number of anchor boxes.**
+
+⟶ ここで、pcは物体を検出する確率、bx,by,bh,bwは検出された境界ボックスの属性、c1, ..., cpはp個のクラスのうちどれが検出されたかのOne-hot表現、kはアンカーボックスの数です。
+
+
+
+
+**68. Step 3: Run the non-max suppression algorithm to remove any potential duplicate overlapping bounding boxes.**
+
+⟶ ステップ3: 重複する可能性のある重なり合う境界ボックスを全て除去するため、非極大抑制アルゴリズムを実行する。
+
+
+
+
+**69. [Original image, Division in GxG grid, Bounding box prediction, Non-max suppression]**
+
+⟶ [元の画像、GxGグリッドでの分割、境界ボックス予測、非極大抑制]
+
+
+
+
+**70. Remark: when pc=0, then the network does not detect any object. In that case, the corresponding predictions bx,...,cp have to be ignored.**
+
+⟶ 注: pc=0のとき、ネットワークは物体を検出しません。その場合には、対応する予測 bx, ..., cpは無視する必要があります。
+
+
+
+
+**71. R-CNN ― Region with Convolutional Neural Networks (R-CNN) is an object detection algorithm that first segments the image to find potential relevant bounding boxes and then run the detection algorithm to find most probable objects in those bounding boxes.**
+
+⟶ R-CNN - Region with Convolutional Neural Networks (R-CNN)は物体検出アルゴリズムで、最初に画像をセグメント化して潜在的に関連する境界ボックスを見つけ、次に検出アルゴリズムを実行してそれらの境界ボックス内で最も可能性の高い物体を見つけます。
+
+
+
+
+**72. [Original image, Segmentation, Bounding box prediction, Non-max suppression]**
+
+⟶ [元の画像、セグメンテーション、境界ボックス予測、非極大抑制]
+
+
+
+
+**73. Remark: although the original algorithm is computationally expensive and slow, newer architectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.**
+
+⟶ 注: 元のアルゴリズムは計算コストが高くて遅いですが、Fast R-CNNやFaster R-CNNなどの、より新しいアーキテクチャではアルゴリズムをより速く実行できます。
+
+
+
+
+**74. Face verification and recognition**
+
+⟶ 顔認証及び認識
+
+
+
+
+**75. Types of models ― Two main types of model are summed up in table below:**
+
+⟶ モデルの種類 - 2種類の主要なモデルが次の表にまとめられています:
+
+
+
+
+**76. [Face verification, Face recognition, Query, Reference, Database]**
+
+⟶ [顔認証、顔認識、クエリ、参照、データベース]
+
+
+
+
+**77. [Is this the correct person?, One-to-one lookup, Is this one of the K persons in the database?, One-to-many lookup]**
+
+⟶ [これは正しい人ですか?、1対1検索、これはデータベース内のK人のうちの1人ですか?、1対多検索]
+
+
+
+
+**78. One Shot Learning ― One Shot Learning is a face verification algorithm that uses a limited training set to learn a similarity function that quantifies how different two given images are. The similarity function applied to two images is often noted d(image 1,image 2).**
+
+⟶ ワンショット学習 - ワンショット学習は限られた学習セットを利用して、2つの与えられた画像の違いを定量化する類似度関数を学習する顔認証アルゴリズムです。2つの画像に適用される類似度関数はしばしばd(画像1, 画像2)と記されます。
+
+
+
+
+**79. Siamese Network ― Siamese Networks aim at learning how to encode images to then quantify how different two images are. For a given input image x(i), the encoded output is often noted as f(x(i)).**
+
+⟶ シャムネットワーク - シャムネットワークは画像のエンコード方法を学習して2つの画像の違いを定量化することを目的としています。与えられた入力画像x(i)に対してエンコードされた出力はしばしばf(x(i))と記されます。
+
+
+
+
+**80. Triplet loss ― The triplet loss ℓ is a loss function computed on the embedding representation of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive example belong to a same class, while the negative example to another one. By calling α∈R+ the margin parameter, this loss is defined as follows:**
+
+⟶ トリプレット損失 - トリプレット損失ℓは3つ組の画像A(アンカー)、P(ポジティブ)、N(ネガティブ)の埋め込み表現で計算される損失関数です。アンカーとポジティブ例は同じクラスに属し、ネガティブ例は別のクラスに属します。マージンパラメータをα∈R+と呼ぶことによってこの損失は次のように定義されます:
+
+
+
+
+**81. Neural style transfer**
+
+⟶ ニューラルスタイル変換
+
+
+
+
+**82. Motivation ― The goal of neural style transfer is to generate an image G based on a given content C and a given style S.**
+
+⟶ モチベーション - ニューラルスタイル変換の目的は与えられたコンテンツCとスタイルSに基づく画像Gを生成することです。
+
+
+
+
+**83. [Content C, Style S, Generated image G]**
+
+⟶ [コンテンツC、スタイルS、生成された画像G]
+
+
+
+
+**84. Activation ― In a given layer l, the activation is noted a[l] and is of dimensions nH×nw×nc**
+
+⟶ 活性化 - 層lにおける活性化はa[l]と表記され、次元はnH×nw×ncです。
+
+
+
+
+**85. Content cost function ― The content cost function Jcontent(C,G) is used to determine how the generated image G differs from the original content image C. It is defined as follows:**
+
+⟶ コンテンツコスト関数 - Jcontent(C, G)というコンテンツコスト関数は生成された画像Gと元のコンテンツ画像Cとの違いを測定するため利用されます。以下のように定義されます:
+
+
+
+
+**86. Style matrix ― The style matrix G[l] of a given layer l is a Gram matrix where each of its elements G[l]kk′ quantifies how correlated the channels k and k′ are. It is defined with respect to activations a[l] as follows:**
+
+⟶ スタイル行列 - 与えられた層lのスタイル行列G[l]はグラム行列で、各要素G[l]kk′がチャネルkとk′の相関関係を定量化します。活性化a[l]に関して次のように定義されます:
+
+
+
+
+**87. Remark: the style matrix for the style image and the generated image are noted G[l] (S) and G[l] (G) respectively.**
+
+⟶ 注: スタイル画像及び生成された画像に対するスタイル行列はそれぞれG[l] (S)、G[l] (G)と表記されます。
+
+
+
+
+**88. Style cost function ― The style cost function Jstyle(S,G) is used to determine how the generated image G differs from the style S. It is defined as follows:**
+
+⟶ スタイルコスト関数 - スタイルコスト関数Jstyle(S,G)は生成された画像GとスタイルSとの違いを測定するため利用されます。以下のように定義されます:
+
+
+
+
+**89. Overall cost function ― The overall cost function is defined as being a combination of the content and style cost functions, weighted by parameters α,β, as follows:**
+
+⟶ 全体のコスト関数 - 全体のコスト関数は以下のようにパラメータα,βによって重み付けされたコンテンツ及びスタイルコスト関数の組み合わせとして定義されます:
+
+
+
+
+**90. Remark: a higher value of α will make the model care more about the content while a higher value of β will make it care more about the style.**
+
+⟶ 注: αの値を大きくするとモデルはコンテンツを重視し、βの値を大きくするとスタイルを重視します。
+
+
+
+
+**91. Architectures using computational tricks**
+
+⟶ 計算トリックを使うアーキテクチャ
+
+
+
+
+**92. Generative Adversarial Network ― Generative adversarial networks, also known as GANs, are composed of a generative and a discriminative model, where the generative model aims at generating the most truthful output that will be fed into the discriminative which aims at differentiating the generated and true image.**
+
+⟶ 敵対的生成ネットワーク - 敵対的生成ネットワーク(GANsとも呼ばれる)は生成モデルと識別モデルで構成されます。生成モデルの目的は、生成された画像と本物の画像を区別することを目的とする識別モデルに与えられる、最も本物らしい出力を生成することです。
+
+
+
+
+**93. [Training set, Noise, Real-world image, Generator, Discriminator, Real Fake]**
+
+⟶ [学習セット、ノイズ、現実世界の画像、生成器、識別器、真偽]
+
+
+
+
+**94. Remark: use cases using variants of GANs include text to image, music generation and synthesis.**
+
+⟶ 注: GANsの変種を使用するユースケースにはテキストからの画像生成, 音楽生成及び合成があります。
+
+
+
+
+**95. ResNet ― The Residual Network architecture (also called ResNet) uses residual blocks with a high number of layers meant to decrease the training error. The residual block has the following characterizing equation:**
+
+⟶ ResNet - Residual Networkアーキテクチャ(ResNetとも呼ばれる)は学習エラーを減らすため多数の層がある残差ブロックを使用します。残差ブロックは次の特性方程式を有します:
+
+
+
+
+**96. Inception Network ― This architecture uses inception modules and aims at giving a try at different convolutions in order to increase its performance through features diversification. In particular, it uses the 1×1 convolution trick to limit the computational burden.**
+
+⟶ インセプションネットワーク - このアーキテクチャはインセプションモジュールを利用し、特徴量の多様化を通じてパーフォーマンスを向上させるため、様々な畳み込みを試すことを目的としています。特に、計算負荷を限定するため1×1畳み込みトリックを使います。
+
+
+
+
+**97. The Deep Learning cheatsheets are now available in [target language].**
+
+⟶ ディープラーニングのチートシートが[日本語]で利用可能になりました。
+
+
+
+
+**98. Original authors**
+
+⟶ 原著者
+
+
+
+
+**99. Translated by X, Y and Z**
+
+⟶ X・Y・Z 訳
+
+
+
+
+**100. Reviewed by X, Y and Z**
+
+⟶ X・Y・Z 校正
+
+
+
+
+**101. View PDF version on GitHub**
+
+⟶ GitHubでPDF版を見る
+
+
+
+
+**102. By X and Y**
+
+⟶ X・Y 著
+
+
diff --git a/ja/cs-230-deep-learning-tips-and-tricks.md b/ja/cs-230-deep-learning-tips-and-tricks.md
new file mode 100644
index 000000000..a7de15349
--- /dev/null
+++ b/ja/cs-230-deep-learning-tips-and-tricks.md
@@ -0,0 +1,457 @@
+**Deep Learning Tips and Tricks translation**
+
+
+
+**1. Deep Learning Tips and Tricks cheatsheet**
+
+⟶深層学習(ディープラーニング)のアドバイスやコツのチートシート
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶CS 230 - 深層学習
+
+
+
+
+**3. Tips and tricks**
+
+⟶アドバイスやコツ
+
+
+
+
+**4. [Data processing, Data augmentation, Batch normalization]**
+
+⟶データ処理、Data augmentation (データ拡張)、Batch normalization (バッチ正規化)
+
+
+
+
+**5. [Training a neural network, Epoch, Mini-batch, Cross-entropy loss, Backpropagation, Gradient descent, Updating weights, Gradient checking]**
+
+⟶ニューラルネットワークの学習、エポック、ミニバッチ、交差エントロピー誤差、誤差逆伝播法、勾配降下法、重み更新、勾配チェック
+
+
+
+
+**6. [Parameter tuning, Xavier initialization, Transfer learning, Learning rate, Adaptive learning rates]**
+
+⟶パラメータチューニング、Xavier初期化、転移学習、学習率、適応学習率
+
+
+
+
+**7. [Regularization, Dropout, Weight regularization, Early stopping]**
+
+⟶正規化、Dropout (ドロップアウト)、重みの正規化、Early stopping (学習の早々な終了)
+
+
+
+
+**8. [Good practices, Overfitting small batch, Gradient checking]**
+
+⟶おすすめの技法、小さいバッチの過学習、勾配チェック
+
+
+
+
+**9. View PDF version on GitHub**
+
+⟶GitHubでPDF版を見る
+
+
+
+
+**10. Data processing**
+
+⟶データ処理
+
+
+
+
+**11. Data augmentation ― Deep learning models usually need a lot of data to be properly trained. It is often useful to get more data from the existing ones using data augmentation techniques. The main ones are summed up in the table below. More precisely, given the following input image, here are the techniques that we can apply:**
+
+⟶Data augmentation (データ拡張) - 大抵の場合は、深層学習のモデルを適切に訓練するには大量のデータが必要です。Data augmentation という技術を用いて既存のデータから、データを増やすことがよく役立ちます。以下、Data augmentation の主な手法はまとまっています。より正確には、以下の入力画像に対して、下記の技術を適用できます。
+
+
+
+
+**12. [Original, Flip, Rotation, Random crop]**
+
+⟶元の画像、反転、回転、ランダムな切り抜き
+
+
+
+
+**13. [Image without any modification, Flipped with respect to an axis for which the meaning of the image is preserved, Rotation with a slight angle, Simulates incorrect horizon calibration, Random focus on one part of the image, Several random crops can be done in a row]**
+
+⟶何も変更されていない画像、画像の意味が変わらない軸における反転、わずかな角度の回転、不正確な水平線の校正(calibration)をシミュレートする、画像の一部へのランダムなフォーカス、連続して数回のランダムな切り抜きが可能
+
+
+
+
+**14. [Color shift, Noise addition, Information loss, Contrast change]**
+
+⟶カラーシフト、ノイズの付加、情報損失、コントラスト(鮮やかさ)の修正
+
+
+
+
+**15. [Nuances of RGB is slightly changed, Captures noise that can occur with light exposure, Addition of noise, More tolerance to quality variation of inputs, Parts of image ignored, Mimics potential loss of parts of image, Luminosity changes, Controls difference in exposition due to time of day]**
+
+⟶RGBのわずかな修正、照らされ方によるノイズを捉える、ノイズの付加、入力画像の品質のばらつきへの耐性の強化、画像の一部を無視、画像の一部が欠ける可能性を再現する、明るさの変化、時刻による露出の違いをコントロールする
+
+
+
+
+**16. Remark: data is usually augmented on the fly during training.**
+
+⟶備考:データ拡張は基本的には学習時に臨機応変に行われる。
+
+
+
+
+**17. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
+
+⟶batch normalization - ハイパーパラメータ γ、β によってバッチ {xi} を正規化するステップです。修正を加えたいバッチの平均と分散をμB,σ2Bと表記すると、以下のように行えます。
+
+
+
+
+**18. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
+
+⟶より高い学習率を利用可能にし初期化への強い依存を減らすことを目的として、基本的には全結合層・畳み込み層のあとで非線形層の前に行います。
+
+
+
+
+**19. Training a neural network**
+
+⟶ニューラルネットワークの学習
+
+
+
+
+**20. Definitions**
+
+⟶定義
+
+
+
+
+**21. Epoch ― In the context of training a model, epoch is a term used to refer to one iteration where the model sees the whole training set to update its weights.**
+
+⟶エポック - モデル学習においてエポックとは学習の繰り返しの中の1回を指す用語で、1エポックの間にモデルは全学習データからその重みを更新します。
+
+
+
+
+**22. Mini-batch gradient descent ― During the training phase, updating weights is usually not based on the whole training set at once due to computation complexities or one data point due to noise issues. Instead, the update step is done on mini-batches, where the number of data points in a batch is a hyperparameter that we can tune.**
+
+⟶ミニバッチ勾配降下法 - 学習段階では、計算が複雑になりすぎるため通常は全データを一度に使って重みを更新することはありません。またノイズが問題になるため1つのデータポイントだけを使って重みを更新することもありません。代わりに、更新はミニバッチごとに行われます。各バッチに含まれるデータポイントの数は調整可能なハイパーパラメータです。
+
+
+
+
+**23. Loss function ― In order to quantify how a given model performs, the loss function L is usually used to evaluate to what extent the actual outputs y are correctly predicted by the model outputs z.**
+
+⟶損失関数 - 得られたモデルの性能を数値化するために、モデルの出力zが実際の出力yをどの程度正確に予測できているかを評価する損失関数Lが通常使われます。
+
+
+
+
+**24. Cross-entropy loss ― In the context of binary classification in neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
+
+⟶交差エントロピー誤差 - ニューラルネットワークにおける二項分類では、交差エントロピー誤差L(z,y)が一般的に使用されており、以下のように定義されています。
+
+
+
+
+**25. Finding optimal weights**
+
+⟶最適な重みの探索
+
+
+
+
+**26. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight w is computed using the chain rule.**
+
+⟶誤差逆伝播法 - 実際の出力と期待される出力の差に基づいてニューラルネットワークの重みを更新する手法です。各重みwに関する微分は連鎖律を用いて計算されます。
+
+
+
+
+**27. Using this method, each weight is updated with the rule:**
+
+⟶この方法を使用することで、それぞれの重みはそのルールにしたがって更新されます。
+
+
+
+
+**28. Updating weights ― In a neural network, weights are updated as follows:**
+
+⟶重みの更新 - ニューラルネットワークでは、以下の方法にしたがって重みが更新されます。
+
+
+
+
+**29. [Step 1: Take a batch of training data and perform forward propagation to compute the loss, Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight, Step 3: Use the gradients to update the weights of the network.]**
+
+⟶ステップ1:訓練データのバッチを用いて順伝播で損失を計算します。ステップ2:損失を逆伝播させて各重みに関する損失の勾配を求めます。ステップ3:求めた勾配を用いてネットワークの重みを更新します。
+
+
+
+
+**30. [Forward propagation, Backpropagation, Weights update]**
+
+⟶順伝播、逆伝播、重みの更新
+
+
+
+
+**31. Parameter tuning**
+
+⟶パラメータチューニング
+
+
+
+
+**32. Weights initialization**
+
+⟶重みの初期化
+
+
+
+
+**33. Xavier initialization ― Instead of initializing the weights in a purely random manner, Xavier initialization enables to have initial weights that take into account characteristics that are unique to the architecture.**
+
+⟶Xavier初期化 - 完全にランダムな方法で重みを初期化するのではなく、そのアーキテクチャのユニークな特徴を考慮に入れて重みを初期化する方法です。
+
+
+
+
+**34. Transfer learning ― Training a deep learning model requires a lot of data and more importantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets that took days/weeks to train, and leverage it towards our use case. Depending on how much data we have at hand, here are the different ways to leverage this:**
+
+⟶転移学習 - 深層学習のモデルを学習させるには大量のデータと何よりも時間が必要です。膨大なデータセットから数日・数週間をかけて構築した学習済みモデルを利用し、自身のユースケースに活かすことは有益であることが多いです。手元にあるデータ量次第ではありますが、これを利用する以下の方法があります。
+
+
+
+
+**35. [Training size, Illustration, Explanation]**
+
+⟶学習サイズ、図、解説
+
+
+
+
+**36. [Small, Medium, Large]**
+
+⟶小、中、大
+
+
+
+
+**37. [Freezes all layers, trains weights on softmax, Freezes most layers, trains weights on last layers and softmax, Trains weights on layers and softmax by initializing weights on pre-trained ones]**
+
+⟶全層を凍結し、softmaxの重みを学習させる、大半の層を凍結し、最終層とsoftmaxの重みを学習させる、学習済みの重みで初期化して各層とsoftmaxの重みを学習させる
+
+
+
+
+**38. Optimizing convergence**
+
+⟶収束の最適化
+
+
+
+
+**39. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. It can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.
+**
+
+⟶学習率 - 多くの場合αや時々ηと表記される学習率とは、重みの更新速度を表しています。学習率は固定することもできる上に、適応的に変更することもできます。現在もっとも使用される手法は、学習率を適切に調整するAdamと呼ばれる手法です。
+
+
+
+
+**40. Adaptive learning rates ― Letting the learning rate vary when training a model can reduce the training time and improve the numerical optimal solution. While Adam optimizer is the most commonly used technique, others can also be useful. They are summed up in the table below:**
+
+⟶適応学習率法 - モデルを学習させる際に学習率を変動させると、学習時間の短縮や精度の向上につながります。Adamがもっとも一般的に使用されている手法ですが、他の手法も役立つことがあります。それらの手法を下記の表にまとめました。
+
+
+
+
+**41. [Method, Explanation, Update of w, Update of b]**
+
+⟶手法、解説、wの更新、bの更新
+
+
+
+
+**42. [Momentum, Dampens oscillations, Improvement to SGD, 2 parameters to tune]**
+
+⟶Momentum(運動量)、振動を抑制する、SGDの改良、チューニングするパラメータは2つ
+
+
+
+
+**43. [RMSprop, Root Mean Square propagation, Speeds up learning algorithm by controlling oscillations]**
+
+⟶RMSprop, 二乗平均平方根のプロパゲーション、振動をコントロールすることで学習アルゴリズムを高速化する
+
+
+
+
+**44. [Adam, Adaptive Moment estimation, Most popular method, 4 parameters to tune]**
+
+⟶Adam, Adaptive Moment estimation, もっとも人気のある手法、チューニングするパラメータは4つ
+
+
+
+
+**45. Remark: other methods include Adadelta, Adagrad and SGD.**
+
+⟶備考:他にAdadelta, Adagrad, SGD などの手法があります。
+
+
+
+
+**46. Regularization**
+
+⟶正則化
+
+
+
+
+**47. Dropout ― Dropout is a technique used in neural networks to prevent overfitting the training data by dropping out neurons with probability p>0. It forces the model to avoid relying too much on particular sets of features.**
+
+⟶ドロップアウト - ドロップアウトとは、ニューラルネットワークで過学習を避けるためにp>0の確率でノードをドロップアウト(無効化)する手法です。モデルが特定の特徴量に依存しすぎることを避けるよう強制します。
+
+
+
+
+**48. Remark: most deep learning frameworks parametrize dropout through the 'keep' parameter 1−p.**
+
+⟶備考:ほとんどの深層学習のフレームワークでは、ドロップアウトを'keep'というパラメータ(1-p)でパラメータ化します。
+
+
+
+
+**49. Weight regularization ― In order to make sure that the weights are not too large and that the model is not overfitting the training set, regularization techniques are usually performed on the model weights. The main ones are summed up in the table below:**
+
+⟶重みの正則化 - 重みが大きくなりすぎず、モデルが過学習しないようにするため、モデルの重みに対して正則化を行います。主な正則化手法は以下の表にまとめられています。
+
+
+
+
+**50. [LASSO, Ridge, Elastic Net]**
+
+⟶LASSO, Ridge, Elastic Net
+
+
+
+**50 bis. Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+⟶bis. 係数を0へ小さくする、変数選択に適している、係数を小さくする、変数選択と小さい係数のトレードオフ
+
+
+
+**51. Early stopping ― This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase.**
+
+⟶Early stopping - バリデーションの損失が変化しなくなるか、あるいは増加し始めたときに学習を早々に止める正則化方法
+
+
+
+
+**52. [Error, Validation, Training, early stopping, Epochs]**
+
+⟶損失、評価、学習、early stopping、エポック
+
+
+
+
+**53. Good practices**
+
+⟶おすすめの技法
+
+
+
+
+**54. Overfitting small batch ― When debugging a model, it is often useful to make quick tests to see if there is any major issue with the architecture of the model itself. In particular, in order to make sure that the model can be properly trained, a mini-batch is passed inside the network to see if it can overfit on it. If it cannot, it means that the model is either too complex or not complex enough to even overfit on a small batch, let alone a normal-sized training set.**
+
+⟶小さいバッチの過学習 - モデルをデバッグするとき、モデル自体の構造に大きな問題がないか確認するため簡易的なテストが役に立つことが多いです。特に、モデルを正しく学習できることを確認するため、ミニバッチをネットワークに渡してそれを過学習できるかを見ます。もしできなければ、モデルは複雑すぎるか単純すぎるかのいずれかであることを意味し、普通サイズの学習データセットはもちろん、小さいバッチですら過学習できないのです。
+
+
+
+
+**55. Gradient checking ― Gradient checking is a method used during the implementation of the backward pass of a neural network. It compares the value of the analytical gradient to the numerical gradient at given points and plays the role of a sanity-check for correctness.**
+
+⟶Gradient checking (勾配チェック) - Gradient checking とは、ニューラルネットワークの逆伝播を実装する際に用いられる手法です。特定の点で解析的勾配と数値的勾配とを比較する手法で、逆伝播の実装が正しいことを確認できます。
+
+
+
+
+**56. [Type, Numerical gradient, Analytical gradient]**
+
+⟶種類、数値的勾配、解析的勾配
+
+
+
+
+**57. [Formula, Comments]**
+
+⟶公式、コメント
+
+
+
+
+**58. [Expensive; loss has to be computed two times per dimension, Used to verify correctness of analytical implementation, Trade-off in choosing h not too small (numerical instability) nor too large (poor gradient approximation)]**
+
+⟶計算コストが高い;損失を次元ごとに2回計算する必要がある、解析的実装が正しいかのチェックに用いられる、hを選ぶ時に小さすぎると数値不安定になり、大きすぎると勾配近似が不正確になるというトレードオフがある
+
+
+
+
+**59. ['Exact' result, Direct computation, Used in the final implementation]**
+
+⟶「正しい」結果、直接的な計算、最終的な実装で使われる
+
+
+
+
+**60. The Deep Learning cheatsheets are now available in [target language].
+
+⟶深層学習のチートシートは[対象言語]で利用可能になりました。
+
+
+**61. Original authors**
+
+⟶原著者
+
+
+
+**62.Translated by X, Y and Z**
+
+⟶X・Y・Z 訳
+
+
+
+**63.Reviewed by X, Y and Z**
+
+⟶X・Y・Z 校正
+
+
+
+**64.View PDF version on GitHub**
+
+⟶GitHubでPDF版を見る
+
+
+
+**65.By X and Y**
+
+⟶X・Y 著
+
+
diff --git a/ja/cs-230-recurrent-neural-networks.md b/ja/cs-230-recurrent-neural-networks.md
new file mode 100644
index 000000000..e366a86de
--- /dev/null
+++ b/ja/cs-230-recurrent-neural-networks.md
@@ -0,0 +1,678 @@
+**Recurrent Neural Networks translation**
+
+
+
+**1. Recurrent Neural Networks cheatsheet**
+
+⟶リカレントニューラルネットワーク チートシート
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶ディープラーニング
+
+
+
+
+**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]**
+
+⟶[概要、アーキテクチャの構造、RNNの応用アプリケーション、損失関数、逆伝播]
+
+
+
+
+**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]**
+
+⟶[長期依存性関係の処理、活性化関数、勾配喪失と発散、勾配クリッピング、GRU/LTSM、ゲートの種類、双方向性RNN、ディープ(深層学習)RNN]
+
+
+
+
+**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]**
+
+⟶[単語出現の学習、ノーテーション、埋め込み行列、Word2vec、スキップグラム、ネガティブサンプリング、グローブ]
+
+
+
+
+**6. [Comparing words, Cosine similarity, t-SNE]**
+
+⟶[単語の比較、コサイン類似度、t-SNE]
+
+
+
+
+**7. [Language model, n-gram, Perplexity]**
+
+⟶[言語モデル、n-gramモデル、パープレキシティ]
+
+
+
+
+**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]**
+
+⟶[機械翻訳、ビームサーチ、単語長の正規化、エラー分析、BLEUスコア(機械翻訳比較スコア)]
+
+
+
+
+**9. [Attention, Attention model, Attention weights]**
+
+⟶[アテンション、アテンションモデル、アテンションウェイト]
+
+
+
+
+**10. Overview**
+
+⟶概要
+
+
+
+
+**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:**
+
+⟶一般的なRNNのアーキテクチャ - RNNとして知られるリカレントニューラルネットワークは、隠れ層の状態を利用して、前の出力を次の入力として取り扱うことを可能にするニューラルネットワークの一種です。一般的なモデルは下記のようになります:
+
+
+
+
+**12. For each timestep t, the activation a and the output y are expressed as follows:**
+
+⟶それぞれの時点 t において活性化関数の状態 a と出力 y は下記のように表現されます:
+
+
+
+
+**13. and**
+
+⟶そして
+
+
+
+
+**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.**
+
+⟶ここで、Wax,Waa,Wya,ba,by は全ての時点で共有される係数であり、g1,g2 は活性化関数です。
+
+
+
+
+**15. The pros and cons of a typical RNN architecture are summed up in the table below:**
+
+⟶一般的なRNNのアーキテクチャ利用の長所・短所については下記の表にまとめられています。
+
+
+
+
+**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]**
+
+⟶[長所、任意の長さの入力の処理可能性、入力サイズに応じて大きくならないモデルサイズ、時系列情報を考慮した計算、全ての時点で共有される重み]
+
+
+
+
+**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]**
+
+⟶[短所、遅い計算、長い時間軸での情報の利用の困難性、現在の状態から将来の入力が予測不可能]
+
+
+
+
+**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:**
+
+⟶RNNの応用 - RNNモデルは主に自然言語処理と音声認識の分野で使用されます。さまざまな応用例が以下の表にとめられています:
+
+
+
+
+**19. [Type of RNN, Illustration, Example]**
+
+⟶[RNNの種類、図、例]
+
+
+
+
+**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]**
+
+⟶[一対一、一対多、多対一、多対多]
+
+
+
+
+**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]**
+
+⟶[伝統的なニューラルネットワーク、音楽生成、感情分類、固有表現認識、機械翻訳]
+
+
+
+
+**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:**
+
+⟶損失関数 - リカレントニューラルネットワークの場合、時間軸全体での損失関数Lは、各時点での損失に基づき、次のように定義されます:
+
+
+
+
+**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:**
+
+⟶時間軸での誤差逆伝播法 - 誤差逆伝播(バックプロパゲーション)が各時点で行われます。時刻 T における、重み行列 W に関する損失 L の導関数は以下のように表されます:
+
+
+
+
+**24. Handling long term dependencies**
+
+⟶長期依存関係の処理
+
+
+
+
+**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:**
+
+⟶一般的に使用される活性化関数 - RNNモジュールで使用される最も一般的な活性化関数を以下に説明します:
+
+
+
+
+**26. [Sigmoid, Tanh, RELU]**
+
+⟶[シグモイド、ハイパボリックタンジェント、RELU]
+
+
+
+
+**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.**
+
+⟶勾配消失と勾配爆発について - 勾配消失と勾配爆発の現象は、RNNでよく見られます。これらの現象が起こる理由は、掛け算の勾配が層の数に対して指数関数的に減少/増加する可能性があるため、長期の依存関係を捉えるのが難しいからです。
+
+
+
+
+**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.**
+
+⟶勾配クリッピング - 誤差逆伝播法を実行するときに時折発生する勾配爆発問題に対処するために使用される手法です。勾配の上限値を定義することで、実際にこの現象が抑制されます。
+
+
+
+
+**29. clipped**
+
+⟶clipped
+
+
+
+
+**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:**
+
+⟶ゲートの種類 - 勾配消失問題を解決するために、特定のゲートがいくつかのRNNで使用され、通常明確に定義された目的を持っています。それらは通常Γと記され、以下のように定義されます:
+
+
+
+
+**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:**
+
+⟶ここで、W、U、bはゲート固有の係数、σはシグモイド関数です。主なものは以下の表にまとめられています:
+
+
+
+
+**32. [Type of gate, Role, Used in]**
+
+⟶[ゲートの種類、役割、下記で使用]
+
+
+
+
+**33. [Update gate, Relevance gate, Forget gate, Output gate]**
+
+⟶[更新ゲート、関連ゲート、忘却ゲート、出力ゲート]
+
+
+
+
+**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]**
+
+⟶[過去情報はどのくらい重要ですか?、前の情報を削除しますか?、セルを消去しますか?しませんか?、セルをどのくらい見せますか?]
+
+
+
+
+**35. [LSTM, GRU]**
+
+⟶[LSTM、GRU]
+
+
+
+
+**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:**
+
+⟶GRU/LSTM - ゲート付きリカレントユニット(GRU)およびロングショートタームメモリユニット(LSTM)は、従来のRNNが直面した勾配消失問題を解決しようとします。LSTMはGRUを一般化したものです。各アーキテクチャを特徴づける式を以下の表にまとめます:
+
+
+
+
+**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]**
+
+⟶[特徴づけ、ゲート付きリカレントユニット(GRU)、ロングショートタームメモリ(LSTM)、依存関係]
+
+
+
+
+**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.**
+
+⟶備考:記号 ⋆ は2つのベクトル間の要素ごとの乗算を表します。
+
+
+
+
+**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:**
+
+⟶RNNの変種 - 一般的に使用されている他のRNNアーキテクチャを以下の表にまとめます:
+
+
+
+
+**40. [Bidirectional (BRNN), Deep (DRNN)]**
+
+⟶[双方向(BRNN)、ディープ(DRNN)]
+
+
+
+
+**41. Learning word representation**
+
+⟶単語表現の学習
+
+
+
+
+**42. In this section, we note V the vocabulary and |V| its size.**
+
+⟶この節では、Vは語彙、そして|V|は語彙のサイズを表します。
+
+
+
+
+**43. Motivation and notations**
+
+⟶動機と表記
+
+
+
+
+**44. Representation techniques ― The two main ways of representing words are summed up in the table below:**
+
+⟶表現のテクニック - 単語を表現する2つの主な方法は、以下の表にまとめられています。
+
+
+
+
+**45. [1-hot representation, Word embedding]**
+
+⟶[1-hot表現、単語埋め込み(単語分散表現)]
+
+
+
+
+**46. [teddy bear, book, soft]**
+
+⟶[テディベア、本、柔らかい]
+
+
+
+
+**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]**
+
+⟶[owの表記、素朴なアプローチ、類似性のない情報、ewの表記、単語の類似性の考慮]
+
+
+
+
+**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:**
+
+⟶埋め込み行列(分散表現行列) - 与えられた単語wに対して、埋め込み行列Eは、1-hot表現owを以下のように埋め込み行列ewに写像します:
+
+
+
+
+**49. Remark: learning the embedding matrix can be done using target/context likelihood models.**
+
+⟶注:埋め込み行列は、ターゲット/コンテキスト尤度モデルを使用して学習できます。
+
+
+
+
+**50. Word embeddings**
+
+⟶単語の埋め込み
+
+
+
+
+**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.**
+
+⟶Word2vec - Word2vecは、ある単語が他の単語の周辺にある可能性を推定することで、単語の埋め込みの重みを学習することを目的としたフレームワークです。人気のあるモデルは、スキップグラム、ネガティブサンプリング、およびCBOWです。
+
+
+
+
+**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]**
+
+⟶[かわいいテディベアが読んでいる、テディベア、柔らかい、ペルシャ詩、芸術]
+
+
+
+
+**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]**
+
+⟶[代理タスクでのネットワークの訓練、高水準表現の抽出、単語埋め込み重みの計算]
+
+
+
+
+**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:**
+
+⟶スキップグラム - スキップグラムword2vecモデルは、あるターゲット単語tがコンテキスト単語cと一緒に出現する確率を評価することで単語の埋め込みを学習する教師付き学習タスクです。tに関するパラメータをθtと表記すると、その確率P(t|c) は以下の式で与えられます:
+
+
+
+
+**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.**
+
+⟶注:softmax部分の分母の語彙全体を合計するため、このモデルの計算コストは高くなります。 CBOWは、ある単語を予測するため周辺単語を使用する別のタイプのword2vecモデルです。
+
+
+
+
+**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:**
+
+⟶ネガティブサンプリング - ロジスティック回帰を使用したバイナリ分類器のセットで、特定の文脈とあるターゲット単語が同時に出現する確率を評価することを目的としています。モデルはk個のネガティブな例と1つのポジティブな例のセットで訓練されます。コンテキスト単語cとターゲット単語tが与えられると、予測は次のように表現されます。
+
+
+
+
+**57. Remark: this method is less computationally expensive than the skip-gram model.**
+
+⟶注:この方法の計算コストは、スキップグラムモデルよりも少ないです。
+
+
+
+
+**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:**
+
+⟶GloVe - GloVeモデルは、単語表現のためのグローバルベクトルの略で、共起行列Xを使用する単語の埋め込み手法です。ここで、各Xi,jは、ターゲットiがコンテキストjで発生した回数を表します。そのコスト関数Jは以下の通りです:
+
+
+
+
+**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0.
+Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:**
+
+⟶ここで、fはXi,j =0⟹f(Xi,j)= 0となるような重み関数です。このモデルでeとθが果たす対称性を考えると、最後の単語の埋め込みe(final)wは以下のようになります:
+
+
+
+
+**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.**
+
+⟶注:学習された単語の埋め込みの個々の要素は、必ずしも解釈可能ではありません。
+
+
+
+
+**60. Comparing words**
+
+⟶単語の比較
+
+
+
+
+**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:**
+
+⟶コサイン類似度 - 単語w1とw2のコサイン類似度は次のように表されます
+
+
+
+
+**62. Remark: θ is the angle between words w1 and w2.**
+
+⟶注:θは単語w1とw2の間の角度です。
+
+
+
+
+**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.**
+
+⟶ t-SNE − t-SNE(t−分布型確率的近傍埋め込み)は、高次元埋め込みから低次元埋め込み空間への次元削減を目的とした手法です。実際には、2次元空間で単語ベクトルを視覚化するために使用されます。
+
+
+
+
+**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]**
+
+⟶[文学、芸術、本、文化、詩、読書、知識、面白い、愛らしい、幼年期、親切、テディベア、柔らかい、抱擁、かわいい、愛らしい]
+
+
+
+
+**65. Language model**
+
+⟶言語モデル
+
+
+
+
+**66. Overview ― A language model aims at estimating the probability of a sentence P(y).**
+
+⟶概要 - 言語モデルは文の確率P(y)を推定することを目的としています。
+
+
+
+
+**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.**
+
+⟶n-gramモデル - このモデルは、トレーニングデータでの出現数を数えることによって、ある表現がコーパスに出現する確率を定量化することを目的とした単純なアプローチです。
+
+
+
+
+**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:**
+
+⟶パープレキシティ - 言語モデルは一般的に、PPとも呼ばれるパープレキシティメトリックを使用して評価されます。これは、単語数Tにより正規化されたデータセットの逆確率と解釈できます。パープレキシティは低いほど良く、次のように定義されます:
+(訳注:パープレキシティの数値はより低いものがより選択しやすい単語として評価されます。10であれば10個の中から1つ、10000であれば10000個の中から1つ選択されます。)
+
+
+
+
+**69. Remark: PP is commonly used in t-SNE.**
+
+⟶注:PPはt-SNEで一般的に使用されています。
+
+
+
+
+**70. Machine translation**
+
+⟶機械翻訳
+
+
+
+
+**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:**
+
+⟶概要 - 機械翻訳モデルは、エンコーダーネットワークのロジックが最初に付加されている以外は、言語モデルと似ています。このため、条件付き言語モデルと呼ばれることもあります。目的は次のような文yを見つけることです:
+
+
+
+
+**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.**
+
+⟶ビーム検索 - 入力xが与えられたとき最も可能性の高い文yを見つけるために、機械翻訳と音声認識で使用されるヒューリスティック探索アルゴリズムです。
+
+
+
+
+**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]**
+
+⟶[ステップ1:上位B個の高い確率を持つ単語y<1>を見つけ、ステップ2:条件付き確率y|x,y<1>,...,yを計算し、ステップ3:上位B個の組み合わせx,y<1>,...,yを保持し、あるストップワードでプロセスを終了します]
+
+
+
+
+**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.**
+
+⟶注意:ビーム幅が1に設定されている場合、これは単純な貪欲法と同等です。
+
+
+
+
+**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.**
+
+⟶ビーム幅 - ビーム幅Bはビーム検索のパラメータです。 Bの値を大きくするとより良い結果が得られますが、探索パフォーマンスは低下し、メモリ使用量が増加します。 Bの値が小さいと結果が悪くなりますが、計算量は少なくなります。 Bの標準値は10前後です。
+
+
+
+
+**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:**
+
+⟶文章の長さの正規化 - 数値の安定性を向上させるために、ビーム検索は通常、正規化(対数尤度正規化)された目的関数に対して適用され、次のように定義されます:
+
+
+
+
+**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.**
+
+⟶注:パラメータαは緩衝パラメータと見なされ、その値は通常、0.5から1の間です。
+
+
+
+
+**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:**
+
+⟶エラー分析 - 予測されたˆyの翻訳が良くない場合、以下のようなエラー分析を実行することで、なぜy∗のような良い翻訳を得られなかったのか考えることが可能です:
+
+
+
+
+**79. [Case, Root cause, Remedies]**
+
+⟶[症例、根本原因、改善策]
+
+
+
+
+**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]**
+
+⟶[ビーム検索の誤り、RNNの誤り、ビーム幅の拡大、さまざまなアーキテクチャを試す、正則化、データをさらに取得]
+
+
+
+
+**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:**
+
+⟶Bleuスコア - Bleu(Bilingual evaluation understudy)スコアは、n-gramの精度に基づき類似性スコアを計算することで、機械翻訳がどれほど優れているかを定量化します。以下のように定義されています:
+
+
+
+
+**82. where pn is the bleu score on n-gram only defined as follows:**
+
+⟶ここで、pnはn-gramでのbleuスコアで下記のようにだけ定義されています:
+
+
+
+
+**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.**
+
+⟶注:人為的に水増しされたブルースコアを防ぐために、短い翻訳評価には簡潔さへのペナルティが適用される場合があります。
+
+
+
+
+**84. Attention**
+
+⟶アテンション
+
+
+
+
+**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:**
+
+⟶アテンションモデル - このモデルを使用するとRNNは重要であると考えられる入力の特定部分に注目することができ、得られるモデルの性能が実際に向上します。時刻tにおいて、出力yが活性化関数aとコンテキストcとに払うべき注意量をαと表記すると次のようになります:
+
+
+
+
+**86. with**
+
+⟶および
+
+
+
+
+**87. Remark: the attention scores are commonly used in image captioning and machine translation.**
+
+⟶注:アテンションスコアは、一般的に画像のキャプション作成および機械翻訳で使用されています。
+
+
+
+
+**88. A cute teddy bear is reading Persian literature.**
+
+⟶かわいいテディベアがペルシャ文学を読んでいます。
+
+
+
+
+**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:**
+
+⟶アテンションの重み - 出力yが活性化関数aに払うべき注意量αは次のように計算されます。
+
+
+
+
+**90. Remark: computation complexity is quadratic with respect to Tx.**
+
+⟶注:この計算の複雑さはTxに関して2次です。
+
+
+
+
+**91. The Deep Learning cheatsheets are now available in [target language].**
+
+⟶ディープラーニングのチートシートが[日本語]で利用可能になりました。
+
+
+
+**92. Original authors**
+
+⟶原著者
+
+
+
+**93. Translated by X, Y and Z**
+
+⟶X・Y・Z 訳
+
+
+
+**94. Reviewed by X, Y and Z**
+
+⟶X・Y・Z 校正
+
+
+
+**95. View PDF version on GitHub**
+
+⟶GitHubでPDF版を見る
+
+
+
+**96. By X and Y**
+
+⟶X・Y 著
+
+
diff --git a/ar/refresher-linear-algebra.md b/ko/cs-229-linear-algebra.md
similarity index 50%
rename from ar/refresher-linear-algebra.md
rename to ko/cs-229-linear-algebra.md
index a6b440d1e..6414d6db4 100644
--- a/ar/refresher-linear-algebra.md
+++ b/ko/cs-229-linear-algebra.md
@@ -1,339 +1,340 @@
**1. Linear Algebra and Calculus refresher**
-⟶
+⟶ 선형대수와 미적분학 복습
**2. General notations**
-⟶
+⟶ 일반적인 표기법
**3. Definitions**
-⟶
+⟶ 정의
**4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:**
-⟶
+⟶ 벡터 - x∈Rn는 n개의 요소를 가진 벡터이고, xi∈R는 i번째 요소이다.
**5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:**
-⟶
+⟶ 행렬 - A∈Rm×n는 m개의 행과 n개의 열을 가진 행렬이고, Ai,j∈R는 i번째 행, j번째 열에 있는 원소이다.
**6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.**
-⟶
+⟶ 비고 : 위에서 정의된 벡터 x는 n×1행렬로 볼 수 있으며, 열벡터라고도 불린다.
**7. Main matrices**
-⟶
+⟶ 주요 행렬
**8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:**
-⟶
+⟶ 단위행렬 - 단위행렬 I∈Rn×n는 대각성분이 모두 1이고 대각성분이 아닌 성분은 모두 0인 정사각행렬이다.
**9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.**
-⟶
+⟶ 비고 : 모든 행렬 A∈Rn×n에 대하여, A×I=I×A=A를 만족한다.
**10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:**
-⟶
+⟶ 대각행렬 - 대각행렬 D∈Rn×n는 대각성분은 모두 0이 아니고, 대각성분이 아닌 성분은 모두 0인 정사각행렬이다.
**11. Remark: we also note D as diag(d1,...,dn).**
-⟶
+⟶ 비고 : D를 diag(d1,...,dn)라고도 표시한다.
**12. Matrix operations**
-⟶
+⟶ 행렬 연산
**13. Multiplication**
-⟶
+⟶ 곱셈
**14. Vector-vector ― There are two types of vector-vector products:**
-⟶
+⟶ 벡터-벡터 – 벡터 간 연산에는 두 가지 종류가 있다.
**15. inner product: for x,y∈Rn, we have:**
-⟶
+⟶ 내적 : x,y∈Rn에 대하여,
**16. outer product: for x∈Rm,y∈Rn, we have:**
-⟶
+⟶ 외적 : x∈Rm,y∈Rn에 대하여,
**17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:**
-⟶
+⟶ 행렬-벡터 - 행렬 A∈Rm×n와 벡터 x∈Rn의 곱은 다음을 만족하는 Rn크기의 벡터이다.
**18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.**
-⟶
+⟶ aTr,i는 A의 벡터행, ac,j는 A의 벡터열, xi는 x의 성분이다.
**19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:**
-⟶
+⟶ 행렬-행렬 - 행렬 A∈Rm×n와 행렬 B∈Rn×p의 곱은 다음을 만족하는 Rn×p크기의 행렬이다.
**20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively**
-⟶
+⟶ aTr,i,bTr,i는 A,B의 벡터행, ac,j,bc,j는 A,B의 벡터열이다.
**21. Other operations**
-⟶
+⟶ 그 외 연산
**22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:**
-⟶
+⟶ 전치 - 행렬 A∈Rm×n의 전치 AT는 모든 성분을 뒤집은 것이다.
**23. Remark: for matrices A,B, we have (AB)T=BTAT**
-⟶
+⟶ 비고 - 행렬 A,B에 대하여, (AB)T=BTAT가 성립힌다.
**24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:**
-⟶
+⟶ 역행렬 - 가역행렬 A의 역행렬은 A-1로 표기하며, 유일하다.
**25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1**
-⟶
+⟶ 모든 정사각행렬이 역행렬을 갖는 것은 아니다. 그리고, 행렬 A,B에 대하여 (AB)−1=B−1A−1가 성립힌다.
**26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:**
-⟶
+⟶ 대각합 – 정사각행렬 A의 대각합 tr(A)는 대각성분의 합이다.
**27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)**
-⟶
+⟶ 비고 : 행렬 A,B에 대하여, tr(AT)=tr(A)와 tr(AB)=tr(BA)가 성립힌다.
**28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:**
-⟶
+⟶ 행렬식 - 정사각행렬 A∈Rn×n의 행렬식 |A| 또는 det(A)는 i번째 행과 j번째 열이 없는 행렬 A인 A∖i,∖j에 대해 재귀적으로 표현된다.
**29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.**
-⟶
+⟶ 비고 : A가 가역일 필요충분조건은 |A|≠0이다. 또한 |AB|=|A||B|와 |AT|=|A|도 그렇다.
**30. Matrix properties**
-⟶
+⟶ 행렬의 성질
**31. Definitions**
-⟶
+⟶ 정의
**32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:**
-⟶
+⟶ 대칭 분해 - 주어진 행렬 A는 다음과 같이 대칭과 비대칭 부분으로 표현될 수 있다.
**33. [Symmetric, Antisymmetric]**
-⟶
+⟶ [대칭, 비대칭]
-**34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:**
+**34. Norm ― A norm is a function N:V⟶[0,+∞] where V is a vector space, and such that for all x,y∈V, we have:**
-⟶
+⟶ 노름 – V는 벡터공간일 때, 노름은 모든 x,y∈V에 대해 다음을 만족하는 함수 N:V⟶[0,+∞]이다.
**35. N(ax)=|a|N(x) for a scalar**
-⟶
+⟶ scalar a에 대해서 N(ax)=|a|N(x)를 만족한다.
**36. if N(x)=0, then x=0**
-⟶
+⟶ N(x)=0이면 x=0이다.
**37. For x∈V, the most commonly used norms are summed up in the table below:**
-⟶
+⟶ x∈V에 대해, 가장 일반적으로 사용되는 규범이 아래 표에 요약되어 있다.
**38. [Norm, Notation, Definition, Use case]**
-⟶
+⟶ [규범, 표기법, 정의, 유스케이스]
**39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.**
-⟶
+⟶ 일차 종속 - 집합 내의 벡터 중 하나가 다른 벡터들의 선형결합으로 정의될 수 있으면, 그 벡터 집합은 일차 종속이라고 한다.
**40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent**
-⟶
+⟶ 비고 : 어느 벡터도 이런 방식으로 표현될 수 없다면, 그 벡터들은 일차 독립이라고 한다.
**41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.**
-⟶
+⟶ 행렬 랭크 - 주어진 행렬 A의 랭크는 열에 의해 생성된 벡터공간의 차원이고, rank(A)라고 쓴다. 이는 A의 선형독립인 열의 최대 수와 동일하다.
**42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:**
-⟶
+⟶ 양의 준정부호 행렬 – 행렬 A∈Rn×n는 다음을 만족하면 양의 준정부호(PSD)라고 하고 A⪰0라고 쓴다.
**43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.**
-⟶
+⟶ 비고 : 마찬가지로 PSD 행렬이 모든 0이 아닌 벡터 x에 대하여 xTAx>0를 만족하면 행렬 A를 양의 정부호라고 말하고 A≻0라고 쓴다.
**44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
-⟶
+⟶ 고유값, 고유벡터 - 주어진 행렬 A∈Rn×n에 대하여, 다음을 만족하는 벡터 z∈Rn∖{0}가 존재하면, z를 고유벡터라고 부르고, λ를 A의 고유값이라고 부른다.
**45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
-⟶
+⟶ 스펙트럼 정리 – A∈Rn×n라고 하자. A가 대칭이면, A는 실수 직교행렬 U∈Rn×n에 의해 대각화 가능하다. Λ=diag(λ1,...,λn)인 것에 주목하면, 다음을 만족한다.
**46. diagonal**
-⟶
+⟶ 대각
**47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:**
-⟶
+⟶ 특이값 분해 – 주어진 m×n차원 행렬 A에 대하여, 특이값 분해(SVD)는 다음과 같이 U m×m 유니터리와 Σ m×n 대각 및 V n×n 유니터리 행렬의 존재를 보증하는 인수분해 기술이다.
**48. Matrix calculus**
-⟶
+⟶ 행렬 미적분
**49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:**
-⟶
+⟶ 그라디언트 – f:Rm×n→R는 함수이고 A∈Rm×n는 행렬이라 하자. A에 대한 f의 그라디언트 ∇Af(A)는 다음을 만족하는 m×n 행렬이다.
**50. Remark: the gradient of f is only defined when f is a function that returns a scalar.**
-⟶
+⟶ 비고 : f의 그라디언트는 f가 스칼라를 반환하는 함수일 때만 정의된다.
**51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:**
-⟶
+⟶ 헤시안 – f:Rn→R는 함수이고 x∈Rn는 벡터라고 하자. x에 대한 f의 헤시안 ∇2xf(x)는 다음을 만족하는 n×n 대칭행렬이다.
-**52. Remark: the hessian of f is only defined when f is a function that returns a scalar**
+**52. Remark: the hessian of f is only defined when f is a function that returns a scalar**
-⟶
+⟶ 비고 : f의 헤시안은 f가 스칼라를 반환하는 함수일 때만 정의된다.
**53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:**
-⟶
+⟶ 그라디언트 연산 – 행렬 A,B,C에 대하여, 다음 그라디언트 성질을 염두해두는 것이 좋다.
**54. [General notations, Definitions, Main matrices]**
-⟶
+⟶ [일반적인 표기법, 정의, 주요 행렬]
**55. [Matrix operations, Multiplication, Other operations]**
-⟶
+⟶ [행렬 연산, 곱셈, 다른 연산]
**56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]**
-⟶
+⟶ [행렬 성질, 노름, 고유값/고유벡터, 특이값 분해]
**57. [Matrix calculus, Gradient, Hessian, Operations]**
-⟶
+⟶ [행렬 미적분, 그라디언트, 헤시안, 연산]
+
diff --git a/ko/cs-229-machine-learning-tips-and-tricks.md b/ko/cs-229-machine-learning-tips-and-tricks.md
new file mode 100644
index 000000000..d6732e145
--- /dev/null
+++ b/ko/cs-229-machine-learning-tips-and-tricks.md
@@ -0,0 +1,285 @@
+**1. Machine Learning tips and tricks cheatsheet**
+
+⟶머신러닝 팁과 트릭 치트시트
+
+
+
+**2. Classification metrics**
+
+⟶분류 측정 항목
+
+
+
+**3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
+
+⟶이진 분류 상황에서 모델의 성능을 평가하기 위해 눈 여겨 봐야하는 주요 측정 항목이 여기에 있습니다.
+
+
+
+**4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
+
+⟶혼동 행렬 ― 혼동 행렬은 모델의 성능을 평가할 때, 보다 큰 그림을 보기위해 사용됩니다. 이는 다음과 같이 정의됩니다.
+
+
+
+**5. [Predicted class, Actual class]**
+
+⟶[예측된 클래스, 실제 클래스]
+
+
+
+**6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
+
+⟶주요 측정 항목들 ― 다음 측정 항목들은 주로 분류 모델의 성능을 평가할 때 사용됩니다.
+
+
+
+**7. [Metric, Formula, Interpretation]**
+
+⟶[측정 항목, 공식, 해석]
+
+
+
+**8. Overall performance of model**
+
+⟶전반적인 모델의 성능
+
+
+
+**9. How accurate the positive predictions are**
+
+⟶예측된 양성이 정확한 정도
+
+
+
+**10. Coverage of actual positive sample**
+
+⟶실제 양성의 예측 정도
+
+
+
+**11. Coverage of actual negative sample**
+
+⟶실제 음성의 예측 정도
+
+
+
+**12. Hybrid metric useful for unbalanced classes**
+
+⟶불균형 클래스에 유용한 하이브리드 측정 항목
+
+
+
+**13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
+
+⟶ROC(Receiver Operating Curve) ― ROC 곡선은 임계값의 변화에 따른 TPR 대 FPR의 플롯입니다. 이 측정 항목은 아래 표에 요약되어 있습니다:
+
+
+
+**14. [Metric, Formula, Equivalent]**
+
+⟶[측정 항목, 공식, 같은 측도]
+
+
+
+**15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
+
+⟶AUC(Area Under the receiving operating Curve) ― AUC 또는 AUROC라고도 하는 이 측정 항목은 다음 그림과 같이 ROC 곡선 아래의 영역입니다:
+
+
+
+**16. [Actual, Predicted]**
+
+⟶[실제값, 예측된 값]
+
+
+
+**17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
+
+⟶기본 측정 항목 ― 회귀 모델 f가 주어졌을때, 다음의 측정 항목들은 모델의 성능을 평가할 때 주로 사용됩니다:
+
+
+
+**18. [Total sum of squares, Explained sum of squares, Residual sum of squares]**
+
+⟶[총 제곱합, 설명된 제곱합, 잔차 제곱합]
+
+
+
+**19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
+
+⟶결정 계수 ― 종종 R2 또는 r2로 표시되는 결정 계수는 관측된 결과가 모델에 의해 얼마나 잘 재현되는지를 측정하는 측도로서 다음과 같이 정의됩니다:
+
+
+
+**20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
+
+⟶주요 측정 항목들 ― 다음 측정 항목들은 주로 변수의 수를 고려하여 회귀 모델의 성능을 평가할 때 사용됩니다:
+
+
+
+**21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
+
+⟶여기서 L은 가능도이고 ^σ2는 각각의 반응과 관련된 분산의 추정값입니다.
+
+
+
+**22. Model selection**
+
+⟶모델 선택
+
+
+
+**23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
+
+⟶어휘 ― 모델을 선택할 때 우리는 다음과 같이 가지고 있는 데이터를 세 부분으로 구분합니다:
+
+
+
+**24. [Training set, Validation set, Testing set]**
+
+⟶[학습 세트, 검증 세트, 테스트 세트]
+
+
+
+**25. [Model is trained, Model is assessed, Model gives predictions]**
+
+⟶[모델 훈련, 모델 평가, 모델 예측]
+
+
+
+**26. [Usually 80% of the dataset, Usually 20% of the dataset]**
+
+⟶[주로 데이터 세트의 80%, 주로 데이터 세트의 20%]
+
+
+
+**27. [Also called hold-out or development set, Unseen data]**
+
+⟶[홀드아웃 또는 개발 세트라고도하는, 보지 않은 데이터]
+
+
+
+**28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
+
+⟶모델이 선택되면 전체 데이터 세트에 대해 학습을 하고 보지 않은 데이터에서 테스트합니다. 이는 아래 그림에 나타나있습니다.
+
+
+
+**29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
+
+⟶교차-검증 ― CV라고도하는 교차-검증은 초기의 학습 세트에 지나치게 의존하지 않는 모델을 선택하는데 사용되는 방법입니다. 다양한 유형이 아래 표에 요약되어 있습니다:
+
+
+
+**30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
+
+⟶[k-1 폴드에 대한 학습과 나머지 1폴드에 대한 평가, n-p개 관측치에 대한 학습과 나머지 p개 관측치에 대한 평가]
+
+
+
+**31. [Generally k=5 or 10, Case p=1 is called leave-one-out]**
+
+⟶[일반적으로 k=5 또는 10, p=1인 케이스는 leave-one-out]
+
+
+
+**32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
+
+⟶가장 일반적으로 사용되는 방법은 k-폴드 교차-검증이라고하며 이는 학습 데이터를 k개의 폴드로 분할하고, 그 중 k-1개의 폴드로 모델을 학습하는 동시에 나머지 1개의 폴드로 모델을 검증합니다. 이 작업을 k번 수행합니다. 오류는 k 폴드에 대해 평균화되고 교차-검증 오류라고 부릅니다.
+
+
+
+**33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
+
+⟶정규화 ― 정규화 절차는 데이터에 대한 모델의 과적합을 피하고 분산이 커지는 문제를 처리하는 것을 목표로 합니다. 다음의 표는 일반적으로 사용되는 정규화 기법의 여러 유형을 요약한 것입니다:
+
+
+
+**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+⟶[계수를 0으로 축소, 변수 선택에 좋음, 계수를 작게 함, 변수 선택과 작은 계수 간의 트래이드오프]
+
+
+
+**35. Diagnostics**
+
+⟶진단
+
+
+
+**36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
+
+⟶편향 ― 모델의 편향은 기대되는 예측과 주어진 데이터 포인트에 대해 예측하려고하는 올바른 모델 간의 차이입니다.
+
+
+
+**37. Variance ― The variance of a model is the variability of the model prediction for given data points.**
+
+⟶분산 ― 모델의 분산은 주어진 데이터 포인트에 대한 모델 예측의 가변성입니다.
+
+
+
+**38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
+
+⟶편향/분산 트래이드오프 ― 모델이 간단할수록 편향이 높아지고 모델이 복잡할수록 분산이 커집니다.
+
+
+
+**39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
+
+⟶[증상, 회귀 일러스트레이션, 분류 일러스트레이션, 딥러닝 일러스트레이션, 가능한 처리방법]
+
+
+
+**40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
+
+⟶[높은 학습 오류, 테스트 오류에 가까운 학습 오류, 높은 편향, 테스트 에러 보다 약간 낮은 학습 오류, 매우 낮은 학습 오류, 테스트 오류보다 훨씬 낮은 학습 오류, 높은 분산]
+
+
+
+**41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
+
+⟶[모델 복잡화, 특징 추가, 학습 증대, 정규화 수행, 추가 데이터 수집]
+
+
+
+**42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
+
+⟶오류 분석 ― 오류 분석은 현재 모델과 완벽한 모델 간의 성능 차이의 근본 원인을 분석합니다.
+
+
+
+**43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
+
+⟶애블러티브 분석 ― 애블러티브 분석은 현재 모델과 베이스라인 모델 간의 성능 차이의 근본 원인을 분석합니다.
+
+
+
+**44. Regression metrics**
+
+⟶회귀 측정 항목
+
+
+
+**45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
+
+⟶[분류 측정 항목, 혼동 행렬, 정확도, 정밀도, 리콜, F1 스코어, ROC]
+
+
+
+**46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]**
+
+⟶[회귀 측정 항목, R 스퀘어, 맬로우의 CP, AIC, BIC]
+
+
+
+**47. [Model selection, cross-validation, regularization]**
+
+⟶[모델 선택, 교차-검증, 정규화]
+
+
+
+**48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
+
+⟶[진단, 편향/분산 트래이드오프, 오류/애블러티브 분석]
diff --git a/ko/cs-229-probability.md b/ko/cs-229-probability.md
new file mode 100644
index 000000000..53ec90c53
--- /dev/null
+++ b/ko/cs-229-probability.md
@@ -0,0 +1,381 @@
+
+**1. Probabilities and Statistics refresher**
+
+⟶확률과 통계
+
+
+
+**2. Introduction to Probability and Combinatorics**
+
+⟶확률과 조합론 소개
+
+
+
+**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
+
+⟶표본 공간 ― 시행의 가능한 모든 결과 집합은 시행의 표본 공간으로 알려져 있으며 S로 표기합니다.
+
+
+
+**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
+
+⟶사건 ― 표본 공간의 모든 부분 집합 E를 사건이라고 합니다. 즉, 사건은 시행 가능한 결과로 구성된 집합입니다. 시행 결과가 E에 포함된다면, E가 발생했다고 이야기합니다.
+
+
+
+**5. Axioms of probability ― For each event E, we denote P(E) as the probability of event E occuring.**
+
+⟶확률의 공리 ― 각 사건 E에 대하여, 우리는 사건 E가 발생할 확률을 P(E)로 나타냅니다.
+
+
+
+**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
+
+⟶공리 1 ― 모든 확률은 0과 1사이에 포함됩니다, 즉:
+
+
+
+**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
+
+⟶공리 2 ― 전체 표본 공간에서 적어도 하나의 근원 사건이 발생할 확률은 1입니다. 즉:
+
+
+
+**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
+
+⟶공리 3 ― 서로 배반인 어떤 연속적인 사건 E1,...,En 에 대하여, 우리는 다음을 가집니다:
+
+
+
+**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
+
+⟶순열(Permutation) ― 순열은 n개의 객체들로부터 r개의 객체들의 순서를 고려한 배열입니다. 그러한 배열의 수는 P (n, r)에 의해 주어지며, 다음과 같이 정의됩니다:
+
+
+
+**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
+
+⟶조합(Combination) ― 조합은 n개의 객체들로부터 r개의 객체들의 순서를 고려하지 않은 배열입니다. 그러한 배열의 수는 다음과 같이 정의되는 C(n, r)에 의해 주어집니다:
+
+
+
+**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
+
+⟶비고 :우리는 for 0⩽r⩽n에 대해, P(n,r)⩾C(n,r)를 가집니다.
+
+
+
+**12. Conditional Probability**
+
+⟶조건부 확률
+
+
+
+**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
+
+⟶베이즈 규칙 ― P(B)>0인 사건 A, B에 대해, 우리는 다음을 가집니다:
+
+
+
+**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
+
+⟶비고 :우리는 P(A∩B)=P(A)P(B|A)=P(A|B)P(B)를 가집니다.
+
+
+
+**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
+
+⟶파티션(Partition)― {Ai, i∈ [[1, n]]}은 모든 i에 대해 Ai ≠ ∅이라고 해봅시다. 우리는 {Ai}가 다음과 같은 경우 파티션이라고 말합니다.
+
+
+
+**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
+
+⟶비고 : 표본 공간에서 어떤 사건 B에 대해서 우리는 P(B) = nΣi = 1P (B | Ai) P (Ai)를 가집니다.
+
+
+
+**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
+
+⟶베이즈 규칙의 확장된 형태 ― {Ai,i∈[[1,n]]}를 표본 공간의 파티션이라고 합시다. 우리는 다음을 가집니다.:
+
+
+
+**18. Independence ― Two events A and B are independent if and only if we have:**
+
+⟶독립성 ― 다음의 경우에만 두 사건 A, B가 독립적입니다:
+
+
+
+**19. Random Variables**
+
+⟶확률 변수
+
+
+
+**20. Definitions**
+
+⟶정의
+
+
+
+**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
+
+⟶확률 변수 ― 주로 X라고 표기된 확률 변수는 표본 공간의 모든 요소를 실선에 대응시키는 함수입니다.
+
+
+
+**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
+
+⟶누적 분포 함수 (CDF) ― 단조 감소하지 않고 limx → -∞F (x) = 0 이고, limx → + ∞F (x) = 1 인 누적 분포 함수 F는 다음과 같이 정의됩니다:
+
+
+
+**23. Remark: we have P(a
+
+**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
+
+⟶확률 밀도 함수 (PDF) ― 확률 밀도 함수 f는 인접한 두 확률 변수의 사이에 X가 포함될 확률입니다.
+
+
+
+**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
+
+⟶PDF와 CDF의 관계 ― 이산 (D)과 연속 (C) 예시에서 알아야 할 중요한 특성이 있습니다.
+
+
+
+**26. [Case, CDF F, PDF f, Properties of PDF]**
+
+⟶[예시, CDF F, PDF f, PDF의 특성]
+
+
+
+**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
+
+⟶분포의 기대값과 적률 ― 이산 혹은 연속일 때, 기대값 E[X], 일반화된 기대값 E[g(X)], k번째 적률 E[Xk] 및 특성 함수 ψ(ω) :
+
+
+
+**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
+
+⟶분산 (Variance) ― 주로 Var(X) 또는 σ2이라고 표기된 확률 변수의 분산은 분포 함수의 산포(Spread)를 측정한 값입니다. 이는 다음과 같이 결정됩니다:
+
+
+
+**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
+
+⟶표준 편차(Standard Deviation) ― 표준 편차는 실제 확률 변수의 단위를 사용할 수 있는 분포 함수의 산포(Spread)를 측정하는 측도입니다. 이는 다음과 같이 결정됩니다:
+
+
+**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
+
+⟶확률 변수의 변환 ― 변수 X와 Y를 어떤 함수로 연결되도록 해봅시다. fX와 fY에 각각 X와 Y의 분포 함수를 표기하면 다음과 같습니다:
+
+
+
+**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
+
+⟶라이프니츠 적분 규칙 ― g를 x의 함수로, 잠재적으로 c라고 해봅시다. 그리고 c에 종속적인 경계 a, b에 대해 우리는 다음을 가집니다:
+
+
+
+**32. Probability Distributions**
+
+⟶확률 분포
+
+
+
+**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
+
+⟶체비쇼프 부등식 ― X를 기대값 μ의 확률 변수라고 해봅시다. k에 대하여, σ>0이면 다음과 같은 부등식을 가집니다:
+
+
+
+**34. Main distributions ― Here are the main distributions to have in mind:**
+
+⟶주요 분포들― 기억해야 할 주요 분포들이 여기 있습니다:
+
+
+
+**35. [Type, Distribution]**
+
+⟶[타입(Type), 분포]
+
+
+
+**36. Jointly Distributed Random Variables**
+
+⟶결합 분포 확률 변수
+
+
+
+**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
+
+⟶주변 밀도와 누적 분포 ― 결합 밀도 확률 함수 fXY로부터 우리는 다음을 가집니다
+
+
+
+**38. [Case, Marginal density, Cumulative function]**
+
+⟶[예시, 주변 밀도, 누적 함수]
+
+
+
+**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
+
+⟶조건부 밀도 ― 주로 fX|Y로 표기되는 Y에 대한 X의 조건부 밀도는 다음과 같이 정의됩니다:
+
+
+
+**40. Independence ― Two random variables X and Y are said to be independent if we have:**
+
+⟶독립성 ― 두 확률 변수 X와 Y는 다음과 같은 경우에 독립적이라고 합니다:
+
+
+
+**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
+
+⟶공분산 ― 다음과 같이 두 확률 변수 X와 Y의 공분산을 σ2XY 혹은 더 일반적으로는 Cov(X,Y)로 정의합니다:
+
+
+
+**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
+
+⟶상관관계 ― σX, σY로 X와 Y의 표준 편차를 표기함으로써 ρXY로 표기된 임의의 변수 X와 Y 사이의 상관관계를 다음과 같이 정의합니다:
+
+
+
+**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
+
+⟶비고 1 : 우리는 임의의 확률 변수 X, Y에 대해 ρXY∈ [-1,1]를 가진다고 말합니다.
+
+
+
+**44. Remark 2: If X and Y are independent, then ρXY=0.**
+
+⟶비고 2 : X와 Y가 독립이라면 ρXY=0입니다.
+
+
+
+**45. Parameter estimation**
+
+⟶모수 추정
+
+
+
+**46. Definitions**
+
+⟶정의
+
+
+
+**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
+
+⟶확률 표본 ― 확률 표본은 X와 독립적으로 동일하게 분포하는 n개의 확률 변수 X1, ..., Xn의 모음입니다.
+
+
+
+**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
+
+⟶추정량 ― 추정량은 통계 모델에서 알 수 없는 모수의 값을 추론하는 데 사용되는 데이터의 함수입니다.
+
+
+
+**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
+
+⟶편향 ― 추정량 ^θ의 편향은 ^θ 분포의 기대값과 실제값 사이의 차이로 정의됩니다. 즉,:
+
+
+
+**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
+
+⟶비고 : 추정량은 E [^ θ]=θ 일 때, 비 편향적이라고 말합니다.
+
+
+
+**51. Estimating the mean**
+
+⟶평균 추정
+
+
+
+**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
+
+⟶표본 평균 ― 랜덤 표본의 표본 평균은 분포의 실제 평균 μ를 추정하는 데 사용되며 종종 다음과 같이 정의됩니다:
+
+
+
+**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
+
+⟶비고 : 표본 평균은 비 편향적입니다, 즉i.e E[¯¯¯¯¯X]=μ.
+
+
+
+**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
+
+⟶중심 극한 정리 ― 평균 μ와 분산 σ2를 갖는 주어진 분포를 따르는 랜덤 표본 X1, ..., Xn을 가정해 봅시다 그러면 우리는 다음을 가집니다:
+
+
+
+**55. Estimating the variance**
+
+⟶분산 추정
+
+
+
+**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
+
+⟶표본 분산 ― 랜덤 표본의 표본 분산은 분포의 실제 분산 σ2를 추정하는 데 사용되며 종종 s2 또는 σ2로 표기되며 다음과 같이 정의됩니다:
+
+
+
+**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
+
+⟶비고 : 표본 분산은 비 편향적입니다, 즉 E[s2]=σ2.
+
+
+
+**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
+
+⟶표본 분산과 카이 제곱의 관계 ― s2를 랜덤 표본의 표분 분산이라고 합시다. 우리는 다음을 가집니다:
+
+
+
+**59. [Introduction, Sample space, Event, Permutation]**
+
+⟶[소개, 표본 공간, 사건, 순열]
+
+
+
+**60. [Conditional probability, Bayes' rule, Independence]**
+
+⟶[조건부 확률, 베이즈 규칙, 독립]
+
+
+
+**61. [Random variables, Definitions, Expectation, Variance]**
+
+⟶[확률 변수, 정의, 기대값, 분산]
+
+
+
+**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
+
+⟶[확률 분포, 체비쇼프 부등식, 주요 분포]
+
+
+
+**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
+
+⟶[결합 분포의 확률 변수, 밀도, 공분산, 상관관계]
+
+
+
+**64. [Parameter estimation, Mean, Variance]**
+
+⟶[모수 추정, 평균, 분산]
diff --git a/ko/cs-229-unsupervised-learning.md b/ko/cs-229-unsupervised-learning.md
new file mode 100644
index 000000000..e961a88cc
--- /dev/null
+++ b/ko/cs-229-unsupervised-learning.md
@@ -0,0 +1,340 @@
+**1. Unsupervised Learning cheatsheet**
+
+⟶ 비지도 학습 cheatsheet
+
+
+
+**2. Introduction to Unsupervised Learning**
+
+⟶ 비지도 학습 소개
+
+
+
+**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
+
+⟶ 동기부여 - 비지도학습의 목표는 {x(1),...,x(m)}와 같이 라벨링이 되어있지 않은 데이터 내의 숨겨진 패턴을 찾는것이다.
+
+
+
+**4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
+
+⟶ 옌센 부등식 - f를 볼록함수로 하며 X는 확률변수로 두고 아래와 같은 부등식을 따르도록 하자.
+
+
+
+**5. Clustering**
+
+⟶ 군집화
+
+
+
+**6. Expectation-Maximization**
+
+⟶ 기댓값 최대화
+
+
+
+**7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
+
+⟶ 잠재변수 - 잠재변수들은 숨겨져있거나 관측되지 않는 변수들을 말하며, 이러한 변수들은 추정문제의 어려움을 가져온다. 그리고 잠재변수는 종종 z로 표기되어진다. 일반적인 잠재변수로 구성되어져있는 형태들을 살펴보자
+
+
+
+**8. [Setting, Latent variable z, Comments]**
+
+⟶ 표기형태, 잠재변수 z, 주석
+
+
+
+**9. [Mixture of k Gaussians, Factor analysis]**
+
+⟶ 가우시안 혼합모델, 요인분석
+
+
+
+**10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
+
+⟶ 알고리즘 - 기댓값 최대화 (EM) 알고리즘은 모수 θ를 추정하는 효율적인 방법을 제공해준다. 모수 θ의 추정은 아래와 같이 우도의 아래 경계지점을 구성하는(E-step)과 그 우도의 아래 경계지점을 최적화하는(M-step)들의 반복적인 최대우도측정을 통해 추정된다.
+
+
+
+**11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
+
+⟶ E-step : 각각의 데이터 포인트 x(i)은 특정 클러스터 z(i)로 부터 발생한 후 사후확률Qi(z(i))를 평가한다. 아래의 식 참조
+
+
+
+**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
+
+⟶ M-step : 데이터 포인트 x(i)에 대한 클러스트의 특정 가중치로 사후확률 Qi(z(i))을 사용, 각 클러스트 모델을 개별적으로 재평가한다. 아래의 식 참조
+
+
+
+**13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
+
+⟶ Gaussians 초기값, 기대 단계, 최대화 단계, 수렴
+
+
+
+**14. k-means clustering**
+
+⟶ k-평균 군집화
+
+
+
+**15. We note c(i) the cluster of data point i and μj the center of cluster j.**
+
+⟶ c(i)는 데이터 포인트 i 와 j군집의 중앙인 μj 들의 군집이다.
+
+
+
+**16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
+
+⟶ 알고리즘 - 군집 중앙에 μ1,μ2,...,μk∈Rn 와 같이 무작위로 초기값을 잡은 후, k-평균 알고리즘이 수렴될때 까지 아래와 같은 단계를 반복한다.
+
+
+
+**17. [Means initialization, Cluster assignment, Means update, Convergence]**
+
+⟶ 평균 초기값, 군집분할, 평균 재조정, 수렴
+
+
+
+**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
+
+⟶ 왜곡 함수 - 알고리즘이 수렴하는지를 확인하기 위해서는 아래와 같은 왜곡함수를 정의해야 한다.
+
+
+
+**19. Hierarchical clustering**
+
+⟶ 계층적 군집분석
+
+
+
+**20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
+
+⟶ 알고리즘 - 연속적 방식으로 중첩된 클러스트를 구축하는 결합형 계층적 접근방식을 사용하는 군집 알고리즘이다.
+
+
+
+**21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
+
+⟶ 종류 - 다양한 목적함수의 최적화를 목표로하는 다양한 종류의 계층적 군집분석 알고리즘들이 있으며, 아래 표와 같이 요약되어있다.
+
+
+
+**22. [Ward linkage, Average linkage, Complete linkage]**
+
+⟶ Ward 연결법, 평균 연결법, 완전 연결법
+
+
+
+**23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
+
+⟶ 군집 거리 내에서의 최소화, 한쌍의 군집간 평균거리의 최소화, 한쌍의 군집간 최대거리의 최소화
+
+
+
+**24. Clustering assessment metrics**
+
+⟶ 군집화 평가 metrics
+
+
+
+**25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
+
+⟶ 비지도학습 환경에서는, 지도학습 환경과는 다르게 실측자료에 라벨링이 없기 때문에 종종 모델에 대한 성능평가가 어렵다.
+
+
+
+**26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
+
+⟶ 실루엣 계수 - a와 b를 같은 클래스의 다른 모든점과 샘플 사이의 평균거리와 다음 가장 가까운 군집의 다른 모든 점과 샘플사이의 평균거리로 표기하면 단일 샘플에 대한 실루엣 계수 s는 다음과 같이 정의할 수 있다.
+
+
+
+**27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
+
+⟶ Calinski-Harabaz 색인 - k개 군집에 Bk와 Wk를 표기하면, 다음과 같이 각각 정의 된 군집간 분산행렬이다.
+
+
+
+**28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
+
+⟶ Calinski-Harabaz 색인 s(k)는 군집모델이 군집화를 얼마나 잘 정의하는지를 나타낸다. 가령 높은 점수일수록 군집이 더욱 밀도있으며 잘 분리되는 형태이다. 아래와 같은 정의를 따른다.
+
+
+
+**29. Dimension reduction**
+
+⟶ 차원 축소
+
+
+
+**30. Principal component analysis**
+
+⟶ 주성분 분석
+
+
+
+**31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
+
+⟶ 차원축소 기술은 데이터를 반영하는 최대 분산방향을 찾는 기술이다.
+
+
+
+**32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+⟶ 고유값, 고유벡터 - A∈Rn×n 행렬이 주어질때, λ는 A의 고유값이 되며, 만약 z∈Rn∖{0} 벡터가 있다면 고유함수이다.
+
+
+
+**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+⟶ 스펙트럼 정리 - A∈Rn×n 이라고 하자 만약 A가 대칭이라면, A는 실수 직교 행렬 U∈Rn×n에 의해 대각행렬로 만들 수 있다.
+
+
+
+**34. diagonal**
+
+⟶ 대각선
+
+
+
+**35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
+
+⟶ 참조: 가장 큰 고유값과 연관된 고유 벡터를 행렬 A의 주요 고유벡터라고 부른다
+
+
+
+**36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k
+dimensions by maximizing the variance of the data as follows:**
+
+⟶ 알고리즘 - 주성분 분석(PCA) 절차는 데이터 분산을 최대화하여 k 차원의 데이터를 투영하는 차원 축소 기술로 다음과 같이 따른다.
+
+
+
+**37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
+
+⟶ 1단계: 평균을 0으로 표준편차가 1이되도록 데이터를 표준화한다.
+
+
+
+**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
+
+⟶ 2단계: 실제 고유값과 대칭인 Σ=1mm∑i=1x(i)x(i)T∈Rn×n를 계산합니다.
+
+
+
+**39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
+
+⟶ 3단계: k 직교 고유벡터의 합을 u1,...,uk∈Rn와 같이 계산한다. 다시말하면, 가장 큰 고유값 k의 직교 고유벡터이다.
+
+
+
+**40. Step 4: Project the data on spanR(u1,...,uk).**
+
+⟶ 4단계: R(u1,...,uk) 범위에 데이터를 투영하자.
+
+
+
+**41. This procedure maximizes the variance among all k-dimensional spaces.**
+
+⟶ 해당 절차는 모든 k-차원의 공간들 사이에 분산을 최대화 하는것이다.
+
+
+
+**42. [Data in feature space, Find principal components, Data in principal components space]**
+
+⟶ 변수공간의 데이터, 주요성분들 찾기, 주요성분공간의 데이터
+
+
+
+**43. Independent component analysis**
+
+⟶ 독립성분분석
+
+
+
+**44. It is a technique meant to find the underlying generating sources.**
+
+⟶ 근원적인 생성원을 찾기위한 기술을 의미한다.
+
+
+
+**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
+
+⟶ 가정 - 다음과 같이 우리는 데이터 x가 n차원의 소스벡터 s=(s1,...,sn)에서부터 생성되었음을 가정한다. 이때 si는 독립적인 확률변수에서 나왔으며, 혼합 및 비특이 행렬 A를 통해 생성된다고 가정한다.
+
+
+
+**46. The goal is to find the unmixing matrix W=A−1.**
+
+⟶ 비혼합 행렬 W=A−1를 찾는 것을 목표로 한다.
+
+
+
+**47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
+
+⟶ Bell과 Sejnowski 독립성분분석(ICA) 알고리즘 - 다음의 단계들을 따르는 비혼합 행렬 W를 찾는 알고리즘이다.
+
+
+
+**48. Write the probability of x=As=W−1s as:**
+
+⟶ x=As=W−1s의 확률을 다음과 같이 기술한다.
+
+
+
+**49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
+
+⟶ 주어진 학습데이터 {x(i),i∈[[1,m]]}에 로그우도를 기술하고 시그모이드 함수 g를 다음과 같이 표기한다.
+
+
+
+**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
+
+⟶ 그러므로, 확률적 경사상승 학습 규칙은 각 학습예제 x(i)에 대해서 다음과 같이 W를 업데이트하는 것과 같다.
+
+
+
+**51. The Machine Learning cheatsheets are now available in Korean.**
+
+⟶ 머신러닝 cheatsheets는 현재 한국어로 제공된다.
+
+
+
+**52. Original authors**
+
+⟶ 원저자
+
+
+
+**53. Translated by X, Y and Z**
+
+⟶ X,Y,Z에 의해 번역되다.
+
+
+
+**54. Reviewed by X, Y and Z**
+
+⟶ X,Y,Z에 의해 검토되다.
+
+
+
+**55. [Introduction, Motivation, Jensen's inequality]**
+
+⟶ 소개, 동기부여, 얀센 부등식
+
+
+
+**56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
+
+⟶ 군집화, 기댓값-최대화, k-means, 계층적 군집화, 측정지표
+
+
+
+**57. [Dimension reduction, PCA, ICA]**
+
+⟶ 차원축소, 주성분분석(PCA), 독립성분분석(ICA)
diff --git a/pt/cheatsheet-deep-learning.md b/pt/cs-229-deep-learning.md
similarity index 81%
rename from pt/cheatsheet-deep-learning.md
rename to pt/cs-229-deep-learning.md
index 2e3e63879..6d7c083f4 100644
--- a/pt/cheatsheet-deep-learning.md
+++ b/pt/cs-229-deep-learning.md
@@ -24,25 +24,25 @@
**5. [Input layer, hidden layer, output layer]**
-⟶ [Camada de entrada, camada escondida, camada de saída]
+⟶ [Camada de entrada, camada oculta, camada de saída]
**6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
-⟶ Dado que i é a i-ésima camada da rede e j a j-ésima unidade escondida da camada, nós temos:
+⟶ Dado que i é a i-ésima camada da rede e j a j-ésima unidade oculta da camada, nós temos:
**7. where we note w, b, z the weight, bias and output respectively.**
-⟶ onde é definido que w, b, z, o peso, o viés e a saída respectivamente.
+⟶ onde é definido que w, b, z, representam o peso, o viés e a saída, respectivamente.
**8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:**
-⟶ Função de ativação - Funções de ativação são usadas no fim de uma unidade escondida para introduzir complexidades não lineares ao modelo. Aqui estão as mais comuns:
+⟶ Função de ativação - Funções de ativação são usadas no fim de uma unidade oculta para introduzir complexidades não lineares ao modelo. Aqui estão as mais comuns:
@@ -108,7 +108,7 @@
**19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
-⟶ Abandono (Dropout) - Abandono (Dropout) é uma técnica que pretende prevenir o sobreajuste dos dados de treinamente abandonando unidades na rede neural. Na prática, neurônios são ou abandonados com a propabilidade p ou mantidos com a propabilidade 1-p
+⟶ Abandono (Dropout) - Abandono (Dropout) é uma técnica que pretende evitar o sobreajuste (overfitting) dos dados de treinamento abandonando unidades na rede neural. Na prática, os neurônios são, ou abandonados com a propabilidade p, ou mantidos com a propabilidade 1-p
@@ -120,7 +120,7 @@
**21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
-⟶ Requisito de camada convolucional - Dado que W é o tamanho do volume de entrada, F o tamanho dos neurônios da camada convolucional, P a quantidade de preenchimento de zeros, então o número de neurônios N que cabem em um dado volume é tal que:
+⟶ Requisito da camada convolucional - Dado que W é o tamanho do volume de entrada, F o tamanho dos neurônios da camada convolucional, P a quantidade de preenchimento de zeros, então o número de neurônios N que cabem em um dado volume é tal que:
@@ -132,7 +132,7 @@
**23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
-⟶ Isso é usualmente feito após de uma totalmente conectada/camada concolucional e antes de uma camada não linear e objetiva permitir maiores taxas de apredizado e reduzir a forte dependência na inicialização.
+⟶ Isso é geralmente feito após uma camada convolucional totalmente conectada e antes de uma camada não-linear, e objetiva permitir maiores taxas de apredizado e reduzir a forte dependência na inicialização.
@@ -144,7 +144,7 @@
**25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:**
-⟶ Tipos de portas (gates) - Aqui estão os diferentes tipos de portas (gates) que encontramos em uma rede neural recorrente típica:
+⟶ Tipos de portas (gates) - Aqui estão os diferentes tipos de portas (gates) que encontramos em uma típica rede neural recorrente:
@@ -162,19 +162,19 @@
**28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
-⟶ LSTM - Uma rede de memória de longo prazo (LSTM) é um tipo de modelo de rede neural recorretne (RNN) que evita o problema do desaparecimento da gradiente adicionando portas de 'esquecimento'.
+⟶ LSTM - Uma rede de memória de longo prazo (LSTM) é um tipo de modelo de rede neural recorrente (RNN) que evita o problema do desaparecimento do gradiente adicionando portas de 'esquecimento' (forget gate).
**29. Reinforcement Learning and Control**
-⟶ Aprendizado e Controle Reforçado
+⟶ Controle e Aprendizado por Reforço
**30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.**
-⟶ O objetivo do aprendizado reforçado é fazer um agente aprender como evoluir em um ambiente.
+⟶ O objetivo do aprendizado por reforço é fazer um agente aprender como evoluir em um ambiente.
@@ -204,7 +204,7 @@
**35. {Psa} are the state transition probabilities for s∈S and a∈A**
-⟶ Psa são as probabilidade de transição de estado para s∈S e a∈A
+⟶ {Psa} são as probabilidade de transição de estado para s∈S e a∈A
@@ -222,7 +222,7 @@
**38. Policy ― A policy π is a function π:S⟶A that maps states to actions.**
-⟶ Diretriz - Uma diretriz π é a função π:S⟶A que mapeia os estados a ações.
+⟶ Diretriz - Uma diretriz π é a função π:S⟶A que mapeia os estados em ações.
@@ -240,13 +240,13 @@
**41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:**
-⟶ Equação de Bellman - As equações de Bellman ótimas caracterizam a função de valor Vπ∗ para a ótima diretriz π∗:
+⟶ Equação de Bellman - As equações de Bellman ótimas descrevem a função de valor Vπ∗ a partir da diretriz ótima π∗:
**42. Remark: we note that the optimal policy π∗ for a given state s is such that:**
-⟶ Observação: definimos que a ótima diretriz π∗ para um dado estado s é tal que:
+⟶ Observe: definimos que a diretriz ótima π∗ para um dado estado s é tal que:
@@ -270,7 +270,7 @@
**46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
-⟶ Máxima probabilidade estimada - A máxima probabildiade estima para o estado de transição de probabilidades como se segue:
+⟶ Estimador de Máxima Verossimilhança - O estimador de máxima verossimilhança para as probabilidades de transição de estados são como segue:
@@ -288,7 +288,7 @@
**49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:**
-⟶ Aprendizado Q - Aprendizado Q é um modelo livre de estimativa de Q, o qual é feito como se segue:
+⟶ Aprendizado-Q (Q-learning) - Aprendizado-Q é um modelo livre de estimativa de Q, o qual é feito como se segue:
@@ -306,16 +306,16 @@
**52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
-⟶ [Redes Neurais Convolucionais, Camada convolucional, Normalização em lote]
+⟶ [Redes Neurais Convolucionais (CNN), Camada Convolucional, Normalização em lote]
**53. [Recurrent Neural Networks, Gates, LSTM]**
-⟶[Redes Nerais Recorrentes, Portas (Gates), LSTM]
+⟶[Redes Neurais Recorrentes (RNN), Portas (Gates), LSTM]
**54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
-⟶ [Aprendizado reforçado, Processo de decisão de Markov, Iteração de valor/diretriz, Programação dinâmica aproximada, Busca de diretriz]
+⟶ [Aprendizado por Reforço, Processo de Decisão de Markov, Iteração de valor/diretriz, Programação dinâmica aproximada, Busca de diretriz]
diff --git a/pt/refresher-linear-algebra.md b/pt/cs-229-linear-algebra.md
similarity index 100%
rename from pt/refresher-linear-algebra.md
rename to pt/cs-229-linear-algebra.md
diff --git a/pt/cheatsheet-machine-learning-tips-and-tricks.md b/pt/cs-229-machine-learning-tips-and-tricks.md
similarity index 100%
rename from pt/cheatsheet-machine-learning-tips-and-tricks.md
rename to pt/cs-229-machine-learning-tips-and-tricks.md
diff --git a/pt/refresher-probability.md b/pt/cs-229-probability.md
similarity index 100%
rename from pt/refresher-probability.md
rename to pt/cs-229-probability.md
diff --git a/pt/cheatsheet-supervised-learning.md b/pt/cs-229-supervised-learning.md
similarity index 100%
rename from pt/cheatsheet-supervised-learning.md
rename to pt/cs-229-supervised-learning.md
diff --git a/pt/cheatsheet-unsupervised-learning.md b/pt/cs-229-unsupervised-learning.md
similarity index 100%
rename from pt/cheatsheet-unsupervised-learning.md
rename to pt/cs-229-unsupervised-learning.md
diff --git a/pt/cs-230-convolutional-neural-networks.md b/pt/cs-230-convolutional-neural-networks.md
new file mode 100644
index 000000000..4934d7c2f
--- /dev/null
+++ b/pt/cs-230-convolutional-neural-networks.md
@@ -0,0 +1,718 @@
+**Convolutional Neural Networks translation**
+
+
+
+**1. Convolutional Neural Networks cheatsheet**
+
+⟶ Dicas de Redes Neurais Convolucionais
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶ CS 230 - Aprendizagem profunda
+
+
+
+
+**3. [Overview, Architecture structure]**
+
+⟶ [Visão geral, Estrutura arquitetural]
+
+
+
+
+**4. [Types of layer, Convolution, Pooling, Fully connected]**
+
+⟶ [Tipos de camadas, Convolução, Pooling, Totalmente conectada]
+
+
+
+
+**5. [Filter hyperparameters, Dimensions, Stride, Padding]**
+
+⟶ [Hiperparâmetros de filtro, Dimensões, Passo, Preenchimento]
+
+
+
+
+**6. [Tuning hyperparameters, Parameter compatibility, Model complexity, Receptive field]**
+
+⟶[Ajustando hiperparâmetros, Compatibilidade de parâmetros, Complexidade de modelo, Campo receptivo]
+
+
+
+
+**7. [Activation functions, Rectified Linear Unit, Softmax]**
+
+⟶ [Funções de Ativação, Unidade Linear Retificada, Softmax]
+
+
+
+
+**8. [Object detection, Types of models, Detection, Intersection over Union, Non-max suppression, YOLO, R-CNN]**
+
+⟶[Detecção de objetos, Tipos de modelos, Detecção, Intersecção por União, Supressão não-máxima, YOLO, R-CNN]
+
+
+
+
+**9. [Face verification/recognition, One shot learning, Siamese network, Triplet loss]**
+
+⟶ [Verificação / reconhecimento facial, Aprendizado de disparo único, Rede siamesa, Perda tripla]
+
+
+
+
+**10. [Neural style transfer, Activation, Style matrix, Style/content cost function]**
+
+⟶ [Transferência de estilo neural, Ativação, Matriz de estilo, Função de custo de estilo/conteúdo]
+
+
+
+
+**11. [Computational trick architectures, Generative Adversarial Net, ResNet, Inception Network]**
+
+⟶ [Arquiteturas de truques computacionais, Rede Adversarial Generativa, ResNet, Rede de Iniciação]
+
+
+
+
+**12. Overview**
+
+⟶ Visão geral
+
+
+
+
+**13. Architecture of a traditional CNN ― Convolutional neural networks, also known as CNNs, are a specific type of neural networks that are generally composed of the following layers:**
+
+⟶ Arquitetura de uma RNC tradicional (CNN) - Redes neurais convolucionais, também conhecidas como CNN (em inglês), são tipos específicos de redes neurais que geralmente são compostas pelas seguintes camadas:
+
+
+
+
+**14. The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters that are described in the next sections.**
+
+⟶ A camada convolucional e a camadas de pooling podem ter um ajuste fino considerando os hiperparâmetros que estão descritos nas próximas seções.
+
+
+
+
+**15. Types of layer**
+
+⟶ Tipos de camadas
+
+
+
+
+**16. Convolution layer (CONV) ― The convolution layer (CONV) uses filters that perform convolution operations as it is scanning the input I with respect to its dimensions. Its hyperparameters include the filter size F and stride S. The resulting output O is called feature map or activation map.**
+
+⟶ Camada convolucional (CONV) - A camada convolucional (CONV) usa filtros que realizam operações de convolução conforme eles escaneiam a entrada I com relação a suas dimensões. Seus hiperparâmetros incluem o tamanho do filtro F e o passo S. O resultado O é chamado de mapa de recursos (feature map) ou mapa de ativação.
+
+
+
+
+**17. Remark: the convolution step can be generalized to the 1D and 3D cases as well.**
+
+⟶ Observação: o passo de convolução também pode ser generalizado para os casos 1D e 3D.
+
+
+
+
+**18. Pooling (POOL) ― The pooling layer (POOL) is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance. In particular, max and average pooling are special kinds of pooling where the maximum and average value is taken, respectively.**
+
+⟶ Pooling (POOL) - A camada de pooling (POOL) é uma operação de amostragem (downsampling), tipicamente aplicada depois de uma camada convolucional, que faz alguma invariância espacial. Em particular, pooling máximo e médio são casos especiais de pooling onde o máximo e o médio valor são obtidos, respectivamente.
+
+
+
+
+**19. [Type, Purpose, Illustration, Comments]**
+
+⟶ [Tipo, Propósito, Ilustração, Comentários]
+
+
+
+
+**20. [Max pooling, Average pooling, Each pooling operation selects the maximum value of the current view, Each pooling operation averages the values of the current view]**
+
+⟶ [Pooling máximo, Pooling médio, Cada operação de pooling seleciona o valor máximo da exibição atual, Cada operação de pooling calcula a média dos valores da exibição atual]
+
+
+
+
+**21. [Preserves detected features, Most commonly used, Downsamples feature map, Used in LeNet]**
+
+⟶ [Preserva os recursos detectados, Mais comumente usados, Mapa de recursos de amostragem (downsample), Usado no LeNet]
+
+
+
+
+
+**22. Fully Connected (FC) ― The fully connected layer (FC) operates on a flattened input where each input is connected to all neurons. If present, FC layers are usually found towards the end of CNN architectures and can be used to optimize objectives such as class scores.**
+
+⟶ Totalmente Conectado (FC) - A camada totalmente conectada (FC opera em uma entrada achatada, onde cada entrada é conectada a todos os neurônios. Se estiver presente, as camadas FC geralmente são encontradas no final das arquiteturas da CNN e podem ser usadas para otimizar objetivos, como pontuações de classes.
+
+
+
+
+**23. Filter hyperparameters**
+
+⟶ Hiperparâmetros de filtros
+
+
+
+
+**24. The convolution layer contains filters for which it is important to know the meaning behind its hyperparameters.**
+
+⟶ A camada de convolução contém filtros para os quais é importante conhecer o significado por trás de seus hiperparâmetros.
+
+
+
+
+**25. Dimensions of a filter ― A filter of size F×F applied to an input containing C channels is a F×F×C volume that performs convolutions on an input of size I×I×C and produces an output feature map (also called activation map) of size O×O×1.**
+
+⟶ Dimensões de um filtro - Um filtro de tamanho F×F aplicado a uma entrada contendo C canais é um volume de tamanho F×F×C que executa convoluções em uma entrada de tamanho I×I×C e produz um mapa de recursos (também chamado de mapa de ativação) da saída de tamanho O×O×1.
+
+
+
+
+**26. Filter**
+
+⟶ Filtros
+
+
+
+
+**27. Remark: the application of K filters of size F×F results in an output feature map of size O×O×K.**
+
+⟶ Observação: a aplicação de K filtros de tamanho F×F resulta em um mapa de recursos de saída de tamanho O×O×K.
+
+
+
+
+**28. Stride ― For a convolutional or a pooling operation, the stride S denotes the number of pixels by which the window moves after each operation.**
+
+⟶ Passo - Para uma operação convolucional ou de pooling, o passo S denota o número de pixels que a janela se move após cada operação.
+
+
+
+
+**29. Zero-padding ― Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. This value can either be manually specified or automatically set through one of the three modes detailed below:**
+
+⟶ Zero preenchimento (Zero-padding) - Zero preenchimento denota o processo de adicionar P zeros em cada lado das fronteiras de entrada. Esse valor pode ser especificado manualmente ou automaticamente ajustado através de um dos três modelos abaixo:
+
+
+
+
+**30. [Mode, Value, Illustration, Purpose, Valid, Same, Full]**
+
+⟶ [Modo, Valor, Ilustração, Propósito, Válido, Idêntico, Completo]
+
+
+
+
+**31. [No padding, Drops last convolution if dimensions do not match, Padding such that feature map size has size ⌈IS⌉, Output size is mathematically convenient, Also called 'half' padding, Maximum padding such that end convolutions are applied on the limits of the input, Filter 'sees' the input end-to-end]**
+
+⟶ [Sem preenchimento, Descarta a última convolução se as dimensões não corresponderem, Preenchimento de tal forma que o tamanho do mapa de recursos tenha tamanho ⌈IS⌉, Tamanho da saída é matematicamente conveniente, Também chamado de 'meio' preenchimento, Preenchimento máximo de tal forma que convoluções finais são aplicadas nos limites de a entrada, Filtro 'vê' a entrada de ponta a ponta]
+
+
+
+
+**32. Tuning hyperparameters**
+
+⟶ Ajuste de hiperparâmetros
+
+
+
+
+**33. Parameter compatibility in convolution layer ― By noting I the length of the input volume size, F the length of the filter, P the amount of zero padding, S the stride, then the output size O of the feature map along that dimension is given by:**
+
+⟶ Compatibilidade de parâmetro na camada convolucional - Considerando I o comprimento do tamanho do volume da entrada, F o tamanho do filtro, P a quantidade de preenchimento de zero (zero-padding) e S o tamanho do passo, então o tamanho de saída O do mapa de recursos ao longo dessa dimensão é dado por:
+
+
+
+
+
+**34. [Input, Filter, Output]**
+
+⟶ [Entrada, Filtro, Saída]
+
+
+
+
+**35. Remark: often times, Pstart=Pend≜P, in which case we can replace Pstart+Pend by 2P in the formula above.**
+
+⟶ Observação: diversas vezes, Pstart=Pend≜P, em cujo caso podemos substituir Pstart+Pen por 2P na fórmula acima.
+
+
+
+
+**36. Understanding the complexity of the model ― In order to assess the complexity of a model, it is often useful to determine the number of parameters that its architecture will have. In a given layer of a convolutional neural network, it is done as follows:**
+
+⟶ Entendendo a complexidade do modelo - Para avaliar a complexidade de um modelo, é geralmente útil determinar o número de parâmetros que a arquitetura deverá ter. Em uma determinada camada de uma rede neural convolucional, ela é dada da seguinte forma:
+
+
+
+
+**37. [Illustration, Input size, Output size, Number of parameters, Remarks]**
+
+⟶ [Ilustração, Tamanho da entrada, Tamanho da saída, Número de parâmetros, Observações]
+
+
+
+
+**38. [One bias parameter per filter, In most cases, S
+
+
+**39. [Pooling operation done channel-wise, In most cases, S=F]**
+
+⟶ [Operação de pooling feita pelo canal, Na maior parte dos casos, S=F]
+
+
+
+
+**40. [Input is flattened, One bias parameter per neuron, The number of FC neurons is free of structural constraints]**
+
+⟶ [Entrada é achatada, Um parâmetro de viés (bias parameter) por neurônio, O número de neurônios FC está livre de restrições estruturais]
+
+
+
+
+**41. Receptive field ― The receptive field at layer k is the area denoted Rk×Rk of the input that each pixel of the k-th activation map can 'see'. By calling Fj the filter size of layer j and Si the stride value of layer i and with the convention S0=1, the receptive field at layer k can be computed with the formula:**
+
+⟶ Campo receptivo - O campo receptivo na camada k é a área denotada por Rk×Rk da entrada que cada pixel do k-ésimo mapa de ativação pode 'ver'. Ao chamar Fj o tamanho do filtro da camada j e Si o valor do passo da camada i e com a convenção S0=1, o campo receptivo na camada k pode ser calculado com a fórmula:
+
+
+
+
+**42. In the example below, we have F1=F2=3 and S1=S2=1, which gives R2=1+2⋅1+2⋅1=5.**
+
+⟶ No exemplo abaixo, temos que F1=F2=3 e S1=S2=1, o que resulta em R2=1+2⋅1+2⋅1=5.
+
+
+
+
+**43. Commonly used activation functions**
+
+⟶ Funções de ativação comumente usadas
+
+
+
+
+**44. Rectified Linear Unit ― The rectified linear unit layer (ReLU) is an activation function g that is used on all elements of the volume. It aims at introducing non-linearities to the network. Its variants are summarized in the table below:**
+
+⟶ Unidade Linear Retificada (Rectified Linear Unit) - A camada unitária linear retificada (ReLU) é uma função de ativação g que é usada em todos os elementos do volume. Tem como objetivo introduzir não linearidades na rede. Suas variantes estão resumidas na tabela abaixo:
+
+
+
+
+**45. [ReLU, Leaky ReLU, ELU, with]**
+
+⟶ [ReLU, Leaky ReLU, ELU, com]
+
+
+
+
+**46. [Non-linearity complexities biologically interpretable, Addresses dying ReLU issue for negative values, Differentiable everywhere]**
+
+⟶ [Complexidades de não-linearidade biologicamente interpretáveis, Endereça o problema da ReLU para valores negativos, Diferenciável em todos os lugares]
+
+
+
+
+**47. Softmax ― The softmax step can be seen as a generalized logistic function that takes as input a vector of scores x∈Rn and outputs a vector of output probability p∈Rn through a softmax function at the end of the architecture. It is defined as follows:**
+
+⟶ Softmax - O passo de softmax pode ser visto como uma função logística generalizada que pega como entrada um vetor de pontuações x∈Rn e retorna um vetor de probabilidades p∈Rn através de uma função softmax no final da arquitetura. É definida como:
+
+
+
+
+**48. where**
+
+⟶ onde
+
+
+
+
+**49. Object detection**
+
+⟶ Detecção de objeto
+
+
+
+
+**50. Types of models ― There are 3 main types of object recognition algorithms, for which the nature of what is predicted is different. They are described in the table below:**
+
+⟶ Tipos de modelos - Existem 3 tipos de algoritmos de reconhecimento de objetos, para o qual a natureza do que é previsto é diferente para cada um. Eles estão descritos na tabela abaixo:
+
+
+
+
+**51. [Image classification, Classification w. localization, Detection]**
+
+⟶ [Classificação de imagem, Classificação com localização, Detecção]
+
+
+
+
+**52. [Teddy bear, Book]**
+
+⟶ [Urso de pelúcia, Livro]
+
+
+
+
+**53. [Classifies a picture, Predicts probability of object, Detects an object in a picture, Predicts probability of object and where it is located, Detects up to several objects in a picture, Predicts probabilities of objects and where they are located]**
+
+⟶ [Classifica uma imagem, Prevê a probabilidade de um objeto, Detecta um objeto em uma imagem, Prevê a probabilidade de objeto e onde ele está localizado, Detecta vários objetos em uma imagem, Prevê probabilidades de objetos e onde eles estão localizados]
+
+
+
+
+**54. [Traditional CNN, Simplified YOLO, R-CNN, YOLO, R-CNN]**
+
+⟶ [CNN tradicional, YOLO simplificado, R-CNN, YOLO, R-CNN]
+
+
+
+
+**55. Detection ― In the context of object detection, different methods are used depending on whether we just want to locate the object or detect a more complex shape in the image. The two main ones are summed up in the table below:**
+
+⟶ Detecção - No contexto da detecção de objetos, diferentes métodos são usados dependendo se apenas queremos localizar o objeto ou detectar uma forma mais complexa na imagem. Os dois principais são resumidos na tabela abaixo:
+
+
+
+
+**56. [Bounding box detection, Landmark detection]**
+
+⟶ [Detecção de caixa limite, Detecção de marco]
+
+
+
+
+**57. [Detects the part of the image where the object is located, Detects a shape or characteristics of an object (e.g. eyes), More granular]**
+
+⟶ [Detecta parte da imagem onde o objeto está localizado, Detecta a forma ou característica de um objeto (e.g. olhos), Mais granular]
+
+
+
+
+**58. [Box of center (bx,by), height bh and width bw, Reference points (l1x,l1y), ..., (lnx,lny)]**
+
+⟶ [Caixa central (bx,by), altura bh e largura bw, Pontos de referência (l1x,l1y), ..., (lnx,lny)]
+
+
+
+
+**59. Intersection over Union ― Intersection over Union, also known as IoU, is a function that quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding box Ba. It is defined as:**
+
+⟶ Interseção sobre União (Intersection over Union) - Interseção sobre União, também conhecida como IoU, é uma função que quantifica quão corretamente posicionado uma caixa de delimitação predita Bp está sobre a caixa de delimitação real Ba. É definida por:
+
+
+
+
+**60. Remark: we always have IoU∈[0,1]. By convention, a predicted bounding box Bp is considered as being reasonably good if IoU(Bp,Ba)⩾0.5.**
+
+⟶ Observação: temos que IoU∈[0,1]. Por convenção, uma caixa de delimitação predita Bp é considerada razoavelmente boa se IoU(Bp,Ba)⩾0.5.
+
+
+
+
+**61. Anchor boxes ― Anchor boxing is a technique used to predict overlapping bounding boxes. In practice, the network is allowed to predict more than one box simultaneously, where each box prediction is constrained to have a given set of geometrical properties. For instance, the first prediction can potentially be a rectangular box of a given form, while the second will be another rectangular box of a different geometrical form.**
+
+⟶ Caixas de ancoragem (Anchor boxes) - Caixas de ancoragem é uma técnica usada para predizer caixas de delimitação que se sobrepõem. Na prática, a rede tem permissão para predizer mais de uma caixa simultaneamente, onde cada caixa prevista é restrita a ter um dado conjunto de propriedades geométricas. Por exemplo, a primeira predição pode ser potencialmente uma caixa retangular de uma determinada forma, enquanto a segunda pode ser outra caixa retangular de uma forma geométrica diferente.
+
+
+
+
+**62. Non-max suppression ― The non-max suppression technique aims at removing duplicate overlapping bounding boxes of a same object by selecting the most representative ones. After having removed all boxes having a probability prediction lower than 0.6, the following steps are repeated while there are boxes remaining:**
+
+⟶ Supressão não máxima (Non-max suppression) - A técnica supressão não máxima visa remover caixas de delimitação de um mesmo objeto que estão duplicadas e se sobrepõem, selecionando as mais representativas. Depois de ter removido todas as caixas que contém uma predição menor que 0.6. os seguintes passos são repetidos enquanto existem caixas remanescentes:
+
+
+
+
+**63. [For a given class, Step 1: Pick the box with the largest prediction probability., Step 2: Discard any box having an IoU⩾0.5 with the previous box.]**
+
+⟶ [Para uma dada classe, Passo 1: Pegue a caixa com a maior predição de probabilidade., Passo 2: Descarte todas as caixas que tem IoU⩾0.5 com a caixa anterior.]
+
+
+
+
+**64. [Box predictions, Box selection of maximum probability, Overlap removal of same class, Final bounding boxes]**
+
+⟶ [Predição de caixa, Seleção de caixa com máxima probabilidade, Remoção de sobreposições da mesma classe, Caixas de delimitação final]
+
+
+
+
+**65. YOLO ― You Only Look Once (YOLO) is an object detection algorithm that performs the following steps:**
+
+⟶ YOLO - Você Apenas Vê Uma Vez (You Only Look Once - YOLO) é um algoritmo de detecção de objeto que realiza os seguintes passos:
+
+
+
+
+**66. [Step 1: Divide the input image into a G×G grid., Step 2: For each grid cell, run a CNN that predicts y of the following form:, repeated k times]**
+
+⟶ [Passo 1: Divide a imagem de entrada em uma grade G×G., Passo 2: Para cada célula da grade, roda uma CNN que prevê o valor y da seguinte forma:, repita k vezes]
+
+
+
+
+**67. where pc is the probability of detecting an object, bx,by,bh,bw are the properties of the detected bouding box, c1,...,cp is a one-hot representation of which of the p classes were detected, and k is the number of anchor boxes.**
+
+⟶ onde pc é a probabilidade de detecção do objeto, bx,by,bh,bw são as propriedades das caixas delimitadoras detectadas, c1,...,cp é uma representação única (one-hot representation) de quais das classes p foram detectadas, e k é o número de caixas de ancoragem.
+
+
+
+
+**68. Step 3: Run the non-max suppression algorithm to remove any potential duplicate overlapping bounding boxes.**
+
+⟶ Passo 3: Rode o algoritmo de supressão não máximo para remover qualquer caixa delimitadora duplicada e que se sobrepõe.
+
+
+
+
+**69. [Original image, Division in GxG grid, Bounding box prediction, Non-max suppression]**
+
+⟶ [Imagem original, Divisão em uma grade GxG, Caixa delimitadora prevista, Supressão não máxima]
+
+
+
+
+**70. Remark: when pc=0, then the network does not detect any object. In that case, the corresponding predictions bx,...,cp have to be ignored.**
+
+⟶ Observação: Quando pc=0, então a rede não detecta nenhum objeto. Nesse caso, as predições correspondentes bx,...,cp devem ser ignoradas.
+
+
+
+
+**71. R-CNN ― Region with Convolutional Neural Networks (R-CNN) is an object detection algorithm that first segments the image to find potential relevant bounding boxes and then run the detection algorithm to find most probable objects in those bounding boxes.**
+
+⟶ R-CNN - Região com Redes Neurais Convolucionais (R-CNN) é um algoritmo de detecção de objetos que primeiro segmenta a imagem para encontrar potenciais caixas de delimitação relevantes e então roda o algoritmo de detecção para encontrar os objetos mais prováveis dentro das caixas de delimitação.
+
+
+
+
+**72. [Original image, Segmentation, Bounding box prediction, Non-max suppression]**
+
+⟶ [Imagem original, Segmentação, Predição da caixa delimitadora, Supressão não-máxima]
+
+
+
+
+**73. Remark: although the original algorithm is computationally expensive and slow, newer architectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.**
+
+⟶ Observação: embora o algoritmo original seja computacionalmente caro e lento, arquiteturas mais recentes, como o Fast R-CNN e o Faster R-CNN, permitiram que o algoritmo fosse executado mais rapidamente.
+
+
+
+
+**74. Face verification and recognition**
+
+⟶ Verificação facial e reconhecimento
+
+
+
+
+**75. Types of models ― Two main types of model are summed up in table below:**
+
+⟶ Tipos de modelos - Os dois principais tipos de modelos são resumidos na tabela abaixo:
+
+
+
+
+**76. [Face verification, Face recognition, Query, Reference, Database]**
+
+⟶ [Verificação facial, Reconhecimento facial, Consulta, Referência, Banco de dados]
+
+
+
+
+**77. [Is this the correct person?, One-to-one lookup, Is this one of the K persons in the database?, One-to-many lookup]**
+
+⟶ [Esta é a pessoa correta?, Pesquisa um-para-um, Esta é uma das K pessoas no banco de dados?, Pesquisa um-para-muitos]
+
+
+
+
+**78. One Shot Learning ― One Shot Learning is a face verification algorithm that uses a limited training set to learn a similarity function that quantifies how different two given images are. The similarity function applied to two images is often noted d(image 1,image 2).**
+
+⟶ Aprendizado de Disparo Único (One Shot Learning) - One Shot Learning é um algoritmo de verificação facial que utiliza um conjunto de treinamento limitado para aprender uma função de similaridade que quantifica o quão diferentes são as duas imagens. A função de similaridade aplicada a duas imagens é frequentemente denotada como d(imagem 1, imagem 2).
+
+
+
+
+**79. Siamese Network ― Siamese Networks aim at learning how to encode images to then quantify how different two images are. For a given input image x(i), the encoded output is often noted as f(x(i)).**
+
+⟶ Rede Siamesa (Siamese Network) - Siamese Networks buscam aprender como codificar imagens para depois quantificar quão diferentes são as duas imagens. Para uma imagem de entrada x(i), o resultado codificado é normalmente denotado como f(x(i)).
+
+
+
+
+**80. Triplet loss ― The triplet loss ℓ is a loss function computed on the embedding representation of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive example belong to a same class, while the negative example to another one. By calling α∈R+ the margin parameter, this loss is defined as follows:**
+
+⟶ Perda tripla (Triplet loss) - A perda tripla ℓ é uma função de perda (loss function) computada na representação da encorporação de três imagens A (âncora), P (positiva) e N (negativa). O exemplo da âncora e positivo pertencem à mesma classe, enquanto o exemplo negativo pertence a uma classe diferente. Chamando o parâmetro de margem de α∈R+, essa função de perda é definida da seguinte forma:
+
+
+
+
+**81. Neural style transfer**
+
+⟶ Transferência de estilo neural
+
+
+
+
+**82. Motivation ― The goal of neural style transfer is to generate an image G based on a given content C and a given style S.**
+
+⟶ Motivação - O objetivo da transferência de estilo neural é gerar uma imagem G baseada num dado conteúdo C com um estilo S.
+
+
+
+
+**83. [Content C, Style S, Generated image G]**
+
+⟶ [Conteúdo C, Estulo S, Imagem gerada G]
+
+
+
+
+**84. Activation ― In a given layer l, the activation is noted a[l] and is of dimensions nH×nw×nc**
+
+⟶ Ativação - Em uma dada camada l, a ativação é denotada como a[l] e suas dimensões são nH×nw×nc
+
+
+
+
+**85. Content cost function ― The content cost function Jcontent(C,G) is used to determine how the generated image G differs from the original content image C. It is defined as follows:**
+
+⟶ Função de custo de conteúdo (Content cost function) - A função de custo de conteúdo Jcontent(C,G) é usada para determinar como a imagem gerada G difere da imagem de conteúdo original C. Ela é definida da seguinte forma:
+
+
+
+
+**86. Style matrix ― The style matrix G[l] of a given layer l is a Gram matrix where each of its elements G[l]kk′ quantifies how correlated the channels k and k′ are. It is defined with respect to activations a[l] as follows:**
+
+⟶ Matriz de estilo - A matriz de estilo G[l] de uma determinada camada l é a matriz de Gram em que cada um dos seus elementos G[l]kk′ quantificam quão correlacionados são os canais k e k′. Ela é definida com respeito às ativações a[l] da seguinte forma:
+
+
+
+
+**87. Remark: the style matrix for the style image and the generated image are noted G[l] (S) and G[l] (G) respectively.**
+
+⟶ Observação: a matriz de estilo para a imagem estilizada e para a imagem gerada são denotadas como G[l] (S) e G[l] (G), respectivamente.
+
+
+
+
+**88. Style cost function ― The style cost function Jstyle(S,G) is used to determine how the generated image G differs from the style S. It is defined as follows:**
+
+⟶ Função de custo de estilo (Style cost function) - A função de custo de estilo Jstyle(S,G) é usada para determinar como a imagem gerada G difere do estilo S. Ela é definida da seguinte forma:
+
+
+
+
+**89. Overall cost function ― The overall cost function is defined as being a combination of the content and style cost functions, weighted by parameters α,β, as follows:**
+
+⟶ Função de custo geral (Overall cost function) é definida como sendo a combinação das funções de custo do conteúdo e do estilo, ponderada pelos parâmetros α,β, como mostrado abaixo:
+
+
+
+
+**90. Remark: a higher value of α will make the model care more about the content while a higher value of β will make it care more about the style.**
+
+⟶ Observação: um valor de α maior irá fazer com que o modelo se preocupe mais com o conteúdo enquanto um maior valor de β irá fazer com que ele se preocupe mais com o estilo.
+
+
+
+
+**91. Architectures using computational tricks**
+
+⟶ Arquiteturas usando truques computacionais
+
+
+
+
+**92. Generative Adversarial Network ― Generative adversarial networks, also known as GANs, are composed of a generative and a discriminative model, where the generative model aims at generating the most truthful output that will be fed into the discriminative which aims at differentiating the generated and true image.**
+
+⟶ Rede Adversarial Gerativa (Generative Adversarial Network) - As Generaive Adversarial Networks, também conhecidas como GANs, são compostas de um modelo generativo e um modelo discriminativo, onde o modelo generativo visa gerar a saída mais verdadeira que será alimentada na discriminativa que visa diferenciar a imagem gerada e a imagem verdadeira.
+
+
+
+
+**93. [Training, Noise, Real-world image, Generator, Discriminator, Real Fake]**
+
+⟶ [Treinamento, Ruído, Imagem real, Gerador, Discriminador, Falsa real]
+
+
+
+
+**94. Remark: use cases using variants of GANs include text to image, music generation and synthesis.**
+
+⟶ Observação: casos de uso usando variações de GANs incluem texto para imagem, geração de música e síntese.
+
+
+
+
+**95. ResNet ― The Residual Network architecture (also called ResNet) uses residual blocks with a high number of layers meant to decrease the training error. The residual block has the following characterizing equation:**
+
+⟶ ResNet - A arquitetura de Rede Residual (também chamada de ResNet) usa blocos residuais com um alto número de camadas para diminuir o erro de treinamento. O bloco residual possui a seguinte equação caracterizadora:
+
+
+
+
+**96. Inception Network ― This architecture uses inception modules and aims at giving a try at different convolutions in order to increase its performance through features diversification. In particular, it uses the 1×1 convolution trick to limit the computational burden.**
+
+⟶ Rede de Iniciação - Esta arquitetura utiliza módulos de iniciação e visa experimentar diferentes convoluções, a fim de aumentar seu desempenho através da diversificação de recursos. Em particular, ele usa o truque de convolução 1×1 para limitar a carga computacional.
+
+
+
+
+**97. The Deep Learning cheatsheets are now available in [target language].**
+
+⟶ Os resumos de Aprendizagem Profunda estão disponíveis em português.
+
+
+
+
+**98. Original authors**
+
+⟶ Autores Originais
+
+
+
+
+**99. Translated by X, Y and Z**
+
+⟶ Traduzido por Leticia Portella
+
+
+
+
+**100. Reviewed by X, Y and Z**
+
+⟶ Revisado por Gabriel Fonseca
+
+
+
+
+**101. View PDF version on GitHub**
+
+⟶ Ver versão em PDF no GitHub.
+
+
+
+
+**102. By X and Y**
+
+⟶ Por X e Y
+
+
diff --git a/ru/cheatsheet-deep-learning.md b/ru/cheatsheet-deep-learning.md
deleted file mode 100644
index a5aa3756c..000000000
--- a/ru/cheatsheet-deep-learning.md
+++ /dev/null
@@ -1,321 +0,0 @@
-**1. Deep Learning cheatsheet**
-
-⟶
-
-
-
-**2. Neural Networks**
-
-⟶
-
-
-
-**3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.**
-
-⟶
-
-
-
-**4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:**
-
-⟶
-
-
-
-**5. [Input layer, hidden layer, output layer]**
-
-⟶
-
-
-
-**6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
-
-⟶
-
-
-
-**7. where we note w, b, z the weight, bias and output respectively.**
-
-⟶
-
-
-
-**8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:**
-
-⟶
-
-
-
-**9. [Sigmoid, Tanh, ReLU, Leaky ReLU]**
-
-⟶
-
-
-
-**10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
-
-⟶
-
-
-
-**11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
-
-⟶
-
-
-
-**12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:**
-
-⟶
-
-
-
-**13. As a result, the weight is updated as follows:**
-
-⟶
-
-
-
-**14. Updating weights ― In a neural network, weights are updated as follows:**
-
-⟶
-
-
-
-**15. Step 1: Take a batch of training data.**
-
-⟶
-
-
-
-**16. Step 2: Perform forward propagation to obtain the corresponding loss.**
-
-⟶
-
-
-
-**17. Step 3: Backpropagate the loss to get the gradients.**
-
-⟶
-
-
-
-**18. Step 4: Use the gradients to update the weights of the network.**
-
-⟶
-
-
-
-**19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
-
-⟶
-
-
-
-**20. Convolutional Neural Networks**
-
-⟶
-
-
-
-**21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
-
-⟶
-
-
-
-**22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
-
-⟶
-
-
-
-**23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
-
-⟶
-
-
-
-**24. Recurrent Neural Networks**
-
-⟶
-
-
-
-**25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:**
-
-⟶
-
-
-
-**26. [Input gate, forget gate, gate, output gate]**
-
-⟶
-
-
-
-**27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]**
-
-⟶
-
-
-
-**28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
-
-⟶
-
-
-
-**29. Reinforcement Learning and Control**
-
-⟶
-
-
-
-**30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.**
-
-⟶
-
-
-
-**31. Definitions**
-
-⟶
-
-
-
-**32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:**
-
-⟶
-
-
-
-**33. S is the set of states**
-
-⟶
-
-
-
-**34. A is the set of actions**
-
-⟶
-
-
-
-**35. {Psa} are the state transition probabilities for s∈S and a∈A**
-
-⟶
-
-
-
-**36. γ∈[0,1[ is the discount factor**
-
-⟶
-
-
-
-**37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize**
-
-⟶
-
-
-
-**38. Policy ― A policy π is a function π:S⟶A that maps states to actions.**
-
-⟶
-
-
-
-**39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).**
-
-⟶
-
-
-
-**40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:**
-
-⟶
-
-
-
-**41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:**
-
-⟶
-
-
-
-**42. Remark: we note that the optimal policy π∗ for a given state s is such that:**
-
-⟶
-
-
-
-**43. Value iteration algorithm ― The value iteration algorithm is in two steps:**
-
-⟶
-
-
-
-**44. 1) We initialize the value:**
-
-⟶
-
-
-
-**45. 2) We iterate the value based on the values before:**
-
-⟶
-
-
-
-**46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
-
-⟶
-
-
-
-**47. times took action a in state s and got to s′**
-
-⟶
-
-
-
-**48. times took action a in state s**
-
-⟶
-
-
-
-**49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:**
-
-⟶
-
-
-
-**50. View PDF version on GitHub**
-
-⟶
-
-
-
-**51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
-
-⟶
-
-
-
-**52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
-
-⟶
-
-
-
-**53. [Recurrent Neural Networks, Gates, LSTM]**
-
-⟶
-
-
-
-**54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
-
-⟶
diff --git a/ru/cheatsheet-machine-learning-tips-and-tricks.md b/ru/cheatsheet-machine-learning-tips-and-tricks.md
deleted file mode 100644
index 9712297b8..000000000
--- a/ru/cheatsheet-machine-learning-tips-and-tricks.md
+++ /dev/null
@@ -1,285 +0,0 @@
-**1. Machine Learning tips and tricks cheatsheet**
-
-⟶
-
-
-
-**2. Classification metrics**
-
-⟶
-
-
-
-**3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
-
-⟶
-
-
-
-**4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
-
-⟶
-
-
-
-**5. [Predicted class, Actual class]**
-
-⟶
-
-
-
-**6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
-
-⟶
-
-
-
-**7. [Metric, Formula, Interpretation]**
-
-⟶
-
-
-
-**8. Overall performance of model**
-
-⟶
-
-
-
-**9. How accurate the positive predictions are**
-
-⟶
-
-
-
-**10. Coverage of actual positive sample**
-
-⟶
-
-
-
-**11. Coverage of actual negative sample**
-
-⟶
-
-
-
-**12. Hybrid metric useful for unbalanced classes**
-
-⟶
-
-
-
-**13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
-
-⟶
-
-
-
-**14. [Metric, Formula, Equivalent]**
-
-⟶
-
-
-
-**15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
-
-⟶
-
-
-
-**16. [Actual, Predicted]**
-
-⟶
-
-
-
-**17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
-
-⟶
-
-
-
-**18. [Total sum of squares, Explained sum of squares, Residual sum of squares]**
-
-⟶
-
-
-
-**19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
-
-⟶
-
-
-
-**20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
-
-⟶
-
-
-
-**21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
-
-⟶
-
-
-
-**22. Model selection**
-
-⟶
-
-
-
-**23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
-
-⟶
-
-
-
-**24. [Training set, Validation set, Testing set]**
-
-⟶
-
-
-
-**25. [Model is trained, Model is assessed, Model gives predictions]**
-
-⟶
-
-
-
-**26. [Usually 80% of the dataset, Usually 20% of the dataset]**
-
-⟶
-
-
-
-**27. [Also called hold-out or development set, Unseen data]**
-
-⟶
-
-
-
-**28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
-
-⟶
-
-
-
-**29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
-
-⟶
-
-
-
-**30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
-
-⟶
-
-
-
-**31. [Generally k=5 or 10, Case p=1 is called leave-one-out]**
-
-⟶
-
-
-
-**32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
-
-⟶
-
-
-
-**33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
-
-⟶
-
-
-
-**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
-
-⟶
-
-
-
-**35. Diagnostics**
-
-⟶
-
-
-
-**36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
-
-⟶
-
-
-
-**37. Variance ― The variance of a model is the variability of the model prediction for given data points.**
-
-⟶
-
-
-
-**38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
-
-⟶
-
-
-
-**39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
-
-⟶
-
-
-
-**40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
-
-⟶
-
-
-
-**41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
-
-⟶
-
-
-
-**42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
-
-⟶
-
-
-
-**43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
-
-⟶
-
-
-
-**44. Regression metrics**
-
-⟶
-
-
-
-**45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
-
-⟶
-
-
-
-**46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]**
-
-⟶
-
-
-
-**47. [Model selection, cross-validation, regularization]**
-
-⟶
-
-
-
-**48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
-
-⟶
diff --git a/ru/cheatsheet-supervised-learning.md b/ru/cheatsheet-supervised-learning.md
deleted file mode 100644
index a6b19ea1c..000000000
--- a/ru/cheatsheet-supervised-learning.md
+++ /dev/null
@@ -1,567 +0,0 @@
-**1. Supervised Learning cheatsheet**
-
-⟶
-
-
-
-**2. Introduction to Supervised Learning**
-
-⟶
-
-
-
-**3. Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.**
-
-⟶
-
-
-
-**4. Type of prediction ― The different types of predictive models are summed up in the table below:**
-
-⟶
-
-
-
-**5. [Regression, Classifier, Outcome, Examples]**
-
-⟶
-
-
-
-**6. [Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
-
-⟶
-
-
-
-**7. Type of model ― The different models are summed up in the table below:**
-
-⟶
-
-
-
-**8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
-
-⟶
-
-
-
-**9. [Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]**
-
-⟶
-
-
-
-**10. Notations and general concepts**
-
-⟶
-
-
-
-**11. Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).**
-
-⟶
-
-
-
-**12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
-
-⟶
-
-
-
-**13. [Least squared error, Logistic loss, Hinge loss, Cross-entropy]**
-
-⟶
-
-
-
-**14. [Linear regression, Logistic regression, SVM, Neural Network]**
-
-⟶
-
-
-
-**15. Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:**
-
-⟶
-
-
-
-**16. Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:**
-
-⟶
-
-
-
-**17. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.**
-
-⟶
-
-
-
-**18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
-
-⟶
-
-
-
-**19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**
-
-⟶
-
-
-
-**20. Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:**
-
-⟶
-
-
-
-**21. Linear models**
-
-⟶
-
-
-
-**22. Linear regression**
-
-⟶
-
-
-
-**23. We assume here that y|x;θ∼N(μ,σ2)**
-
-⟶
-
-
-
-**24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
-
-⟶
-
-
-
-**25. LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:**
-
-⟶
-
-
-
-**26. Remark: the update rule is a particular case of the gradient ascent.**
-
-⟶
-
-
-
-**27. LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:**
-
-⟶
-
-
-
-**28. Classification and logistic regression**
-
-⟶
-
-
-
-**29. Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:**
-
-⟶
-
-
-
-**30. Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:**
-
-⟶
-
-
-
-**31. Remark: there is no closed form solution for the case of logistic regressions.**
-
-⟶
-
-
-
-**32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
-
-⟶
-
-
-
-**33. Generalized Linear Models**
-
-⟶
-
-
-
-**34. Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:**
-
-⟶
-
-
-
-**35. Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.**
-
-⟶
-
-
-
-**36. Here are the most common exponential distributions summed up in the following table:**
-
-⟶
-
-
-
-**37. [Distribution, Bernoulli, Gaussian, Poisson, Geometric]**
-
-⟶
-
-
-
-**38. Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:**
-
-⟶
-
-
-
-**39. Remark: ordinary least squares and logistic regression are special cases of generalized linear models.**
-
-⟶
-
-
-
-**40. Support Vector Machines**
-
-⟶
-
-
-
-**41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
-
-⟶
-
-
-
-**42: Optimal margin classifier ― The optimal margin classifier h is such that:**
-
-⟶
-
-
-
-**43: where (w,b)∈Rn×R is the solution of the following optimization problem:**
-
-⟶
-
-
-
-**44. such that**
-
-⟶
-
-
-
-**45. support vectors**
-
-⟶
-
-
-
-**46. Remark: the line is defined as wTx−b=0.**
-
-⟶
-
-
-
-**47. Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:**
-
-⟶
-
-
-
-**48. Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:**
-
-⟶
-
-
-
-**49. In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.**
-
-⟶
-
-
-
-**50. [Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]**
-
-⟶
-
-
-
-**51. Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.**
-
-⟶
-
-
-
-**52. Lagrangian ― We define the Lagrangian L(w,b) as follows:**
-
-⟶
-
-
-
-**53. Remark: the coefficients βi are called the Lagrange multipliers.**
-
-⟶
-
-
-
-**54. Generative Learning**
-
-⟶
-
-
-
-**55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
-
-⟶
-
-
-
-**56. Gaussian Discriminant Analysis**
-
-⟶
-
-
-
-**57. Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:**
-
-⟶
-
-
-
-**58. Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:**
-
-⟶
-
-
-
-**59. Naive Bayes**
-
-⟶
-
-
-
-**60. Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:**
-
-⟶
-
-
-
-**61. Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]**
-
-⟶
-
-
-
-**62. Remark: Naive Bayes is widely used for text classification and spam detection.**
-
-⟶
-
-
-
-**63. Tree-based and ensemble methods**
-
-⟶
-
-
-
-**64. These methods can be used for both regression and classification problems.**
-
-⟶
-
-
-
-**65. CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.**
-
-⟶
-
-
-
-**66. Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.**
-
-⟶
-
-
-
-**67. Remark: random forests are a type of ensemble methods.**
-
-⟶
-
-
-
-**68. Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:**
-
-⟶
-
-
-
-**69. [Adaptive boosting, Gradient boosting]**
-
-⟶
-
-
-
-**70. High weights are put on errors to improve at the next boosting step**
-
-⟶
-
-
-
-**71. Weak learners trained on remaining errors**
-
-⟶
-
-
-
-**72. Other non-parametric approaches**
-
-⟶
-
-
-
-**73. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
-
-⟶
-
-
-
-**74. Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
-
-⟶
-
-
-
-**75. Learning Theory**
-
-⟶
-
-
-
-**76. Union bound ― Let A1,...,Ak be k events. We have:**
-
-⟶
-
-
-
-**77. Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:**
-
-⟶
-
-
-
-**78. Remark: this inequality is also known as the Chernoff bound.**
-
-⟶
-
-
-
-**79. Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:**
-
-⟶
-
-
-
-**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: **
-
-⟶
-
-
-
-**81: the training and testing sets follow the same distribution **
-
-⟶
-
-
-
-**82. the training examples are drawn independently**
-
-⟶
-
-
-
-**83. Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:**
-
-⟶
-
-
-
-**84. Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:**
-
-⟶
-
-
-
-**85. VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.**
-
-⟶
-
-
-
-**86. Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.**
-
-⟶
-
-
-
-**87. Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:**
-
-⟶
-
-
-
-**88. [Introduction, Type of prediction, Type of model]**
-
-⟶
-
-
-
-**89. [Notations and general concepts, loss function, gradient descent, likelihood]**
-
-⟶
-
-
-
-**90. [Linear models, linear regression, logistic regression, generalized linear models]**
-
-⟶
-
-
-
-**91. [Support vector machines, Optimal margin classifier, Hinge loss, Kernel]**
-
-⟶
-
-
-
-**92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]**
-
-⟶
-
-
-
-**93. [Trees and ensemble methods, CART, Random forest, Boosting]**
-
-⟶
-
-
-
-**94. [Other methods, k-NN]**
-
-⟶
-
-
-
-**95. [Learning theory, Hoeffding inequality, PAC, VC dimension]**
-
-⟶
diff --git a/ru/cheatsheet-unsupervised-learning.md b/ru/cheatsheet-unsupervised-learning.md
deleted file mode 100644
index e18b3f50f..000000000
--- a/ru/cheatsheet-unsupervised-learning.md
+++ /dev/null
@@ -1,340 +0,0 @@
-**1. Unsupervised Learning cheatsheet**
-
-⟶
-
-
-
-**2. Introduction to Unsupervised Learning**
-
-⟶
-
-
-
-**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
-
-⟶
-
-
-
-**4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
-
-⟶
-
-
-
-**5. Clustering**
-
-⟶
-
-
-
-**6. Expectation-Maximization**
-
-⟶
-
-
-
-**7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
-
-⟶
-
-
-
-**8. [Setting, Latent variable z, Comments]**
-
-⟶
-
-
-
-**9. [Mixture of k Gaussians, Factor analysis]**
-
-⟶
-
-
-
-**10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
-
-⟶
-
-
-
-**11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
-
-⟶
-
-
-
-**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
-
-⟶
-
-
-
-**13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
-
-⟶
-
-
-
-**14. k-means clustering**
-
-⟶
-
-
-
-**15. We note c(i) the cluster of data point i and μj the center of cluster j.**
-
-⟶
-
-
-
-**16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
-
-⟶
-
-
-
-**17. [Means initialization, Cluster assignment, Means update, Convergence]**
-
-⟶
-
-
-
-**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
-
-⟶
-
-
-
-**19. Hierarchical clustering**
-
-⟶
-
-
-
-**20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
-
-⟶
-
-
-
-**21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
-
-⟶
-
-
-
-**22. [Ward linkage, Average linkage, Complete linkage]**
-
-⟶
-
-
-
-**23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
-
-⟶
-
-
-
-**24. Clustering assessment metrics**
-
-⟶
-
-
-
-**25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
-
-⟶
-
-
-
-**26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
-
-⟶
-
-
-
-**27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
-
-⟶
-
-
-
-**28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
-
-⟶
-
-
-
-**29. Dimension reduction**
-
-⟶
-
-
-
-**30. Principal component analysis**
-
-⟶
-
-
-
-**31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
-
-⟶
-
-
-
-**32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
-
-⟶
-
-
-
-**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
-
-⟶
-
-
-
-**34. diagonal**
-
-⟶
-
-
-
-**35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
-
-⟶
-
-
-
-**36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k
-dimensions by maximizing the variance of the data as follows:**
-
-⟶
-
-
-
-**37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
-
-⟶
-
-
-
-**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
-
-⟶
-
-
-
-**39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
-
-⟶
-
-
-
-**40. Step 4: Project the data on spanR(u1,...,uk).**
-
-⟶
-
-
-
-**41. This procedure maximizes the variance among all k-dimensional spaces.**
-
-⟶
-
-
-
-**42. [Data in feature space, Find principal components, Data in principal components space]**
-
-⟶
-
-
-
-**43. Independent component analysis**
-
-⟶
-
-
-
-**44. It is a technique meant to find the underlying generating sources.**
-
-⟶
-
-
-
-**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
-
-⟶
-
-
-
-**46. The goal is to find the unmixing matrix W=A−1.**
-
-⟶
-
-
-
-**47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
-
-⟶
-
-
-
-**48. Write the probability of x=As=W−1s as:**
-
-⟶
-
-
-
-**49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
-
-⟶
-
-
-
-**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
-
-⟶
-
-
-
-**51. The Machine Learning cheatsheets are now available in Russian.**
-
-⟶
-
-
-
-**52. Original authors**
-
-⟶
-
-
-
-**53. Translated by X, Y and Z**
-
-⟶
-
-
-
-**54. Reviewed by X, Y and Z**
-
-⟶
-
-
-
-**55. [Introduction, Motivation, Jensen's inequality]**
-
-⟶
-
-
-
-**56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
-
-⟶
-
-
-
-**57. [Dimension reduction, PCA, ICA]**
-
-⟶
diff --git a/ru/refresher-linear-algebra.md b/ru/refresher-linear-algebra.md
deleted file mode 100644
index a6b440d1e..000000000
--- a/ru/refresher-linear-algebra.md
+++ /dev/null
@@ -1,339 +0,0 @@
-**1. Linear Algebra and Calculus refresher**
-
-⟶
-
-
-
-**2. General notations**
-
-⟶
-
-
-
-**3. Definitions**
-
-⟶
-
-
-
-**4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:**
-
-⟶
-
-
-
-**5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:**
-
-⟶
-
-
-
-**6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.**
-
-⟶
-
-
-
-**7. Main matrices**
-
-⟶
-
-
-
-**8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:**
-
-⟶
-
-
-
-**9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.**
-
-⟶
-
-
-
-**10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:**
-
-⟶
-
-
-
-**11. Remark: we also note D as diag(d1,...,dn).**
-
-⟶
-
-
-
-**12. Matrix operations**
-
-⟶
-
-
-
-**13. Multiplication**
-
-⟶
-
-
-
-**14. Vector-vector ― There are two types of vector-vector products:**
-
-⟶
-
-
-
-**15. inner product: for x,y∈Rn, we have:**
-
-⟶
-
-
-
-**16. outer product: for x∈Rm,y∈Rn, we have:**
-
-⟶
-
-
-
-**17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:**
-
-⟶
-
-
-
-**18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.**
-
-⟶
-
-
-
-**19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:**
-
-⟶
-
-
-
-**20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively**
-
-⟶
-
-
-
-**21. Other operations**
-
-⟶
-
-
-
-**22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:**
-
-⟶
-
-
-
-**23. Remark: for matrices A,B, we have (AB)T=BTAT**
-
-⟶
-
-
-
-**24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:**
-
-⟶
-
-
-
-**25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1**
-
-⟶
-
-
-
-**26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:**
-
-⟶
-
-
-
-**27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)**
-
-⟶
-
-
-
-**28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:**
-
-⟶
-
-
-
-**29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.**
-
-⟶
-
-
-
-**30. Matrix properties**
-
-⟶
-
-
-
-**31. Definitions**
-
-⟶
-
-
-
-**32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:**
-
-⟶
-
-
-
-**33. [Symmetric, Antisymmetric]**
-
-⟶
-
-
-
-**34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:**
-
-⟶
-
-
-
-**35. N(ax)=|a|N(x) for a scalar**
-
-⟶
-
-
-
-**36. if N(x)=0, then x=0**
-
-⟶
-
-
-
-**37. For x∈V, the most commonly used norms are summed up in the table below:**
-
-⟶
-
-
-
-**38. [Norm, Notation, Definition, Use case]**
-
-⟶
-
-
-
-**39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.**
-
-⟶
-
-
-
-**40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent**
-
-⟶
-
-
-
-**41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.**
-
-⟶
-
-
-
-**42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:**
-
-⟶
-
-
-
-**43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.**
-
-⟶
-
-
-
-**44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
-
-⟶
-
-
-
-**45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
-
-⟶
-
-
-
-**46. diagonal**
-
-⟶
-
-
-
-**47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:**
-
-⟶
-
-
-
-**48. Matrix calculus**
-
-⟶
-
-
-
-**49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:**
-
-⟶
-
-
-
-**50. Remark: the gradient of f is only defined when f is a function that returns a scalar.**
-
-⟶
-
-
-
-**51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:**
-
-⟶
-
-
-
-**52. Remark: the hessian of f is only defined when f is a function that returns a scalar**
-
-⟶
-
-
-
-**53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:**
-
-⟶
-
-
-
-**54. [General notations, Definitions, Main matrices]**
-
-⟶
-
-
-
-**55. [Matrix operations, Multiplication, Other operations]**
-
-⟶
-
-
-
-**56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]**
-
-⟶
-
-
-
-**57. [Matrix calculus, Gradient, Hessian, Operations]**
-
-⟶
diff --git a/ru/refresher-probability.md b/ru/refresher-probability.md
deleted file mode 100644
index 5c9b34656..000000000
--- a/ru/refresher-probability.md
+++ /dev/null
@@ -1,381 +0,0 @@
-**1. Probabilities and Statistics refresher**
-
-⟶
-
-
-
-**2. Introduction to Probability and Combinatorics**
-
-⟶
-
-
-
-**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
-
-⟶
-
-
-
-**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
-
-⟶
-
-
-
-**5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
-
-⟶
-
-
-
-**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
-
-⟶
-
-
-
-**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
-
-⟶
-
-
-
-**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
-
-⟶
-
-
-
-**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
-
-⟶
-
-
-
-**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
-
-⟶
-
-
-
-**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
-
-⟶
-
-
-
-**12. Conditional Probability**
-
-⟶
-
-
-
-**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
-
-⟶
-
-
-
-**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
-
-⟶
-
-
-
-**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
-
-⟶
-
-
-
-**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
-
-⟶
-
-
-
-**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
-
-⟶
-
-
-
-**18. Independence ― Two events A and B are independent if and only if we have:**
-
-⟶
-
-
-
-**19. Random Variables**
-
-⟶
-
-
-
-**20. Definitions**
-
-⟶
-
-
-
-**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
-
-⟶
-
-
-
-**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
-
-⟶
-
-
-
-**23. Remark: we have P(a
-
-**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
-
-⟶
-
-
-
-**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
-
-⟶
-
-
-
-**26. [Case, CDF F, PDF f, Properties of PDF]**
-
-⟶
-
-
-
-**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
-
-⟶
-
-
-
-**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
-
-⟶
-
-
-
-**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
-
-⟶
-
-
-
-**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
-
-⟶
-
-
-
-**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
-
-⟶
-
-
-
-**32. Probability Distributions**
-
-⟶
-
-
-
-**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
-
-⟶
-
-
-
-**34. Main distributions ― Here are the main distributions to have in mind:**
-
-⟶
-
-
-
-**35. [Type, Distribution]**
-
-⟶
-
-
-
-**36. Jointly Distributed Random Variables**
-
-⟶
-
-
-
-**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
-
-⟶
-
-
-
-**38. [Case, Marginal density, Cumulative function]**
-
-⟶
-
-
-
-**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
-
-⟶
-
-
-
-**40. Independence ― Two random variables X and Y are said to be independent if we have:**
-
-⟶
-
-
-
-**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
-
-⟶
-
-
-
-**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
-
-⟶
-
-
-
-**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
-
-⟶
-
-
-
-**44. Remark 2: If X and Y are independent, then ρXY=0.**
-
-⟶
-
-
-
-**45. Parameter estimation**
-
-⟶
-
-
-
-**46. Definitions**
-
-⟶
-
-
-
-**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
-
-⟶
-
-
-
-**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
-
-⟶
-
-
-
-**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
-
-⟶
-
-
-
-**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
-
-⟶
-
-
-
-**51. Estimating the mean**
-
-⟶
-
-
-
-**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
-
-⟶
-
-
-
-**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
-
-⟶
-
-
-
-**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
-
-⟶
-
-
-
-**55. Estimating the variance**
-
-⟶
-
-
-
-**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
-
-⟶
-
-
-
-**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
-
-⟶
-
-
-
-**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
-
-⟶
-
-
-
-**59. [Introduction, Sample space, Event, Permutation]**
-
-⟶
-
-
-
-**60. [Conditional probability, Bayes' rule, Independence]**
-
-⟶
-
-
-
-**61. [Random variables, Definitions, Expectation, Variance]**
-
-⟶
-
-
-
-**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
-
-⟶
-
-
-
-**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
-
-⟶
-
-
-
-**64. [Parameter estimation, Mean, Variance]**
-
-⟶
diff --git a/template/cheatsheet-deep-learning.md b/template/cheatsheet-deep-learning.md
deleted file mode 100644
index a5aa3756c..000000000
--- a/template/cheatsheet-deep-learning.md
+++ /dev/null
@@ -1,321 +0,0 @@
-**1. Deep Learning cheatsheet**
-
-⟶
-
-
-
-**2. Neural Networks**
-
-⟶
-
-
-
-**3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.**
-
-⟶
-
-
-
-**4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:**
-
-⟶
-
-
-
-**5. [Input layer, hidden layer, output layer]**
-
-⟶
-
-
-
-**6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
-
-⟶
-
-
-
-**7. where we note w, b, z the weight, bias and output respectively.**
-
-⟶
-
-
-
-**8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:**
-
-⟶
-
-
-
-**9. [Sigmoid, Tanh, ReLU, Leaky ReLU]**
-
-⟶
-
-
-
-**10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
-
-⟶
-
-
-
-**11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
-
-⟶
-
-
-
-**12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:**
-
-⟶
-
-
-
-**13. As a result, the weight is updated as follows:**
-
-⟶
-
-
-
-**14. Updating weights ― In a neural network, weights are updated as follows:**
-
-⟶
-
-
-
-**15. Step 1: Take a batch of training data.**
-
-⟶
-
-
-
-**16. Step 2: Perform forward propagation to obtain the corresponding loss.**
-
-⟶
-
-
-
-**17. Step 3: Backpropagate the loss to get the gradients.**
-
-⟶
-
-
-
-**18. Step 4: Use the gradients to update the weights of the network.**
-
-⟶
-
-
-
-**19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
-
-⟶
-
-
-
-**20. Convolutional Neural Networks**
-
-⟶
-
-
-
-**21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
-
-⟶
-
-
-
-**22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
-
-⟶
-
-
-
-**23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
-
-⟶
-
-
-
-**24. Recurrent Neural Networks**
-
-⟶
-
-
-
-**25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:**
-
-⟶
-
-
-
-**26. [Input gate, forget gate, gate, output gate]**
-
-⟶
-
-
-
-**27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]**
-
-⟶
-
-
-
-**28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
-
-⟶
-
-
-
-**29. Reinforcement Learning and Control**
-
-⟶
-
-
-
-**30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.**
-
-⟶
-
-
-
-**31. Definitions**
-
-⟶
-
-
-
-**32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:**
-
-⟶
-
-
-
-**33. S is the set of states**
-
-⟶
-
-
-
-**34. A is the set of actions**
-
-⟶
-
-
-
-**35. {Psa} are the state transition probabilities for s∈S and a∈A**
-
-⟶
-
-
-
-**36. γ∈[0,1[ is the discount factor**
-
-⟶
-
-
-
-**37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize**
-
-⟶
-
-
-
-**38. Policy ― A policy π is a function π:S⟶A that maps states to actions.**
-
-⟶
-
-
-
-**39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).**
-
-⟶
-
-
-
-**40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:**
-
-⟶
-
-
-
-**41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:**
-
-⟶
-
-
-
-**42. Remark: we note that the optimal policy π∗ for a given state s is such that:**
-
-⟶
-
-
-
-**43. Value iteration algorithm ― The value iteration algorithm is in two steps:**
-
-⟶
-
-
-
-**44. 1) We initialize the value:**
-
-⟶
-
-
-
-**45. 2) We iterate the value based on the values before:**
-
-⟶
-
-
-
-**46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
-
-⟶
-
-
-
-**47. times took action a in state s and got to s′**
-
-⟶
-
-
-
-**48. times took action a in state s**
-
-⟶
-
-
-
-**49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:**
-
-⟶
-
-
-
-**50. View PDF version on GitHub**
-
-⟶
-
-
-
-**51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
-
-⟶
-
-
-
-**52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
-
-⟶
-
-
-
-**53. [Recurrent Neural Networks, Gates, LSTM]**
-
-⟶
-
-
-
-**54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
-
-⟶
diff --git a/template/cheatsheet-machine-learning-tips-and-tricks.md b/template/cheatsheet-machine-learning-tips-and-tricks.md
deleted file mode 100644
index 9712297b8..000000000
--- a/template/cheatsheet-machine-learning-tips-and-tricks.md
+++ /dev/null
@@ -1,285 +0,0 @@
-**1. Machine Learning tips and tricks cheatsheet**
-
-⟶
-
-
-
-**2. Classification metrics**
-
-⟶
-
-
-
-**3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
-
-⟶
-
-
-
-**4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
-
-⟶
-
-
-
-**5. [Predicted class, Actual class]**
-
-⟶
-
-
-
-**6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
-
-⟶
-
-
-
-**7. [Metric, Formula, Interpretation]**
-
-⟶
-
-
-
-**8. Overall performance of model**
-
-⟶
-
-
-
-**9. How accurate the positive predictions are**
-
-⟶
-
-
-
-**10. Coverage of actual positive sample**
-
-⟶
-
-
-
-**11. Coverage of actual negative sample**
-
-⟶
-
-
-
-**12. Hybrid metric useful for unbalanced classes**
-
-⟶
-
-
-
-**13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
-
-⟶
-
-
-
-**14. [Metric, Formula, Equivalent]**
-
-⟶
-
-
-
-**15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
-
-⟶
-
-
-
-**16. [Actual, Predicted]**
-
-⟶
-
-
-
-**17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
-
-⟶
-
-
-
-**18. [Total sum of squares, Explained sum of squares, Residual sum of squares]**
-
-⟶
-
-
-
-**19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
-
-⟶
-
-
-
-**20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
-
-⟶
-
-
-
-**21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
-
-⟶
-
-
-
-**22. Model selection**
-
-⟶
-
-
-
-**23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
-
-⟶
-
-
-
-**24. [Training set, Validation set, Testing set]**
-
-⟶
-
-
-
-**25. [Model is trained, Model is assessed, Model gives predictions]**
-
-⟶
-
-
-
-**26. [Usually 80% of the dataset, Usually 20% of the dataset]**
-
-⟶
-
-
-
-**27. [Also called hold-out or development set, Unseen data]**
-
-⟶
-
-
-
-**28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
-
-⟶
-
-
-
-**29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
-
-⟶
-
-
-
-**30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
-
-⟶
-
-
-
-**31. [Generally k=5 or 10, Case p=1 is called leave-one-out]**
-
-⟶
-
-
-
-**32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
-
-⟶
-
-
-
-**33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
-
-⟶
-
-
-
-**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
-
-⟶
-
-
-
-**35. Diagnostics**
-
-⟶
-
-
-
-**36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
-
-⟶
-
-
-
-**37. Variance ― The variance of a model is the variability of the model prediction for given data points.**
-
-⟶
-
-
-
-**38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
-
-⟶
-
-
-
-**39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
-
-⟶
-
-
-
-**40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
-
-⟶
-
-
-
-**41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
-
-⟶
-
-
-
-**42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
-
-⟶
-
-
-
-**43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
-
-⟶
-
-
-
-**44. Regression metrics**
-
-⟶
-
-
-
-**45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
-
-⟶
-
-
-
-**46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]**
-
-⟶
-
-
-
-**47. [Model selection, cross-validation, regularization]**
-
-⟶
-
-
-
-**48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
-
-⟶
diff --git a/template/cheatsheet-supervised-learning.md b/template/cheatsheet-supervised-learning.md
deleted file mode 100644
index a6b19ea1c..000000000
--- a/template/cheatsheet-supervised-learning.md
+++ /dev/null
@@ -1,567 +0,0 @@
-**1. Supervised Learning cheatsheet**
-
-⟶
-
-
-
-**2. Introduction to Supervised Learning**
-
-⟶
-
-
-
-**3. Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.**
-
-⟶
-
-
-
-**4. Type of prediction ― The different types of predictive models are summed up in the table below:**
-
-⟶
-
-
-
-**5. [Regression, Classifier, Outcome, Examples]**
-
-⟶
-
-
-
-**6. [Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
-
-⟶
-
-
-
-**7. Type of model ― The different models are summed up in the table below:**
-
-⟶
-
-
-
-**8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
-
-⟶
-
-
-
-**9. [Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]**
-
-⟶
-
-
-
-**10. Notations and general concepts**
-
-⟶
-
-
-
-**11. Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).**
-
-⟶
-
-
-
-**12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
-
-⟶
-
-
-
-**13. [Least squared error, Logistic loss, Hinge loss, Cross-entropy]**
-
-⟶
-
-
-
-**14. [Linear regression, Logistic regression, SVM, Neural Network]**
-
-⟶
-
-
-
-**15. Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:**
-
-⟶
-
-
-
-**16. Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:**
-
-⟶
-
-
-
-**17. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.**
-
-⟶
-
-
-
-**18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
-
-⟶
-
-
-
-**19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**
-
-⟶
-
-
-
-**20. Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:**
-
-⟶
-
-
-
-**21. Linear models**
-
-⟶
-
-
-
-**22. Linear regression**
-
-⟶
-
-
-
-**23. We assume here that y|x;θ∼N(μ,σ2)**
-
-⟶
-
-
-
-**24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
-
-⟶
-
-
-
-**25. LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:**
-
-⟶
-
-
-
-**26. Remark: the update rule is a particular case of the gradient ascent.**
-
-⟶
-
-
-
-**27. LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:**
-
-⟶
-
-
-
-**28. Classification and logistic regression**
-
-⟶
-
-
-
-**29. Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:**
-
-⟶
-
-
-
-**30. Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:**
-
-⟶
-
-
-
-**31. Remark: there is no closed form solution for the case of logistic regressions.**
-
-⟶
-
-
-
-**32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
-
-⟶
-
-
-
-**33. Generalized Linear Models**
-
-⟶
-
-
-
-**34. Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:**
-
-⟶
-
-
-
-**35. Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.**
-
-⟶
-
-
-
-**36. Here are the most common exponential distributions summed up in the following table:**
-
-⟶
-
-
-
-**37. [Distribution, Bernoulli, Gaussian, Poisson, Geometric]**
-
-⟶
-
-
-
-**38. Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:**
-
-⟶
-
-
-
-**39. Remark: ordinary least squares and logistic regression are special cases of generalized linear models.**
-
-⟶
-
-
-
-**40. Support Vector Machines**
-
-⟶
-
-
-
-**41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
-
-⟶
-
-
-
-**42: Optimal margin classifier ― The optimal margin classifier h is such that:**
-
-⟶
-
-
-
-**43: where (w,b)∈Rn×R is the solution of the following optimization problem:**
-
-⟶
-
-
-
-**44. such that**
-
-⟶
-
-
-
-**45. support vectors**
-
-⟶
-
-
-
-**46. Remark: the line is defined as wTx−b=0.**
-
-⟶
-
-
-
-**47. Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:**
-
-⟶
-
-
-
-**48. Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:**
-
-⟶
-
-
-
-**49. In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.**
-
-⟶
-
-
-
-**50. [Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]**
-
-⟶
-
-
-
-**51. Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.**
-
-⟶
-
-
-
-**52. Lagrangian ― We define the Lagrangian L(w,b) as follows:**
-
-⟶
-
-
-
-**53. Remark: the coefficients βi are called the Lagrange multipliers.**
-
-⟶
-
-
-
-**54. Generative Learning**
-
-⟶
-
-
-
-**55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
-
-⟶
-
-
-
-**56. Gaussian Discriminant Analysis**
-
-⟶
-
-
-
-**57. Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:**
-
-⟶
-
-
-
-**58. Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:**
-
-⟶
-
-
-
-**59. Naive Bayes**
-
-⟶
-
-
-
-**60. Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:**
-
-⟶
-
-
-
-**61. Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]**
-
-⟶
-
-
-
-**62. Remark: Naive Bayes is widely used for text classification and spam detection.**
-
-⟶
-
-
-
-**63. Tree-based and ensemble methods**
-
-⟶
-
-
-
-**64. These methods can be used for both regression and classification problems.**
-
-⟶
-
-
-
-**65. CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.**
-
-⟶
-
-
-
-**66. Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.**
-
-⟶
-
-
-
-**67. Remark: random forests are a type of ensemble methods.**
-
-⟶
-
-
-
-**68. Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:**
-
-⟶
-
-
-
-**69. [Adaptive boosting, Gradient boosting]**
-
-⟶
-
-
-
-**70. High weights are put on errors to improve at the next boosting step**
-
-⟶
-
-
-
-**71. Weak learners trained on remaining errors**
-
-⟶
-
-
-
-**72. Other non-parametric approaches**
-
-⟶
-
-
-
-**73. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
-
-⟶
-
-
-
-**74. Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
-
-⟶
-
-
-
-**75. Learning Theory**
-
-⟶
-
-
-
-**76. Union bound ― Let A1,...,Ak be k events. We have:**
-
-⟶
-
-
-
-**77. Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:**
-
-⟶
-
-
-
-**78. Remark: this inequality is also known as the Chernoff bound.**
-
-⟶
-
-
-
-**79. Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:**
-
-⟶
-
-
-
-**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: **
-
-⟶
-
-
-
-**81: the training and testing sets follow the same distribution **
-
-⟶
-
-
-
-**82. the training examples are drawn independently**
-
-⟶
-
-
-
-**83. Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:**
-
-⟶
-
-
-
-**84. Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:**
-
-⟶
-
-
-
-**85. VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.**
-
-⟶
-
-
-
-**86. Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.**
-
-⟶
-
-
-
-**87. Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:**
-
-⟶
-
-
-
-**88. [Introduction, Type of prediction, Type of model]**
-
-⟶
-
-
-
-**89. [Notations and general concepts, loss function, gradient descent, likelihood]**
-
-⟶
-
-
-
-**90. [Linear models, linear regression, logistic regression, generalized linear models]**
-
-⟶
-
-
-
-**91. [Support vector machines, Optimal margin classifier, Hinge loss, Kernel]**
-
-⟶
-
-
-
-**92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]**
-
-⟶
-
-
-
-**93. [Trees and ensemble methods, CART, Random forest, Boosting]**
-
-⟶
-
-
-
-**94. [Other methods, k-NN]**
-
-⟶
-
-
-
-**95. [Learning theory, Hoeffding inequality, PAC, VC dimension]**
-
-⟶
diff --git a/template/cheatsheet-unsupervised-learning.md b/template/cheatsheet-unsupervised-learning.md
deleted file mode 100644
index 827d815a3..000000000
--- a/template/cheatsheet-unsupervised-learning.md
+++ /dev/null
@@ -1,340 +0,0 @@
-**1. Unsupervised Learning cheatsheet**
-
-⟶
-
-
-
-**2. Introduction to Unsupervised Learning**
-
-⟶
-
-
-
-**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
-
-⟶
-
-
-
-**4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
-
-⟶
-
-
-
-**5. Clustering**
-
-⟶
-
-
-
-**6. Expectation-Maximization**
-
-⟶
-
-
-
-**7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
-
-⟶
-
-
-
-**8. [Setting, Latent variable z, Comments]**
-
-⟶
-
-
-
-**9. [Mixture of k Gaussians, Factor analysis]**
-
-⟶
-
-
-
-**10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
-
-⟶
-
-
-
-**11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
-
-⟶
-
-
-
-**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
-
-⟶
-
-
-
-**13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
-
-⟶
-
-
-
-**14. k-means clustering**
-
-⟶
-
-
-
-**15. We note c(i) the cluster of data point i and μj the center of cluster j.**
-
-⟶
-
-
-
-**16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
-
-⟶
-
-
-
-**17. [Means initialization, Cluster assignment, Means update, Convergence]**
-
-⟶
-
-
-
-**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
-
-⟶
-
-
-
-**19. Hierarchical clustering**
-
-⟶
-
-
-
-**20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
-
-⟶
-
-
-
-**21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
-
-⟶
-
-
-
-**22. [Ward linkage, Average linkage, Complete linkage]**
-
-⟶
-
-
-
-**23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
-
-⟶
-
-
-
-**24. Clustering assessment metrics**
-
-⟶
-
-
-
-**25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
-
-⟶
-
-
-
-**26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
-
-⟶
-
-
-
-**27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
-
-⟶
-
-
-
-**28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
-
-⟶
-
-
-
-**29. Dimension reduction**
-
-⟶
-
-
-
-**30. Principal component analysis**
-
-⟶
-
-
-
-**31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
-
-⟶
-
-
-
-**32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
-
-⟶
-
-
-
-**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
-
-⟶
-
-
-
-**34. diagonal**
-
-⟶
-
-
-
-**35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
-
-⟶
-
-
-
-**36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k
-dimensions by maximizing the variance of the data as follows:**
-
-⟶
-
-
-
-**37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
-
-⟶
-
-
-
-**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
-
-⟶
-
-
-
-**39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
-
-⟶
-
-
-
-**40. Step 4: Project the data on spanR(u1,...,uk).**
-
-⟶
-
-
-
-**41. This procedure maximizes the variance among all k-dimensional spaces.**
-
-⟶
-
-
-
-**42. [Data in feature space, Find principal components, Data in principal components space]**
-
-⟶
-
-
-
-**43. Independent component analysis**
-
-⟶
-
-
-
-**44. It is a technique meant to find the underlying generating sources.**
-
-⟶
-
-
-
-**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
-
-⟶
-
-
-
-**46. The goal is to find the unmixing matrix W=A−1.**
-
-⟶
-
-
-
-**47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
-
-⟶
-
-
-
-**48. Write the probability of x=As=W−1s as:**
-
-⟶
-
-
-
-**49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
-
-⟶
-
-
-
-**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
-
-⟶
-
-
-
-**51. The Machine Learning cheatsheets are now available in Japanese.**
-
-⟶
-
-
-
-**52. Original authors**
-
-⟶
-
-
-
-**53. Translated by X, Y and Z**
-
-⟶
-
-
-
-**54. Reviewed by X, Y and Z**
-
-⟶
-
-
-
-**55. [Introduction, Motivation, Jensen's inequality]**
-
-⟶
-
-
-
-**56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
-
-⟶
-
-
-
-**57. [Dimension reduction, PCA, ICA]**
-
-⟶
diff --git a/template/cs-221-logic-models.md b/template/cs-221-logic-models.md
new file mode 100644
index 000000000..8be03acc4
--- /dev/null
+++ b/template/cs-221-logic-models.md
@@ -0,0 +1,462 @@
+**Logic-based models translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-221/cheatsheet-logic-models)
+
+
+
+**1. Logic-based models with propositional and first-order logic**
+
+⟶
+
+
+
+
+**2. Basics**
+
+⟶
+
+
+
+
+**3. Syntax of propositional logic ― By noting f,g formulas, and ¬,∧,∨,→,↔ connectives, we can write the following logical expressions:**
+
+⟶
+
+
+
+
+**4. [Name, Symbol, Meaning, Illustration]**
+
+⟶
+
+
+
+
+**5. [Affirmation, Negation, Conjunction, Disjunction, Implication, Biconditional]**
+
+⟶
+
+
+
+
+**6. [not f, f and g, f or g, if f then g, f, that is to say g]**
+
+⟶
+
+
+
+
+**7. Remark: formulas can be built up recursively out of these connectives.**
+
+⟶
+
+
+
+
+**8. Model ― A model w denotes an assignment of binary weights to propositional symbols.**
+
+⟶
+
+
+
+
+**9. Example: the set of truth values w={A:0,B:1,C:0} is one possible model to the propositional symbols A, B and C.**
+
+⟶
+
+
+
+
+**10. Interpretation function ― The interpretation function I(f,w) outputs whether model w satisfies formula f:**
+
+⟶
+
+
+
+
+**11. Set of models ― M(f) denotes the set of models w that satisfy formula f. Mathematically speaking, we define it as follows:**
+
+⟶
+
+
+
+
+**12. Knowledge base**
+
+⟶
+
+
+
+
+**13. Definition ― The knowledge base KB is the conjunction of all formulas that have been considered so far. The set of models of the knowledge base is the intersection of the set of models that satisfy each formula. In other words:**
+
+⟶
+
+
+
+
+**14. Probabilistic interpretation ― The probability that query f is evaluated to 1 can be seen as the proportion of models w of the knowledge base KB that satisfy f, i.e.:**
+
+⟶
+
+
+
+
+**15. Satisfiability ― The knowledge base KB is said to be satisfiable if at least one model w satisfies all its constraints. In other words:**
+
+⟶
+
+
+
+
+**16. satisfiable**
+
+⟶
+
+
+
+
+**17. Remark: M(KB) denotes the set of models compatible with all the constraints of the knowledge base.**
+
+⟶
+
+
+
+
+**18. Relation between formulas and knowledge base - We define the following properties between the knowledge base KB and a new formula f:**
+
+⟶
+
+
+
+
+**19. [Name, Mathematical formulation, Illustration, Notes]**
+
+⟶
+
+
+
+
+**20. [KB entails f, KB contradicts f, f contingent to KB]**
+
+⟶
+
+
+
+
+**21. [f does not bring any new information, Also written KB⊨f, No model satisfies the constraints after adding f, Equivalent to KB⊨¬f, f does not contradict KB, f adds a non-trivial amount of information to KB]**
+
+⟶
+
+
+
+
+**22. Model checking ― A model checking algorithm takes as input a knowledge base KB and outputs whether it is satisfiable or not.**
+
+⟶
+
+
+
+
+**23. Remark: popular model checking algorithms include DPLL and WalkSat.**
+
+⟶
+
+
+
+
+**24. Inference rule ― An inference rule of premises f1,...,fk and conclusion g is written:**
+
+⟶
+
+
+
+
+**25. Forward inference algorithm ― From a set of inference rules Rules, this algorithm goes through all possible f1,...,fk and adds g to the knowledge base KB if a matching rule exists. This process is repeated until no more additions can be made to KB.**
+
+⟶
+
+
+
+
+**26. Derivation ― We say that KB derives f (written KB⊢f) with rules Rules if f already is in KB or gets added during the forward inference algorithm using the set of rules Rules.**
+
+⟶
+
+
+
+
+**27. Properties of inference rules ― A set of inference rules Rules can have the following properties:**
+
+⟶
+
+
+
+
+**28. [Name, Mathematical formulation, Notes]**
+
+⟶
+
+
+
+
+**29. [Soundness, Completeness]**
+
+⟶
+
+
+
+
+**30. [Inferred formulas are entailed by KB, Can be checked one rule at a time, "Nothing but the truth", Formulas entailing KB are either already in the knowledge base or inferred from it, "The whole truth"]**
+
+⟶
+
+
+
+
+**31. Propositional logic**
+
+⟶
+
+
+
+
+**32. In this section, we will go through logic-based models that use logical formulas and inference rules. The idea here is to balance expressivity and computational efficiency.**
+
+⟶
+
+
+
+
+**33. Horn clause ― By noting p1,...,pk and q propositional symbols, a Horn clause has the form:**
+
+⟶
+
+
+
+
+**34. Remark: when q=false, it is called a "goal clause", otherwise we denote it as a "definite clause".**
+
+⟶
+
+
+
+
+**35. Modus ponens ― For propositional symbols f1,...,fk and p, the modus ponens rule is written:**
+
+⟶
+
+
+
+
+**36. Remark: it takes linear time to apply this rule, as each application generate a clause that contains a single propositional symbol.**
+
+⟶
+
+
+
+
+**37. Completeness ― Modus ponens is complete with respect to Horn clauses if we suppose that KB contains only Horn clauses and p is an entailed propositional symbol. Applying modus ponens will then derive p.**
+
+⟶
+
+
+
+
+**38. Conjunctive normal form ― A conjunctive normal form (CNF) formula is a conjunction of clauses, where each clause is a disjunction of atomic formulas.**
+
+⟶
+
+
+
+
+**39. Remark: in other words, CNFs are ∧ of ∨.**
+
+⟶
+
+
+
+
+**40. Equivalent representation ― Every formula in propositional logic can be written into an equivalent CNF formula. The table below presents general conversion properties:**
+
+⟶
+
+
+
+
+**41. [Rule name, Initial, Converted, Eliminate, Distribute, over]**
+
+⟶
+
+
+
+
+**42. Resolution rule ― For propositional symbols f1,...,fn, and g1,...,gm as well as p, the resolution rule is written:**
+
+⟶
+
+
+
+
+**43. Remark: it can take exponential time to apply this rule, as each application generates a clause that has a subset of the propositional symbols.**
+
+⟶
+
+
+
+
+**44. [Resolution-based inference ― The resolution-based inference algorithm follows the following steps:, Step 1: Convert all formulas into CNF, Step 2: Repeatedly apply resolution rule, Step 3: Return unsatisfiable if and only if False, is derived]**
+
+⟶
+
+
+
+
+**45. First-order logic**
+
+⟶
+
+
+
+
+**46. The idea here is to use variables to yield more compact knowledge representations.**
+
+⟶
+
+
+
+
+**47. [Model ― A model w in first-order logic maps:, constant symbols to objects, predicate symbols to tuple of objects]**
+
+⟶
+
+
+
+
+**48. Horn clause ― By noting x1,...,xn variables and a1,...,ak,b atomic formulas, the first-order logic version of a horn clause has the form:**
+
+⟶
+
+
+
+
+**49. Substitution ― A substitution θ maps variables to terms and Subst[θ,f] denotes the result of substitution θ on f.**
+
+⟶
+
+
+
+
+**50. Unification ― Unification takes two formulas f and g and returns the most general substitution θ that makes them equal:**
+
+⟶
+
+
+
+
+**51. such that**
+
+⟶
+
+
+
+
+**52. Note: Unify[f,g] returns Fail if no such θ exists.**
+
+⟶
+
+
+
+
+**53. Modus ponens ― By noting x1,...,xn variables, a1,...,ak and a′1,...,a′k atomic formulas and by calling θ=Unify(a′1∧...∧a′k,a1∧...∧ak) the first-order logic version of modus ponens can be written:**
+
+⟶
+
+
+
+
+**54. Completeness ― Modus ponens is complete for first-order logic with only Horn clauses.**
+
+⟶
+
+
+
+
+**55. Resolution rule ― By noting f1,...,fn, g1,...,gm, p, q formulas and by calling θ=Unify(p,q), the first-order logic version of the resolution rule can be written:**
+
+⟶
+
+
+
+
+**56. [Semi-decidability ― First-order logic, even restricted to only Horn clauses, is semi-decidable., if KB⊨f, forward inference on complete inference rules will prove f in finite time, if KB⊭f, no algorithm can show this in finite time]**
+
+⟶
+
+
+
+
+**57. [Basics, Notations, Model, Interpretation function, Set of models]**
+
+⟶
+
+
+
+
+**58. [Knowledge base, Definition, Probabilistic interpretation, Satisfiability, Relationship with formulas, Forward inference, Rule properties]**
+
+⟶
+
+
+
+
+**59. [Propositional logic, Clauses, Modus ponens, Conjunctive normal form, Representation equivalence, Resolution]**
+
+⟶
+
+
+
+
+**60. [First-order logic, Substitution, Unification, Resolution rule, Modus ponens, Resolution, Semi-decidability]**
+
+⟶
+
+
+
+
+**61. View PDF version on GitHub**
+
+⟶
+
+
+
+
+**62. Original authors**
+
+⟶
+
+
+
+
+**63. Translated by X, Y and Z**
+
+⟶
+
+
+
+
+**64. Reviewed by X, Y and Z**
+
+⟶
+
+
+
+
+**65. By X and Y**
+
+⟶
+
+
+
+
+**66. The Artificial Intelligence cheatsheets are now available in [target language].**
+
+⟶
diff --git a/template/cs-221-reflex-models.md b/template/cs-221-reflex-models.md
new file mode 100644
index 000000000..f64a380b0
--- /dev/null
+++ b/template/cs-221-reflex-models.md
@@ -0,0 +1,539 @@
+**Reflex-based models translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-221/cheatsheet-reflex-models)
+
+
+
+**1. Reflex-based models with Machine Learning**
+
+⟶
+
+
+
+
+**2. Linear predictors**
+
+⟶
+
+
+
+
+**3. In this section, we will go through reflex-based models that can improve with experience, by going through samples that have input-output pairs.**
+
+⟶
+
+
+
+
+**4. Feature vector ― The feature vector of an input x is noted ϕ(x) and is such that:**
+
+⟶
+
+
+
+
+**5. Score ― The score s(x,w) of an example (ϕ(x),y)∈Rd×R associated to a linear model of weights w∈Rd is given by the inner product:**
+
+⟶
+
+
+
+
+**6. Classification**
+
+⟶
+
+
+
+
+**7. Linear classifier ― Given a weight vector w∈Rd and a feature vector ϕ(x)∈Rd, the binary linear classifier fw is given by:**
+
+⟶
+
+
+
+
+**8. if**
+
+⟶
+
+
+
+
+**9. Margin ― The margin m(x,y,w)∈R of an example (ϕ(x),y)∈Rd×{−1,+1} associated to a linear model of weights w∈Rd quantifies the confidence of the prediction: larger values are better. It is given by:**
+
+⟶
+
+
+
+
+**10. Regression**
+
+⟶
+
+
+
+
+**11. Linear regression ― Given a weight vector w∈Rd and a feature vector ϕ(x)∈Rd, the output of a linear regression of weights w denoted as fw is given by:**
+
+⟶
+
+
+
+
+**12. Residual ― The residual res(x,y,w)∈R is defined as being the amount by which the prediction fw(x) overshoots the target y:**
+
+⟶
+
+
+
+
+**13. Loss minimization**
+
+⟶
+
+
+
+
+**14. Loss function ― A loss function Loss(x,y,w) quantifies how unhappy we are with the weights w of the model in the prediction task of output y from input x. It is a quantity we want to minimize during the training process.**
+
+⟶
+
+
+
+
+**15. Classification case - The classification of a sample x of true label y∈{−1,+1} with a linear model of weights w can be done with the predictor fw(x)≜sign(s(x,w)). In this situation, a metric of interest quantifying the quality of the classification is given by the margin m(x,y,w), and can be used with the following loss functions:**
+
+⟶
+
+
+
+
+**16. [Name, Illustration, Zero-one loss, Hinge loss, Logistic loss]**
+
+⟶
+
+
+
+
+**17. Regression case - The prediction of a sample x of true label y∈R with a linear model of weights w can be done with the predictor fw(x)≜s(x,w). In this situation, a metric of interest quantifying the quality of the regression is given by the margin res(x,y,w) and can be used with the following loss functions:**
+
+⟶
+
+
+
+
+**18. [Name, Squared loss, Absolute deviation loss, Illustration]**
+
+⟶
+
+
+
+
+**19. Loss minimization framework ― In order to train a model, we want to minimize the training loss is defined as follows:**
+
+⟶
+
+
+
+
+**20. Non-linear predictors**
+
+⟶
+
+
+
+
+**21. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
+
+⟶
+
+
+
+
+**22. Remark: the higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
+
+⟶
+
+
+
+
+**23. Neural networks ― Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks. The vocabulary around neural networks architectures is described in the figure below:**
+
+⟶
+
+
+
+
+**24. [Input layer, Hidden layer, Output layer]**
+
+⟶
+
+
+
+
+**25. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
+
+⟶
+
+
+
+
+**26. where we note w, b, x, z the weight, bias, input and non-activated output of the neuron respectively.**
+
+⟶
+
+
+
+
+**27. For a more detailed overview of the concepts above, check out the Supervised Learning cheatsheets!**
+
+⟶
+
+
+
+
+**28. Stochastic gradient descent**
+
+⟶
+
+
+
+
+**29. Gradient descent ― By noting η∈R the learning rate (also called step size), the update rule for gradient descent is expressed with the learning rate and the loss function Loss(x,y,w) as follows:**
+
+⟶
+
+
+
+
+**30. Stochastic updates ― Stochastic gradient descent (SGD) updates the parameters of the model one training example (ϕ(x),y)∈Dtrain at a time. This method leads to sometimes noisy, but fast updates.**
+
+⟶
+
+
+
+
+**31. Batch updates ― Batch gradient descent (BGD) updates the parameters of the model one batch of examples (e.g. the entire training set) at a time. This method computes stable update directions, at a greater computational cost.**
+
+⟶
+
+
+
+
+**32. Fine-tuning models**
+
+⟶
+
+
+
+
+**33. Hypothesis class ― A hypothesis class F is the set of possible predictors with a fixed ϕ(x) and varying w:**
+
+⟶
+
+
+
+
+**34. Logistic function ― The logistic function σ, also called the sigmoid function, is defined as:**
+
+⟶
+
+
+
+
+**35. Remark: we have σ′(z)=σ(z)(1−σ(z)).**
+
+⟶
+
+
+
+
+**36. Backpropagation ― The forward pass is done through fi, which is the value for the subexpression rooted at i, while the backward pass is done through gi=∂out∂fi and represents how fi influences the output.**
+
+⟶
+
+
+
+
+**37. Approximation and estimation error ― The approximation error ϵapprox represents how far the entire hypothesis class F is from the target predictor g∗, while the estimation error ϵest quantifies how good the predictor ^f is with respect to the best predictor f∗ of the hypothesis class F.**
+
+⟶
+
+
+
+
+**38. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
+
+⟶
+
+
+
+
+**39. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+⟶
+
+
+
+
+**40. Hyperparameters ― Hyperparameters are the properties of the learning algorithm, and include features, regularization parameter λ, number of iterations T, step size η, etc.**
+
+⟶
+
+
+
+
+**41. Sets vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
+
+⟶
+
+
+
+
+**42. [Training set, Validation set, Testing set]**
+
+⟶
+
+
+
+
+**43. [Model is trained, Usually 80% of the dataset, Model is assessed, Usually 20% of the dataset, Also called hold-out or development set, Model gives predictions, Unseen data]**
+
+⟶
+
+
+
+
+**44. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
+
+⟶
+
+
+
+
+**45. [Dataset, Unseen data, train, validation, test]**
+
+⟶
+
+
+
+
+**46. For a more detailed overview of the concepts above, check out the Machine Learning tips and tricks cheatsheets!**
+
+⟶
+
+
+
+
+**47. Unsupervised Learning**
+
+⟶
+
+
+
+
+**48. The class of unsupervised learning methods aims at discovering the structure of the data, which may have of rich latent structures.**
+
+⟶
+
+
+
+
+**49. k-means**
+
+⟶
+
+
+
+
+**50. Clustering ― Given a training set of input points Dtrain, the goal of a clustering algorithm is to assign each point ϕ(xi) to a cluster zi∈{1,...,k}**
+
+⟶
+
+
+
+
+**51. Objective function ― The loss function for one of the main clustering algorithms, k-means, is given by:**
+
+⟶
+
+
+
+
+**52. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
+
+⟶
+
+
+
+
+**53. and**
+
+⟶
+
+
+
+
+**54. [Means initialization, Cluster assignment, Means update, Convergence]**
+
+⟶
+
+
+
+
+**55. Principal Component Analysis**
+
+⟶
+
+
+
+
+**56. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+⟶
+
+
+
+
+**57. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+⟶
+
+
+
+
+**58. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
+
+⟶
+
+
+
+
+**59. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:**
+
+⟶
+
+
+
+
+**60. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
+
+⟶
+
+
+
+
+**61. [where, and]**
+
+⟶
+
+
+
+
+**62. [Step 2: Compute Σ=1mm∑i=1ϕ(xi)ϕ(xi)T∈Rn×n, which is symmetric with real eigenvalues., Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues., Step 4: Project the data on spanR(u1,...,uk).]**
+
+⟶
+
+
+
+
+**63. This procedure maximizes the variance among all k-dimensional spaces.**
+
+⟶
+
+
+
+
+**64. [Data in feature space, Find principal components, Data in principal components space]**
+
+⟶
+
+
+
+
+**65. For a more detailed overview of the concepts above, check out the Unsupervised Learning cheatsheets!**
+
+⟶
+
+
+
+
+**66. [Linear predictors, Feature vector, Linear classifier/regression, Margin]**
+
+⟶
+
+
+
+
+**67. [Loss minimization, Loss function, Framework]**
+
+⟶
+
+
+
+
+**68. [Non-linear predictors, k-nearest neighbors, Neural networks]**
+
+⟶
+
+
+
+
+**69. [Stochastic gradient descent, Gradient, Stochastic updates, Batch updates]**
+
+⟶
+
+
+
+
+**70. [Fine-tuning models, Hypothesis class, Backpropagation, Regularization, Sets vocabulary]**
+
+⟶
+
+
+
+
+**71. [Unsupervised Learning, k-means, Principal components analysis]**
+
+⟶
+
+
+
+
+**72. View PDF version on GitHub**
+
+⟶
+
+
+
+
+**73. Original authors**
+
+⟶
+
+
+
+
+**74. Translated by X, Y and Z**
+
+⟶
+
+
+
+
+**75. Reviewed by X, Y and Z**
+
+⟶
+
+
+
+
+**76. By X and Y**
+
+⟶
+
+
+
+
+**77. The Artificial Intelligence cheatsheets are now available in [target language].**
+
+⟶
diff --git a/template/cs-221-states-models.md b/template/cs-221-states-models.md
new file mode 100644
index 000000000..e21270f89
--- /dev/null
+++ b/template/cs-221-states-models.md
@@ -0,0 +1,980 @@
+**States-based models translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-221/cheatsheet-states-models)
+
+
+
+**1. States-based models with search optimization and MDP**
+
+⟶
+
+
+
+
+**2. Search optimization**
+
+⟶
+
+
+
+
+**3. In this section, we assume that by accomplishing action a from state s, we deterministically arrive in state Succ(s,a). The goal here is to determine a sequence of actions (a1,a2,a3,a4,...) that starts from an initial state and leads to an end state. In order to solve this kind of problem, our objective will be to find the minimum cost path by using states-based models.**
+
+⟶
+
+
+
+
+**4. Tree search**
+
+⟶
+
+
+
+
+**5. This category of states-based algorithms explores all possible states and actions. It is quite memory efficient, and is suitable for huge state spaces but the runtime can become exponential in the worst cases.**
+
+⟶
+
+
+
+
+**6. [Self-loop, More than a parent, Cycle, More than a root, Valid tree]**
+
+⟶
+
+
+
+
+**7. [Search problem ― A search problem is defined with:, a starting state sstart, possible actions Actions(s) from state s, action cost Cost(s,a) from state s with action a, successor Succ(s,a) of state s after action a, whether an end state was reached IsEnd(s)]**
+
+⟶
+
+
+
+
+**8. The objective is to find a path that minimizes the cost.**
+
+⟶
+
+
+
+
+**9. Backtracking search ― Backtracking search is a naive recursive algorithm that tries all possibilities to find the minimum cost path. Here, action costs can be either positive or negative.**
+
+⟶
+
+
+
+
+**10. Breadth-first search (BFS) ― Breadth-first search is a graph search algorithm that does a level-by-level traversal. We can implement it iteratively with the help of a queue that stores at each step future nodes to be visited. For this algorithm, we can assume action costs to be equal to a constant c⩾0.**
+
+⟶
+
+
+
+
+**11. Depth-first search (DFS) ― Depth-first search is a search algorithm that traverses a graph by following each path as deep as it can. We can implement it recursively, or iteratively with the help of a stack that stores at each step future nodes to be visited. For this algorithm, action costs are assumed to be equal to 0.**
+
+⟶
+
+
+
+
+**12. Iterative deepening ― The iterative deepening trick is a modification of the depth-first search algorithm so that it stops after reaching a certain depth, which guarantees optimality when all action costs are equal. Here, we assume that action costs are equal to a constant c⩾0.**
+
+⟶
+
+
+
+
+**13. Tree search algorithms summary ― By noting b the number of actions per state, d the solution depth, and D the maximum depth, we have:**
+
+⟶
+
+
+
+
+**14. [Algorithm, Action costs, Space, Time]**
+
+⟶
+
+
+
+
+**15. [Backtracking search, any, Breadth-first search, Depth-first search, DFS-Iterative deepening]**
+
+⟶
+
+
+
+
+**16. Graph search**
+
+⟶
+
+
+
+
+**17. This category of states-based algorithms aims at constructing optimal paths, enabling exponential savings. In this section, we will focus on dynamic programming and uniform cost search.**
+
+⟶
+
+
+
+
+**18. Graph ― A graph is comprised of a set of vertices V (also called nodes) as well as a set of edges E (also called links).**
+
+⟶
+
+
+
+
+**19. Remark: a graph is said to be acylic when there is no cycle.**
+
+⟶
+
+
+
+
+**20. State ― A state is a summary of all past actions sufficient to choose future actions optimally.**
+
+⟶
+
+
+
+
+**21. Dynamic programming ― Dynamic programming (DP) is a backtracking search algorithm with memoization (i.e. partial results are saved) whose goal is to find a minimum cost path from state s to an end state send. It can potentially have exponential savings compared to traditional graph search algorithms, and has the property to only work for acyclic graphs. For any given state s, the future cost is computed as follows:**
+
+⟶
+
+
+
+
+**22. [if, otherwise]**
+
+⟶
+
+
+
+
+**23. Remark: the figure above illustrates a bottom-to-top approach whereas the formula provides the intuition of a top-to-bottom problem resolution.**
+
+⟶
+
+
+
+
+**24. Types of states ― The table below presents the terminology when it comes to states in the context of uniform cost search:**
+
+⟶
+
+
+
+
+**25. [State, Explanation]**
+
+⟶
+
+
+
+
+**26. [Explored, Frontier, Unexplored]**
+
+⟶
+
+
+
+
+**27. [States for which the optimal path has already been found, States seen for which we are still figuring out how to get there with the cheapest cost, States not seen yet]**
+
+⟶
+
+
+
+
+**28. Uniform cost search ― Uniform cost search (UCS) is a search algorithm that aims at finding the shortest path from a state sstart to an end state send. It explores states s in increasing order of PastCost(s) and relies on the fact that all action costs are non-negative.**
+
+⟶
+
+
+
+
+**29. Remark 1: the UCS algorithm is logically equivalent to Dijkstra's algorithm.**
+
+⟶
+
+
+
+
+**30. Remark 2: the algorithm would not work for a problem with negative action costs, and adding a positive constant to make them non-negative would not solve the problem since this would end up being a different problem.**
+
+⟶
+
+
+
+
+**31. Correctness theorem ― When a state s is popped from the frontier F and moved to explored set E, its priority is equal to PastCost(s) which is the minimum cost path from sstart to s.**
+
+⟶
+
+
+
+
+**32. Graph search algorithms summary ― By noting N the number of total states, n of which are explored before the end state send, we have:**
+
+⟶
+
+
+
+
+**33. [Algorithm, Acyclicity, Costs, Time/space]**
+
+⟶
+
+
+
+
+**34. [Dynamic programming, Uniform cost search]**
+
+⟶
+
+
+
+
+**35. Remark: the complexity countdown supposes the number of possible actions per state to be constant.**
+
+⟶
+
+
+
+
+**36. Learning costs**
+
+⟶
+
+
+
+
+**37. Suppose we are not given the values of Cost(s,a), we want to estimate these quantities from a training set of minimizing-cost-path sequence of actions (a1,a2,...,ak).**
+
+⟶
+
+
+
+
+**38. [Structured perceptron ― The structured perceptron is an algorithm aiming at iteratively learning the cost of each state-action pair. At each step, it:, decreases the estimated cost of each state-action of the true minimizing path y given by the training data, increases the estimated cost of each state-action of the current predicted path y' inferred from the learned weights.]**
+
+⟶
+
+
+
+
+**39. Remark: there are several versions of the algorithm, one of which simplifies the problem to only learning the cost of each action a, and the other parametrizes Cost(s,a) to a feature vector of learnable weights.**
+
+⟶
+
+
+
+
+**40. A* search**
+
+⟶
+
+
+
+
+**41. Heuristic function ― A heuristic is a function h over states s, where each h(s) aims at estimating FutureCost(s), the cost of the path from s to send.**
+
+⟶
+
+
+
+
+**42. Algorithm ― A∗ is a search algorithm that aims at finding the shortest path from a state s to an end state send. It explores states s in increasing order of PastCost(s)+h(s). It is equivalent to a uniform cost search with edge costs Cost′(s,a) given by:**
+
+⟶
+
+
+
+
+**43. Remark: this algorithm can be seen as a biased version of UCS exploring states estimated to be closer to the end state.**
+
+⟶
+
+
+
+
+**44. [Consistency ― A heuristic h is said to be consistent if it satisfies the two following properties:, For all states s and actions a, The end state verifies the following:]**
+
+⟶
+
+
+
+
+**45. Correctness ― If h is consistent, then A∗ returns the minimum cost path.**
+
+⟶
+
+
+
+
+**46. Admissibility ― A heuristic h is said to be admissible if we have:**
+
+⟶
+
+
+
+
+**47. Theorem ― Let h(s) be a given heuristic. We have:**
+
+⟶
+
+
+
+
+**48. [consistent, admissible]**
+
+⟶
+
+
+
+
+**49. Efficiency ― A* explores all states s satisfying the following equation:**
+
+⟶
+
+
+
+
+**50. Remark: larger values of h(s) is better as this equation shows it will restrict the set of states s going to be explored.**
+
+⟶
+
+
+
+
+**51. Relaxation**
+
+⟶
+
+
+
+
+**52. It is a framework for producing consistent heuristics. The idea is to find closed-form reduced costs by removing constraints and use them as heuristics.**
+
+⟶
+
+
+
+
+**53. Relaxed search problem ― The relaxation of search problem P with costs Cost is noted Prel with costs Costrel, and satisfies the identity:**
+
+⟶
+
+
+
+
+**54. Relaxed heuristic ― Given a relaxed search problem Prel, we define the relaxed heuristic h(s)=FutureCostrel(s) as the minimum cost path from s to an end state in the graph of costs Costrel(s,a).**
+
+⟶
+
+
+
+
+**55. Consistency of relaxed heuristics ― Let Prel be a given relaxed problem. By theorem, we have:**
+
+⟶
+
+
+
+
+**56. consistent**
+
+⟶
+
+
+
+
+**57. [Tradeoff when choosing heuristic ― We have to balance two aspects in choosing a heuristic:, Computational efficiency: h(s)=FutureCostrel(s) must be easy to compute. It has to produce a closed form, easier search and independent subproblems., Good enough approximation: the heuristic h(s) should be close to FutureCost(s) and we have thus to not remove too many constraints.]**
+
+⟶
+
+
+
+
+**58. Max heuristic ― Let h1(s), h2(s) be two heuristics. We have the following property:**
+
+⟶
+
+
+
+
+**59. Markov decision processes**
+
+⟶
+
+
+
+
+**60. In this section, we assume that performing action a from state s can lead to several states s′1,s′2,... in a probabilistic manner. In order to find our way between an initial state and an end state, our objective will be to find the maximum value policy by using Markov decision processes that help us cope with randomness and uncertainty.**
+
+⟶
+
+
+
+
+**61. Notations**
+
+⟶
+
+
+
+
+**62. [Definition ― The objective of a Markov decision process is to maximize rewards. It is defined with:, a starting state sstart, possible actions Actions(s) from state s, transition probabilities T(s,a,s′) from s to s′ with action a, rewards Reward(s,a,s′) from s to s′ with action a, whether an end state was reached IsEnd(s), a discount factor 0⩽γ⩽1]**
+
+⟶
+
+
+
+
+**63. Transition probabilities ― The transition probability T(s,a,s′) specifies the probability of going to state s′ after action a is taken in state s. Each s′↦T(s,a,s′) is a probability distribution, which means that:**
+
+⟶
+
+
+
+
+**64. states**
+
+⟶
+
+
+
+
+**65. Policy ― A policy π is a function that maps each state s to an action a, i.e.**
+
+⟶
+
+
+
+
+**66. Utility ― The utility of a path (s0,...,sk) is the discounted sum of the rewards on that path. In other words,**
+
+⟶
+
+
+
+
+**67. The figure above is an illustration of the case k=4.**
+
+⟶
+
+
+
+
+**68. Q-value ― The Q-value of a policy π at state s with action a, also noted Qπ(s,a), is the expected utility from state s after taking action a and then following policy π. It is defined as follows:**
+
+⟶
+
+
+
+
+**69. Value of a policy ― The value of a policy π from state s, also noted Vπ(s), is the expected utility by following policy π from state s over random paths. It is defined as follows:**
+
+⟶
+
+
+
+
+**70. Remark: Vπ(s) is equal to 0 if s is an end state.**
+
+⟶
+
+
+
+
+**71. Applications**
+
+⟶
+
+
+
+
+**72. [Policy evaluation ― Given a policy π, policy evaluation is an iterative algorithm that aims at estimating Vπ. It is done as follows:, Initialization: for all states s, we have:, Iteration: for t from 1 to TPE, we have, with]**
+
+⟶
+
+
+
+
+**73. Remark: by noting S the number of states, A the number of actions per state, S′ the number of successors and T the number of iterations, then the time complexity is of O(TPESS′).**
+
+⟶
+
+
+
+
+**74. Optimal Q-value ― The optimal Q-value Qopt(s,a) of state s with action a is defined to be the maximum Q-value attained by any policy starting. It is computed as follows:**
+
+⟶
+
+
+
+
+**75. Optimal value ― The optimal value Vopt(s) of state s is defined as being the maximum value attained by any policy. It is computed as follows:**
+
+⟶
+
+
+
+
+**76. actions**
+
+⟶
+
+
+
+
+**77. Optimal policy ― The optimal policy πopt is defined as being the policy that leads to the optimal values. It is defined by:**
+
+⟶
+
+
+
+
+**78. [Value iteration ― Value iteration is an algorithm that finds the optimal value Vopt as well as the optimal policy πopt. It is done as follows:, Initialization: for all states s, we have:, Iteration: for t from 1 to TVI, we have:, with]**
+
+⟶
+
+
+
+
+**79. Remark: if we have either γ<1 or the MDP graph being acyclic, then the value iteration algorithm is guaranteed to converge to the correct answer.**
+
+⟶
+
+
+
+
+**80. When unknown transitions and rewards**
+
+⟶
+
+
+
+
+**81. Now, let's assume that the transition probabilities and the rewards are unknown.**
+
+⟶
+
+
+
+
+**82. Model-based Monte Carlo ― The model-based Monte Carlo method aims at estimating T(s,a,s′) and Reward(s,a,s′) using Monte Carlo simulation with: **
+
+⟶
+
+
+
+
+**83. [# times (s,a,s′) occurs, and]**
+
+⟶
+
+
+
+
+**84. These estimations will be then used to deduce Q-values, including Qπ and Qopt.**
+
+⟶
+
+
+
+
+**85. Remark: model-based Monte Carlo is said to be off-policy, because the estimation does not depend on the exact policy.**
+
+⟶
+
+
+
+
+**86. Model-free Monte Carlo ― The model-free Monte Carlo method aims at directly estimating Qπ, as follows:**
+
+⟶
+
+
+
+
+**87. Qπ(s,a)=average of ut where st−1=s,at=a**
+
+⟶
+
+
+
+
+**88. where ut denotes the utility starting at step t of a given episode.**
+
+⟶
+
+
+
+
+**89. Remark: model-free Monte Carlo is said to be on-policy, because the estimated value is dependent on the policy π used to generate the data.**
+
+⟶
+
+
+
+
+**90. Equivalent formulation - By introducing the constant η=11+(#updates to (s,a)) and for each (s,a,u) of the training set, the update rule of model-free Monte Carlo has a convex combination formulation:**
+
+⟶
+
+
+
+
+**91. as well as a stochastic gradient formulation:**
+
+⟶
+
+
+
+
+**92. SARSA ― State-action-reward-state-action (SARSA) is a boostrapping method estimating Qπ by using both raw data and estimates as part of the update rule. For each (s,a,r,s′,a′), we have:**
+
+⟶
+
+
+
+
+**93. Remark: the SARSA estimate is updated on the fly as opposed to the model-free Monte Carlo one where the estimate can only be updated at the end of the episode.**
+
+⟶
+
+
+
+
+**94. Q-learning ― Q-learning is an off-policy algorithm that produces an estimate for Qopt. On each (s,a,r,s′,a′), we have:**
+
+⟶
+
+
+
+
+**95. Epsilon-greedy ― The epsilon-greedy policy is an algorithm that balances exploration with probability ϵ and exploitation with probability 1−ϵ. For a given state s, the policy πact is computed as follows:**
+
+⟶
+
+
+
+
+**96. [with probability, random from Actions(s)]**
+
+⟶
+
+
+
+
+**97. Game playing**
+
+⟶
+
+
+
+
+**98. In games (e.g. chess, backgammon, Go), other agents are present and need to be taken into account when constructing our policy.**
+
+⟶
+
+
+
+
+**99. Game tree ― A game tree is a tree that describes the possibilities of a game. In particular, each node is a decision point for a player and each root-to-leaf path is a possible outcome of the game.**
+
+⟶
+
+
+
+
+**100. [Two-player zero-sum game ― It is a game where each state is fully observed and such that players take turns. It is defined with:, a starting state sstart, possible actions Actions(s) from state s, successors Succ(s,a) from states s with actions a, whether an end state was reached IsEnd(s), the agent's utility Utility(s) at end state s, the player Player(s) who controls state s]**
+
+⟶
+
+
+
+
+**101. Remark: we will assume that the utility of the agent has the opposite sign of the one of the opponent.**
+
+⟶
+
+
+
+
+**102. [Types of policies ― There are two types of policies:, Deterministic policies, noted πp(s), which are actions that player p takes in state s., Stochastic policies, noted πp(s,a)∈[0,1], which are probabilities that player p takes action a in state s.]**
+
+⟶
+
+
+
+
+**103. Expectimax ― For a given state s, the expectimax value Vexptmax(s) is the maximum expected utility of any agent policy when playing with respect to a fixed and known opponent policy πopp. It is computed as follows:**
+
+⟶
+
+
+
+
+**104. Remark: expectimax is the analog of value iteration for MDPs.**
+
+⟶
+
+
+
+
+**105. Minimax ― The goal of minimax policies is to find an optimal policy against an adversary by assuming the worst case, i.e. that the opponent is doing everything to minimize the agent's utility. It is done as follows:**
+
+⟶
+
+
+
+
+**106. Remark: we can extract πmax and πmin from the minimax value Vminimax.**
+
+⟶
+
+
+
+
+**107. Minimax properties ― By noting V the value function, there are 3 properties around minimax to have in mind:**
+
+⟶
+
+
+
+
+**108. Property 1: if the agent were to change its policy to any πagent, then the agent would be no better off.**
+
+⟶
+
+
+
+
+**109. Property 2: if the opponent changes its policy from πmin to πopp, then he will be no better off.**
+
+⟶
+
+
+
+
+**110. Property 3: if the opponent is known to be not playing the adversarial policy, then the minimax policy might not be optimal for the agent.**
+
+⟶
+
+
+
+
+**111. In the end, we have the following relationship:**
+
+⟶
+
+
+
+
+**112. Speeding up minimax**
+
+⟶
+
+
+
+
+**113. Evaluation function ― An evaluation function is a domain-specific and approximate estimate of the value Vminimax(s). It is noted Eval(s).**
+
+⟶
+
+
+
+
+**114. Remark: FutureCost(s) is an analogy for search problems.**
+
+⟶
+
+
+
+
+**115. Alpha-beta pruning ― Alpha-beta pruning is a domain-general exact method optimizing the minimax algorithm by avoiding the unnecessary exploration of parts of the game tree. To do so, each player keeps track of the best value they can hope for (stored in α for the maximizing player and in β for the minimizing player). At a given step, the condition β<α means that the optimal path is not going to be in the current branch as the earlier player had a better option at their disposal.**
+
+⟶
+
+
+
+
+**116. TD learning ― Temporal difference (TD) learning is used when we don't know the transitions/rewards. The value is based on exploration policy. To be able to use it, we need to know rules of the game Succ(s,a). For each (s,a,r,s′), the update is done as follows:**
+
+⟶
+
+
+
+
+**117. Simultaneous games**
+
+⟶
+
+
+
+
+**118. This is the contrary of turn-based games, where there is no ordering on the player's moves.**
+
+⟶
+
+
+
+
+**119. Single-move simultaneous game ― Let there be two players A and B, with given possible actions. We note V(a,b) to be A's utility if A chooses action a, B chooses action b. V is called the payoff matrix.**
+
+⟶
+
+
+
+
+**120. [Strategies ― There are two main types of strategies:, A pure strategy is a single action:, A mixed strategy is a probability distribution over actions:]**
+
+⟶
+
+
+
+
+**121. Game evaluation ― The value of the game V(πA,πB) when player A follows πA and player B follows πB is such that:**
+
+⟶
+
+
+
+
+**122. Minimax theorem ― By noting πA,πB ranging over mixed strategies, for every simultaneous two-player zero-sum game with a finite number of actions, we have:**
+
+⟶
+
+
+
+
+**123. Non-zero-sum games**
+
+⟶
+
+
+
+
+**124. Payoff matrix ― We define Vp(πA,πB) to be the utility for player p.**
+
+⟶
+
+
+
+
+**125. Nash equilibrium ― A Nash equilibrium is (π∗A,π∗B) such that no player has an incentive to change its strategy. We have:**
+
+⟶
+
+
+
+
+**126. and**
+
+⟶
+
+
+
+
+**127. Remark: in any finite-player game with finite number of actions, there exists at least one Nash equilibrium.**
+
+⟶
+
+
+
+
+**128. [Tree search, Backtracking search, Breadth-first search, Depth-first search, Iterative deepening]**
+
+⟶
+
+
+
+
+**129. [Graph search, Dynamic programming, Uniform cost search]**
+
+⟶
+
+
+
+
+**130. [Learning costs, Structured perceptron]**
+
+⟶
+
+
+
+
+**131. [A star search, Heuristic function, Algorithm, Consistency, correctness, Admissibility, efficiency]**
+
+⟶
+
+
+
+
+**132. [Relaxation, Relaxed search problem, Relaxed heuristic, Max heuristic]**
+
+⟶
+
+
+
+
+**133. [Markov decision processes, Overview, Policy evaluation, Value iteration, Transitions, rewards]**
+
+⟶
+
+
+
+
+**134. [Game playing, Expectimax, Minimax, Speeding up minimax, Simultaneous games, Non-zero-sum games]**
+
+⟶
+
+
+
+
+**135. View PDF version on GitHub**
+
+⟶
+
+
+
+
+**136. Original authors**
+
+⟶
+
+
+
+
+**137. Translated by X, Y and Z**
+
+⟶
+
+
+
+
+**138. Reviewed by X, Y and Z**
+
+⟶
+
+
+
+
+**139. By X and Y**
+
+⟶
+
+
+
+
+**140. The Artificial Intelligence cheatsheets are now available in [target language].**
+
+⟶
diff --git a/template/cs-221-variables-models.md b/template/cs-221-variables-models.md
new file mode 100644
index 000000000..f55ef0270
--- /dev/null
+++ b/template/cs-221-variables-models.md
@@ -0,0 +1,617 @@
+**Variables-based models translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-221/cheatsheet-variables-models)
+
+
+
+**1. Variables-based models with CSP and Bayesian networks**
+
+⟶
+
+
+
+
+**2. Constraint satisfaction problems**
+
+⟶
+
+
+
+
+**3. In this section, our objective is to find maximum weight assignments of variable-based models. One advantage compared to states-based models is that these algorithms are more convenient to encode problem-specific constraints.**
+
+⟶
+
+
+
+
+**4. Factor graphs**
+
+⟶
+
+
+
+
+**5. Definition ― A factor graph, also referred to as a Markov random field, is a set of variables X=(X1,...,Xn) where Xi∈Domaini and m factors f1,...,fm with each fj(X)⩾0.**
+
+⟶
+
+
+
+
+**6. Domain**
+
+⟶
+
+
+
+
+**7. Scope and arity ― The scope of a factor fj is the set of variables it depends on. The size of this set is called the arity.**
+
+⟶
+
+
+
+
+**8. Remark: factors of arity 1 and 2 are called unary and binary respectively.**
+
+⟶
+
+
+
+
+**9. Assignment weight ― Each assignment x=(x1,...,xn) yields a weight Weight(x) defined as being the product of all factors fj applied to that assignment. Its expression is given by:**
+
+⟶
+
+
+
+
+**10. Constraint satisfaction problem ― A constraint satisfaction problem (CSP) is a factor graph where all factors are binary; we call them to be constraints:**
+
+⟶
+
+
+
+
+**11. Here, the constraint j with assignment x is said to be satisfied if and only if fj(x)=1.**
+
+⟶
+
+
+
+
+**12. Consistent assignment ― An assignment x of a CSP is said to be consistent if and only if Weight(x)=1, i.e. all constraints are satisfied.**
+
+⟶
+
+
+
+
+**13. Dynamic ordering**
+
+⟶
+
+
+
+
+**14. Dependent factors ― The set of dependent factors of variable Xi with partial assignment x is called D(x,Xi), and denotes the set of factors that link Xi to already assigned variables.**
+
+⟶
+
+
+
+
+**15. Backtracking search ― Backtracking search is an algorithm used to find maximum weight assignments of a factor graph. At each step, it chooses an unassigned variable and explores its values by recursion. Dynamic ordering (i.e. choice of variables and values) and lookahead (i.e. early elimination of inconsistent options) can be used to explore the graph more efficiently, although the worst-case runtime stays exponential: O(|Domain|n).**
+
+⟶
+
+
+
+
+**16. [Forward checking ― It is a one-step lookahead heuristic that preemptively removes inconsistent values from the domains of neighboring variables. It has the following characteristics:, After assigning a variable Xi, it eliminates inconsistent values from the domains of all its neighbors., If any of these domains becomes empty, we stop the local backtracking search., If we un-assign a variable Xi, we have to restore the domain of its neighbors.]**
+
+⟶
+
+
+
+
+**17. Most constrained variable ― It is a variable-level ordering heuristic that selects the next unassigned variable that has the fewest consistent values. This has the effect of making inconsistent assignments to fail earlier in the search, which enables more efficient pruning.**
+
+⟶
+
+
+
+
+**18. Least constrained value ― It is a value-level ordering heuristic that assigns the next value that yields the highest number of consistent values of neighboring variables. Intuitively, this procedure chooses first the values that are most likely to work.**
+
+⟶
+
+
+
+
+**19. Remark: in practice, this heuristic is useful when all factors are constraints.**
+
+⟶
+
+
+
+
+**20. The example above is an illustration of the 3-color problem with backtracking search coupled with most constrained variable exploration and least constrained value heuristic, as well as forward checking at each step.**
+
+⟶
+
+
+
+
+**21. [Arc consistency ― We say that arc consistency of variable Xl with respect to Xk is enforced when for each xl∈Domainl:, unary factors of Xl are non-zero, there exists at least one xk∈Domaink such that any factor between Xl and Xk is non-zero.]**
+
+⟶
+
+
+
+
+**22. AC-3 ― The AC-3 algorithm is a multi-step lookahead heuristic that applies forward checking to all relevant variables. After a given assignment, it performs forward checking and then successively enforces arc consistency with respect to the neighbors of variables for which the domain change during the process.**
+
+⟶
+
+
+
+
+**23. Remark: AC-3 can be implemented both iteratively and recursively.**
+
+⟶
+
+
+
+
+**24. Approximate methods**
+
+⟶
+
+
+
+
+**25. Beam search ― Beam search is an approximate algorithm that extends partial assignments of n variables of branching factor b=|Domain| by exploring the K top paths at each step. The beam size K∈{1,...,bn} controls the tradeoff between efficiency and accuracy. This algorithm has a time complexity of O(n⋅Kblog(Kb)).**
+
+⟶
+
+
+
+
+**26. The example below illustrates a possible beam search of parameters K=2, b=3 and n=5.**
+
+⟶
+
+
+
+
+**27. Remark: K=1 corresponds to greedy search whereas K→+∞ is equivalent to BFS tree search.**
+
+⟶
+
+
+
+
+**28. Iterated conditional modes ― Iterated conditional modes (ICM) is an iterative approximate algorithm that modifies the assignment of a factor graph one variable at a time until convergence. At step i, we assign to Xi the value v that maximizes the product of all factors connected to that variable.**
+
+⟶
+
+
+
+
+**29. Remark: ICM may get stuck in local minima.**
+
+⟶
+
+
+
+
+**30. [Gibbs sampling ― Gibbs sampling is an iterative approximate method that modifies the assignment of a factor graph one variable at a time until convergence. At step i:, we assign to each element u∈Domaini a weight w(u) that is the product of all factors connected to that variable, we sample v from the probability distribution induced by w and assign it to Xi.]**
+
+⟶
+
+
+
+
+**31. Remark: Gibbs sampling can be seen as the probabilistic counterpart of ICM. It has the advantage to be able to escape local minima in most cases.**
+
+⟶
+
+
+
+
+**32. Factor graph transformations**
+
+⟶
+
+
+
+
+**33. Independence ― Let A,B be a partitioning of the variables X. We say that A and B are independent if there are no edges between A and B and we write:**
+
+⟶
+
+
+
+
+**34. Remark: independence is the key property that allows us to solve subproblems in parallel.**
+
+⟶
+
+
+
+
+**35. Conditional independence ― We say that A and B are conditionally independent given C if conditioning on C produces a graph in which A and B are independent. In this case, it is written:**
+
+⟶
+
+
+
+
+**36. [Conditioning ― Conditioning is a transformation aiming at making variables independent that breaks up a factor graph into smaller pieces that can be solved in parallel and can use backtracking. In order to condition on a variable Xi=v, we do as follows:, Consider all factors f1,...,fk that depend on Xi, Remove Xi and f1,...,fk, Add gj(x) for j∈{1,...,k} defined as:]**
+
+⟶
+
+
+
+
+**37. Markov blanket ― Let A⊆X be a subset of variables. We define MarkovBlanket(A) to be the neighbors of A that are not in A.**
+
+⟶
+
+
+
+
+**38. Proposition ― Let C=MarkovBlanket(A) and B=X∖(A∪C). Then we have:**
+
+⟶
+
+
+
+
+**39. [Elimination ― Elimination is a factor graph transformation that removes Xi from the graph and solves a small subproblem conditioned on its Markov blanket as follows:, Consider all factors fi,1,...,fi,k that depend on Xi, Remove Xi
+and fi,1,...,fi,k, Add fnew,i(x) defined as:]**
+
+⟶
+
+
+
+
+**40. Treewidth ― The treewidth of a factor graph is the maximum arity of any factor created by variable elimination with the best variable ordering. In other words,**
+
+⟶
+
+
+
+
+**41. The example below illustrates the case of a factor graph of treewidth 3.**
+
+⟶
+
+
+
+
+**42. Remark: finding the best variable ordering is a NP-hard problem.**
+
+⟶
+
+
+
+
+**43. Bayesian networks**
+
+⟶
+
+
+
+
+**44. In this section, our goal will be to compute conditional probabilities. What is the probability of a query given evidence?**
+
+⟶
+
+
+
+
+**45. Introduction**
+
+⟶
+
+
+
+
+**46. Explaining away ― Suppose causes C1 and C2 influence an effect E. Conditioning on the effect E and on one of the causes (say C1) changes the probability of the other cause (say C2). In this case, we say that C1 has explained away C2.**
+
+⟶
+
+
+
+
+**47. Directed acyclic graph ― A directed acyclic graph (DAG) is a finite directed graph with no directed cycles.**
+
+⟶
+
+
+
+
+**48. Bayesian network ― A Bayesian network is a directed acyclic graph (DAG) that specifies a joint distribution over random variables X=(X1,...,Xn) as a product of local conditional distributions, one for each node:**
+
+⟶
+
+
+
+
+**49. Remark: Bayesian networks are factor graphs imbued with the language of probability.**
+
+⟶
+
+
+
+
+**50. Locally normalized ― For each xParents(i), all factors are local conditional distributions. Hence they have to satisfy:**
+
+⟶
+
+
+
+
+**51. As a result, sub-Bayesian networks and conditional distributions are consistent.**
+
+⟶
+
+
+
+
+**52. Remark: local conditional distributions are the true conditional distributions.**
+
+⟶
+
+
+
+
+**53. Marginalization ― The marginalization of a leaf node yields a Bayesian network without that node.**
+
+⟶
+
+
+
+
+**54. Probabilistic programs**
+
+⟶
+
+
+
+
+**55. Concept ― A probabilistic program randomizes variables assignment. That way, we can write down complex Bayesian networks that generate assignments without us having to explicitly specify associated probabilities.**
+
+⟶
+
+
+
+
+**56. Remark: examples of probabilistic programs include Hidden Markov model (HMM), factorial HMM, naive Bayes, latent Dirichlet allocation, diseases and symptoms and stochastic block models.**
+
+⟶
+
+
+
+
+**57. Summary ― The table below summarizes the common probabilistic programs as well as their applications:**
+
+⟶
+
+
+
+
+**58. [Program, Algorithm, Illustration, Example]**
+
+⟶
+
+
+
+
+**59. [Markov Model, Hidden Markov Model (HMM), Factorial HMM, Naive Bayes, Latent Dirichlet Allocation (LDA)]**
+
+⟶
+
+
+
+
+**60. [Generate, distribution]**
+
+⟶
+
+
+
+
+**61. [Language modeling, Object tracking, Multiple object tracking, Document classification, Topic modeling]**
+
+⟶
+
+
+
+
+**62. Inference**
+
+⟶
+
+
+
+
+**63. [General probabilistic inference strategy ― The strategy to compute the probability P(Q|E=e) of query Q given evidence E=e is as follows:, Step 1: Remove variables that are not ancestors of the query Q or the evidence E by marginalization, Step 2: Convert Bayesian network to factor graph, Step 3: Condition on the evidence E=e, Step 4: Remove nodes disconnected from the query Q by marginalization, Step 5: Run a probabilistic inference algorithm (manual, variable elimination, Gibbs sampling, particle filtering)]**
+
+⟶
+
+
+
+
+**64. Forward-backward algorithm ― This algorithm computes the exact value of P(H=hk|E=e) (smoothing query) for any k∈{1,...,L} in the case of an HMM of size L. To do so, we proceed in 3 steps:**
+
+⟶
+
+
+
+
+**65. Step 1: for ..., compute ...**
+
+⟶
+
+
+
+
+**66. with the convention F0=BL+1=1. From this procedure and these notations, we get that**
+
+⟶
+
+
+
+
+**67. Remark: this algorithm interprets each assignment to be a path where each edge hi−1→hi is of weight p(hi|hi−1)p(ei|hi).**
+
+⟶
+
+
+
+
+**68. [Gibbs sampling ― This algorithm is an iterative approximate method that uses a small set of assignments (particles) to represent a large probability distribution. From a random assignment x, Gibbs sampling performs the following steps for i∈{1,...,n} until convergence:, For all u∈Domaini, compute the weight w(u) of assignment x where Xi=u, Sample v from the probability distribution induced by w: v∼P(Xi=v|X−i=x−i), Set Xi=v]**
+
+⟶
+
+
+
+
+**69. Remark: X−i denotes X∖{Xi} and x−i represents the corresponding assignment.**
+
+⟶
+
+
+
+
+**70. [Particle filtering ― This algorithm approximates the posterior density of state variables given the evidence of observation variables by keeping track of K particles at a time. Starting from a set of particles C of size K, we run the following 3 steps iteratively:, Step 1: proposal - For each old particle xt−1∈C, sample x from the transition probability distribution p(x|xt−1) and add x to a set C′., Step 2: weighting - Weigh each x of the set C′ by w(x)=p(et|x), where et is the evidence observed at time t., Step 3: resampling - Sample K elements from the set C′ using the probability distribution induced by w and store them in C: these are the current particles xt.]**
+
+⟶
+
+
+
+
+**71. Remark: a more expensive version of this algorithm also keeps track of past particles in the proposal step.**
+
+⟶
+
+
+
+
+**72. Maximum likelihood ― If we don't know the local conditional distributions, we can learn them using maximum likelihood.**
+
+⟶
+
+
+
+
+**73. Laplace smoothing ― For each distribution d and partial assignment (xParents(i),xi), add λ to countd(xParents(i),xi), then normalize to get probability estimates.**
+
+⟶
+
+
+
+
+**74. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
+
+⟶
+
+
+
+
+**75. [E-step: Evaluate the posterior probability q(h) that each data point e came from a particular cluster h as follows:, M-step: Use the posterior probabilities q(h) as cluster specific weights on data points e to determine θ through maximum likelihood.]**
+
+⟶
+
+
+
+
+**76. [Factor graphs, Arity, Assignment weight, Constraint satisfaction problem, Consistent assignment]**
+
+⟶
+
+
+
+
+**77. [Dynamic ordering, Dependent factors, Backtracking search, Forward checking, Most constrained variable, Least constrained value]**
+
+⟶
+
+
+
+
+**78. [Approximate methods, Beam search, Iterated conditional modes, Gibbs sampling]**
+
+⟶
+
+
+
+
+**79. [Factor graph transformations, Conditioning, Elimination]**
+
+⟶
+
+
+
+
+**80. [Bayesian networks, Definition, Locally normalized, Marginalization]**
+
+⟶
+
+
+
+
+**81. [Probabilistic program, Concept, Summary]**
+
+⟶
+
+
+
+
+**82. [Inference, Forward-backward algorithm, Gibbs sampling, Laplace smoothing]**
+
+⟶
+
+
+
+
+**83. View PDF version on GitHub**
+
+⟶
+
+
+
+
+**84. Original authors**
+
+⟶
+
+
+
+
+**85. Translated by X, Y and Z**
+
+⟶
+
+
+
+
+**86. Reviewed by X, Y and Z**
+
+⟶
+
+
+
+
+**87. By X and Y**
+
+⟶
+
+
+
+
+**88. The Artificial Intelligence cheatsheets are now available in [target language].**
+
+⟶
diff --git a/ar/cheatsheet-deep-learning.md b/template/cs-229-deep-learning.md
similarity index 98%
rename from ar/cheatsheet-deep-learning.md
rename to template/cs-229-deep-learning.md
index a5aa3756c..a7770a048 100644
--- a/ar/cheatsheet-deep-learning.md
+++ b/template/cs-229-deep-learning.md
@@ -1,3 +1,7 @@
+**Deep learning translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning)
+
+
+
**1. Deep Learning cheatsheet**
⟶
diff --git a/hi/refresher-linear-algebra.md b/template/cs-229-linear-algebra.md
similarity index 97%
rename from hi/refresher-linear-algebra.md
rename to template/cs-229-linear-algebra.md
index a6b440d1e..dced85397 100644
--- a/hi/refresher-linear-algebra.md
+++ b/template/cs-229-linear-algebra.md
@@ -1,3 +1,7 @@
+**Linear Algebra and Calculus translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/refresher-algebra-calculus)
+
+
+
**1. Linear Algebra and Calculus refresher**
⟶
diff --git a/hi/cheatsheet-machine-learning-tips-and-tricks.md b/template/cs-229-machine-learning-tips-and-tricks.md
similarity index 97%
rename from hi/cheatsheet-machine-learning-tips-and-tricks.md
rename to template/cs-229-machine-learning-tips-and-tricks.md
index 9712297b8..edba03259 100644
--- a/hi/cheatsheet-machine-learning-tips-and-tricks.md
+++ b/template/cs-229-machine-learning-tips-and-tricks.md
@@ -1,3 +1,7 @@
+**Machine Learning tips and tricks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-machine-learning-tips-and-tricks)
+
+
+
**1. Machine Learning tips and tricks cheatsheet**
⟶
diff --git a/hi/refresher-probability.md b/template/cs-229-probability.md
similarity index 98%
rename from hi/refresher-probability.md
rename to template/cs-229-probability.md
index 5c9b34656..b8be13004 100644
--- a/hi/refresher-probability.md
+++ b/template/cs-229-probability.md
@@ -1,3 +1,7 @@
+**Probabilities and Statistics translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/refresher-probabilities-statistics)
+
+
+
**1. Probabilities and Statistics refresher**
⟶
diff --git a/hi/cheatsheet-supervised-learning.md b/template/cs-229-supervised-learning.md
similarity index 96%
rename from hi/cheatsheet-supervised-learning.md
rename to template/cs-229-supervised-learning.md
index a6b19ea1c..9a0a1901a 100644
--- a/hi/cheatsheet-supervised-learning.md
+++ b/template/cs-229-supervised-learning.md
@@ -1,3 +1,7 @@
+**Supervised Learning translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-supervised-learning)
+
+
+
**1. Supervised Learning cheatsheet**
⟶
@@ -238,19 +242,19 @@
-**41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
+**41. The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
⟶
-**42: Optimal margin classifier ― The optimal margin classifier h is such that:**
+**42. Optimal margin classifier ― The optimal margin classifier h is such that:**
⟶
-**43: where (w,b)∈Rn×R is the solution of the following optimization problem:**
+**43. where (w,b)∈Rn×R is the solution of the following optimization problem:**
⟶
@@ -472,13 +476,13 @@
-**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: **
+**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions:**
⟶
-**81: the training and testing sets follow the same distribution **
+**81: the training and testing sets follow the same distribution**
⟶
diff --git a/de/cheatsheet-unsupervised-learning.md b/template/cs-229-unsupervised-learning.md
similarity index 96%
rename from de/cheatsheet-unsupervised-learning.md
rename to template/cs-229-unsupervised-learning.md
index 1bf117d72..18fafef8c 100644
--- a/de/cheatsheet-unsupervised-learning.md
+++ b/template/cs-229-unsupervised-learning.md
@@ -1,3 +1,7 @@
+**Unsupervised Learning translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-unsupervised-learning)
+
+
+
**1. Unsupervised Learning cheatsheet**
⟶
@@ -299,7 +303,7 @@ dimensions by maximizing the variance of the data as follows:**
-**51. The Machine Learning cheatsheets are now available in German.**
+**51. The Machine Learning cheatsheets are now available in [target language].**
⟶
diff --git a/template/cs-230-convolutional-neural-networks.md b/template/cs-230-convolutional-neural-networks.md
new file mode 100644
index 000000000..94006a675
--- /dev/null
+++ b/template/cs-230-convolutional-neural-networks.md
@@ -0,0 +1,716 @@
+**Convolutional Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks)
+
+
+
+**1. Convolutional Neural Networks cheatsheet**
+
+⟶
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶
+
+
+
+
+**3. [Overview, Architecture structure]**
+
+⟶
+
+
+
+
+**4. [Types of layer, Convolution, Pooling, Fully connected]**
+
+⟶
+
+
+
+
+**5. [Filter hyperparameters, Dimensions, Stride, Padding]**
+
+⟶
+
+
+
+
+**6. [Tuning hyperparameters, Parameter compatibility, Model complexity, Receptive field]**
+
+⟶
+
+
+
+
+**7. [Activation functions, Rectified Linear Unit, Softmax]**
+
+⟶
+
+
+
+
+**8. [Object detection, Types of models, Detection, Intersection over Union, Non-max suppression, YOLO, R-CNN]**
+
+⟶
+
+
+
+
+**9. [Face verification/recognition, One shot learning, Siamese network, Triplet loss]**
+
+⟶
+
+
+
+
+**10. [Neural style transfer, Activation, Style matrix, Style/content cost function]**
+
+⟶
+
+
+
+
+**11. [Computational trick architectures, Generative Adversarial Net, ResNet, Inception Network]**
+
+⟶
+
+
+
+
+**12. Overview**
+
+⟶
+
+
+
+
+**13. Architecture of a traditional CNN ― Convolutional neural networks, also known as CNNs, are a specific type of neural networks that are generally composed of the following layers:**
+
+⟶
+
+
+
+
+**14. The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters that are described in the next sections.**
+
+⟶
+
+
+
+
+**15. Types of layer**
+
+⟶
+
+
+
+
+**16. Convolution layer (CONV) ― The convolution layer (CONV) uses filters that perform convolution operations as it is scanning the input I with respect to its dimensions. Its hyperparameters include the filter size F and stride S. The resulting output O is called feature map or activation map.**
+
+⟶
+
+
+
+
+**17. Remark: the convolution step can be generalized to the 1D and 3D cases as well.**
+
+⟶
+
+
+
+
+**18. Pooling (POOL) ― The pooling layer (POOL) is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance. In particular, max and average pooling are special kinds of pooling where the maximum and average value is taken, respectively.**
+
+⟶
+
+
+
+
+**19. [Type, Purpose, Illustration, Comments]**
+
+⟶
+
+
+
+
+**20. [Max pooling, Average pooling, Each pooling operation selects the maximum value of the current view, Each pooling operation averages the values of the current view]**
+
+⟶
+
+
+
+
+**21. [Preserves detected features, Most commonly used, Downsamples feature map, Used in LeNet]**
+
+⟶
+
+
+
+
+**22. Fully Connected (FC) ― The fully connected layer (FC) operates on a flattened input where each input is connected to all neurons. If present, FC layers are usually found towards the end of CNN architectures and can be used to optimize objectives such as class scores.**
+
+⟶
+
+
+
+
+**23. Filter hyperparameters**
+
+⟶
+
+
+
+
+**24. The convolution layer contains filters for which it is important to know the meaning behind its hyperparameters.**
+
+⟶
+
+
+
+
+**25. Dimensions of a filter ― A filter of size F×F applied to an input containing C channels is a F×F×C volume that performs convolutions on an input of size I×I×C and produces an output feature map (also called activation map) of size O×O×1.**
+
+⟶
+
+
+
+
+**26. Filter**
+
+⟶
+
+
+
+
+**27. Remark: the application of K filters of size F×F results in an output feature map of size O×O×K.**
+
+⟶
+
+
+
+
+**28. Stride ― For a convolutional or a pooling operation, the stride S denotes the number of pixels by which the window moves after each operation.**
+
+⟶
+
+
+
+
+**29. Zero-padding ― Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. This value can either be manually specified or automatically set through one of the three modes detailed below:**
+
+⟶
+
+
+
+
+**30. [Mode, Value, Illustration, Purpose, Valid, Same, Full]**
+
+⟶
+
+
+
+
+**31. [No padding, Drops last convolution if dimensions do not match, Padding such that feature map size has size ⌈IS⌉, Output size is mathematically convenient, Also called 'half' padding, Maximum padding such that end convolutions are applied on the limits of the input, Filter 'sees' the input end-to-end]**
+
+⟶
+
+
+
+
+**32. Tuning hyperparameters**
+
+⟶
+
+
+
+
+**33. Parameter compatibility in convolution layer ― By noting I the length of the input volume size, F the length of the filter, P the amount of zero padding, S the stride, then the output size O of the feature map along that dimension is given by:**
+
+⟶
+
+
+
+
+**34. [Input, Filter, Output]**
+
+⟶
+
+
+
+
+**35. Remark: often times, Pstart=Pend≜P, in which case we can replace Pstart+Pend by 2P in the formula above.**
+
+⟶
+
+
+
+
+**36. Understanding the complexity of the model ― In order to assess the complexity of a model, it is often useful to determine the number of parameters that its architecture will have. In a given layer of a convolutional neural network, it is done as follows:**
+
+⟶
+
+
+
+
+**37. [Illustration, Input size, Output size, Number of parameters, Remarks]**
+
+⟶
+
+
+
+
+**38. [One bias parameter per filter, In most cases, S
+
+
+**39. [Pooling operation done channel-wise, In most cases, S=F]**
+
+⟶
+
+
+
+
+**40. [Input is flattened, One bias parameter per neuron, The number of FC neurons is free of structural constraints]**
+
+⟶
+
+
+
+
+**41. Receptive field ― The receptive field at layer k is the area denoted Rk×Rk of the input that each pixel of the k-th activation map can 'see'. By calling Fj the filter size of layer j and Si the stride value of layer i and with the convention S0=1, the receptive field at layer k can be computed with the formula:**
+
+⟶
+
+
+
+
+**42. In the example below, we have F1=F2=3 and S1=S2=1, which gives R2=1+2⋅1+2⋅1=5.**
+
+⟶
+
+
+
+
+**43. Commonly used activation functions**
+
+⟶
+
+
+
+
+**44. Rectified Linear Unit ― The rectified linear unit layer (ReLU) is an activation function g that is used on all elements of the volume. It aims at introducing non-linearities to the network. Its variants are summarized in the table below:**
+
+⟶
+
+
+
+
+**45. [ReLU, Leaky ReLU, ELU, with]**
+
+⟶
+
+
+
+
+**46. [Non-linearity complexities biologically interpretable, Addresses dying ReLU issue for negative values, Differentiable everywhere]**
+
+⟶
+
+
+
+
+**47. Softmax ― The softmax step can be seen as a generalized logistic function that takes as input a vector of scores x∈Rn and outputs a vector of output probability p∈Rn through a softmax function at the end of the architecture. It is defined as follows:**
+
+⟶
+
+
+
+
+**48. where**
+
+⟶
+
+
+
+
+**49. Object detection**
+
+⟶
+
+
+
+
+**50. Types of models ― There are 3 main types of object recognition algorithms, for which the nature of what is predicted is different. They are described in the table below:**
+
+⟶
+
+
+
+
+**51. [Image classification, Classification w. localization, Detection]**
+
+⟶
+
+
+
+
+**52. [Teddy bear, Book]**
+
+⟶
+
+
+
+
+**53. [Classifies a picture, Predicts probability of object, Detects an object in a picture, Predicts probability of object and where it is located, Detects up to several objects in a picture, Predicts probabilities of objects and where they are located]**
+
+⟶
+
+
+
+
+**54. [Traditional CNN, Simplified YOLO, R-CNN, YOLO, R-CNN]**
+
+⟶
+
+
+
+
+**55. Detection ― In the context of object detection, different methods are used depending on whether we just want to locate the object or detect a more complex shape in the image. The two main ones are summed up in the table below:**
+
+⟶
+
+
+
+
+**56. [Bounding box detection, Landmark detection]**
+
+⟶
+
+
+
+
+**57. [Detects the part of the image where the object is located, Detects a shape or characteristics of an object (e.g. eyes), More granular]**
+
+⟶
+
+
+
+
+**58. [Box of center (bx,by), height bh and width bw, Reference points (l1x,l1y), ..., (lnx,lny)]**
+
+⟶
+
+
+
+
+**59. Intersection over Union ― Intersection over Union, also known as IoU, is a function that quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding box Ba. It is defined as:**
+
+⟶
+
+
+
+
+**60. Remark: we always have IoU∈[0,1]. By convention, a predicted bounding box Bp is considered as being reasonably good if IoU(Bp,Ba)⩾0.5.**
+
+⟶
+
+
+
+
+**61. Anchor boxes ― Anchor boxing is a technique used to predict overlapping bounding boxes. In practice, the network is allowed to predict more than one box simultaneously, where each box prediction is constrained to have a given set of geometrical properties. For instance, the first prediction can potentially be a rectangular box of a given form, while the second will be another rectangular box of a different geometrical form.**
+
+⟶
+
+
+
+
+**62. Non-max suppression ― The non-max suppression technique aims at removing duplicate overlapping bounding boxes of a same object by selecting the most representative ones. After having removed all boxes having a probability prediction lower than 0.6, the following steps are repeated while there are boxes remaining:**
+
+⟶
+
+
+
+
+**63. [For a given class, Step 1: Pick the box with the largest prediction probability., Step 2: Discard any box having an IoU⩾0.5 with the previous box.]**
+
+⟶
+
+
+
+
+**64. [Box predictions, Box selection of maximum probability, Overlap removal of same class, Final bounding boxes]**
+
+⟶
+
+
+
+
+**65. YOLO ― You Only Look Once (YOLO) is an object detection algorithm that performs the following steps:**
+
+⟶
+
+
+
+
+**66. [Step 1: Divide the input image into a G×G grid., Step 2: For each grid cell, run a CNN that predicts y of the following form:, repeated k times]**
+
+⟶
+
+
+
+
+**67. where pc is the probability of detecting an object, bx,by,bh,bw are the properties of the detected bouding box, c1,...,cp is a one-hot representation of which of the p classes were detected, and k is the number of anchor boxes.**
+
+⟶
+
+
+
+
+**68. Step 3: Run the non-max suppression algorithm to remove any potential duplicate overlapping bounding boxes.**
+
+⟶
+
+
+
+
+**69. [Original image, Division in GxG grid, Bounding box prediction, Non-max suppression]**
+
+⟶
+
+
+
+
+**70. Remark: when pc=0, then the network does not detect any object. In that case, the corresponding predictions bx,...,cp have to be ignored.**
+
+⟶
+
+
+
+
+**71. R-CNN ― Region with Convolutional Neural Networks (R-CNN) is an object detection algorithm that first segments the image to find potential relevant bounding boxes and then run the detection algorithm to find most probable objects in those bounding boxes.**
+
+⟶
+
+
+
+
+**72. [Original image, Segmentation, Bounding box prediction, Non-max suppression]**
+
+⟶
+
+
+
+
+**73. Remark: although the original algorithm is computationally expensive and slow, newer architectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.**
+
+⟶
+
+
+
+
+**74. Face verification and recognition**
+
+⟶
+
+
+
+
+**75. Types of models ― Two main types of model are summed up in table below:**
+
+⟶
+
+
+
+
+**76. [Face verification, Face recognition, Query, Reference, Database]**
+
+⟶
+
+
+
+
+**77. [Is this the correct person?, One-to-one lookup, Is this one of the K persons in the database?, One-to-many lookup]**
+
+⟶
+
+
+
+
+**78. One Shot Learning ― One Shot Learning is a face verification algorithm that uses a limited training set to learn a similarity function that quantifies how different two given images are. The similarity function applied to two images is often noted d(image 1,image 2).**
+
+⟶
+
+
+
+
+**79. Siamese Network ― Siamese Networks aim at learning how to encode images to then quantify how different two images are. For a given input image x(i), the encoded output is often noted as f(x(i)).**
+
+⟶
+
+
+
+
+**80. Triplet loss ― The triplet loss ℓ is a loss function computed on the embedding representation of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive example belong to a same class, while the negative example to another one. By calling α∈R+ the margin parameter, this loss is defined as follows:**
+
+⟶
+
+
+
+
+**81. Neural style transfer**
+
+⟶
+
+
+
+
+**82. Motivation ― The goal of neural style transfer is to generate an image G based on a given content C and a given style S.**
+
+⟶
+
+
+
+
+**83. [Content C, Style S, Generated image G]**
+
+⟶
+
+
+
+
+**84. Activation ― In a given layer l, the activation is noted a[l] and is of dimensions nH×nw×nc**
+
+⟶
+
+
+
+
+**85. Content cost function ― The content cost function Jcontent(C,G) is used to determine how the generated image G differs from the original content image C. It is defined as follows:**
+
+⟶
+
+
+
+
+**86. Style matrix ― The style matrix G[l] of a given layer l is a Gram matrix where each of its elements G[l]kk′ quantifies how correlated the channels k and k′ are. It is defined with respect to activations a[l] as follows:**
+
+⟶
+
+
+
+
+**87. Remark: the style matrix for the style image and the generated image are noted G[l] (S) and G[l] (G) respectively.**
+
+⟶
+
+
+
+
+**88. Style cost function ― The style cost function Jstyle(S,G) is used to determine how the generated image G differs from the style S. It is defined as follows:**
+
+⟶
+
+
+
+
+**89. Overall cost function ― The overall cost function is defined as being a combination of the content and style cost functions, weighted by parameters α,β, as follows:**
+
+⟶
+
+
+
+
+**90. Remark: a higher value of α will make the model care more about the content while a higher value of β will make it care more about the style.**
+
+⟶
+
+
+
+
+**91. Architectures using computational tricks**
+
+⟶
+
+
+
+
+**92. Generative Adversarial Network ― Generative adversarial networks, also known as GANs, are composed of a generative and a discriminative model, where the generative model aims at generating the most truthful output that will be fed into the discriminative which aims at differentiating the generated and true image.**
+
+⟶
+
+
+
+
+**93. [Training, Noise, Real-world image, Generator, Discriminator, Real Fake]**
+
+⟶
+
+
+
+
+**94. Remark: use cases using variants of GANs include text to image, music generation and synthesis.**
+
+⟶
+
+
+
+
+**95. ResNet ― The Residual Network architecture (also called ResNet) uses residual blocks with a high number of layers meant to decrease the training error. The residual block has the following characterizing equation:**
+
+⟶
+
+
+
+
+**96. Inception Network ― This architecture uses inception modules and aims at giving a try at different convolutions in order to increase its performance through features diversification. In particular, it uses the 1×1 convolution trick to limit the computational burden.**
+
+⟶
+
+
+
+
+**97. The Deep Learning cheatsheets are now available in [target language].**
+
+⟶
+
+
+
+
+**98. Original authors**
+
+⟶
+
+
+
+
+**99. Translated by X, Y and Z**
+
+⟶
+
+
+
+
+**100. Reviewed by X, Y and Z**
+
+⟶
+
+
+
+
+**101. View PDF version on GitHub**
+
+⟶
+
+
+
+
+**102. By X and Y**
+
+⟶
+
+
diff --git a/template/cs-230-deep-learning-tips-and-tricks.md b/template/cs-230-deep-learning-tips-and-tricks.md
new file mode 100644
index 000000000..e1778de36
--- /dev/null
+++ b/template/cs-230-deep-learning-tips-and-tricks.md
@@ -0,0 +1,450 @@
+**Deep Learning Tips and Tricks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-deep-learning-tips-and-tricks)
+
+
+
+**1. Deep Learning Tips and Tricks cheatsheet**
+
+⟶
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶
+
+
+
+
+**3. Tips and tricks**
+
+⟶
+
+
+
+
+**4. [Data processing, Data augmentation, Batch normalization]**
+
+⟶
+
+
+
+
+**5. [Training a neural network, Epoch, Mini-batch, Cross-entropy loss, Backpropagation, Gradient descent, Updating weights, Gradient checking]**
+
+⟶
+
+
+
+
+**6. [Parameter tuning, Xavier initialization, Transfer learning, Learning rate, Adaptive learning rates]**
+
+⟶
+
+
+
+
+**7. [Regularization, Dropout, Weight regularization, Early stopping]**
+
+⟶
+
+
+
+
+**8. [Good practices, Overfitting small batch, Gradient checking]**
+
+⟶
+
+
+
+
+**9. View PDF version on GitHub**
+
+⟶
+
+
+
+
+**10. Data processing**
+
+⟶
+
+
+
+
+**11. Data augmentation ― Deep learning models usually need a lot of data to be properly trained. It is often useful to get more data from the existing ones using data augmentation techniques. The main ones are summed up in the table below. More precisely, given the following input image, here are the techniques that we can apply:**
+
+⟶
+
+
+
+
+**12. [Original, Flip, Rotation, Random crop]**
+
+⟶
+
+
+
+
+**13. [Image without any modification, Flipped with respect to an axis for which the meaning of the image is preserved, Rotation with a slight angle, Simulates incorrect horizon calibration, Random focus on one part of the image, Several random crops can be done in a row]**
+
+⟶
+
+
+
+
+**14. [Color shift, Noise addition, Information loss, Contrast change]**
+
+⟶
+
+
+
+
+**15. [Nuances of RGB is slightly changed, Captures noise that can occur with light exposure, Addition of noise, More tolerance to quality variation of inputs, Parts of image ignored, Mimics potential loss of parts of image, Luminosity changes, Controls difference in exposition due to time of day]**
+
+⟶
+
+
+
+
+**16. Remark: data is usually augmented on the fly during training.**
+
+⟶
+
+
+
+
+**17. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
+
+⟶
+
+
+
+
+**18. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
+
+⟶
+
+
+
+
+**19. Training a neural network**
+
+⟶
+
+
+
+
+**20. Definitions**
+
+⟶
+
+
+
+
+**21. Epoch ― In the context of training a model, epoch is a term used to refer to one iteration where the model sees the whole training set to update its weights.**
+
+⟶
+
+
+
+
+**22. Mini-batch gradient descent ― During the training phase, updating weights is usually not based on the whole training set at once due to computation complexities or one data point due to noise issues. Instead, the update step is done on mini-batches, where the number of data points in a batch is a hyperparameter that we can tune.**
+
+⟶
+
+
+
+
+**23. Loss function ― In order to quantify how a given model performs, the loss function L is usually used to evaluate to what extent the actual outputs y are correctly predicted by the model outputs z.**
+
+⟶
+
+
+
+
+**24. Cross-entropy loss ― In the context of binary classification in neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
+
+⟶
+
+
+
+
+**25. Finding optimal weights**
+
+⟶
+
+
+
+
+**26. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight w is computed using the chain rule.**
+
+⟶
+
+
+
+
+**27. Using this method, each weight is updated with the rule:**
+
+⟶
+
+
+
+
+**28. Updating weights ― In a neural network, weights are updated as follows:**
+
+⟶
+
+
+
+
+**29. [Step 1: Take a batch of training data and perform forward propagation to compute the loss, Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight, Step 3: Use the gradients to update the weights of the network.]**
+
+⟶
+
+
+
+
+**30. [Forward propagation, Backpropagation, Weights update]**
+
+⟶
+
+
+
+
+**31. Parameter tuning**
+
+⟶
+
+
+
+
+**32. Weights initialization**
+
+⟶
+
+
+
+
+**33. Xavier initialization ― Instead of initializing the weights in a purely random manner, Xavier initialization enables to have initial weights that take into account characteristics that are unique to the architecture.**
+
+⟶
+
+
+
+
+**34. Transfer learning ― Training a deep learning model requires a lot of data and more importantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets that took days/weeks to train, and leverage it towards our use case. Depending on how much data we have at hand, here are the different ways to leverage this:**
+
+⟶
+
+
+
+
+**35. [Training size, Illustration, Explanation]**
+
+⟶
+
+
+
+
+**36. [Small, Medium, Large]**
+
+⟶
+
+
+
+
+**37. [Freezes all layers, trains weights on softmax, Freezes most layers, trains weights on last layers and softmax, Trains weights on layers and softmax by initializing weights on pre-trained ones]**
+
+⟶
+
+
+
+
+**38. Optimizing convergence**
+
+⟶
+
+
+
+
+**39. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. It can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
+
+⟶
+
+
+
+
+**40. Adaptive learning rates ― Letting the learning rate vary when training a model can reduce the training time and improve the numerical optimal solution. While Adam optimizer is the most commonly used technique, others can also be useful. They are summed up in the table below:**
+
+⟶
+
+
+
+
+**41. [Method, Explanation, Update of w, Update of b]**
+
+⟶
+
+
+
+
+**42. [Momentum, Dampens oscillations, Improvement to SGD, 2 parameters to tune]**
+
+⟶
+
+
+
+
+**43. [RMSprop, Root Mean Square propagation, Speeds up learning algorithm by controlling oscillations]**
+
+⟶
+
+
+
+
+**44. [Adam, Adaptive Moment estimation, Most popular method, 4 parameters to tune]**
+
+⟶
+
+
+
+
+**45. Remark: other methods include Adadelta, Adagrad and SGD.**
+
+⟶
+
+
+
+
+**46. Regularization**
+
+⟶
+
+
+
+
+**47. Dropout ― Dropout is a technique used in neural networks to prevent overfitting the training data by dropping out neurons with probability p>0. It forces the model to avoid relying too much on particular sets of features.**
+
+⟶
+
+
+
+
+**48. Remark: most deep learning frameworks parametrize dropout through the 'keep' parameter 1−p.**
+
+⟶
+
+
+
+
+**49. Weight regularization ― In order to make sure that the weights are not too large and that the model is not overfitting the training set, regularization techniques are usually performed on the model weights. The main ones are summed up in the table below:**
+
+⟶
+
+
+
+
+**50. [LASSO, Ridge, Elastic Net, Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+⟶
+
+
+
+**51. Early stopping ― This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase.**
+
+⟶
+
+
+
+
+**52. [Error, Validation, Training, early stopping, Epochs]**
+
+⟶
+
+
+
+
+**53. Good practices**
+
+⟶
+
+
+
+
+**54. Overfitting small batch ― When debugging a model, it is often useful to make quick tests to see if there is any major issue with the architecture of the model itself. In particular, in order to make sure that the model can be properly trained, a mini-batch is passed inside the network to see if it can overfit on it. If it cannot, it means that the model is either too complex or not complex enough to even overfit on a small batch, let alone a normal-sized training set.**
+
+⟶
+
+
+
+
+**55. Gradient checking ― Gradient checking is a method used during the implementation of the backward pass of a neural network. It compares the value of the analytical gradient to the numerical gradient at given points and plays the role of a sanity-check for correctness.**
+
+⟶
+
+
+
+
+**56. [Type, Numerical gradient, Analytical gradient]**
+
+⟶
+
+
+
+
+**57. [Formula, Comments]**
+
+⟶
+
+
+
+
+**58. [Expensive; loss has to be computed two times per dimension, Used to verify correctness of analytical implementation, Trade-off in choosing h not too small (numerical instability) nor too large (poor gradient approximation)]**
+
+⟶
+
+
+
+
+**59. ['Exact' result, Direct computation, Used in the final implementation]**
+
+⟶
+
+
+
+
+**60. The Deep Learning cheatsheets are now available in [target language].**
+
+⟶
+
+
+**61. Original authors**
+
+⟶
+
+
+
+**62.Translated by X, Y and Z**
+
+⟶
+
+
+
+**63.Reviewed by X, Y and Z**
+
+⟶
+
+
+
+**64.View PDF version on GitHub**
+
+⟶
+
+
+
+**65.By X and Y**
+
+⟶
+
+
diff --git a/template/cs-230-recurrent-neural-networks.md b/template/cs-230-recurrent-neural-networks.md
new file mode 100644
index 000000000..bd3c638bc
--- /dev/null
+++ b/template/cs-230-recurrent-neural-networks.md
@@ -0,0 +1,677 @@
+**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks)
+
+
+
+**1. Recurrent Neural Networks cheatsheet**
+
+⟶
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶
+
+
+
+
+**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]**
+
+⟶
+
+
+
+
+**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]**
+
+⟶
+
+
+
+
+**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]**
+
+⟶
+
+
+
+
+**6. [Comparing words, Cosine similarity, t-SNE]**
+
+⟶
+
+
+
+
+**7. [Language model, n-gram, Perplexity]**
+
+⟶
+
+
+
+
+**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]**
+
+⟶
+
+
+
+
+**9. [Attention, Attention model, Attention weights]**
+
+⟶
+
+
+
+
+**10. Overview**
+
+⟶
+
+
+
+
+**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:**
+
+⟶
+
+
+
+
+**12. For each timestep t, the activation a and the output y are expressed as follows:**
+
+⟶
+
+
+
+
+**13. and**
+
+⟶
+
+
+
+
+**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.**
+
+⟶
+
+
+
+
+**15. The pros and cons of a typical RNN architecture are summed up in the table below:**
+
+⟶
+
+
+
+
+**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]**
+
+⟶
+
+
+
+
+**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]**
+
+⟶
+
+
+
+
+**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:**
+
+⟶
+
+
+
+
+**19. [Type of RNN, Illustration, Example]**
+
+⟶
+
+
+
+
+**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]**
+
+⟶
+
+
+
+
+**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]**
+
+⟶
+
+
+
+
+**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:**
+
+⟶
+
+
+
+
+**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:**
+
+⟶
+
+
+
+
+**24. Handling long term dependencies**
+
+⟶
+
+
+
+
+**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:**
+
+⟶
+
+
+
+
+**26. [Sigmoid, Tanh, RELU]**
+
+⟶
+
+
+
+
+**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.**
+
+⟶
+
+
+
+
+**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.**
+
+⟶
+
+
+
+
+**29. clipped**
+
+⟶
+
+
+
+
+**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:**
+
+⟶
+
+
+
+
+**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:**
+
+⟶
+
+
+
+
+**32. [Type of gate, Role, Used in]**
+
+⟶
+
+
+
+
+**33. [Update gate, Relevance gate, Forget gate, Output gate]**
+
+⟶
+
+
+
+
+**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]**
+
+⟶
+
+
+
+
+**35. [LSTM, GRU]**
+
+⟶
+
+
+
+
+**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:**
+
+⟶
+
+
+
+
+**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]**
+
+⟶
+
+
+
+
+**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.**
+
+⟶
+
+
+
+
+**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:**
+
+⟶
+
+
+
+
+**40. [Bidirectional (BRNN), Deep (DRNN)]**
+
+⟶
+
+
+
+
+**41. Learning word representation**
+
+⟶
+
+
+
+
+**42. In this section, we note V the vocabulary and |V| its size.**
+
+⟶
+
+
+
+
+**43. Motivation and notations**
+
+⟶
+
+
+
+
+**44. Representation techniques ― The two main ways of representing words are summed up in the table below:**
+
+⟶
+
+
+
+
+**45. [1-hot representation, Word embedding]**
+
+⟶
+
+
+
+
+**46. [teddy bear, book, soft]**
+
+⟶
+
+
+
+
+**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]**
+
+⟶
+
+
+
+
+**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:**
+
+⟶
+
+
+
+
+**49. Remark: learning the embedding matrix can be done using target/context likelihood models.**
+
+⟶
+
+
+
+
+**50. Word embeddings**
+
+⟶
+
+
+
+
+**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.**
+
+⟶
+
+
+
+
+**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]**
+
+⟶
+
+
+
+
+**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]**
+
+⟶
+
+
+
+
+**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:**
+
+⟶
+
+
+
+
+**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.**
+
+⟶
+
+
+
+
+**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:**
+
+⟶
+
+
+
+
+**57. Remark: this method is less computationally expensive than the skip-gram model.**
+
+⟶
+
+
+
+
+**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:**
+
+⟶
+
+
+
+
+**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0.
+Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:**
+
+⟶
+
+
+
+
+**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.**
+
+⟶
+
+
+
+
+**60. Comparing words**
+
+⟶
+
+
+
+
+**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:**
+
+⟶
+
+
+
+
+**62. Remark: θ is the angle between words w1 and w2.**
+
+⟶
+
+
+
+
+**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.**
+
+⟶
+
+
+
+
+**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]**
+
+⟶
+
+
+
+
+**65. Language model**
+
+⟶
+
+
+
+
+**66. Overview ― A language model aims at estimating the probability of a sentence P(y).**
+
+⟶
+
+
+
+
+**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.**
+
+⟶
+
+
+
+
+**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:**
+
+⟶
+
+
+
+
+**69. Remark: PP is commonly used in t-SNE.**
+
+⟶
+
+
+
+
+**70. Machine translation**
+
+⟶
+
+
+
+
+**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:**
+
+⟶
+
+
+
+
+**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.**
+
+⟶
+
+
+
+
+**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]**
+
+⟶
+
+
+
+
+**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.**
+
+⟶
+
+
+
+
+**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.**
+
+⟶
+
+
+
+
+**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:**
+
+⟶
+
+
+
+
+**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.**
+
+⟶
+
+
+
+
+**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:**
+
+⟶
+
+
+
+
+**79. [Case, Root cause, Remedies]**
+
+⟶
+
+
+
+
+**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]**
+
+⟶
+
+
+
+
+**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:**
+
+⟶
+
+
+
+
+**82. where pn is the bleu score on n-gram only defined as follows:**
+
+⟶
+
+
+
+
+**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.**
+
+⟶
+
+
+
+
+**84. Attention**
+
+⟶
+
+
+
+
+**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:**
+
+⟶
+
+
+
+
+**86. with**
+
+⟶
+
+
+
+
+**87. Remark: the attention scores are commonly used in image captioning and machine translation.**
+
+⟶
+
+
+
+
+**88. A cute teddy bear is reading Persian literature.**
+
+⟶
+
+
+
+
+**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:**
+
+⟶
+
+
+
+
+**90. Remark: computation complexity is quadratic with respect to Tx.**
+
+⟶
+
+
+
+
+**91. The Deep Learning cheatsheets are now available in [target language].**
+
+⟶
+
+
+
+**92. Original authors**
+
+⟶
+
+
+
+**93. Translated by X, Y and Z**
+
+⟶
+
+
+
+**94. Reviewed by X, Y and Z**
+
+⟶
+
+
+
+**95. View PDF version on GitHub**
+
+⟶
+
+
+
+**96. By X and Y**
+
+⟶
+
+
diff --git a/template/refresher-linear-algebra.md b/template/refresher-linear-algebra.md
deleted file mode 100644
index a6b440d1e..000000000
--- a/template/refresher-linear-algebra.md
+++ /dev/null
@@ -1,339 +0,0 @@
-**1. Linear Algebra and Calculus refresher**
-
-⟶
-
-
-
-**2. General notations**
-
-⟶
-
-
-
-**3. Definitions**
-
-⟶
-
-
-
-**4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:**
-
-⟶
-
-
-
-**5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:**
-
-⟶
-
-
-
-**6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.**
-
-⟶
-
-
-
-**7. Main matrices**
-
-⟶
-
-
-
-**8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:**
-
-⟶
-
-
-
-**9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.**
-
-⟶
-
-
-
-**10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:**
-
-⟶
-
-
-
-**11. Remark: we also note D as diag(d1,...,dn).**
-
-⟶
-
-
-
-**12. Matrix operations**
-
-⟶
-
-
-
-**13. Multiplication**
-
-⟶
-
-
-
-**14. Vector-vector ― There are two types of vector-vector products:**
-
-⟶
-
-
-
-**15. inner product: for x,y∈Rn, we have:**
-
-⟶
-
-
-
-**16. outer product: for x∈Rm,y∈Rn, we have:**
-
-⟶
-
-
-
-**17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:**
-
-⟶
-
-
-
-**18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.**
-
-⟶
-
-
-
-**19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:**
-
-⟶
-
-
-
-**20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively**
-
-⟶
-
-
-
-**21. Other operations**
-
-⟶
-
-
-
-**22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:**
-
-⟶
-
-
-
-**23. Remark: for matrices A,B, we have (AB)T=BTAT**
-
-⟶
-
-
-
-**24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:**
-
-⟶
-
-
-
-**25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1**
-
-⟶
-
-
-
-**26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:**
-
-⟶
-
-
-
-**27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)**
-
-⟶
-
-
-
-**28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:**
-
-⟶
-
-
-
-**29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.**
-
-⟶
-
-
-
-**30. Matrix properties**
-
-⟶
-
-
-
-**31. Definitions**
-
-⟶
-
-
-
-**32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:**
-
-⟶
-
-
-
-**33. [Symmetric, Antisymmetric]**
-
-⟶
-
-
-
-**34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:**
-
-⟶
-
-
-
-**35. N(ax)=|a|N(x) for a scalar**
-
-⟶
-
-
-
-**36. if N(x)=0, then x=0**
-
-⟶
-
-
-
-**37. For x∈V, the most commonly used norms are summed up in the table below:**
-
-⟶
-
-
-
-**38. [Norm, Notation, Definition, Use case]**
-
-⟶
-
-
-
-**39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.**
-
-⟶
-
-
-
-**40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent**
-
-⟶
-
-
-
-**41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.**
-
-⟶
-
-
-
-**42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:**
-
-⟶
-
-
-
-**43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.**
-
-⟶
-
-
-
-**44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
-
-⟶
-
-
-
-**45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
-
-⟶
-
-
-
-**46. diagonal**
-
-⟶
-
-
-
-**47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:**
-
-⟶
-
-
-
-**48. Matrix calculus**
-
-⟶
-
-
-
-**49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:**
-
-⟶
-
-
-
-**50. Remark: the gradient of f is only defined when f is a function that returns a scalar.**
-
-⟶
-
-
-
-**51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:**
-
-⟶
-
-
-
-**52. Remark: the hessian of f is only defined when f is a function that returns a scalar**
-
-⟶
-
-
-
-**53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:**
-
-⟶
-
-
-
-**54. [General notations, Definitions, Main matrices]**
-
-⟶
-
-
-
-**55. [Matrix operations, Multiplication, Other operations]**
-
-⟶
-
-
-
-**56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]**
-
-⟶
-
-
-
-**57. [Matrix calculus, Gradient, Hessian, Operations]**
-
-⟶
diff --git a/template/refresher-probability.md b/template/refresher-probability.md
deleted file mode 100644
index 5c9b34656..000000000
--- a/template/refresher-probability.md
+++ /dev/null
@@ -1,381 +0,0 @@
-**1. Probabilities and Statistics refresher**
-
-⟶
-
-
-
-**2. Introduction to Probability and Combinatorics**
-
-⟶
-
-
-
-**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
-
-⟶
-
-
-
-**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
-
-⟶
-
-
-
-**5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
-
-⟶
-
-
-
-**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
-
-⟶
-
-
-
-**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
-
-⟶
-
-
-
-**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
-
-⟶
-
-
-
-**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
-
-⟶
-
-
-
-**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
-
-⟶
-
-
-
-**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
-
-⟶
-
-
-
-**12. Conditional Probability**
-
-⟶
-
-
-
-**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
-
-⟶
-
-
-
-**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
-
-⟶
-
-
-
-**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
-
-⟶
-
-
-
-**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
-
-⟶
-
-
-
-**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
-
-⟶
-
-
-
-**18. Independence ― Two events A and B are independent if and only if we have:**
-
-⟶
-
-
-
-**19. Random Variables**
-
-⟶
-
-
-
-**20. Definitions**
-
-⟶
-
-
-
-**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
-
-⟶
-
-
-
-**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
-
-⟶
-
-
-
-**23. Remark: we have P(a
-
-**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
-
-⟶
-
-
-
-**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
-
-⟶
-
-
-
-**26. [Case, CDF F, PDF f, Properties of PDF]**
-
-⟶
-
-
-
-**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
-
-⟶
-
-
-
-**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
-
-⟶
-
-
-
-**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
-
-⟶
-
-
-
-**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
-
-⟶
-
-
-
-**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
-
-⟶
-
-
-
-**32. Probability Distributions**
-
-⟶
-
-
-
-**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
-
-⟶
-
-
-
-**34. Main distributions ― Here are the main distributions to have in mind:**
-
-⟶
-
-
-
-**35. [Type, Distribution]**
-
-⟶
-
-
-
-**36. Jointly Distributed Random Variables**
-
-⟶
-
-
-
-**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
-
-⟶
-
-
-
-**38. [Case, Marginal density, Cumulative function]**
-
-⟶
-
-
-
-**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
-
-⟶
-
-
-
-**40. Independence ― Two random variables X and Y are said to be independent if we have:**
-
-⟶
-
-
-
-**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
-
-⟶
-
-
-
-**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
-
-⟶
-
-
-
-**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
-
-⟶
-
-
-
-**44. Remark 2: If X and Y are independent, then ρXY=0.**
-
-⟶
-
-
-
-**45. Parameter estimation**
-
-⟶
-
-
-
-**46. Definitions**
-
-⟶
-
-
-
-**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
-
-⟶
-
-
-
-**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
-
-⟶
-
-
-
-**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
-
-⟶
-
-
-
-**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
-
-⟶
-
-
-
-**51. Estimating the mean**
-
-⟶
-
-
-
-**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
-
-⟶
-
-
-
-**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
-
-⟶
-
-
-
-**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
-
-⟶
-
-
-
-**55. Estimating the variance**
-
-⟶
-
-
-
-**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
-
-⟶
-
-
-
-**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
-
-⟶
-
-
-
-**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
-
-⟶
-
-
-
-**59. [Introduction, Sample space, Event, Permutation]**
-
-⟶
-
-
-
-**60. [Conditional probability, Bayes' rule, Independence]**
-
-⟶
-
-
-
-**61. [Random variables, Definitions, Expectation, Variance]**
-
-⟶
-
-
-
-**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
-
-⟶
-
-
-
-**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
-
-⟶
-
-
-
-**64. [Parameter estimation, Mean, Variance]**
-
-⟶
diff --git a/tr/cheatsheet-machine-learning-tips-and-tricks.md b/tr/cheatsheet-machine-learning-tips-and-tricks.md
deleted file mode 100644
index 9712297b8..000000000
--- a/tr/cheatsheet-machine-learning-tips-and-tricks.md
+++ /dev/null
@@ -1,285 +0,0 @@
-**1. Machine Learning tips and tricks cheatsheet**
-
-⟶
-
-
-
-**2. Classification metrics**
-
-⟶
-
-
-
-**3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
-
-⟶
-
-
-
-**4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
-
-⟶
-
-
-
-**5. [Predicted class, Actual class]**
-
-⟶
-
-
-
-**6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
-
-⟶
-
-
-
-**7. [Metric, Formula, Interpretation]**
-
-⟶
-
-
-
-**8. Overall performance of model**
-
-⟶
-
-
-
-**9. How accurate the positive predictions are**
-
-⟶
-
-
-
-**10. Coverage of actual positive sample**
-
-⟶
-
-
-
-**11. Coverage of actual negative sample**
-
-⟶
-
-
-
-**12. Hybrid metric useful for unbalanced classes**
-
-⟶
-
-
-
-**13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
-
-⟶
-
-
-
-**14. [Metric, Formula, Equivalent]**
-
-⟶
-
-
-
-**15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
-
-⟶
-
-
-
-**16. [Actual, Predicted]**
-
-⟶
-
-
-
-**17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
-
-⟶
-
-
-
-**18. [Total sum of squares, Explained sum of squares, Residual sum of squares]**
-
-⟶
-
-
-
-**19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
-
-⟶
-
-
-
-**20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
-
-⟶
-
-
-
-**21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
-
-⟶
-
-
-
-**22. Model selection**
-
-⟶
-
-
-
-**23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
-
-⟶
-
-
-
-**24. [Training set, Validation set, Testing set]**
-
-⟶
-
-
-
-**25. [Model is trained, Model is assessed, Model gives predictions]**
-
-⟶
-
-
-
-**26. [Usually 80% of the dataset, Usually 20% of the dataset]**
-
-⟶
-
-
-
-**27. [Also called hold-out or development set, Unseen data]**
-
-⟶
-
-
-
-**28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
-
-⟶
-
-
-
-**29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
-
-⟶
-
-
-
-**30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
-
-⟶
-
-
-
-**31. [Generally k=5 or 10, Case p=1 is called leave-one-out]**
-
-⟶
-
-
-
-**32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
-
-⟶
-
-
-
-**33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
-
-⟶
-
-
-
-**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
-
-⟶
-
-
-
-**35. Diagnostics**
-
-⟶
-
-
-
-**36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
-
-⟶
-
-
-
-**37. Variance ― The variance of a model is the variability of the model prediction for given data points.**
-
-⟶
-
-
-
-**38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
-
-⟶
-
-
-
-**39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
-
-⟶
-
-
-
-**40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
-
-⟶
-
-
-
-**41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
-
-⟶
-
-
-
-**42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
-
-⟶
-
-
-
-**43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
-
-⟶
-
-
-
-**44. Regression metrics**
-
-⟶
-
-
-
-**45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
-
-⟶
-
-
-
-**46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]**
-
-⟶
-
-
-
-**47. [Model selection, cross-validation, regularization]**
-
-⟶
-
-
-
-**48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
-
-⟶
diff --git a/tr/cheatsheet-supervised-learning.md b/tr/cheatsheet-supervised-learning.md
deleted file mode 100644
index a6b19ea1c..000000000
--- a/tr/cheatsheet-supervised-learning.md
+++ /dev/null
@@ -1,567 +0,0 @@
-**1. Supervised Learning cheatsheet**
-
-⟶
-
-
-
-**2. Introduction to Supervised Learning**
-
-⟶
-
-
-
-**3. Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.**
-
-⟶
-
-
-
-**4. Type of prediction ― The different types of predictive models are summed up in the table below:**
-
-⟶
-
-
-
-**5. [Regression, Classifier, Outcome, Examples]**
-
-⟶
-
-
-
-**6. [Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
-
-⟶
-
-
-
-**7. Type of model ― The different models are summed up in the table below:**
-
-⟶
-
-
-
-**8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
-
-⟶
-
-
-
-**9. [Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]**
-
-⟶
-
-
-
-**10. Notations and general concepts**
-
-⟶
-
-
-
-**11. Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).**
-
-⟶
-
-
-
-**12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
-
-⟶
-
-
-
-**13. [Least squared error, Logistic loss, Hinge loss, Cross-entropy]**
-
-⟶
-
-
-
-**14. [Linear regression, Logistic regression, SVM, Neural Network]**
-
-⟶
-
-
-
-**15. Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:**
-
-⟶
-
-
-
-**16. Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:**
-
-⟶
-
-
-
-**17. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.**
-
-⟶
-
-
-
-**18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
-
-⟶
-
-
-
-**19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**
-
-⟶
-
-
-
-**20. Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:**
-
-⟶
-
-
-
-**21. Linear models**
-
-⟶
-
-
-
-**22. Linear regression**
-
-⟶
-
-
-
-**23. We assume here that y|x;θ∼N(μ,σ2)**
-
-⟶
-
-
-
-**24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
-
-⟶
-
-
-
-**25. LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:**
-
-⟶
-
-
-
-**26. Remark: the update rule is a particular case of the gradient ascent.**
-
-⟶
-
-
-
-**27. LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:**
-
-⟶
-
-
-
-**28. Classification and logistic regression**
-
-⟶
-
-
-
-**29. Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:**
-
-⟶
-
-
-
-**30. Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:**
-
-⟶
-
-
-
-**31. Remark: there is no closed form solution for the case of logistic regressions.**
-
-⟶
-
-
-
-**32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
-
-⟶
-
-
-
-**33. Generalized Linear Models**
-
-⟶
-
-
-
-**34. Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:**
-
-⟶
-
-
-
-**35. Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.**
-
-⟶
-
-
-
-**36. Here are the most common exponential distributions summed up in the following table:**
-
-⟶
-
-
-
-**37. [Distribution, Bernoulli, Gaussian, Poisson, Geometric]**
-
-⟶
-
-
-
-**38. Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:**
-
-⟶
-
-
-
-**39. Remark: ordinary least squares and logistic regression are special cases of generalized linear models.**
-
-⟶
-
-
-
-**40. Support Vector Machines**
-
-⟶
-
-
-
-**41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
-
-⟶
-
-
-
-**42: Optimal margin classifier ― The optimal margin classifier h is such that:**
-
-⟶
-
-
-
-**43: where (w,b)∈Rn×R is the solution of the following optimization problem:**
-
-⟶
-
-
-
-**44. such that**
-
-⟶
-
-
-
-**45. support vectors**
-
-⟶
-
-
-
-**46. Remark: the line is defined as wTx−b=0.**
-
-⟶
-
-
-
-**47. Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:**
-
-⟶
-
-
-
-**48. Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:**
-
-⟶
-
-
-
-**49. In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.**
-
-⟶
-
-
-
-**50. [Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]**
-
-⟶
-
-
-
-**51. Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.**
-
-⟶
-
-
-
-**52. Lagrangian ― We define the Lagrangian L(w,b) as follows:**
-
-⟶
-
-
-
-**53. Remark: the coefficients βi are called the Lagrange multipliers.**
-
-⟶
-
-
-
-**54. Generative Learning**
-
-⟶
-
-
-
-**55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
-
-⟶
-
-
-
-**56. Gaussian Discriminant Analysis**
-
-⟶
-
-
-
-**57. Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:**
-
-⟶
-
-
-
-**58. Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:**
-
-⟶
-
-
-
-**59. Naive Bayes**
-
-⟶
-
-
-
-**60. Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:**
-
-⟶
-
-
-
-**61. Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]**
-
-⟶
-
-
-
-**62. Remark: Naive Bayes is widely used for text classification and spam detection.**
-
-⟶
-
-
-
-**63. Tree-based and ensemble methods**
-
-⟶
-
-
-
-**64. These methods can be used for both regression and classification problems.**
-
-⟶
-
-
-
-**65. CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.**
-
-⟶
-
-
-
-**66. Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.**
-
-⟶
-
-
-
-**67. Remark: random forests are a type of ensemble methods.**
-
-⟶
-
-
-
-**68. Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:**
-
-⟶
-
-
-
-**69. [Adaptive boosting, Gradient boosting]**
-
-⟶
-
-
-
-**70. High weights are put on errors to improve at the next boosting step**
-
-⟶
-
-
-
-**71. Weak learners trained on remaining errors**
-
-⟶
-
-
-
-**72. Other non-parametric approaches**
-
-⟶
-
-
-
-**73. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
-
-⟶
-
-
-
-**74. Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
-
-⟶
-
-
-
-**75. Learning Theory**
-
-⟶
-
-
-
-**76. Union bound ― Let A1,...,Ak be k events. We have:**
-
-⟶
-
-
-
-**77. Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:**
-
-⟶
-
-
-
-**78. Remark: this inequality is also known as the Chernoff bound.**
-
-⟶
-
-
-
-**79. Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:**
-
-⟶
-
-
-
-**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: **
-
-⟶
-
-
-
-**81: the training and testing sets follow the same distribution **
-
-⟶
-
-
-
-**82. the training examples are drawn independently**
-
-⟶
-
-
-
-**83. Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:**
-
-⟶
-
-
-
-**84. Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:**
-
-⟶
-
-
-
-**85. VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.**
-
-⟶
-
-
-
-**86. Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.**
-
-⟶
-
-
-
-**87. Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:**
-
-⟶
-
-
-
-**88. [Introduction, Type of prediction, Type of model]**
-
-⟶
-
-
-
-**89. [Notations and general concepts, loss function, gradient descent, likelihood]**
-
-⟶
-
-
-
-**90. [Linear models, linear regression, logistic regression, generalized linear models]**
-
-⟶
-
-
-
-**91. [Support vector machines, Optimal margin classifier, Hinge loss, Kernel]**
-
-⟶
-
-
-
-**92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]**
-
-⟶
-
-
-
-**93. [Trees and ensemble methods, CART, Random forest, Boosting]**
-
-⟶
-
-
-
-**94. [Other methods, k-NN]**
-
-⟶
-
-
-
-**95. [Learning theory, Hoeffding inequality, PAC, VC dimension]**
-
-⟶
diff --git a/tr/cheatsheet-unsupervised-learning.md b/tr/cheatsheet-unsupervised-learning.md
deleted file mode 100644
index 5eae29ed8..000000000
--- a/tr/cheatsheet-unsupervised-learning.md
+++ /dev/null
@@ -1,340 +0,0 @@
-**1. Unsupervised Learning cheatsheet**
-
-⟶
-
-
-
-**2. Introduction to Unsupervised Learning**
-
-⟶
-
-
-
-**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
-
-⟶
-
-
-
-**4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
-
-⟶
-
-
-
-**5. Clustering**
-
-⟶
-
-
-
-**6. Expectation-Maximization**
-
-⟶
-
-
-
-**7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
-
-⟶
-
-
-
-**8. [Setting, Latent variable z, Comments]**
-
-⟶
-
-
-
-**9. [Mixture of k Gaussians, Factor analysis]**
-
-⟶
-
-
-
-**10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
-
-⟶
-
-
-
-**11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
-
-⟶
-
-
-
-**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
-
-⟶
-
-
-
-**13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
-
-⟶
-
-
-
-**14. k-means clustering**
-
-⟶
-
-
-
-**15. We note c(i) the cluster of data point i and μj the center of cluster j.**
-
-⟶
-
-
-
-**16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
-
-⟶
-
-
-
-**17. [Means initialization, Cluster assignment, Means update, Convergence]**
-
-⟶
-
-
-
-**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
-
-⟶
-
-
-
-**19. Hierarchical clustering**
-
-⟶
-
-
-
-**20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
-
-⟶
-
-
-
-**21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
-
-⟶
-
-
-
-**22. [Ward linkage, Average linkage, Complete linkage]**
-
-⟶
-
-
-
-**23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
-
-⟶
-
-
-
-**24. Clustering assessment metrics**
-
-⟶
-
-
-
-**25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
-
-⟶
-
-
-
-**26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
-
-⟶
-
-
-
-**27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
-
-⟶
-
-
-
-**28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
-
-⟶
-
-
-
-**29. Dimension reduction**
-
-⟶
-
-
-
-**30. Principal component analysis**
-
-⟶
-
-
-
-**31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
-
-⟶
-
-
-
-**32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
-
-⟶
-
-
-
-**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
-
-⟶
-
-
-
-**34. diagonal**
-
-⟶
-
-
-
-**35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
-
-⟶
-
-
-
-**36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k
-dimensions by maximizing the variance of the data as follows:**
-
-⟶
-
-
-
-**37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
-
-⟶
-
-
-
-**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
-
-⟶
-
-
-
-**39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
-
-⟶
-
-
-
-**40. Step 4: Project the data on spanR(u1,...,uk).**
-
-⟶
-
-
-
-**41. This procedure maximizes the variance among all k-dimensional spaces.**
-
-⟶
-
-
-
-**42. [Data in feature space, Find principal components, Data in principal components space]**
-
-⟶
-
-
-
-**43. Independent component analysis**
-
-⟶
-
-
-
-**44. It is a technique meant to find the underlying generating sources.**
-
-⟶
-
-
-
-**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
-
-⟶
-
-
-
-**46. The goal is to find the unmixing matrix W=A−1.**
-
-⟶
-
-
-
-**47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
-
-⟶
-
-
-
-**48. Write the probability of x=As=W−1s as:**
-
-⟶
-
-
-
-**49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
-
-⟶
-
-
-
-**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
-
-⟶
-
-
-
-**51. The Machine Learning cheatsheets are now available in Turkish.**
-
-⟶
-
-
-
-**52. Original authors**
-
-⟶
-
-
-
-**53. Translated by X, Y and Z**
-
-⟶
-
-
-
-**54. Reviewed by X, Y and Z**
-
-⟶
-
-
-
-**55. [Introduction, Motivation, Jensen's inequality]**
-
-⟶
-
-
-
-**56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
-
-⟶
-
-
-
-**57. [Dimension reduction, PCA, ICA]**
-
-⟶
diff --git a/tr/cs-221-logic-models.md b/tr/cs-221-logic-models.md
new file mode 100644
index 000000000..23476dd86
--- /dev/null
+++ b/tr/cs-221-logic-models.md
@@ -0,0 +1,462 @@
+**Logic-based models translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-221/cheatsheet-logic-models)
+
+
+
+**1. Logic-based models with propositional and first-order logic**
+
+⟶ Önermeli ve birinci dereceden mantık (Lojik) temelli modeller
+
+
+
+
+**2. Basics**
+
+⟶ Temeller
+
+
+
+
+**3. Syntax of propositional logic ― By noting f,g formulas, and ¬,∧,∨,→,↔ connectives, we can write the following logical expressions:**
+
+⟶ Önerme mantığının sözdizimi ― f, g formülleri ve ¬,∧,∨,→,↔ bağlayıcılarını belirterek, aşağıdaki mantıksal ifadeleri yazabiliriz:
+
+
+
+
+**4. [Name, Symbol, Meaning, Illustration]**
+
+⟶ [Ad, Sembol, Anlamı, Gösterim]
+
+
+
+
+**5. [Affirmation, Negation, Conjunction, Disjunction, Implication, Biconditional]**
+
+⟶ [Doğrulama, Dışlayan, Kesişim, Birleşim, Implication, İki koşullu]
+
+
+
+
+**6. [not f, f and g, f or g, if f then g, f, that is to say g]**
+
+⟶ [f değil, f ve g, f veya g, eğer f'den g çıkarsa, f, f ve g'nin ortak olduğu bölge]
+
+
+
+
+**7. Remark: formulas can be built up recursively out of these connectives.**
+
+⟶ Not: Bu bağlantılar dışında tekrarlayan formüller oluşturulabilir.
+
+
+
+
+**8. Model ― A model w denotes an assignment of binary weights to propositional symbols.**
+
+⟶ Model - w modeli, ikili sembollerin önermeli sembollere atanmasını belirtir.
+
+
+
+
+**9. Example: the set of truth values w={A:0,B:1,C:0} is one possible model to the propositional symbols A, B and C.**
+
+⟶ Örnek: w = {A: 0, B: 1, C: 0} doğruluk değerleri kümesi, A, B ve C önermeli semboller için olası bir modeldir.
+
+
+
+
+**10. Interpretation function ― The interpretation function I(f,w) outputs whether model w satisfies formula f:**
+
+⟶ Yorumlama fonksiyonu ― Yorumlama fonksiyonu I(f,w), w modelinin f formülüne uygun olup olmadığını gösterir:
+
+
+
+
+**11. Set of models ― M(f) denotes the set of models w that satisfy formula f. Mathematically speaking, we define it as follows:**
+
+⟶ Modellerin seti ― M(f), f formülünü sağlayan model setini belirtir. Matematiksel konuşursak, şöyle tanımlarız:
+
+
+
+
+**12. Knowledge base**
+
+⟶ Bilgi temelli
+
+
+
+
+**13. Definition ― The knowledge base KB is the conjunction of all formulas that have been considered so far. The set of models of the knowledge base is the intersection of the set of models that satisfy each formula. In other words:**
+
+⟶ Tanım ― Bilgi temeli (KB-Knowledgde Base), şu ana kadar düşünülen tüm formüllerin birleşimidir. Bilgi temelinin model kümesi, her formülü karşılayan model dizisinin kesişimidir. Diğer bir deyişle:
+
+
+
+
+**14. Probabilistic interpretation ― The probability that query f is evaluated to 1 can be seen as the proportion of models w of the knowledge base KB that satisfy f, i.e.:**
+
+⟶ Olasılıksal yorumlama ― f sorgusunun 1 olarak değerlendirilmesi olasılığı, f'yi sağlayan bilgi temeli KB'nin w modellerinin oranı olarak görülebilir, yani:
+
+
+
+
+**15. Satisfiability ― The knowledge base KB is said to be satisfiable if at least one model w satisfies all its constraints. In other words:**
+
+⟶ Gerçeklenebilirlik ― En az bir modelin tüm kısıtlamaları yerine getirmesi durumunda KB'nin bilgi temelinin gerçeklenebilir olduğu söylenir. Diğer bir deyişle:
+
+
+
+
+**16. satisfiable**
+
+⟶ Karşılanabilirlik
+
+
+
+
+**17. Remark: M(KB) denotes the set of models compatible with all the constraints of the knowledge base.**
+
+⟶ Not: M(KB), bilgi temelinin tüm kısıtları ile uyumlu model kümesini belirtir.
+
+
+
+
+**18. Relation between formulas and knowledge base - We define the following properties between the knowledge base KB and a new formula f:**
+
+⟶ Formüller ve bilgi temeli arasındaki ilişki - Bilgi temeli KB ile yeni bir formül f arasında aşağıdaki özellikleri tanımlarız:
+
+
+
+
+**19. [Name, Mathematical formulation, Illustration, Notes]**
+
+⟶ [Adı, Matematiksel formülü, Gösterim, Notlar]
+
+
+
+
+**20. [KB entails f, KB contradicts f, f contingent to KB]**
+
+⟶ [KB f içerir, KB f içermez, f koşullu KB]
+
+
+
+
+**21. [f does not bring any new information, Also written KB⊨f, No model satisfies the constraints after adding f, Equivalent to KB⊨¬f, f does not contradict KB, f adds a non-trivial amount of information to KB]**
+
+⟶ [f yeni bir bilgi getirmiyor, Ayrıca KB⊨f yazıyor, Hiçbir model f ekledikten sonra kısıtlamaları yerine getirmiyor, f KB'ye eşdeğer, f KB'ye aykırı değil, f KB'ye önemsiz miktarda bilgi ekliyor]
+
+
+
+
+**22. Model checking ― A model checking algorithm takes as input a knowledge base KB and outputs whether it is satisfiable or not.**
+
+⟶ Model denetimi - Bir model denetimi algoritması, KB'nin bilgi temelini girdi olarak alır ve bunun gerçeklenebilir/karşılanabilir olup olmadığını çıkarır.
+
+
+
+
+**23. Remark: popular model checking algorithms include DPLL and WalkSat.**
+
+⟶ Not: popüler model kontrol algoritmaları DPLL ve WalkSat'ı içerir.
+
+
+
+
+**24. Inference rule ― An inference rule of premises f1,...,fk and conclusion g is written:**
+
+⟶ Çıkarım kuralı - f1, ..., fk ve sonuç g yapısının çıkarım kuralı şöyle yazılmıştır:
+
+
+
+
+**25. Forward inference algorithm ― From a set of inference rules Rules, this algorithm goes through all possible f1,...,fk and adds g to the knowledge base KB if a matching rule exists. This process is repeated until no more additions can be made to KB.**
+
+⟶ İleri çıkarım algoritması - Çıkarım kurallarından Kurallar, bu algoritma mümkün olan tüm f1, ..., fk'den geçer ve eşleşen bir kural varsa, KB bilgi tabanına g ekler. Bu işlem KB'ye daha fazla ekleme yapılamayana kadar tekrar edilir.
+
+
+
+
+**26. Derivation ― We say that KB derives f (written KB⊢f) with rules Rules if f already is in KB or gets added during the forward inference algorithm using the set of rules Rules.**
+
+⟶ Türetme - f'nin KB içerisindeyse veya kurallar kurallarını kullanarak ileri çıkarım algoritması sırasında eklenmişse, KB'nin kurallar ile f (KB⊢f yazılır) türettiğini söylüyoruz.
+
+
+
+
+**27. Properties of inference rules ― A set of inference rules Rules can have the following properties:**
+
+⟶ Çıkarım kurallarının özellikleri - Çıkarım kurallarının kümesi Kurallar aşağıdaki özelliklere sahip olabilir:
+
+
+
+
+**28. [Name, Mathematical formulation, Notes]**
+
+⟶ [Adı, Matematiksel formülü, Notlar]
+
+
+
+
+**29. [Soundness, Completeness]**
+
+⟶ [Sağlamlık, Tamlık]
+
+
+
+
+**30. [Inferred formulas are entailed by KB, Can be checked one rule at a time, "Nothing but the truth", Formulas entailing KB are either already in the knowledge base or inferred from it, "The whole truth"]**
+
+⟶ [Çıkarılan formüller KB tarafından sağlanmıştır, Her defasında bir kural kontrol edilebilir, ya KB'yi içeren Formüller ya bilgi tabanında zaten vardır "Gerçeğinden başka bir şey yok", ya da ondan çıkarılan "Tüm gerçek" değerlerdir]
+
+
+
+
+**31. Propositional logic**
+
+⟶ Önerme mantığı
+
+
+
+
+**32. In this section, we will go through logic-based models that use logical formulas and inference rules. The idea here is to balance expressivity and computational efficiency.**
+
+⟶ Bu bölümde, mantıksal formülleri ve çıkarım kurallarını kullanan mantık tabanlı modelleri inceleyeceğiz. Buradaki fikir ifade ve hesaplamanın verimliliğini dengelemektir.
+
+
+
+
+**33. Horn clause ― By noting p1,...,pk and q propositional symbols, a Horn clause has the form:**
+
+⟶ Horn cümlesi ― p1, ..., pk ve q önerme sembollerini not ederek, bir Horn cümlesi şu şekildedir (Matematiksel mantık ve mantık programlamada, kural gibi özel bir biçime sahip mantıksal formüllere Horn cümlesi denir.):
+
+
+
+
+**34. Remark: when q=false, it is called a "goal clause", otherwise we denote it as a "definite clause".**
+
+⟶ Not: q = false olduğunda, "hedeflenen bir cümle" olarak adlandırılır, aksi takdirde "kesin bir cümle" olarak adlandırırız
+
+
+
+
+**35. Modus ponens ― For propositional symbols f1,...,fk and p, the modus ponens rule is written:**
+
+⟶ Modus ponens - f1, ..., fk ve p önermeli semboller için modus ponens kuralı yazılır (Modus ponens: Önerme mantığında, modus ponens bir çıkarım kuralıdır. "P, Q anlamına gelir ve P'nin doğru olduğu iddia edilir, bu yüzden Q doğru olmalı" şeklinde özetlenebilir. Modus ponens, başka bir geçerli argüman biçimi olan modus tollens ile yakından ilgilidir.):
+
+
+
+
+**36. Remark: it takes linear time to apply this rule, as each application generate a clause that contains a single propositional symbol.**
+
+⟶ Not: Her uygulama tek bir önermeli sembol içeren bir cümle oluşturduğundan, bu kuralın uygulanması doğrusal bir zaman alır.
+
+
+
+
+**37. Completeness ― Modus ponens is complete with respect to Horn clauses if we suppose that KB contains only Horn clauses and p is an entailed propositional symbol. Applying modus ponens will then derive p.**
+
+⟶ Tamlık ― KB'nin sadece Horn cümleleri içerdiğini ve p'nin zorunlu bir teklif sembolü olduğunu varsayalım, Hornus cümlelerine göre Modus ponenleri tamamlanmıştır. Modus ponens uygulanması daha sonra p'yi türetir.
+
+
+
+
+**38. Conjunctive normal form ― A conjunctive normal form (CNF) formula is a conjunction of clauses, where each clause is a disjunction of atomic formulas.**
+
+⟶ Konjunktif (Birleştirici) normal form - Bir konjonktif normal form (CNF) formülü, her bir cümlenin atomik formüllerin bir ayrıntısı olduğu cümle birleşimidir.
+
+
+
+
+**39. Remark: in other words, CNFs are ∧ of ∨.**
+
+⟶ Açıklama: başka bir deyişle, CNF'ler ∨ ait ∧ bulunmaktadır.
+
+
+
+
+**40. Equivalent representation ― Every formula in propositional logic can be written into an equivalent CNF formula. The table below presents general conversion properties:**
+
+⟶ Eşdeğer temsil - Önerme mantığındaki her formül eşdeğer bir CNF formülüne yazılabilir. Aşağıdaki tabloda genel dönüşüm özellikleri gösterilmektedir:
+
+
+
+
+**41. [Rule name, Initial, Converted, Eliminate, Distribute, over]**
+
+⟶ [Kural adı, Başlangıç, Dönüştürülmüş, Eleme, Dağıtma, üzerine]
+
+
+
+
+**42. Resolution rule ― For propositional symbols f1,...,fn, and g1,...,gm as well as p, the resolution rule is written:**
+
+⟶ Çözünürlük kuralı - f1, ..., fn ve g1, ..., gm önerme sembolleri için, p, çözümleme kuralı yazılır:
+
+
+
+
+**43. Remark: it can take exponential time to apply this rule, as each application generates a clause that has a subset of the propositional symbols.**
+
+⟶ Not: Her uygulama, teklif sembollerinin alt kümesine sahip bir cümle oluşturduğundan, bu kuralı uygulamak için üssel olarak zaman alabilir.
+
+
+
+
+**44. [Resolution-based inference ― The resolution-based inference algorithm follows the following steps:, Step 1: Convert all formulas into CNF, Step 2: Repeatedly apply resolution rule, Step 3: Return unsatisfiable if and only if False, is derived]**
+
+⟶ [Çözünürlük tabanlı çıkarım - Çözünürlük tabanlı çıkarım algoritması, aşağıdaki adımları izler :, Adım 1: Tüm formülleri CNF'ye dönüştürün, Adım 2: Tekrar tekrar, çözünürlük kuralını uygulayın, Adım 3: Yanlışsa türetilmişse tatmin edici olmayan dönüş yapın]
+
+
+
+
+**45. First-order logic**
+
+⟶ Birinci dereceden mantık
+
+
+
+
+**46. The idea here is to use variables to yield more compact knowledge representations.**
+
+⟶ Buradaki fikir, daha kompakt bilgi sunumları sağlamak için değişkenleri kullanmaktır.
+
+
+
+
+**47. [Model ― A model w in first-order logic maps:, constant symbols to objects, predicate symbols to tuple of objects]**
+
+⟶ [Model ― Birinci mertebeden mantık haritalarında bir w modeli :, nesnelere sabit semboller, nesnelerin dizisini sembolize etmek için tahmin]
+
+
+
+
+**48. Horn clause ― By noting x1,...,xn variables and a1,...,ak,b atomic formulas, the first-order logic version of a horn clause has the form:**
+
+⟶ Horn cümlesi - x1, ..., xn değişkenleri ve a1, ..., ak, b atomik formüllerine dikkat çekerek, bir boynuz maddesinin birinci derece mantık versiyonu aşağıdaki şekildedir:
+
+
+
+
+**49. Substitution ― A substitution θ maps variables to terms and Subst[θ,f] denotes the result of substitution θ on f.**
+
+⟶ Yer değiştirme - Bir yerdeğiştirme değişkenleri terimlerle eşler ve Subst[θ,f] yerdeğiştirme sonucunu f olarak belirtir.
+
+
+
+
+**50. Unification ― Unification takes two formulas f and g and returns the most general substitution θ that makes them equal:**
+
+⟶ Birleştirme ― Birleştirme f ve g'nin iki formülünü alır ve onları eşit yapan en genel ikameyi θ verir:
+
+
+
+
+**51. such that**
+
+⟶ öyle ki
+
+
+
+
+**52. Note: Unify[f,g] returns Fail if no such θ exists.**
+
+⟶ Not: Unify[f,g], eğer böyle bir θ yoksa Fail döndürür.
+
+
+
+
+**53. Modus ponens ― By noting x1,...,xn variables, a1,...,ak and a′1,...,a′k atomic formulas and by calling θ=Unify(a′1∧...∧a′k,a1∧...∧ak) the first-order logic version of modus ponens can be written:**
+
+⟶ Modus ponens ― x1, ..., xn değişkenleri, a1, ..., ak ve a′1, ..., a′k atomik formüllerine dikkat ederek ve θ=Unify(a′1∧...∧a′k,a1∧...∧ak) modus ponenlerin birinci dereceden mantık versiyonu yazılabilir:
+
+
+
+
+**54. Completeness ― Modus ponens is complete for first-order logic with only Horn clauses.**
+
+⟶ Tamlık - Modus ponens sadece Horn cümleleriyle birinci dereceden mantık için tamamlanmıştır.
+
+
+
+
+**55. Resolution rule ― By noting f1,...,fn, g1,...,gm, p, q formulas and by calling θ=Unify(p,q), the first-order logic version of the resolution rule can be written:**
+
+⟶ Çözünürlük kuralı ― f1,...,fn,g1,...,gm, p, q formüllerini not ederek ve θ=Unify(p,q) ifadesini kullanarak, çözümleme kuralının birinci dereceden mantık sürümü yazılabilir. :
+
+
+
+
+**56. [Semi-decidability ― First-order logic, even restricted to only Horn clauses, is semi-decidable., if KB⊨f, forward inference on complete inference rules will prove f in finite time, if KB⊭f, no algorithm can show this in finite time]**
+
+⟶ Yarı-karar verilebilirlik ― Birinci dereceden mantık, sadece Horn cümleleriyle sınırlı olsa bile, yarı karar verilebilir eğer KB⊨f ise f sonsuz zamanlıdır. KB⊭f ise sonsuz zamanlı olabilirliği gösteren algoritma yoktur.
+
+
+
+
+**57. [Basics, Notations, Model, Interpretation function, Set of models]**
+
+⟶ [Temeller, Notasyon, Model, Yorumlama fonksiyonu, Modellerin kümesi]
+
+
+
+
+**58. [Knowledge base, Definition, Probabilistic interpretation, Satisfiability, Relationship with formulas, Forward inference, Rule properties]**
+
+⟶ [Bilgi temeli, Tanım, Olasılıksal yorumlama, Gerçeklenebilirlik, Formüllerle İlişki, İleri çıkarım, Kural özellikleri]
+
+
+
+
+**59. [Propositional logic, Clauses, Modus ponens, Conjunctive normal form, Representation equivalence, Resolution]**
+
+⟶ [Önerme mantığı, Cümleler, Modus ponens, Eşlenik (Conjunctive) normal form, Temsil eşdeğeri, Çözüm]
+
+
+
+
+**60. [First-order logic, Substitution, Unification, Resolution rule, Modus ponens, Resolution, Semi-decidability]**
+
+⟶ [Birinci derece mantık, Değiştirme, Birleştirme, Çözünürlük kuralı, Modus ponens, Çözünürlük, Yarı-karar verilebilirlik]
+
+
+
+
+**61. View PDF version on GitHub**
+
+⟶ GitHub'da PDF sürümünü görüntüleyin
+
+
+
+
+**62. Original authors**
+
+⟶ Orijinal yazarlar
+
+
+
+
+**63. Translated by X, Y and Z**
+
+⟶ X, Y ve Z tarafından çevrilmiştir
+
+
+
+
+**64. Reviewed by X, Y and Z**
+
+⟶ X, Y ve Z tarafından gözden geçirilmiştir
+
+
+
+
+**65. By X and Y**
+
+⟶ X ve Y ile
+
+
+
+
+**66. The Artificial Intelligence cheatsheets are now available in [target language].**
+
+⟶ Yapay Zeka el kitabı şimdi [Türkçe] mevcuttur.
diff --git a/tr/cs-221-reflex-models.md b/tr/cs-221-reflex-models.md
new file mode 100644
index 000000000..e1aea4a79
--- /dev/null
+++ b/tr/cs-221-reflex-models.md
@@ -0,0 +1,538 @@
+**Reflex-based models translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-221/cheatsheet-reflex-models)
+
+
+
+**1. Reflex-based models with Machine Learning**
+
+⟶ Makine Öğrenmesi ile Refleks-temelli modeller
+
+
+
+
+**2. Linear predictors**
+
+⟶ Doğrusal öngörücüler
+
+
+
+
+**3. In this section, we will go through reflex-based models that can improve with experience, by going through samples that have input-output pairs.**
+
+⟶ Bu bölümde, girdi-çıktı çiftleri olan örneklerden geçerek, deneyim ile gelişebilecek refleks-temelli modelleri göreceğiz.
+
+
+
+
+**4. Feature vector ― The feature vector of an input x is noted ϕ(x) and is such that:**
+
+⟶ Öznitelik vektörü ― x girişinin öznitelik vektörü ϕ (x) olarak not edilir ve şöyledir:
+
+
+
+
+**5. Score ― The score s(x,w) of an example (ϕ(x),y)∈Rd×R associated to a linear model of weights w∈Rd is given by the inner product:**
+
+⟶ Puan - Bir örneğin s(x, w)si ni (ϕ(x),y))∈Rd×R, w∈Rd doğrusal ağırlık modeline bağlı olarak:
+
+
+
+
+**6. Classification**
+
+⟶ Sınıflandırma
+
+
+
+
+**7. Linear classifier ― Given a weight vector w∈Rd and a feature vector ϕ(x)∈Rd, the binary linear classifier fw is given by:**
+
+⟶ Doğrusal sınıflandırıcı - Bir ağırlık vektörü w∈Rd ve bir öznitelik vektörü ϕ(x)∈Rd verildiğinde, ikili doğrusal sınıflandırıcı fw şöyle verilir:
+
+
+
+
+**8. if**
+
+⟶
+
+
Eğer
+
+
+**9. Margin ― The margin m(x,y,w)∈R of an example (ϕ(x),y)∈Rd×{−1,+1} associated to a linear model of weights w∈Rd quantifies the confidence of the prediction: larger values are better. It is given by:**
+
+⟶ Marj ― (ϕ(x),y)∈Rd×{−1,+1} örneğinin m(x,y,w)∈R marjları w∈Rd doğrusal ağırlık modeliyle ilişkili olarak, tahminin güvenirliği ölçülür: daha büyük değerler daha iyidir. Şöyle ifade edilir:
+
+
+
+
+**10. Regression**
+
+⟶ Bağlanım (Regression)
+
+
+
+
+**11. Linear regression ― Given a weight vector w∈Rd and a feature vector ϕ(x)∈Rd, the output of a linear regression of weights w denoted as fw is given by:**
+
+⟶ Doğrusal bağlanım (Linear regression) - w∈Rd bir ağırlık vektörü ve bir öznitelik vektörü ϕ(x)∈Rd verildiğinde, fw olarak belirtilen ağırlıkların doğrusal bir bağlanım" çıktısı şöyle verilir:
+
+
+
+
+**12. Residual ― The residual res(x,y,w)∈R is defined as being the amount by which the prediction fw(x) overshoots the target y:**
+
+⟶ Artık (Residual) - Artık res(x,y,w)∈R, fw(x) tahmininin y hedefini aştığı miktar olarak tanımlanır:
+
+
+
+
+**13. Loss minimization**
+
+⟶ Kayıp/Yitim minimizasyonu
+
+
+
+
+**14. Loss function ― A loss function Loss(x,y,w) quantifies how unhappy we are with the weights w of the model in the prediction task of output y from input x. It is a quantity we want to minimize during the training process.**
+
+⟶ Kayıp fonksiyonu - Kayıp fonksiyonu Loss(x,y,w), x girişinden y çıktısının öngörme görevindeki model ağırlıkları ile ne kadar mutsuz olduğumuzu belirler. Bu değer eğitim sürecinde en aza indirmek istediğimiz bir miktar.
+
+
+
+
+**15. Classification case - The classification of a sample x of true label y∈{−1,+1} with a linear model of weights w can be done with the predictor fw(x)≜sign(s(x,w)). In this situation, a metric of interest quantifying the quality of the classification is given by the margin m(x,y,w), and can be used with the following loss functions:**
+
+⟶ Sınıflandırma durumu - Doğru etiket y∈{−1,+1} değerinin x örneğinin doğrusal ağırlık w modeliyle sınıflandırılması fw(x)≜sign(s(x,w)) belirleyicisi ile yapılabilir. Bu durumda, sınıflandırma kalitesini ölçen bir fayda ölçütü m(x,y,w) marjı ile verilir ve aşağıdaki kayıp fonksiyonlarıyla birlikte kullanılabilir:
+
+
+
+
+**16. [Name, Illustration, Zero-one loss, Hinge loss, Logistic loss]**
+
+⟶ [Ad, Örnekleme, Sıfır-bir kayıp, Menteşe kaybı, Lojistik kaybı]
+
+
+
+
+**17. Regression case - The prediction of a sample x of true label y∈R with a linear model of weights w can be done with the predictor fw(x)≜s(x,w). In this situation, a metric of interest quantifying the quality of the regression is given by the margin res(x,y,w) and can be used with the following loss functions:**
+
+⟶ Regresyon durumu - Doğru etiket y∈R değerinin x örneğinin bir doğrusal ağırlık modeli w ile öngörülmesi fw(x)≜s(x,w) öngörüsü ile yapılabilir. Bu durumda, regresyonun kalitesini ölçen bir fayda ölçütü res(x,y,w) marjı ile verilir ve aşağıdaki kayıp fonksiyonlarıyla birlikte kullanılabilir:
+
+
+
+
+**18. [Name, Squared loss, Absolute deviation loss, Illustration]**
+
+⟶ [Ad, Kareler kaybı, Mutlak sapma kaybı, Görselleştirme]
+
+
+
+
+**19. Loss minimization framework ― In order to train a model, we want to minimize the training loss is defined as follows:**
+
+⟶ Kayıp minimize etme çerçevesi (framework) - Bir modeli eğitmek için, eğitim kaybını en aza indirmek istiyoruz;
+
+
+
+
+**20. Non-linear predictors**
+
+⟶ Doğrusal olmayan öngörücüler
+
+
+
+
+**21. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
+
+⟶ k-en yakın komşu - Yaygın olarak k-NN olarak bilinen k-en yakın komşu algoritması, bir veri noktasının tepkisinin eğitim kümesinden k komşularının yapısı tarafından belirlendiği parametrik olmayan bir yaklaşımdır. Hem sınıflandırma hem de regresyon ayarlarında kullanılabilir.
+
+
+
+
+**22. Remark: the higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
+
+⟶ Not: k parametresi ne kadar yüksekse, önyargı (bias) o kadar yüksek ve k parametresi ne kadar düşükse, varyans o kadar yüksek olur.
+
+
+
+
+**23. Neural networks ― Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks. The vocabulary around neural networks architectures is described in the figure below:**
+
+⟶ Yapay sinir ağları - Yapay sinir ağları katmanlarla oluşturulmuş bir model sınıfıdır. Yaygın olarak kullanılan sinir ağları, evrişimli ve tekrarlayan sinir ağlarını içerir. Yapay sinir ağları mimarisi etrafındaki kelime bilgisi aşağıdaki şekilde tanımlanmıştır:
+
+
+
+
+**24. [Input layer, Hidden layer, Output layer]**
+
+⟶ [Giriş katmanı, Gizli katman, Çıkış katmanı]
+
+
+
+
+**25. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
+
+⟶ i, ağın i. katmanı ve j, katmanın j. gizli birimi olacak şekilde aşağıdaki gibi ifade edilir:
+
+
+
+
+**26. where we note w, b, x, z the weight, bias, input and non-activated output of the neuron respectively.**
+
+⟶ w, b, x, z değerlerinin sırasıyla nöronun ağırlık, önyargı (bias), girdi ve aktive edilmemiş çıkışını olarak ifade eder.
+
+
+
+
+**27. For a more detailed overview of the concepts above, check out the Supervised Learning cheatsheets!**
+
+⟶ Yukarıdaki kavramlara daha ayrıntılı bir bakış için, Gözetimli Öğrenme el kitabına göz atın!
+
+
+
+
+**28. Stochastic gradient descent**
+
+⟶ Stokastik gradyan inişi (Bayır inişi)
+
+
+
+
+**29. Gradient descent ― By noting η∈R the learning rate (also called step size), the update rule for gradient descent is expressed with the learning rate and the loss function Loss(x,y,w) as follows:**
+
+⟶ Gradyan inişi (Bayır inişi) - η∈R öğrenme oranını (aynı zamanda adım boyutu olarak da bilinir) dikkate alınarak, gradyan inişine ilişkin güncelleme kuralı, öğrenme oranı ve Loss(x,y,w) kayıp fonksiyonu ile aşağıdaki şekilde ifade edilir:
+
+
+
+
+**30. Stochastic updates ― Stochastic gradient descent (SGD) updates the parameters of the model one training example (ϕ(x),y)∈Dtrain at a time. This method leads to sometimes noisy, but fast updates.**
+
+⟶ Stokastik güncellemeler - Stokastik gradyan inişi (SGİ / SGD), bir seferde bir eğitim örneğinin (ϕ(x),y)∈Değitim parametrelerini günceller. Bu yöntem bazen gürültülü, ancak hızlı güncellemeler yol açar.
+
+
+
+
+**31. Batch updates ― Batch gradient descent (BGD) updates the parameters of the model one batch of examples (e.g. the entire training set) at a time. This method computes stable update directions, at a greater computational cost.**
+
+⟶ Yığın/küme güncellemeler - Yığın gradyan inişi (YGİ / BGD), bir seferde bir grup örnek (örneğin, tüm eğitim kümesi) parametrelerini günceller. Bu yöntem daha yüksek bir hesaplama maliyetiyle kararlı güncelleme talimatlarını hesaplar.
+
+
+
+
+**32. Fine-tuning models**
+
+⟶ İnce ayar (Fine-tuning) modelleri
+
+
+
+
+**33. Hypothesis class ― A hypothesis class F is the set of possible predictors with a fixed ϕ(x) and varying w:**
+
+⟶ Hipotez sınıfı - Bir hipotez sınıfı F, sabit bir ϕ (x) ve değişken w ile olası öngörücü kümesidir:
+
+
+
+
+**34. Logistic function ― The logistic function σ, also called the sigmoid function, is defined as:**
+
+⟶ Lojistik fonksiyon - Ayrıca sigmoid fonksiyon olarak da adlandırılan lojistik fonksiyon σ, şöyle tanımlanır:
+
+
+
+
+**35. Remark: we have σ′(z)=σ(z)(1−σ(z)).**
+
+⟶ Not: σ′(z)=σ(z)(1−σ(z)) şeklinde ifade edilir.
+
+
+
+
+**36. Backpropagation ― The forward pass is done through fi, which is the value for the subexpression rooted at i, while the backward pass is done through gi=∂out∂fi and represents how fi influences the output.**
+
+⟶ Geri yayılım - İleriye geçiş, i'de yer alan alt ifadenin değeri olan fi ile yapılırken, geriye doğru geçiş gi=∂out∂fi aracılığıyla yapılır ve fi'nin çıkışı nasıl etkilediğini gösterir.
+
+
+
+
+**37. Approximation and estimation error ― The approximation error ϵapprox represents how far the entire hypothesis class F is from the target predictor g∗, while the estimation error ϵest quantifies how good the predictor ^f is with respect to the best predictor f∗ of the hypothesis class F.**
+
+⟶ Yaklaşım ve kestirim hatası - Yaklaşım hatası ϵapprox, F tüm hipotez sınıfının hedef öngörücü g∗ ne kadar uzak olduğunu gösterirken, kestirim hatası ϵest öngörücüsü ^f, F hipotez sınıfının en iyi yordayıcısı f∗'ya göre ne kadar iyi olduğunu gösterir.
+
+
+
+**38. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
+
+⟶ Düzenlileştirme (Regularization) - Düzenlileştirme prosedürü, modelin verilerin aşırı öğrenmesinden kaçınmayı amaçlar ve böylece yüksek değişkenlik sorunlarıyla ilgilenir. Aşağıdaki tablo, yaygın olarak kullanılan düzenlileştirme tekniklerinin farklı türlerini özetlemektedir:
+
+
+
+
+**39. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+⟶ [Katsayıları 0'a düşürür, Değişken seçimi için iyi, Katsayıları daha küçük yapar, Değişken seçimi ile küçük katsayılar arasında ödünleşim]
+
+
+
+
+**40. Hyperparameters ― Hyperparameters are the properties of the learning algorithm, and include features, regularization parameter λ, number of iterations T, step size η, etc.**
+
+⟶ Hiperparametreler - Hiperparametreler öğrenme algoritmasının özellikleridir ve öznitelikler dahildir, λ normalizasyon parametresi, yineleme sayısı T, adım büyüklüğü η, vb.
+
+
+
+
+**41. Sets vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
+
+⟶ Kümeler - Bir model seçerken, veriyi aşağıdaki gibi 3 farklı parçaya ayırırız:
+
+
+
+
+**42. [Training set, Validation set, Testing set]**
+
+⟶ [Eğitim kümesi, Doğrulama kümesi, Test kümesi]
+
+
+
+
+**43. [Model is trained, Usually 80% of the dataset, Model is assessed, Usually 20% of the dataset, Also called hold-out or development set, Model gives predictions, Unseen data]**
+
+⟶ [Model eğitilir, Veri kümesinin genellikle %80'i, Model değerlendirilir, Veri kümesinin genellikle %20'si, Ayrıca tutma veya geliştirme kümesi olarak da adlandırılır, Model tahminlerini verir, Görünmeyen veriler]
+
+
+
+
+**44. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
+
+⟶ Model seçildikten sonra, tüm veri kümesi üzerinde eğitilir ve görünmeyen test kümesinde test edilir. Bunlar aşağıdaki şekilde gösterilmektedir:
+
+
+
+
+**45. [Dataset, Unseen data, train, validation, test]**
+
+⟶ [Veri kümesi, Görünmeyen veriler, eğitim, doğrulama, test]
+
+
+
+
+**46. For a more detailed overview of the concepts above, check out the Machine Learning tips and tricks cheatsheets!**
+
+⟶ Yukarıdaki kavramlara daha ayrıntılı bir bakış için, Makine Öğrenmesi ipuçları ve püf noktaları el kitabını göz atın!
+
+
+
+
+**47. Unsupervised Learning**
+
+⟶ Gözetimsiz Öğrenme
+
+
+
+
+**48. The class of unsupervised learning methods aims at discovering the structure of the data, which may have of rich latent structures.**
+
+⟶ Gözetimsiz öğrenme yöntemlerinin sınıfı, zengin gizli yapılara sahip olabilecek verilerin yapısını keşfetmeyi amaçlamaktadır.
+
+
+
+
+**49. k-means**
+
+⟶ k-ortalama
+
+
+
+
+**50. Clustering ― Given a training set of input points Dtrain, the goal of a clustering algorithm is to assign each point ϕ(xi) to a cluster zi∈{1,...,k}**
+
+⟶ Kümeleme - Dtrain giriş noktalarından oluşan bir eğitim kümesi göz önüne alındığında, kümeleme algoritmasının amacı, her bir ϕ(xi) noktasını zi∈{1,...,k} kümesine atamaktır.
+
+
+
+
+**51. Objective function ― The loss function for one of the main clustering algorithms, k-means, is given by:**
+
+⟶ Amaç fonksiyonu - Ana kümeleme algoritmalarından biri olan k-ortalama için kayıp fonksiyonu şöyle ifade edilir:
+
+
+
+
+**52. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
+
+⟶ Algoritma - Küme merkezlerini μ1,μ2,...,μk∈Rn kümesini rasgele başlattıktan sonra, k-ortalama algoritması yakınsayana kadar aşağıdaki adımı tekrarlar:
+
+
+
+
+**53. and**
+
+⟶ ve
+
+
+
+
+**54. [Means initialization, Cluster assignment, Means update, Convergence]**
+
+⟶ [Başlatma anlamına gelir, Kümeleme görevi, Güncelleme, Yakınsama anlamına gelir]
+
+
+
+
+**55. Principal Component Analysis**
+
+⟶ Temel Bileşenler Analizi
+
+
+
+
+**56. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+⟶ Özdeğer, özvektör - Bir A∈Rn×n matrisi verildiğinde, z∈Rn∖{0} olacak şekilde bir vektör varsa λ, A'nın bir öz değeri olduğu söylenir, aşağıdaki gibi ifade edilir:
+
+
+
+
+**57. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+⟶ Spektral teoremi - A∈Rn×n olsun. A simetrik ise, o zaman A gerçek ortogonal matris U∈Rn×n olacak şekilde köşegenleştirilebilir. Λ=diag(λ1,...,λn) formülü dikkate alınarak aşağıdaki gibi ifade edilir:
+
+
+
+
+**58. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
+
+⟶ Not: En büyük özdeğerle ilişkilendirilen özvektör, A matrisinin temel özvektörüdür.
+
+
+
+
+**59. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:**
+
+⟶ Algoritma - Temel Bileşenler Analizi (PCA) prosedürü, verilerin varyansını en üst düzeye çıkararak k boyutlarına indirgeyen bir boyut küçültme tekniğidir:
+
+
+
+
+**60. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
+
+⟶ Adım 1: Verileri ortalama 0 ve 1 standart sapma olacak şekilde normalize edin.
+
+
+
+
+**61. [where, and]**
+
+⟶ [koşul, ve]
+
+
+
+
+**62. [Step 2: Compute Σ=1mm∑i=1ϕ(xi)ϕ(xi)T∈Rn×n, which is symmetric with real eigenvalues., Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues., Step 4: Project the data on spanR(u1,...,uk).]**
+
+⟶ [Adım 2: Hesaplama Σ=1mm∑i=1ϕ(xi)ϕ(xi)T∈Rn×n, ki bu, gerçek özdeğerlerle simetriktir., Adım 3: Hesaplama u1,...,uk∈Rn k'nin ortogonal ana özvektörleri, yani k en büyük özdeğerlerin ortogonal özvektörleri., Adım 4: spanR(u1,...,uk)'daki verilerin izdüşümünü al.
+
+
+
+
+**63. This procedure maximizes the variance among all k-dimensional spaces.**
+
+⟶ Bu prosedür, tüm k boyutlu uzaylar arasındaki farkı en üst düzeye çıkarır.
+
+
+
+
+**64. [Data in feature space, Find principal components, Data in principal components space]**
+
+⟶ [Öznitelik uzayındaki veriler, Asıl bileşenleri bulma, Asıl bileşenler uzayındaki veriler]
+
+
+
+
+**65. For a more detailed overview of the concepts above, check out the Unsupervised Learning cheatsheets!**
+
+⟶ Yukarıdaki kavramlara daha ayrıntılı bir genel bakış için, Gözetimsiz Öğrenme el kitaplarına göz atın!
+
+
+
+
+**66. [Linear predictors, Feature vector, Linear classifier/regression, Margin]**
+
+⟶ [Doğrusal öngörücüler, Öznitelik vektörü, Doğrusal sınıflandırıcı/regresyon, Marj]
+
+
+
+
+**67. [Loss minimization, Loss function, Framework]**
+
+⟶ [Kayıp minimizasyonu, Kayıp fonksiyonu, Çerçeve (Framework)]
+
+
+
+
+**68. [Non-linear predictors, k-nearest neighbors, Neural networks]**
+
+⟶ [Doğrusal olmayan öngörücüler, k-en yakın komşular, Yapay sinir ağları]
+
+
+
+
+**69. [Stochastic gradient descent, Gradient, Stochastic updates, Batch updates]**
+
+⟶ [Stokastik Dereceli Azalma/Bayır İnişi, Gradyan, Stokastik güncellemeler, Yığın/Küme (Batch) güncellemeler]
+
+
+
+
+**70. [Fine-tuning models, Hypothesis class, Backpropagation, Regularization, Sets vocabulary]**
+
+⟶ [Hassas ayar modeller, Hipotez sınıfı, Geri yayılım, Düzenlileştirme (Regularization), Kelime dizisi]
+
+
+
+
+**71. [Unsupervised Learning, k-means, Principal components analysis]**
+
+⟶ [Gözetimsiz Öğrenme, k-ortalama, Temel bileşenler analizi]
+
+
+
+
+**72. View PDF version on GitHub**
+
+⟶ GitHub'da PDF sürümünü görüntüleyin
+
+
+
+
+**73. Original authors**
+
+⟶ Orijinal yazarlar
+
+
+
+
+**74. Translated by X, Y and Z**
+
+⟶ X, Y ve Z tarafından çevrilmiştir
+
+
+
+
+**75. Reviewed by X, Y and Z**
+
+⟶ X, Y ve Z tarafından gözden geçirilmiştir
+
+
+
+
+**76. By X and Y**
+
+⟶ X ve Y ile
+
+
+
+
+**77. The Artificial Intelligence cheatsheets are now available in [target language].**
+
+⟶ Yapay Zeka el kitabı şimdi [hedef dilde] mevcuttur.
diff --git a/tr/cs-221-states-models.md b/tr/cs-221-states-models.md
new file mode 100644
index 000000000..bceddce2b
--- /dev/null
+++ b/tr/cs-221-states-models.md
@@ -0,0 +1,980 @@
+**States-based models translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-221/cheatsheet-states-models)
+
+
+
+**1. States-based models with search optimization and MDP**
+
+⟶ Arama optimizasyonu ve Markov karar sürecine (MDP) sahip durum-temelli modeller
+
+
+
+
+**2. Search optimization**
+
+⟶ Arama optimizasyonu
+
+
+
+
+**3. In this section, we assume that by accomplishing action a from state s, we deterministically arrive in state Succ(s,a). The goal here is to determine a sequence of actions (a1,a2,a3,a4,...) that starts from an initial state and leads to an end state. In order to solve this kind of problem, our objective will be to find the minimum cost path by using states-based models.**
+
+⟶ Bu bölümde, s durumunda a eylemini gerçekleştirdiğimizde, Succ(s,a) durumuna varacağımızı varsayıyoruz. Burada amaç, başlangıç durumundan başlayıp bitiş durumuna götüren bir eylem dizisi (a1,a2,a3,a4,...) belirlenmesidir. Bu tür bir problemi çözmek için, amacımız durum-temelli modelleri kullanarak asgari (minimum) maliyet yolunu bulmak olacaktır.
+
+
+
+
+**4. Tree search**
+
+⟶ Ağaç arama
+
+
+
+
+**5. This category of states-based algorithms explores all possible states and actions. It is quite memory efficient, and is suitable for huge state spaces but the runtime can become exponential in the worst cases.**
+
+⟶ Bu durum-temelli algoritmalar, olası bütün durum ve eylemleri araştırırlar. Oldukça bellek verimli ve büyük durum uzayları için uygundurlar ancak çalışma zamanı en kötü durumlarda üstel olabilir.
+
+
+
+
+**6. [Self-loop, More than a parent, Cycle, More than a root, Valid tree]**
+
+⟶ [Kendinden-Döngü(Self-loop), Bir ebeveynden (parent) daha fazlası, Çevrim, Bir kökten daha fazlası, Geçerli ağaç]
+
+
+
+
+**7. [Search problem ― A search problem is defined with:, a starting state sstart, possible actions Actions(s) from state s, action cost Cost(s,a) from state s with action a, successor Succ(s,a) of state s after action a, whether an end state was reached IsEnd(s)]**
+
+⟶ [Arama problemi ― Bir arama problemi aşağıdaki şekilde tanımlanmaktadır:, bir başlangıç durumu sstart, s durumunda gerçekleşebilecek olası eylemler Actions(s), s durumunda gerçekleşen a eyleminin eylem maliyeti Cost(s,a), a eyleminden sonraki varılacak durum Succ(s,a), son duruma ulaşılıp ulaşılamadığı IsEnd(s)]
+
+
+
+
+**8. The objective is to find a path that minimizes the cost.**
+
+⟶ Amaç, maliyeti en aza indiren bir yol bulmaktır.
+
+
+
+
+**9. Backtracking search ― Backtracking search is a naive recursive algorithm that tries all possibilities to find the minimum cost path. Here, action costs can be either positive or negative.**
+
+⟶ Geri izleme araması ― Geri izleme araması, asgari (minimum) maliyet yolunu bulmak için tüm olasılıkları deneyen saf (naive) bir özyinelemeli algoritmadır. Burada, eylem maliyetleri pozitif ya da negatif olabilir.
+
+
+
+
+**10. Breadth-first search (BFS) ― Breadth-first search is a graph search algorithm that does a level-by-level traversal. We can implement it iteratively with the help of a queue that stores at each step future nodes to be visited. For this algorithm, we can assume action costs to be equal to a constant c⩾0.**
+
+⟶ Genişlik öncelikli arama (Breadth-first search-BFS) ― Genişlik öncelikli arama, seviye seviye arama yapan bir çizge arama algoritmasıdır. Gelecekte her adımda ziyaret edilecek düğümleri tutan bir kuyruk yardımıyla yinelemeli olarak gerçekleyebiliriz. Bu algoritma için, eylem maliyetlerinin belirli bir sabite c⩾0 eşit olduğunu kabul edebiliriz.
+
+
+
+
+**11. Depth-first search (DFS) ― Depth-first search is a search algorithm that traverses a graph by following each path as deep as it can. We can implement it recursively, or iteratively with the help of a stack that stores at each step future nodes to be visited. For this algorithm, action costs are assumed to be equal to 0.**
+
+⟶ Derinlik öncelikli arama (Depth-first search-DFS) ― Derinlik öncelikli arama, her bir yolu olabildiğince derin bir şekilde takip ederek çizgeyi dolaşan bir arama algoritmasıdır. Bu algoritmayı, ziyaret edilecek gelecek düğümleri her adımda bir yığın yardımıyla saklayarak, yinelemeli (recursively) ya da tekrarlı (iteratively) olarak uygulayabiliriz. Bu algoritma için eylem maliyetlerinin 0 olduğu varsayılmaktadır.
+
+
+
+
+**12. Iterative deepening ― The iterative deepening trick is a modification of the depth-first search algorithm so that it stops after reaching a certain depth, which guarantees optimality when all action costs are equal. Here, we assume that action costs are equal to a constant c⩾0.**
+
+⟶ Tekrarlı derinleşme ― Tekrarlı derinleşme hilesi, derinlik-ilk arama algoritmasının değiştirilmiş bir halidir, böylece belirli bir derinliğe ulaştıktan sonra durur, bu da tüm işlem maliyetleri eşit olduğunda en iyiliği (optimal) garanti eder. Burada, işlem maliyetlerinin c⩾0 gibi sabit bir değere eşit olduğunu varsayıyoruz.
+
+
+
+
+**13. Tree search algorithms summary ― By noting b the number of actions per state, d the solution depth, and D the maximum depth, we have:**
+
+⟶ Ağaç arama algoritmaları özeti ― B durum başına eylem sayısını, d çözüm derinliğini ve D en yüksek (maksimum) derinliği ifade ederse, o zaman:
+
+
+
+
+**14. [Algorithm, Action costs, Space, Time]**
+
+⟶ [Algoritma, Eylem maliyetleri, Arama uzayı, Zaman]
+
+
+
+
+**15. [Backtracking search, any, Breadth-first search, Depth-first search, DFS-Iterative deepening]**
+
+⟶ [Geri izleme araması, herhangi bir şey, Genişlik öncelikli arama, Derinlik öncelikli arama, DFS - Tekrarlı derinleşme]
+
+
+
+
+**16. Graph search**
+
+⟶ Çizge arama
+
+
+
+
+**17. This category of states-based algorithms aims at constructing optimal paths, enabling exponential savings. In this section, we will focus on dynamic programming and uniform cost search.**
+
+⟶ Bu durum-temelli algoritmalar kategorisi, üssel tasarruf sağlayan en iyi (optimal) yolları oluşturmayı amaçlar. Bu bölümde, dinamik programlama ve tek tip maliyet araştırması üzerinde duracağız.
+
+
+
+
+**18. Graph ― A graph is comprised of a set of vertices V (also called nodes) as well as a set of edges E (also called links).**
+
+⟶ Çizge ― Bir çizge, V köşeler (düğüm olarak da adlandırılır) kümesi ile E kenarlar (bağlantı olarak da adlandırılır) kümesinden oluşur.
+
+
+
+
+**19. Remark: a graph is said to be acylic when there is no cycle.**
+
+⟶ Not: çevrim olmadığında, bir çizgenin asiklik (çevrimsiz) olduğu söylenir.
+
+
+
+
+**20. State ― A state is a summary of all past actions sufficient to choose future actions optimally.**
+
+⟶ Durum ― Bir durum gelecekteki eylemleri en iyi (optimal) şekilde seçmek için, yeterli tüm geçmiş eylemlerin özetidir.
+
+
+
+
+**21. Dynamic programming ― Dynamic programming (DP) is a backtracking search algorithm with memoization (i.e. partial results are saved) whose goal is to find a minimum cost path from state s to an end state send. It can potentially have exponential savings compared to traditional graph search algorithms, and has the property to only work for acyclic graphs. For any given state s, the future cost is computed as follows:**
+
+⟶ Dinamik programlama ― Dinamik programlama (DP), amacı s durumundan bitiş durumu olan send'e kadar asgari(minimum) maliyet yolunu bulmak olan hatırlamalı (memoization) (başka bir deyişle kısmi sonuçlar kaydedilir) bir geri izleme (backtracking) arama algoritmasıdır. Geleneksel çizge arama algoritmalarına kıyasla üstel olarak tasarruf sağlayabilir ve yalnızca asiklik (çevrimsiz) çizgeler ile çalışma özelliğine sahiptir. Herhangi bir durum için gelecekteki maliyet aşağıdaki gibi hesaplanır:
+
+
+
+
+**22. [if, otherwise]**
+
+⟶ [eğer, aksi taktirde]
+
+
+
+
+**23. Remark: the figure above illustrates a bottom-to-top approach whereas the formula provides the intuition of a top-to-bottom problem resolution.**
+
+⟶ Not: Yukarıdaki şekil, aşağıdan yukarıya bir yaklaşımı sergilerken, formül ise yukarıdan aşağıya bir önsezi ile problem çözümü sağlar.
+
+
+
+
+**24. Types of states ― The table below presents the terminology when it comes to states in the context of uniform cost search:**
+
+⟶ Durum türleri ― Tek tip maliyet araştırması bağlamındaki durumlara ilişkin terminoloji aşağıdaki tabloda sunulmaktadır:
+
+
+
+
+**25. [State, Explanation]**
+
+⟶ [Durum, Açıklama]
+
+
+
+
+**26. [Explored, Frontier, Unexplored]**
+
+⟶ [Keşfedilmiş, Sırada (Frontier), Keşfedilmemiş]
+
+
+
+
+**27. [States for which the optimal path has already been found, States seen for which we are still figuring out how to get there with the cheapest cost, States not seen yet]**
+
+⟶ [En iyi (optimal) yolun daha önce bulunduğu durumlar, Görülen ancak hala en ucuza nasıl gidileceği hesaplanmaya çalışılan durumlar, Daha önce görülmeyen durumlar]
+
+
+
+
+**28. Uniform cost search ― Uniform cost search (UCS) is a search algorithm that aims at finding the shortest path from a state sstart to an end state send. It explores states s in increasing order of PastCost(s) and relies on the fact that all action costs are non-negative.**
+
+⟶ Tek tip maliyet araması ― Tek tip maliyet araması (Uniform cost search - UCS) bir başlangıç durumu olan Sstart, ile bir bitiş durumu olan Send arasındaki en kısa yolu bulmayı amaçlayan bir arama algoritmasıdır. Bu algoritma s durumlarını artan geçmiş maliyetleri olan PastCost(s)'a göre araştırır ve eylem maliyetlerinin negatif olmayacağı kuralına dayanır.
+
+
+
+
+**29. Remark 1: the UCS algorithm is logically equivalent to Dijkstra's algorithm.**
+
+⟶ Not 1: UCS algoritması mantıksal olarak Dijkstra algoritması ile aynıdır.
+
+
+
+
+**30. Remark 2: the algorithm would not work for a problem with negative action costs, and adding a positive constant to make them non-negative would not solve the problem since this would end up being a different problem.**
+
+⟶ Not 2: Algoritma, negatif eylem maliyetleriyle ilgili bir problem için çalışmaz ve negatif olmayan bir hale getirmek için pozitif bir sabit eklemek problemi çözmez, çünkü problem farklı bir problem haline gelmiş olur.
+
+
+
+
+**31. Correctness theorem ― When a state s is popped from the frontier F and moved to explored set E, its priority is equal to PastCost(s) which is the minimum cost path from sstart to s.**
+
+⟶ Doğruluk teoremi ― S durumu sıradaki (frontier) F'den çıkarılır ve daha önceden keşfedilmiş olan E kümesine taşınırsa, önceliği başlangıç durumu olan Sstart'dan, s durumuna kadar asgari (minimum) maliyet yolu olan PastCost(s)'e eşittir.
+
+
+
+
+**32. Graph search algorithms summary ― By noting N the number of total states, n of which are explored before the end state send, we have:**
+
+⟶ Çizge arama algoritmaları özeti ― N toplam durumların sayısı, n-bitiş durumu(Send)'ndan önce keşfedilen durum sayısı ise:
+
+
+
+
+**33. [Algorithm, Acyclicity, Costs, Time/space]**
+
+⟶ [Algoritma, Asiklik (Çevrimsizlik), Maliyetler, Zaman/arama uzayı]
+
+
+
+
+**34. [Dynamic programming, Uniform cost search]**
+
+⟶ [Dinamik programlama, Tek tip maliyet araması]
+
+
+
+
+**35. Remark: the complexity countdown supposes the number of possible actions per state to be constant.**
+
+⟶ Not: Karmaşıklık geri sayımı, her durum için olası eylemlerin sayısını sabit olarak kabul eder.
+
+
+
+
+**36. Learning costs**
+
+⟶ Öğrenme maliyetleri
+
+
+
+
+**37. Suppose we are not given the values of Cost(s,a), we want to estimate these quantities from a training set of minimizing-cost-path sequence of actions (a1,a2,...,ak).**
+
+⟶ Diyelim ki, Cost(s,a) değerleri verilmedi ve biz bu değerleri maliyet yolu eylem dizisini,(a1,a2,...,ak), en aza indiren bir eğitim kümesinden tahmin etmek istiyoruz.
+
+
+
+
+**38. [Structured perceptron ― The structured perceptron is an algorithm aiming at iteratively learning the cost of each state-action pair. At each step, it:, decreases the estimated cost of each state-action of the true minimizing path y given by the training data, increases the estimated cost of each state-action of the current predicted path y' inferred from the learned weights.]**
+
+⟶ [Yapılandırılmış algılayıcı ― Yapılandırılmış algılayıcı, her bir durum-eylem çiftinin maliyetini tekrarlı (iteratively) olarak öğrenmeyi amaçlayan bir algoritmadır. Her bir adımda, algılayıcı:, eğitim verilerinden elde edilen gerçek asgari (minimum) y yolunun her bir durum-eylem çiftinin tahmini (estimated) maliyetini azaltır, öğrenilen ağırlıklardan elde edilen şimdiki tahmini(predicted) y' yolununun durum-eylem çiftlerinin tahmini maliyetini artırır.]
+
+
+
+
+**39. Remark: there are several versions of the algorithm, one of which simplifies the problem to only learning the cost of each action a, and the other parametrizes Cost(s,a) to a feature vector of learnable weights.**
+
+⟶ Not: Algoritmanın birkaç sürümü vardır, bunlardan biri problemi sadece her bir a eyleminin maliyetini öğrenmeye indirger, bir diğeri ise öğrenilebilir ağırlık öznitelik vektörünü, Cost(s,a)'nın parametresi haline getirir.
+
+
+
+
+**40. A* search**
+
+⟶ A* arama
+
+
+
+
+**41. Heuristic function ― A heuristic is a function h over states s, where each h(s) aims at estimating FutureCost(s), the cost of the path from s to send.**
+
+⟶ Sezgisel işlev(Heuristic function) ― Sezgisel, s durumu üzerinde işlem yapan bir h fonksiyonudur, burada her bir h(s), s ile send arasındaki yol maliyeti olan FutureCost(s)'yi tahmin etmeyi amaçlar.
+
+
+
+
+**42. Algorithm ― A∗ is a search algorithm that aims at finding the shortest path from a state s to an end state send. It explores states s in increasing order of PastCost(s)+h(s). It is equivalent to a uniform cost search with edge costs Cost′(s,a) given by:**
+
+⟶ Algoritma ― A∗, s durumu ile send bitiş durumu arasındaki en kısa yolu bulmayı amaçlayan bir arama algoritmasıdır. Bahse konu algoritma PastCost(s)+h(s)'yi artan sıra ile araştırır. Aşağıda verilenler ışığında kenar maliyetlerini de içeren tek tip maliyet aramasına eşittir:
+
+
+
+
+**43. Remark: this algorithm can be seen as a biased version of UCS exploring states estimated to be closer to the end state.**
+
+⟶ Not: Bu algoritma, son duruma yakın olduğu tahmin edilen durumları araştıran tek tip maliyet aramasının taraflı bir sürümü olarak görülebilir.
+
+
+
+
+**44. [Consistency ― A heuristic h is said to be consistent if it satisfies the two following properties:, For all states s and actions a, The end state verifies the following:]**
+
+⟶ [Tutarlılık ― Bir sezgisel h, aşağıdaki iki özelliği sağlaması durumunda tutarlıdır denilebilir:, Bütün s durumları ve a eylemleri için, bitiş durumu aşağıdakileri doğrular:]
+
+
+
+
+**45. Correctness ― If h is consistent, then A∗ returns the minimum cost path.**
+
+⟶ Doğruluk ― Eğer h tutarlı ise o zaman A∗ algoritması asgari (minimum) maliyet yolunu döndürür.
+
+
+
+
+**46. Admissibility ― A heuristic h is said to be admissible if we have:**
+
+⟶ Kabul edilebilirlik ― Bir sezgisel h kabul edilebilirdir eğer:
+
+
+
+
+**47. Theorem ― Let h(s) be a given heuristic. We have:**
+
+⟶ Teorem ― h(s) sezgisel olsun ve:
+
+
+
+
+**48. [consistent, admissible]**
+
+⟶ [tutarlı, kabul edilebilir]
+
+
+
+
+**49. Efficiency ― A* explores all states s satisfying the following equation:**
+
+⟶ Verimlilik ― A* algoritması aşağıdaki eşitliği sağlayan bütün s durumlarını araştırır:
+
+
+
+
+**50. Remark: larger values of h(s) is better as this equation shows it will restrict the set of states s going to be explored.**
+
+⟶ Not: h(s)'nin yüksek değerleri, bu eşitliğin araştırılacak olan s durum kümesini kısıtlayacak olması nedeniyle daha iyidir.
+
+
+
+
+**51. Relaxation**
+
+⟶ Rahatlama
+
+
+
+
+**52. It is a framework for producing consistent heuristics. The idea is to find closed-form reduced costs by removing constraints and use them as heuristics.**
+
+⟶ Bu tutarlı sezgisel için bir altyapıdır (framework). Buradaki fikir, kısıtlamaları kaldırarak kapalı şekilli (closed-form) düşük maliyetler bulmak ve bunları sezgisel olarak kullanmaktır.
+
+
+
+
+**53. Relaxed search problem ― The relaxation of search problem P with costs Cost is noted Prel with costs Costrel, and satisfies the identity:**
+
+⟶ Rahat arama problemi (Relaxed search problem) ― Cost maliyetli bir arama probleminin rahatlaması, Costrel maliyetli Prel ile ifade edilir ve kimliği karşılar (satisfies the identity) :
+
+
+
+
+**54. Relaxed heuristic ― Given a relaxed search problem Prel, we define the relaxed heuristic h(s)=FutureCostrel(s) as the minimum cost path from s to an end state in the graph of costs Costrel(s,a).**
+
+⟶ Rahat sezgisel (Relaxed heuristic) ― Bir Prel rahat arama problemi verildiğinde, h(s)=FutureCostrel(s) rahat sezgisel eşitliğini Costrel(s,a) maliyet çizgesindeki s durumu ile bir bitiş durumu arasındaki asgari(minimum) maliyet yolu olarak tanımlarız.
+
+
+
+
+**55. Consistency of relaxed heuristics ― Let Prel be a given relaxed problem. By theorem, we have:**
+
+⟶ Rahat sezgisel tutarlılığı ― Prel bir rahat problem olarak verilmiş olsun. Teoreme göre:
+
+
+
+
+**56. consistent**
+
+⟶ tutarlı
+
+
+
+
+**57. [Tradeoff when choosing heuristic ― We have to balance two aspects in choosing a heuristic:, Computational efficiency: h(s)=FutureCostrel(s) must be easy to compute. It has to produce a closed form, easier search and independent subproblems., Good enough approximation: the heuristic h(s) should be close to FutureCost(s) and we have thus to not remove too many constraints.]**
+
+⟶ [Sezgisel seçiminde ödünleşim (tradeoff) ― Sezgisel seçiminde iki yönü dengelemeliyiz:, Hesaplamalı verimlilik: h(s)=FutureCostrel(s) eşitliği kolay hesaplanabilir olmalıdır. Kapalı bir şekil, daha kolay arama ve bağımsız alt problemler üretmesi gerekir., Yeterince iyi yaklaşım: sezgisel h(s), FutureCost(s) işlevine yakın olmalı ve bu nedenle çok fazla kısıtlamayı ortadan kaldırmamalıyız.]
+
+
+
+
+**58. Max heuristic ― Let h1(s), h2(s) be two heuristics. We have the following property:**
+
+⟶ En yüksek sezgisel ― h1(s) ve h2(s) aşağıdaki özelliklere sahip iki adet sezgisel olsun:
+
+
+
+
+**59. Markov decision processes**
+
+⟶ Markov karar süreçleri
+
+
+
+
+**60. In this section, we assume that performing action a from state s can lead to several states s′1,s′2,... in a probabilistic manner. In order to find our way between an initial state and an end state, our objective will be to find the maximum value policy by using Markov decision processes that help us cope with randomness and uncertainty.**
+
+⟶ Bu bölümde, s durumunda a eyleminin gerçekleştirilmesinin olasılıksal olarak birden fazla durum,(s′1,s′2,...), ile sonuçlanacağını kabul ediyoruz. Başlangıç durumu ile bitiş durumu arasındaki yolu bulmak için amacımız, rastgelelilik ve belirsizlik ile başa çıkabilmek için yardımcı olan Markov karar süreçlerini kullanarak en yüksek değer politikasını bulmak olacaktır.
+
+
+
+
+**61. Notations**
+
+⟶ Gösterimler
+
+
+
+
+**62. [Definition ― The objective of a Markov decision process is to maximize rewards. It is defined with:, a starting state sstart, possible actions Actions(s) from state s, transition probabilities T(s,a,s′) from s to s′ with action a, rewards Reward(s,a,s′) from s to s′ with action a, whether an end state was reached IsEnd(s), a discount factor 0⩽γ⩽1]**
+
+⟶ [Tanım ― Markov karar sürecinin amacı ödülleri en yüksek seviyeye çıkarmaktır. Markov karar süreci aşağıdaki bileşenlerden oluşmaktadır:, başlangıç durumu sstart, s durumunda gerçekleştirilebilecek olası eylemler Actions(s), s durumunda a eyleminin gerçekleştirilmesi ile s′ durumuna geçiş olasılıkları T(s,a,s′), s durumunda a eyleminin gerçekleştirilmesi ile elde edilen ödüller Reward(s,a,s′), bitiş durumuna ulaşılıp ulaşılamadığı IsEnd(s), indirim faktörü 0⩽γ⩽1]
+
+
+
+
+**63. Transition probabilities ― The transition probability T(s,a,s′) specifies the probability of going to state s′ after action a is taken in state s. Each s′↦T(s,a,s′) is a probability distribution, which means that:**
+
+⟶ Geçiş olasılıkları ― Geçiş olasılığı T(s,a,s′) s durumundayken gerçekleştirilen a eylemi neticesinde s′ durumuna gitme olasılığını belirtir. Her bir s′↦T(s,a,s′) aşağıda belirtildiği gibi bir olasılık dağılımıdır:
+
+
+
+
+**64. states**
+
+⟶ durumlar
+
+
+
+
+**65. Policy ― A policy π is a function that maps each state s to an action a, i.e.**
+
+⟶ Politika ― Bir π politikası her s durumunu bir a eylemi ile ilişkilendiren bir işlevdir.
+
+
+
+
+**66. Utility ― The utility of a path (s0,...,sk) is the discounted sum of the rewards on that path. In other words,**
+
+⟶ Fayda ― Bir (s0,...,sk) yolunun faydası, o yol üzerindeki ödüllerin indirimli toplamıdır. Diğer bir deyişle,
+
+
+
+
+**67. The figure above is an illustration of the case k=4.**
+
+⟶ Yukarıdaki şekil k=4 durumunun bir gösterimidir.
+
+
+
+
+**68. Q-value ― The Q-value of a policy π at state s with action a, also noted Qπ(s,a), is the expected utility from state s after taking action a and then following policy π. It is defined as follows:**
+
+⟶ Q-değeri ― S durumunda gerçekleştirilen bir a eylemi için π politikasının Q-değeri, Qπ(s,a) olarak da gösterilir, a eylemini gerçekleştirip ve sonrasında π politikasını takiben s durumundan beklenen faydadır. Q-değeri aşağıdaki şekilde tanımlanmaktadır:
+
+
+
+
+**69. Value of a policy ― The value of a policy π from state s, also noted Vπ(s), is the expected utility by following policy π from state s over random paths. It is defined as follows:**
+
+⟶ Bir politikanın değeri ― S durumundaki π politikasının değeri,Vπ(s) olarak da gösterilir, rastgele yollar üzerinde s durumundaki π politikasını izleyerek elde edilen beklenen faydadır. S durumundaki π politikasının değeri aşağıdaki gibi tanımlanır:
+
+
+
+
+**70. Remark: Vπ(s) is equal to 0 if s is an end state.**
+
+⟶ Not: Eğer s bitiş durumu ise Vπ(s) sıfıra eşittir.
+
+
+
+
+**71. Applications**
+
+⟶ Uygulamalar
+
+
+
+
+**72. [Policy evaluation ― Given a policy π, policy evaluation is an iterative algorithm that aims at estimating Vπ. It is done as follows:, Initialization: for all states s, we have:, Iteration: for t from 1 to TPE, we have, with]**
+
+⟶ [Politika değerlendirme ― bir π politikası verildiğinde, politika değerlendirmesini,Vπ, tahmin etmeyi amaçlayan bir tekrarlı (iterative) algoritmadır. Politika değerlendirme aşağıdaki gibi yapılmaktadır:, İlklendirme: bütün s durumları için:, Tekrar: 1'den TPE'ye kadar her t için, ile]
+
+
+
+
+**73. Remark: by noting S the number of states, A the number of actions per state, S′ the number of successors and T the number of iterations, then the time complexity is of O(TPESS′).**
+
+⟶ Not: S durum sayısını, A her bir durum için eylem sayısını, S′ ardılların (successors) sayısını ve T yineleme sayısını gösterdiğinde, zaman karmaşıklığı O(TPESS′) olur.
+
+
+
+
+**74. Optimal Q-value ― The optimal Q-value Qopt(s,a) of state s with action a is defined to be the maximum Q-value attained by any policy starting. It is computed as follows:**
+
+⟶ En iyi Q-değeri ― S durumunda a eylemi gerçekleştirildiğinde bu durumun en iyi Q-değeri,Qopt(s,a), herhangi bir politika başlangıcında elde edilen en yüksek Q-değeri olarak tanımlanmaktadır. En iyi Q-değeri aşağıdaki gibi hesaplanmaktadır:
+
+
+
+
+**75. Optimal value ― The optimal value Vopt(s) of state s is defined as being the maximum value attained by any policy. It is computed as follows:**
+
+⟶ En iyi değer ― S durumunun en iyi değeri olan Vopt(s), herhangi bir politika ile elde edilen en yüksek değer olarak tanımlanmaktadır. En iyi değer aşağıdaki gibi hesaplanmaktadır:
+
+
+
+
+**76. actions**
+
+⟶ eylemler
+
+
+
+
+**77. Optimal policy ― The optimal policy πopt is defined as being the policy that leads to the optimal values. It is defined by:**
+
+⟶ En iyi politika ― En iyi politika olan πopt, en iyi değerlere götüren politika olarak tanımlanmaktadır. En iyi politika aşağıdaki gibi tanımlanmaktadır:
+
+
+
+
+**78. [Value iteration ― Value iteration is an algorithm that finds the optimal value Vopt as well as the optimal policy πopt. It is done as follows:, Initialization: for all states s, we have:, Iteration: for t from 1 to TVI, we have:, with]**
+
+⟶ [Değer tekrarı(iteration) ― Değer tekrarı(iteration) en iyi politika olan πopt, yanında en iyi değeri Vopt'ı, bulan bir algoritmadır. Değer tekrarı(iteration) aşağıdaki gibi yapılmaktadır:, İlklendirme: bütün s durumları için:, Tekrar: 1'den TVI'ya kadar her bir t için:, ile]
+
+
+
+
+**79. Remark: if we have either γ<1 or the MDP graph being acyclic, then the value iteration algorithm is guaranteed to converge to the correct answer.**
+
+⟶ Not: Eğer γ<1 ya da Markov karar süreci (Markov Decision Process - MDP) asiklik (çevrimsiz) olursa, o zaman değer tekrarı algoritmasının doğru cevaba yakınsayacağı garanti edilir.
+
+
+
+
+**80. When unknown transitions and rewards**
+
+⟶ Bilinmeyen geçişler ve ödüller
+
+
+
+
+**81. Now, let's assume that the transition probabilities and the rewards are unknown.**
+
+⟶ Şimdi, geçiş olasılıklarının ve ödüllerin bilinmediğini varsayalım.
+
+
+
+
+**82. Model-based Monte Carlo ― The model-based Monte Carlo method aims at estimating T(s,a,s′) and Reward(s,a,s′) using Monte Carlo simulation with: **
+
+⟶ Model-temelli Monte Carlo ― Model-temelli Monte Carlo yöntemi, T(s,a,s′) ve Reward(s,a,s′) işlevlerini Monte Carlo benzetimi kullanarak aşağıdaki formüllere uygun bir şekilde tahmin etmeyi amaçlar:
+
+
+
+
+**83. [# times (s,a,s′) occurs, and]**
+
+⟶ [# kere (s,a,s′) gerçekleşme sayısı, ve]
+
+
+
+
+**84. These estimations will be then used to deduce Q-values, including Qπ and Qopt.**
+
+⟶ Bu tahminler daha sonra Qπ ve Qopt'yi içeren Q-değerleri çıkarımı için kullanılacaktır.
+
+
+
+
+**85. Remark: model-based Monte Carlo is said to be off-policy, because the estimation does not depend on the exact policy.**
+
+⟶ Not: model-tabanlı Monte Carlo'nun politika dışı olduğu söyleniyor, çünkü tahmin kesin politikaya bağlı değildir.
+
+
+
+
+**86. Model-free Monte Carlo ― The model-free Monte Carlo method aims at directly estimating Qπ, as follows:**
+
+⟶ Model içermeyen Monte Carlo ― Model içermeyen Monte Carlo yöntemi aşağıdaki şekilde doğrudan Qπ'yi tahmin etmeyi amaçlar:
+
+
+
+
+**87. Qπ(s,a)=average of ut where st−1=s,at=a**
+
+⟶ Qπ(s,a)= ortalama ut , st−1=s ve at=a olduğunda
+
+
+
+
+**88. where ut denotes the utility starting at step t of a given episode.**
+
+⟶ ut belirli bir bölümün t anında başlayan faydayı ifade etmektedir.
+
+
+
+
+**89. Remark: model-free Monte Carlo is said to be on-policy, because the estimated value is dependent on the policy π used to generate the data.**
+
+⟶ Not: model içermeyen Monte Carlo'nun politikaya dahil olduğu söyleniyor, çünkü tahmini değer veriyi üretmek için kullanılan π politikasına bağlıdır.
+
+
+
+
+**90. Equivalent formulation - By introducing the constant η=11+(#updates to (s,a)) and for each (s,a,u) of the training set, the update rule of model-free Monte Carlo has a convex combination formulation:**
+
+⟶ Eşdeğer formülasyon - Sabit tanımı η=11+(#güncelleme sayısı (s,a) ) ve eğitim kümesinin her bir (s,a,u) üçlemesi için, model içermeyen Monte Carlo'nun güncelleme kuralı dışbükey bir kombinasyon formülasyonuna sahiptir:
+
+
+
+
+**91. as well as a stochastic gradient formulation:**
+
+⟶ olasılıksal bayır formülasyonu yanında:
+
+
+
+
+**92. SARSA ― State-action-reward-state-action (SARSA) is a boostrapping method estimating Qπ by using both raw data and estimates as part of the update rule. For each (s,a,r,s′,a′), we have:**
+
+⟶ SARSA ― Durum-eylem-ödül-durum-eylem (State-Action-Reward-State-Action - SARSA), hem ham verileri hem de güncelleme kuralının bir parçası olarak tahminleri kullanarak Qπ'yi tahmin eden bir destekleme yöntemidir. Her bir (s,a,r,s′,a′) için:
+
+
+
+
+**93. Remark: the SARSA estimate is updated on the fly as opposed to the model-free Monte Carlo one where the estimate can only be updated at the end of the episode.**
+
+⟶ Not: the SARSA tahmini, tahminin yalnızca bölüm sonunda güncellenebildiği model içermeyen Monte Carlo yönteminin aksine anında güncellenir.
+
+
+
+
+**94. Q-learning ― Q-learning is an off-policy algorithm that produces an estimate for Qopt. On each (s,a,r,s′,a′), we have:**
+
+⟶ Q-öğrenme ― Q-öğrenme, Qopt için tahmin üreten politikaya dahil olmayan bir algoritmadır. Her bir (s,a,r,s′,a′) için:
+
+
+
+
+**95. Epsilon-greedy ― The epsilon-greedy policy is an algorithm that balances exploration with probability ϵ and exploitation with probability 1−ϵ. For a given state s, the policy πact is computed as follows:**
+
+⟶ Epsilon-açgözlü ― Epsilon-açgözlü politika, ϵ olasılıkla araştırmayı ve 1−ϵ olasılıkla sömürüyü dengeleyen bir algoritmadır. Her bir s durumu için, πact politikası aşağıdaki şekilde hesaplanır:
+
+
+
+
+**96. [with probability, random from Actions(s)]**
+
+⟶ [olasılıkla, Actions(s) eylem kümesi içinden rastgele]
+
+
+
+
+**97. Game playing**
+
+⟶ Oyun oynama
+
+
+
+
+**98. In games (e.g. chess, backgammon, Go), other agents are present and need to be taken into account when constructing our policy.**
+
+⟶ Oyunlarda (örneğin satranç, tavla, Go), başka oyuncular vardır ve politikamızı oluştururken göz önünde bulundurulması gerekir.
+
+
+
+
+**99. Game tree ― A game tree is a tree that describes the possibilities of a game. In particular, each node is a decision point for a player and each root-to-leaf path is a possible outcome of the game.**
+
+⟶ Oyun ağacı ― Oyun ağacı, bir oyunun olasılıklarını tarif eden bir ağaçtır. Özellikle, her bir düğüm, oyuncu için bir karar noktasıdır ve her bir kökten (root) yaprağa (leaf) giden yol oyunun olası bir sonucudur.
+
+
+
+
+**100. [Two-player zero-sum game ― It is a game where each state is fully observed and such that players take turns. It is defined with:, a starting state sstart, possible actions Actions(s) from state s, successors Succ(s,a) from states s with actions a, whether an end state was reached IsEnd(s), the agent's utility Utility(s) at end state s, the player Player(s) who controls state s]**
+
+⟶ [İki oyunculu sıfır toplamlı oyun ― Her durumun tamamen gözlendiği ve oyuncuların sırayla oynadığı bir oyundur. Aşağıdaki gibi tanımlanır:, bir başlangıç durumu sstart, s durumunda gerçekleştirilebilecek olası eylemler Actions(s), s durumunda a eylemi gerçekleştirildiğindeki ardıllar Succ(s,a), bir bitiş durumuna ulaşılıp ulaşılmadığı IsEnd(s), s bitiş durumunda etmenin elde ettiği fayda Utility(s), s durumunu kontrol eden oyuncu Player(s)]
+
+
+
+
+**101. Remark: we will assume that the utility of the agent has the opposite sign of the one of the opponent.**
+
+⟶ Not: Oyuncu faydasının işaretinin, rakibinin faydasının tersi olacağını varsayacağız.
+
+
+
+
+**102. [Types of policies ― There are two types of policies:, Deterministic policies, noted πp(s), which are actions that player p takes in state s., Stochastic policies, noted πp(s,a)∈[0,1], which are probabilities that player p takes action a in state s.]**
+
+⟶ [Politika türleri ― İki tane politika türü vardır:, πp(s) olarak gösterilen belirlenimci politikalar , p oyuncusunun s durumunda gerçekleştirdiği eylemler., πp(s,a)∈[0,1] olarak gösterilen olasılıksal politikalar, p oyuncusunun s durumunda a eylemini gerçekleştirme olasılıkları.]
+
+
+
+
+**103. Expectimax ― For a given state s, the expectimax value Vexptmax(s) is the maximum expected utility of any agent policy when playing with respect to a fixed and known opponent policy πopp. It is computed as follows:**
+
+⟶ En yüksek beklenen değer(Expectimax) ― Belirli bir s durumu için, en yüksek beklenen değer olan Vexptmax(s), sabit ve bilinen bir rakip politikası olan πopp'a göre oynarken, bir oyuncu politikasının en yüksek beklenen faydasıdır. En yüksek beklenen değer(Expectimax) aşağıdaki gibi hesaplanmaktadır:
+
+
+
+
+**104. Remark: expectimax is the analog of value iteration for MDPs.**
+
+⟶ Not: En yüksek beklenen değer(Expectimax), MDP'ler için değer yinelemenin analog halidir.
+
+
+
+
+**105. Minimax ― The goal of minimax policies is to find an optimal policy against an adversary by assuming the worst case, i.e. that the opponent is doing everything to minimize the agent's utility. It is done as follows:**
+
+⟶ En küçük-en büyük (minimax) ― En küçük-enbüyük (minimax) politikaların amacı en kötü durumu kabul ederek, diğer bir deyişle; rakip, oyuncunun faydasını en aza indirmek için her şeyi yaparken, rakibe karşı en iyi politikayı bulmaktır. En küçük-en büyük(minimax) aşağıdaki şekilde yapılır:
+
+
+
+
+**106. Remark: we can extract πmax and πmin from the minimax value Vminimax.**
+
+⟶ Not: πmax ve πmin değerleri, en küçük-en büyük olan Vminimax'dan elde edilebilir.
+
+
+
+
+**107. Minimax properties ― By noting V the value function, there are 3 properties around minimax to have in mind:**
+
+⟶ En küçük-en büyük (minimax) özellikleri ― V değer fonksiyonunu ifade ederse, En küçük-en büyük (minimax) ile ilgili aklımızda bulundurmamız gereken 3 özellik vardır:
+
+
+
+
+**108. Property 1: if the agent were to change its policy to any πagent, then the agent would be no better off.**
+
+⟶ Özellik 1: Oyuncu politikasını herhangi bir πagent ile değiştirecek olsaydı, o zaman oyuncu daha iyi olmazdı.
+
+
+
+
+**109. Property 2: if the opponent changes its policy from πmin to πopp, then he will be no better off.**
+
+⟶ Özellik 2: Eğer rakip oyuncu politikasını πmin'den πopp'a değiştirecek olsaydı, o zaman rakip oyuncu daha iyi olamazdı.
+
+
+
+
+**110. Property 3: if the opponent is known to be not playing the adversarial policy, then the minimax policy might not be optimal for the agent.**
+
+⟶ Özellik 3: Eğer rakip oyuncunun muhalif (adversarial) politikayı oynamadığı biliniyorsa, o zaman en küçük-en büyük(minimax) politika oyuncu için ey iyi (optimal) olmayabilir.
+
+
+
+
+**111. In the end, we have the following relationship:**
+
+⟶ Sonunda, aşağıda belirtildiği gibi bir ilişkiye sahip oluruz:
+
+
+
+
+**112. Speeding up minimax**
+
+⟶ En küçük-en büyük (minimax) hızlandırma
+
+
+
+
+**113. Evaluation function ― An evaluation function is a domain-specific and approximate estimate of the value Vminimax(s). It is noted Eval(s).**
+
+⟶ Değerlendirme işlevi ― Değerlendirme işlevi, alana özgü (domain-specific) ve Vminimax(s) değerinin yaklaşık bir tahminidir. Eval(s) olarak ifade edilmektedir.
+
+
+
+
+**114. Remark: FutureCost(s) is an analogy for search problems.**
+
+⟶ Not: FutureCost(s) arama problemleri için bir benzetmedir(analogy).
+
+
+
+
+**115. Alpha-beta pruning ― Alpha-beta pruning is a domain-general exact method optimizing the minimax algorithm by avoiding the unnecessary exploration of parts of the game tree. To do so, each player keeps track of the best value they can hope for (stored in α for the maximizing player and in β for the minimizing player). At a given step, the condition β<α means that the optimal path is not going to be in the current branch as the earlier player had a better option at their disposal.**
+
+⟶ Alpha-beta budama ― Alfa-beta budama, oyun ağacının parçalarının gereksiz yere keşfedilmesini önleyerek en küçük-en büyük(minimax) algoritmasını en iyileyen (optimize eden) alana-özgü olmayan genel bir yöntemdir. Bunu yapmak için, her oyuncu ümit edebileceği en iyi değeri takip eder (maksimize eden oyuncu için α'da ve minimize eden oyuncu için β'de saklanır). Belirli bir adımda, β <α koşulu, önceki oyuncunun emrinde daha iyi bir seçeneğe sahip olması nedeniyle en iyi (optimal) yolun mevcut dalda olamayacağı anlamına gelir.
+
+
+
+
+**116. TD learning ― Temporal difference (TD) learning is used when we don't know the transitions/rewards. The value is based on exploration policy. To be able to use it, we need to know rules of the game Succ(s,a). For each (s,a,r,s′), the update is done as follows:**
+
+⟶ TD öğrenme ― Geçici fark (Temporal difference - TD) öğrenmesi, geçiş/ödülleri bilmediğimiz zaman kullanılır. Değer, keşif politikasına dayanır. Bunu kullanabilmek için, oyununun kurallarını,Succ (s, a), bilmemiz gerekir. Her bir (s,a,r,s′) için, güncelleme aşağıdaki şekilde yapılır:
+
+
+
+
+**117. Simultaneous games**
+
+⟶ Eşzamanlı oyunlar
+
+
+
+
+**118. This is the contrary of turn-based games, where there is no ordering on the player's moves.**
+
+⟶ Bu, oyuncunun hamlelerinin sıralı olmadığı sıra temelli oyunların tam tersidir.
+
+
+
+
+**119. Single-move simultaneous game ― Let there be two players A and B, with given possible actions. We note V(a,b) to be A's utility if A chooses action a, B chooses action b. V is called the payoff matrix.**
+
+⟶ Tek-hamleli eşzamanlı oyun ― Olası hareketlere sahip A ve B iki oyuncu olsun. V(a,b), A'nın a eylemini ve B'nin de b eylemini seçtiği A'nın faydasını ifade eder. V, getiri dizeyi olarak adlandırılır.
+
+
+
+
+**120. [Strategies ― There are two main types of strategies:, A pure strategy is a single action:, A mixed strategy is a probability distribution over actions:]**
+
+⟶ [Stratejiler ― İki tane ana strateji türü vardır:, Saf strateji, tek bir eylemdir:, Karışık strateji, eylemler üzerindeki bir olasılık dağılımıdır:]
+
+
+
+
+**121. Game evaluation ― The value of the game V(πA,πB) when player A follows πA and player B follows πB is such that:**
+
+⟶ Oyun değerlendirme ― oyuncu A πA'yı ve oyuncu B de πB'yi izlediğinde, Oyun değeri V(πA,πB):
+
+
+
+
+**122. Minimax theorem ― By noting πA,πB ranging over mixed strategies, for every simultaneous two-player zero-sum game with a finite number of actions, we have:**
+
+⟶ En küçük-en büyük (minimax) teoremi ― ΠA, πB’nin karma stratejilere göre değiştiğini belirterek, sonlu sayıda eylem ile eşzamanlı her iki oyunculu sıfır toplamlı oyun için:
+
+
+
+
+**123. Non-zero-sum games**
+
+⟶ Sıfır toplamı olmayan oyunlar
+
+
+
+
+**124. Payoff matrix ― We define Vp(πA,πB) to be the utility for player p.**
+
+⟶ Getiri matrisi ― Vp(πA,πB)'yi oyuncu p'nin faydası olarak tanımlıyoruz.
+
+
+
+
+**125. Nash equilibrium ― A Nash equilibrium is (π∗A,π∗B) such that no player has an incentive to change its strategy. We have:**
+
+⟶ Nash dengesi ― Nash dengesi (π ∗ A, π ∗ B) öyle birşey ki hiçbir oyuncuyu, stratejisini değiştirmeye teşvik etmiyor:
+
+
+
+
+**126. and**
+
+⟶ ve
+
+
+
+
+**127. Remark: in any finite-player game with finite number of actions, there exists at least one Nash equilibrium.**
+
+⟶ Not: sonlu sayıda eylem olan herhangi bir sonlu oyunculu oyunda, en azından bir tane Nash denegesi mevcuttur.
+
+
+
+
+**128. [Tree search, Backtracking search, Breadth-first search, Depth-first search, Iterative deepening]**
+
+⟶ [Ağaç arama, Geri izleme araması, Genişlik öncelikli arama, Derinlik öncelikli arama, Tekrarlı (Iterative) derinleşme]
+
+
+
+
+**129. [Graph search, Dynamic programming, Uniform cost search]**
+
+⟶ [Çizge arama, Dinamik programlama, Tek tip maliyet araması]
+
+
+
+
+**130. [Learning costs, Structured perceptron]**
+
+⟶ [Öğrenme maliyetleri, Yapısal algılayıcı]
+
+
+
+
+**131. [A star search, Heuristic function, Algorithm, Consistency, correctness, Admissibility, efficiency]**
+
+⟶ [A yıldız arama, Sezgisel işlev, Algoritma, Tutarlılık, doğruluk, kabul edilebilirlik, verimlilik]
+
+
+
+
+**132. [Relaxation, Relaxed search problem, Relaxed heuristic, Max heuristic]**
+
+⟶ [Rahatlama, Rahat arama problemi, Rahat sezgisel, En yüksek sezgisel]
+
+
+
+
+**133. [Markov decision processes, Overview, Policy evaluation, Value iteration, Transitions, rewards]**
+
+⟶ [Markov karar süreçleri, Genel bakış, Politika değerlendirme, Değer yineleme, Geçişler, ödüller]
+
+
+
+
+**134. [Game playing, Expectimax, Minimax, Speeding up minimax, Simultaneous games, Non-zero-sum games]**
+
+⟶ [Oyun oynama, En yüksek beklenti, En küçük-en büyük, En küçük-en büyük hızlandırma, Eşzamanlı oyunlar, Sıfır toplamı olmayan oyunlar]
+
+
+
+
+**135. View PDF version on GitHub**
+
+⟶ GitHub'da PDF sürümünü görüntüleyin
+
+
+
+
+**136. Original authors**
+
+⟶ Asıl yazarlar
+
+
+
+
+**137. Translated by X, Y and Z**
+
+⟶ X, Y ve Z tarafından tercüme edilmiştir.
+
+
+
+
+**138. Reviewed by X, Y and Z**
+
+⟶ X,Y,Z tarafından gözden geçirilmiştir.
+
+
+
+
+**139. By X and Y**
+
+⟶ X ve Y ile
+
+
+
+
+**140. The Artificial Intelligence cheatsheets are now available in [target language].**
+
+⟶ Yapay Zeka el kitapları artık [hedef dilde] mevcuttur.
diff --git a/tr/cs-221-variables-models.md b/tr/cs-221-variables-models.md
new file mode 100644
index 000000000..aac242e96
--- /dev/null
+++ b/tr/cs-221-variables-models.md
@@ -0,0 +1,617 @@
+**Variables-based models translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-221/cheatsheet-variables-models)
+
+
+
+**1. Variables-based models with CSP and Bayesian networks**
+
+⟶ 1. CSP ile değişken-temelli modeller ve Bayesçi ağlar
+
+
+
+
+**2. Constraint satisfaction problems**
+
+⟶ 2. Kısıt memnuniyet problemleri
+
+
+
+
+**3. In this section, our objective is to find maximum weight assignments of variable-based models. One advantage compared to states-based models is that these algorithms are more convenient to encode problem-specific constraints.**
+
+⟶ 3. Bu bölümde hedefimiz değişken-temelli modellerin maksimum ağırlık seçimlerini bulmaktır. Durum temelli modellerle kıyaslandığında, bu algoritmaların probleme özgü kısıtları kodlamak için daha uygun olmaları bir avantajdır.
+
+
+
+
+**4. Factor graphs**
+
+⟶ 4. Faktör grafikleri
+
+
+
+
+**5. Definition ― A factor graph, also referred to as a Markov random field, is a set of variables X=(X1,...,Xn) where Xi∈Domaini and m factors f1,...,fm with each fj(X)⩾0.**
+
+⟶5. Tanımlama - Markov rasgele alanı olarak da adlandırılan faktör grafiği, Xi∈Domaini ve herbir fj(X)⩾0 olan f1,...,fm m faktör olmak üzere X=(X1,...,Xn) değişkenler kümesidir.
+
+
+
+
+**6. Domain**
+
+⟶ 6. Etki Alanı (Domain)
+
+
+
+
+**7. Scope and arity ― The scope of a factor fj is the set of variables it depends on. The size of this set is called the arity.**
+
+⟶ 7. Kapsam ve ilişki derecesi - Fj faktörünün kapsamı, dayandığı değişken kümesidir. Bu kümenin boyutuna ilişki derecesi (arity) denir.
+
+
+
+
+**8. Remark: factors of arity 1 and 2 are called unary and binary respectively.**
+
+⟶ 8. Not: Faktörlerin ilişki derecesi 1 ve 2 olanlarına sırasıyla tek ve ikili denir.
+
+
+
+
+**9. Assignment weight ― Each assignment x=(x1,...,xn) yields a weight Weight(x) defined as being the product of all factors fj applied to that assignment. Its expression is given by:**
+
+⟶9. Atama ağırlığı - Her atama x = (x1, ..., xn), o atamaya uygulanan tüm faktörlerin çarpımı olarak tanımlanan bir Ağırlık (x) ağırlığı verir.Şöyle ifade edilir:
+
+
+
+
+**10. Constraint satisfaction problem ― A constraint satisfaction problem (CSP) is a factor graph where all factors are binary; we call them to be constraints:**
+
+⟶ 10. Kısıt memnuniyet problemi - Kısıtlama memnuniyet problemi (constraint satisfaction problem-CSP), tüm faktörlerin ikili olduğu bir faktör grafiğidir; bunları kısıt olarak adlandırıyoruz:
+
+
+
+
+**11. Here, the constraint j with assignment x is said to be satisfied if and only if fj(x)=1.**
+
+⟶11.Burada, j kısıtlı x ataması ancak ve ancak fj(x)=1 olduğunda uygundur (satisfied) denir.
+
+
+
+
+**12. Consistent assignment ― An assignment x of a CSP is said to be consistent if and only if Weight(x)=1, i.e. all constraints are satisfied.**
+
+⟶ 12.Tutarlı atama - Bir CSP'nin bir x atamasının, yalnızca Ağırlık (x) = 1 olduğunda, yani tüm kısıtların yerine getirilmesi durumunda tutarlı olduğu söylenir.
+
+
+
+
+**13. Dynamic ordering**
+
+⟶ 13. Dinamik düzenleşim (Dynamic ordering)
+
+
+
+
+**14. Dependent factors ― The set of dependent factors of variable Xi with partial assignment x is called D(x,Xi), and denotes the set of factors that link Xi to already assigned variables.**
+
+⟶14.Bağımlı faktörler - X değişkeninin kısmi atamaya sahip bağımlı X değişken faktörlerinin kümesi D (x, Xi) ile gösterilir ve Xi'yi önceden atanmış değişkenlere bağlayan faktörler kümesini belirtir.
+
+
+
+
+**15. Backtracking search ― Backtracking search is an algorithm used to find maximum weight assignments of a factor graph. At each step, it chooses an unassigned variable and explores its values by recursion. Dynamic ordering (i.e. choice of variables and values) and lookahead (i.e. early elimination of inconsistent options) can be used to explore the graph more efficiently, although the worst-case runtime stays exponential: O(|Domain|n).**
+
+⟶ 15. Geri izleme araması - Geri izleme araması, bir faktör grafiğinin maksimum ağırlık atamalarını bulmak için kullanılan bir algoritmadır. Her adımda, atanmamış bir değişken seçer ve değerlerini özyineleme ile arar. Dinamik düzenleşim (yani değişkenlerin ve değerlerin seçimi) ve bakış açısı (yani tutarsız seçeneklerin erken elenmesi), en kötü durum çalışma süresi üssel olarak olsa da grafiği daha verimli aramak için kullanılabilir. O (| Domain | n).
+
+
+
+
+**16. [Forward checking ― It is a one-step lookahead heuristic that preemptively removes inconsistent values from the domains of neighboring variables. It has the following characteristics:, After assigning a variable Xi, it eliminates inconsistent values from the domains of all its neighbors., If any of these domains becomes empty, we stop the local backtracking search., If we un-assign a variable Xi, we have to restore the domain of its neighbors.]**
+
+⟶ 16. [İleri kontrol - Tutarsız değerleri komşu değişkenlerin etki alanlarından öncelikli bir şekilde ortadan kaldıran sezgisel bakış açısıdır. Aşağıdaki özelliklere sahiptir : Bir Xi değişkenini atadıktan sonra, tüm komşularının etki alanlarından tutarsız değerleri eler. Bu etki alanlardan herhangi biri boş olursa, yerel geri arama araması durdurulur.Komşularının etki alanını eski haline getirilmek zorundadır.]
+
+
+
+
+**17. Most constrained variable ― It is a variable-level ordering heuristic that selects the next unassigned variable that has the fewest consistent values. This has the effect of making inconsistent assignments to fail earlier in the search, which enables more efficient pruning.**
+
+⟶ 17. En kısıtlı değişken - En az tutarlı değere sahip bir sonraki atanmamış değişkeni seçen, değişken seviyeli sezgisel düzenleşimdir. Bu, daha verimli budama olanağı sağlayan aramada daha önce başarısız olmak için tutarsız atamalar yapma etkisine sahiptir.
+
+
+
+
+**18. Least constrained value ― It is a value-level ordering heuristic that assigns the next value that yields the highest number of consistent values of neighboring variables. Intuitively, this procedure chooses first the values that are most likely to work.**
+
+⟶ 18. En düşük kısıtlı değer - Komşu değişkenlerin en yüksek tutarlı değerlerini elde ederek bir sonrakini veren seviye düzenleyici sezgisel bir değerdir. Sezgisel olarak, bu prosedür önce çalışması en muhtemel olan değerleri seçer.
+
+
+
+
+**19. Remark: in practice, this heuristic is useful when all factors are constraints.**
+
+⟶ 19. Not: Uygulamada, bu sezgisel yaklaşım tüm faktörler kısıtlı olduğunda kullanışlıdır.
+
+
+
+
+**20. The example above is an illustration of the 3-color problem with backtracking search coupled with most constrained variable exploration and least constrained value heuristic, as well as forward checking at each step.**
+
+⟶ 20. Yukarıdaki örnek, en kısıtlı değişken keşfi ve sezgisel en düşük kısıtlı değerin yanı sıra, her adımda ileri kontrol ile birleştirilmiş geri izleme arama ile 3 renk probleminin bir gösterimidir.
+
+
+
+
+**21. [Arc consistency ― We say that arc consistency of variable Xl with respect to Xk is enforced when for each xl∈Domainl:, unary factors of Xl are non-zero, there exists at least one xk∈Domaink such that any factor between Xl and Xk is non-zero.]**
+
+⟶ 21. [Ark tutarlılığı (Arc consistency) - Xl değişkeninin ark tutarlılığının Xk'ye göre her bir xl∈Domainl için geçerli olduğu söylenir : Xl'in birleşik faktörleri sıfır olmadığında, en az bir xk∈Domaink vardır, öyle ki Xl ve Xk arasında sıfır olmayan herhangi bir faktör vardır.
+
+
+
+
+**22. AC-3 ― The AC-3 algorithm is a multi-step lookahead heuristic that applies forward checking to all relevant variables. After a given assignment, it performs forward checking and then successively enforces arc consistency with respect to the neighbors of variables for which the domain change during the process.**
+
+⟶ 22. AC-3 - AC-3 algoritması, tüm ilgili değişkenlere ileri kontrol uygulayan çok adımlı sezgisel bir bakış açısıdır. Belirli bir görevden sonra ileriye doğru kontrol yapar ve ardından işlem sırasında etki alanının değiştiği değişkenlerin komşularına göre ark tutarlılığını ardı ardına uygular.
+
+
+
+
+**23. Remark: AC-3 can be implemented both iteratively and recursively.**
+
+⟶ 23. Not: AC-3, tekrarlı ve özyinelemeli olarak uygulanabilir.
+
+
+
+
+**24. Approximate methods**
+
+⟶24. Yaklaşık yöntemler (Approximate methods)
+
+
+
+
+**25. Beam search ― Beam search is an approximate algorithm that extends partial assignments of n variables of branching factor b=|Domain| by exploring the K top paths at each step. The beam size K∈{1,...,bn} controls the tradeoff between efficiency and accuracy. This algorithm has a time complexity of O(n⋅Kblog(Kb)).**
+
+⟶ 25. Işın araması (Beam search) - Işın araması, her adımda K en üst yollarını keşfederek, b=|Domain| dallanma faktörünün n değişkeninin kısmi atamalarını genişleten yaklaşık bir algoritmadır.
+
+
+
+
+**26. The example below illustrates a possible beam search of parameters K=2, b=3 and n=5.**
+
+⟶ 26. Aşağıdaki örnek, K = 2, b = 3 ve n = 5 parametreleri ile muhtemel ışın aramasını (beam search) göstermektedir.
+
+
+
+
+**27. Remark: K=1 corresponds to greedy search whereas K→+∞ is equivalent to BFS tree search.**
+
+⟶ 27. Not: K = 1 açgözlü aramaya (greedy search) karşılık gelirken K → + ∞, BFS ağaç aramasına eşdeğerdir.
+
+
+
+
+**28. Iterated conditional modes ― Iterated conditional modes (ICM) is an iterative approximate algorithm that modifies the assignment of a factor graph one variable at a time until convergence. At step i, we assign to Xi the value v that maximizes the product of all factors connected to that variable.**
+
+⟶28. Tekrarlanmış koşullu modlar - Tekrarlanmış koşullu modlar (Iterated conditional modes-ICM), yakınsamaya kadar bir seferde bir değişkenli bir faktör grafiğinin atanmasını değiştiren yinelemeli bir yaklaşık algoritmadır. İ adımında, Xi'ye, bu değişkene bağlı tüm faktörlerin çarpımını maksimize eden v değeri atanır.
+
+
+
+
+**29. Remark: ICM may get stuck in local minima.**
+
+⟶ 29. Not: ICM yerel minimumda takılıp kalabilir.
+
+
+
+
+**30. [Gibbs sampling ― Gibbs sampling is an iterative approximate method that modifies the assignment of a factor graph one variable at a time until convergence. At step i:, we assign to each element u∈Domaini a weight w(u) that is the product of all factors connected to that variable, we sample v from the probability distribution induced by w and assign it to Xi.]**
+
+⟶ 30. [Gibbs örneklemesi - Gibbs örneklemesi, yakınsamaya kadar bir seferde bir değişken grafik faktörünün atanmasını değiştiren yinelemeli bir yaklaşık yöntemdir. İ adımında, her bir u∈Domain olan öğeye , bu değişkene bağlı tüm faktörlerin çarpımı olan bir ağırlık w (u) atanır, v'yi w tarafından indüklenen olasılık dağılımından örnek alır ve Xi'ye atanır.]
+
+
+
+
+**31. Remark: Gibbs sampling can be seen as the probabilistic counterpart of ICM. It has the advantage to be able to escape local minima in most cases.**
+
+⟶ 31. Not: Gibbs örneklemesi, ICM'nin olasılıksal karşılığı olarak görülebilir. Çoğu durumda yerel minimumlardan kaçabilme avantajına sahiptir.
+
+
+
+
+**32. Factor graph transformations**
+
+⟶ 32. Faktör grafiği dönüşümleri
+
+
+
+
+**33. Independence ― Let A,B be a partitioning of the variables X. We say that A and B are independent if there are no edges between A and B and we write:**
+
+⟶ 33. Bağımsızlık - A, B, X değişkenlerinin bir bölümü olsun. A ve B arasında kenar yoksa A ve B'nin bağımsız olduğu söylenir ve şöyle ifade edilir:
+
+
+
+
+**34. Remark: independence is the key property that allows us to solve subproblems in parallel.**
+
+⟶ 34. Not: bağımsızlık, alt sorunları paralel olarak çözmemize olanak sağlayan bir kilit özelliktir.
+
+
+
+
+**35. Conditional independence ― We say that A and B are conditionally independent given C if conditioning on C produces a graph in which A and B are independent. In this case, it is written:**
+
+⟶ 35. Koşullu bağımsızlık - Eğer C'nin şartlandırılması, A ve B'nin bağımsız olduğu bir grafik üretiyorsa A ve B verilen C koşulundan bağımsızdır. Bu durumda şöyle yazılır:
+
+
+
+
+**36. [Conditioning ― Conditioning is a transformation aiming at making variables independent that breaks up a factor graph into smaller pieces that can be solved in parallel and can use backtracking. In order to condition on a variable Xi=v, we do as follows:, Consider all factors f1,...,fk that depend on Xi, Remove Xi and f1,...,fk, Add gj(x) for j∈{1,...,k} defined as:]**
+
+⟶ 36. [Koşullandırma - Koşullandırma, bir faktör grafiğini paralel olarak çözülebilen ve geriye doğru izlemeyi kullanabilen daha küçük parçalara bölen değişkenleri bağımsız kılmayı amaçlayan bir dönüşümdür. Xi = v değişkeninde koşullandırmak için aşağıdakileri yaparız: Xi'ye bağlı tüm f1, ..., fk faktörlerini göz önünde bulundurun, Xi ve f1, ..., fk öğelerini kaldırın, j∈ {1, ..., k} için gj (x) ekleyin:]
+
+
+
+
+**37. Markov blanket ― Let A⊆X be a subset of variables. We define MarkovBlanket(A) to be the neighbors of A that are not in A.**
+
+⟶ 37. Markov blanket - A⊆X değişkenlerin bir alt kümesi olsun. MarkovBlanket'i (A), A'da olmayan A'nın komşuları olarak tanımlıyoruz.
+
+
+
+
+**38. Proposition ― Let C=MarkovBlanket(A) and B=X∖(A∪C). Then we have:**
+
+⟶ Önerme - C = MarkovBlanket (A) ve B = X ∖ (A∪C) olsun.Bu durumda:
+
+
+
+
+**39. [Elimination ― Elimination is a factor graph transformation that removes Xi from the graph and solves a small subproblem conditioned on its Markov blanket as follows:, Consider all factors fi,1,...,fi,k that depend on Xi, Remove Xi
+and fi,1,...,fi,k, Add fnew,i(x) defined as:]**
+
+⟶ 39. [Eliminasyon - Eliminasyon, Xi'yi grafikten ayıran ve Markov blanket de şartlandırılmış küçük bir alt sorunu çözen bir faktör grafiği dönüşümüdür: Xi'ye bağlı tüm fi, 1, ..., fi, k faktörlerini göz önünde bulundurun, Xi ve fi, 1, ..., fi, k, kaldır, fnew ekleyin, i (x) şöyle tanımlanır:]
+
+
+
+
+**40. Treewidth ― The treewidth of a factor graph is the maximum arity of any factor created by variable elimination with the best variable ordering. In other words,**
+
+⟶ 40. Ağaç genişliği (Treewidth) - Bir faktör grafiğinin ağaç genişliği, değişken elemeli en iyi değişken sıralamasıyla oluşturulan herhangi bir faktörün maksimum ilişki derecesidir. Diğer bir deyişle,
+
+
+
+
+**41. The example below illustrates the case of a factor graph of treewidth 3.**
+
+⟶ 41. Aşağıdaki örnek, ağaç genişliği 3 olan faktör grafiğini gösterir.
+
+
+
+
+**42. Remark: finding the best variable ordering is a NP-hard problem.**
+
+⟶ 42. Not: en iyi değişken sıralamasını bulmak NP-zor (NP-hard) bir problemdir.
+
+
+
+
+**43. Bayesian networks**
+
+⟶ 43. Bayesçi ağlar
+
+
+
+
+**44. In this section, our goal will be to compute conditional probabilities. What is the probability of a query given evidence?**
+
+⟶44. Bu bölümün amacı koşullu olasılıkları hesaplamak olacaktır. Bir sorgunun kanıt verilmiş olma olasılığı nedir?
+
+
+
+
+**45. Introduction**
+
+⟶ 45. Giriş
+
+
+
+
+**46. Explaining away ― Suppose causes C1 and C2 influence an effect E. Conditioning on the effect E and on one of the causes (say C1) changes the probability of the other cause (say C2). In this case, we say that C1 has explained away C2.**
+
+⟶ 47. Açıklamalar - C1 ve C2 sebeplerinin E etkisini yarattığını varsayalım. E etkisinin durumu ve sebeplerden biri (C1 olduğunu varsayalım) üzerindeki etkisi, diğer sebep olan C2'nin olasılığını değiştirir. Bu durumda, C1'in C2'yi açıkladığı söylenir.
+
+
+
+
+**47. Directed acyclic graph ― A directed acyclic graph (DAG) is a finite directed graph with no directed cycles.**
+
+⟶47. Yönlü çevrimsiz çizge - Yönlü çevrimsiz bir çizge (Directed acyclic graph-DAG), yönlendirilmiş çevrimleri olmayan sonlu bir yönlü çizgedir.
+
+
+
+
+**48. Bayesian network ― A Bayesian network is a directed acyclic graph (DAG) that specifies a joint distribution over random variables X=(X1,...,Xn) as a product of local conditional distributions, one for each node:**
+
+⟶48. Bayesçi ağ - Her düğüm için bir tane olmak üzere, yerel koşullu dağılımların bir çarpımı olarak, X = (X1, ..., Xn) rasgele değişkenleri üzerindeki bir ortak dağılımı belirten yönlü bir çevrimsiz çizgedir:
+
+
+
+
+**49. Remark: Bayesian networks are factor graphs imbued with the language of probability.**
+
+⟶ 49. Not: Bayesçi ağlar olasılık diliyle bütünleşik faktör grafikleridir.
+
+
+
+
+**50. Locally normalized ― For each xParents(i), all factors are local conditional distributions. Hence they have to satisfy:**
+
+⟶ 50. Yerel olarak normalleştirilmiş - Her xParents (i) için tüm faktörler yerel koşullu dağılımlardır. Bu nedenle yerine getirmek zorundalar:
+
+
+
+
+**51. As a result, sub-Bayesian networks and conditional distributions are consistent.**
+
+⟶51. Sonuç olarak, alt-Bayesçi ağlar ve koşullu dağılımlar tutarlıdır.
+
+
+
+
+**52. Remark: local conditional distributions are the true conditional distributions.**
+
+⟶ 52. Not: Yerel koşullu dağılımlar gerçek koşullu dağılımlardır.
+
+
+
+
+**53. Marginalization ― The marginalization of a leaf node yields a Bayesian network without that node.**
+
+⟶ 53. Marjinalleşme - Bir yaprak düğümünün marjinalleşmesi, o düğüm olmaksızın bir Bayesçi ağı sağlar.
+
+
+
+
+**54. Probabilistic programs**
+
+⟶ 54. Olasılık programları
+
+
+
+
+**55. Concept ― A probabilistic program randomizes variables assignment. That way, we can write down complex Bayesian networks that generate assignments without us having to explicitly specify associated probabilities.**
+
+⟶ 55. Konsept - Olasılıklı bir program değişkenlerin atanmasını randomize eder. Bu şekilde, ilişkili olasılıkları açıkça belirtmek zorunda kalmadan atamalar üreten karmaşık Bayesçi ağlar yazılabilir.
+
+
+
+
+**56. Remark: examples of probabilistic programs include Hidden Markov model (HMM), factorial HMM, naive Bayes, latent Dirichlet allocation, diseases and symptoms and stochastic block models.**
+
+⟶ 56. Not: Olasılık programlarına örnekler arasında Gizli Markov modeli (Hidden Markov model-HMM), faktöriyel HMM, naif Bayes (naive Bayes), gizli Dirichlet tahsisi (latent Dirichlet allocation), hastalıklar ve semptomları belirtirler ve stokastik blok modelleri bulunmaktadır.
+
+
+
+
+**57. Summary ― The table below summarizes the common probabilistic programs as well as their applications:**
+
+⟶ 57. Özet - Aşağıdaki tablo, ortak olasılıklı programları ve bunların uygulamalarını özetlemektedir:
+
+
+
+
+**58. [Program, Algorithm, Illustration, Example]**
+
+⟶ 58. [Program, Algoritma, Gösterim, Örnek]
+
+
+
+
+**59. [Markov Model, Hidden Markov Model (HMM), Factorial HMM, Naive Bayes, Latent Dirichlet Allocation (LDA)]**
+
+⟶ 59. [Markov Modeli, Gizli Markov Modeli (HMM), Faktöriyel HMM, Naif Bayes, Gizli Dirichlet Tahsisi (Latent Dirichlet Allocation-LDA)]
+
+
+
+
+**60. [Generate, distribution]**
+
+⟶ 60. [Üretim, Dağılım]
+
+
+
+
+**61. [Language modeling, Object tracking, Multiple object tracking, Document classification, Topic modeling]**
+
+⟶ 61. [Dil modelleme, Nesne izleme, Çoklu nesne izleme, Belge sınıflandırma, Konu modelleme]
+
+
+
+
+**62. Inference**
+
+⟶ 62. Çıkarım
+
+
+
+
+**63. [General probabilistic inference strategy ― The strategy to compute the probability P(Q|E=e) of query Q given evidence E=e is as follows:, Step 1: Remove variables that are not ancestors of the query Q or the evidence E by marginalization, Step 2: Convert Bayesian network to factor graph, Step 3: Condition on the evidence E=e, Step 4: Remove nodes disconnected from the query Q by marginalization, Step 5: Run a probabilistic inference algorithm (manual, variable elimination, Gibbs sampling, particle filtering)]**
+
+⟶ 63. [Genel olasılıksal çıkarım stratejisi - E = e kanıtı verilen Q sorgusunun P (Q | E = e) olasılığını hesaplama stratejisi aşağıdaki gibidir : Adım 1: Q sorgusunun ataları olmayan değişkenlerini ya da marjinalleştirme yoluyla E kanıtını silin, Adım 2: Bayesçi ağı faktör grafiğine dönüştürün, Adım 3: Kanıtın koşulu E = e, Adım 4: Q sorgusu ile bağlantısı kesilen düğümleri marjinalleştirme yoluyla silin, Adım 5: Olasılıklı bir çıkarım algoritması çalıştırın (kılavuz, değişken eleme, Gibbs örneklemesi, parçacık filtreleme)]
+
+
+
+
+**64. Forward-backward algorithm ― This algorithm computes the exact value of P(H=hk|E=e) (smoothing query) for any k∈{1,...,L} in the case of an HMM of size L. To do so, we proceed in 3 steps:**
+
+⟶ 64. İleri-geri algoritma - Bu algoritma, L boyutunda bir HMM durumunda herhangi bir k∈ {1, ..., L} için P (H = hk | E = e) (düzeltme sorgusu) değerini hesaplar. Bunu yapmak için 3 adımda ilerlenir:
+
+
+
+
+**65. Step 1: for ..., compute ...**
+
+⟶ 65. Adım 1: ... için (for), hesapla ...
+
+
+
+
+**66. with the convention F0=BL+1=1. From this procedure and these notations, we get that**
+
+⟶ 66. F0 = BL + 1 = 1 kuralı ile. Bu prosedürden ve bu notasyonlardan anlıyoruz ki
+
+
+
+
+**67. Remark: this algorithm interprets each assignment to be a path where each edge hi−1→hi is of weight p(hi|hi−1)p(ei|hi).**
+
+⟶ 67. Not: bu algoritma, her bir atamada her bir kenarın hi − 1 → hi'nin p (hi | hi − 1) p (ei | hi) olduğu bir yol olduğunu yorumlar.
+
+
+
+
+**68. [Gibbs sampling ― This algorithm is an iterative approximate method that uses a small set of assignments (particles) to represent a large probability distribution. From a random assignment x, Gibbs sampling performs the following steps for i∈{1,...,n} until convergence:, For all u∈Domaini, compute the weight w(u) of assignment x where Xi=u, Sample v from the probability distribution induced by w: v∼P(Xi=v|X−i=x−i), Set Xi=v]**
+
+⟶ 68. [Gibbs örneklemesi - Bu algoritma, büyük olasılık dağılımını temsil etmek için küçük bir dizi atama (parçacık) kullanan tekrarlı bir yaklaşık yöntemdir. Rasgele bir x atamasından Gibbs örneklemesi, i∈ {1, ..., n} için yakınsamaya kadar aşağıdaki adımları uygular :, Tüm u∈Domaini için, x atamasının x (u) ağırlığını hesaplayın, burada Xi = u, Sample w: v∼P (Xi = v | X − i = x − i), Set Xi = v] ile uyarılmış olasılık dağılımından
+
+
+
+
+**69. Remark: X−i denotes X∖{Xi} and x−i represents the corresponding assignment.**
+
+⟶ 69. Not: X − i, X ∖ {Xi} ve x − i, karşılık gelen atamayı temsil eder.
+
+
+
+
+**70. [Particle filtering ― This algorithm approximates the posterior density of state variables given the evidence of observation variables by keeping track of K particles at a time. Starting from a set of particles C of size K, we run the following 3 steps iteratively:, Step 1: proposal - For each old particle xt−1∈C, sample x from the transition probability distribution p(x|xt−1) and add x to a set C′., Step 2: weighting - Weigh each x of the set C′ by w(x)=p(et|x), where et is the evidence observed at time t., Step 3: resampling - Sample K elements from the set C′ using the probability distribution induced by w and store them in C: these are the current particles xt.]**
+
+⟶70. [Parçacık filtreleme - Bu algoritma, bir seferde K parçacıklarını takip ederek gözlem değişkenlerinin kanıtı olarak verilen durum değişkenlerinin önceki yoğunluğuna yaklaşır.K boyutunda bir C parçacığı kümesinden başlayarak, aşağıdaki 3 adım tekrarlı olarak çalıştırılır: Adım 1: teklif - Her eski parçacık xt − 1∈C için, geçiş olasılığı dağılımından p (x | xt − 1) örnek x'i alın ve C ′ye ekleyin. Adım 2: ağırlıklandırma - C ′nin her x değerini w (x) = p (et | x) ile ağırlıklandırın, burada et t zamanında gözlemlenen kanıttır, Adım 3: yeniden örnekleme - w ile indüklenen olasılık dağılımını kullanarak C kümesinden örnek K elemanlarını C cinsinden saklayın: bunlar şuanki xt parçacıklarıdır.]
+
+
+
+
+**71. Remark: a more expensive version of this algorithm also keeps track of past particles in the proposal step.**
+
+⟶ 71. Not: Bu algoritmanın daha pahalı bir versiyonu da teklif adımındaki geçmiş katılımcıların kaydını tutar.
+
+
+
+
+**72. Maximum likelihood ― If we don't know the local conditional distributions, we can learn them using maximum likelihood.**
+
+⟶ 72. Maksimum olabilirlik - Yerel koşullu dağılımları bilmiyorsak, maksimum olasılık kullanarak bunları öğrenebiliriz.
+
+
+
+
+**73. Laplace smoothing ― For each distribution d and partial assignment (xParents(i),xi), add λ to countd(xParents(i),xi), then normalize to get probability estimates.**
+
+⟶ 73. Laplace yumuşatma - Her d dağılımı ve (xParents (i), xi) kısmi ataması için, countd(xParents (i), xi)'a λ ekleyin, ardından olasılık tahminlerini almak için normalleştirin.
+
+
+
+
+**74. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
+
+⟶ 74. Algoritma - Beklenti-Maksimizasyon (EM) algoritması, olasılığa art arda bir alt sınır oluşturarak (E-adım) tekrarlayarak ve bu alt sınırın (M-adımını) optimize ederek θ parametresini maksimum olasılık tahmini ile tahmin etmede aşağıdaki gibi etkin bir yöntem sunar :
+
+
+
+
+**75. [E-step: Evaluate the posterior probability q(h) that each data point e came from a particular cluster h as follows:, M-step: Use the posterior probabilities q(h) as cluster specific weights on data points e to determine θ through maximum likelihood.]**
+
+⟶ 75. [E-adım: Her bir (e) veri noktasının belirli bir (h) kümesinden geldiği gerideki q (h) durumunu şu şekilde değerlendirin: M-adım: (maksimum olasılığını belirlemek için e veri noktalarındaki küme özgül ağırlıkları olarak gerideki olasılıklar q (h) kullanın.]
+
+
+
+
+**76. [Factor graphs, Arity, Assignment weight, Constraint satisfaction problem, Consistent assignment]**
+
+⟶ 76. [Faktör grafikleri, İlişki Derecesi, Atama ağırlığı, Kısıt memnuniyet sorunu, Tutarlı atama]
+
+
+
+
+**77. [Dynamic ordering, Dependent factors, Backtracking search, Forward checking, Most constrained variable, Least constrained value]**
+
+⟶ 77. [Dinamik düzenleşim, Bağımlı faktörler, Geri izleme araması, İleriye dönük kontrol, En kısıtlı değişken, En düşük kısıtlanmış değer]
+
+
+
+
+**78. [Approximate methods, Beam search, Iterated conditional modes, Gibbs sampling]**
+
+⟶ 78. [Yaklaşık yöntemler, Işın arama , Tekrarlı koşullu modlar, Gibbs örneklemesi]
+
+
+
+
+**79. [Factor graph transformations, Conditioning, Elimination]**
+
+⟶ 79. [Faktör grafiği dönüşümleri, Koşullandırma, Eleme]
+
+
+
+
+**80. [Bayesian networks, Definition, Locally normalized, Marginalization]**
+
+⟶ 80. [Bayesçi ağlar, Tanım, Yerel normalleştirme, Marjinalleşme]
+
+
+
+
+**81. [Probabilistic program, Concept, Summary]**
+
+⟶ 81. [Olasılık programı, Kavram, Özet]
+
+
+
+
+**82. [Inference, Forward-backward algorithm, Gibbs sampling, Laplace smoothing]**
+
+⟶ 82. [Çıkarım, İleri-geri algoritması, Gibbs örneklemesi, Laplace yumuşatması]
+
+
+
+
+**83. View PDF version on GitHub**
+
+⟶ 83. GitHub'da PDF versiyonun görüntüleyin
+
+
+
+
+**84. Original authors**
+
+⟶ 84. Orijinal yazarlar
+
+
+
+
+**85. Translated by X, Y and Z**
+
+⟶ 85. X, Y ve Z tarafından çevrilmiştir.
+
+
+
+
+**86. Reviewed by X, Y and Z**
+
+⟶ 86. X,Y,Z tarafından kontrol edilmiştir.
+
+
+
+
+**87. By X and Y**
+
+⟶ 87. X ve Y ile
+
+
+
+
+**88. The Artificial Intelligence cheatsheets are now available in [target language].**
+
+⟶88. Yapay Zeka el kitapları artık [hedef dilde] mevcuttur.
diff --git a/tr/cheatsheet-deep-learning.md b/tr/cs-229-deep-learning.md
similarity index 92%
rename from tr/cheatsheet-deep-learning.md
rename to tr/cs-229-deep-learning.md
index da5226222..7c8b3e29e 100644
--- a/tr/cheatsheet-deep-learning.md
+++ b/tr/cs-229-deep-learning.md
@@ -24,7 +24,7 @@
**5. [Input layer, hidden layer, output layer]**
-⟶ [Giriş katmanı, gizli katman, ürün katmanı]
+⟶ [Giriş katmanı, gizli katman, çıkış katmanı]
@@ -60,7 +60,7 @@
**11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
-⟶ Öğrenme derecesi ― Öğrenme derecesi, sıklıkla α veya bazen η olarak belirtilir, ağırlıkların hangi tempoda güncellendiğini gösterir. Bu derece sabit olabilir veya uyarlamalı olarak değişebilir. Mevcut en gözde yöntem Adam olarak adlandırılan ve öğrenme oranını uyarlayan bir yöntemdir.
+⟶ Öğrenme oranı ― Öğrenme oranı, sıklıkla α veya bazen η olarak belirtilir, ağırlıkların hangi tempoda güncellendiğini gösterir. Bu derece sabit olabilir veya uyarlamalı olarak değişebilir. Mevcut en gözde yöntem Adam olarak adlandırılan ve öğrenme oranını uyarlayan bir yöntemdir.
@@ -150,7 +150,7 @@
**26. [Input gate, forget gate, gate, output gate]**
-⟶ [Girdi kapısı, unutma kapısı, kapı, ürün kapısı]
+⟶ [Girdi kapısı, unutma kapısı, kapı, çıktı kapısı]
@@ -294,28 +294,28 @@
**50. View PDF version on GitHub**
-⟶
+⟶ GitHub'da PDF sürümünü görüntüle
**51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
-⟶
+⟶ [Yapay Sinir Ağları, Mimari, Aktivasyon fonksiyonu, Geri yayılım, Seyreltme]
**52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
-⟶
+⟶ [Evrişimsel Sinir Ağları, Evreşim katmanı, Toplu normalizasyon]
**53. [Recurrent Neural Networks, Gates, LSTM]**
-⟶
+⟶ [Yinelenen Sinir Ağları, Kapılar, LSTM]
**54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
-⟶
+⟶ [Pekiştirmeli öğrenme, Markov karar süreçleri, Değer/politika iterasyonu, Yaklaşık dinamik programlama, Politika araştırması]
diff --git a/tr/refresher-linear-algebra.md b/tr/cs-229-linear-algebra.md
similarity index 100%
rename from tr/refresher-linear-algebra.md
rename to tr/cs-229-linear-algebra.md
diff --git a/tr/cs-229-machine-learning-tips-and-tricks.md b/tr/cs-229-machine-learning-tips-and-tricks.md
new file mode 100644
index 000000000..b12670229
--- /dev/null
+++ b/tr/cs-229-machine-learning-tips-and-tricks.md
@@ -0,0 +1,290 @@
+**1. Machine Learning tips and tricks cheatsheet**
+
+⟶ Makine Öğrenmesi ipuçları ve püf noktaları el kitabı
+
+
+
+**2. Classification metrics**
+
+⟶ Sınıflandırma metrikleri
+
+
+
+**3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
+
+⟶ İkili bir sınıflandırma durumunda, modelin performansını değerlendirmek için gerekli olan ana metrikler aşağıda verilmiştir.
+
+
+
+**4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
+
+⟶ Karışıklık matrisi - Karışıklık matrisi, bir modelin performansını değerlendirirken daha eksiksiz bir sonuca sahip olmak için kullanılır. Aşağıdaki şekilde tanımlanmıştır:
+
+
+
+**5. [Predicted class, Actual class]**
+
+⟶ [Tahmini sınıf, Gerçek sınıf]
+
+
+
+**6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
+
+⟶ Ana metrikler - Sınıflandırma modellerinin performansını değerlendirmek için aşağıda verilen metrikler yaygın olarak kullanılmaktadır:
+
+
+
+**7. [Metric, Formula, Interpretation]**
+
+⟶ [Metrik, Formül, Açıklama]
+
+
+
+**8. Overall performance of model**
+
+⟶ Modelin genel performansı
+
+
+
+**9. How accurate the positive predictions are**
+
+⟶ Doğru tahminlerin ne kadar kesin olduğu
+
+
+
+**10. Coverage of actual positive sample**
+
+⟶ Gerçek pozitif örneklerin oranı
+
+
+
+**11. Coverage of actual negative sample**
+
+⟶ Gerçek negatif örneklerin oranı
+
+
+
+**12. Hybrid metric useful for unbalanced classes**
+
+⟶ Dengesiz sınıflar için yararlı hibrit metrik
+
+
+
+**13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
+
+⟶ İşlem Karakteristik Eğrisi (ROC) ― İşlem Karakteristik Eğrisi (receiver operating curve), eşik değeri değiştirilerek Doğru Pozitif Oranı-Yanlış Pozitif Oranı grafiğidir. Bu metrikler aşağıdaki tabloda özetlenmiştir:
+
+
+
+**14. [Metric, Formula, Equivalent]**
+
+⟶ [Metrik, Formül, Eşdeğer]
+
+
+
+**15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
+
+⟶ Eğri Altında Kalan Alan (AUC) ― Aynı zamanda AUC veya AUROC olarak belirtilen işlem karakteristik eğrisi altındaki alan, aşağıdaki şekilde gösterildiği gibi İşlem Karakteristik Eğrisi (ROC)'nin altındaki alandır:
+
+
+
+**16. [Actual, Predicted]**
+
+⟶ [Gerçek, Tahmin Edilen]
+
+
+
+**17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
+
+⟶ Temel metrikler - Bir f regresyon modeli verildiğinde aşağıdaki metrikler genellikle modelin performansını değerlendirmek için kullanılır:
+
+
+
+**18. [Total sum of squares, Explained sum of squares, Residual sum of squares]**
+
+⟶ [Toplam karelerinin toplamı, Karelerinin toplamının açıklaması, Karelerinin toplamından artanlar]
+
+
+
+**19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
+
+⟶ Belirleme katsayısı - Genellikle R2 veya r2 olarak belirtilen belirleme katsayısı, gözlemlenen sonuçların model tarafından ne kadar iyi kopyalandığının bir ölçütüdür ve aşağıdaki gibi tanımlanır:
+
+
+
+**20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
+
+⟶ Ana metrikler - Aşağıdaki metrikler, göz önüne aldıkları değişken sayısını dikkate alarak regresyon modellerinin performansını değerlendirmek için yaygın olarak kullanılır:
+
+
+
+**21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
+
+⟶ burada L olabilirlik ve ˆσ2, her bir yanıtla ilişkili varyansın bir tahminidir.
+
+
+
+**22. Model selection**
+
+⟶ Model seçimi
+
+
+
+**23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
+
+⟶ Kelime Bilgisi - Bir model seçerken, aşağıdaki gibi sahip olduğumuz verileri 3 farklı parçaya ayırırız:
+
+
+
+**24. [Training set, Validation set, Testing set]**
+
+⟶ [Eğitim seti, Doğrulama seti, Test seti]
+
+
+
+**25. [Model is trained, Model is assessed, Model gives predictions]**
+
+⟶ [Model eğitildi, Model değerlendirildi, Model tahminleri gerçekleştiriyor]
+
+
+
+**26. [Usually 80% of the dataset, Usually 20% of the dataset]**
+
+⟶ [Genelde veri kümesinin %80'i, Genelde veri kümesinin %20'si]
+
+
+
+**27. [Also called hold-out or development set, Unseen data]**
+
+⟶ [Ayrıca doğrulama için bir kısmını bekletme veya geliştirme seti olarak da bilinir, Görülmemiş veri]
+
+
+
+**28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
+
+⟶ Model bir kere seçildikten sonra, tüm veri seti üzerinde eğitilir ve görünmeyen test setinde test edilir. Bunlar aşağıdaki şekilde gösterilmiştir:
+
+
+
+**29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
+
+⟶ Çapraz doğrulama ― Çapraz doğrulama, başlangıçtaki eğitim setine çok fazla güvenmeyen bir modeli seçmek için kullanılan bir yöntemdir. Farklı tipleri aşağıdaki tabloda özetlenmiştir:
+
+
+
+**30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
+
+⟶ [k − 1 katı üzerinde eğitim ve geriye kalanlar üzerinde değerlendirme, n − p gözlemleri üzerine eğitim ve kalan p üzerinde değerlendirme]
+
+
+
+**31. [Generally k=5 or 10, Case p=1 is called leave-one-out]**
+
+⟶ [Genel olarak k=5 veya 10, Durum p=1'e bir tanesini dışarıda bırak denir]
+
+
+
+**32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
+
+⟶ En yaygın olarak kullanılan yöntem k-kat çapraz doğrulama olarak adlandırılır ve k-1 diğer katlarda olmak üzere, bu k sürelerinin hepsinde model eğitimi yapılırken, modeli bir kat üzerinde doğrulamak için eğitim verilerini k katlarına ayırır. Hata için daha sonra k-katlar üzerinden ortalama alınır ve çapraz doğrulama hatası olarak adlandırılır.
+
+
+
+**33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
+
+⟶ Düzenlileştirme (Regularization) - Düzenlileştirme prosedürü, modelin verileri aşırı öğrenmesinden kaçınılmasını ve dolayısıyla yüksek varyans sorunları ile ilgilenmeyi amaçlamaktadır. Aşağıdaki tablo, yaygın olarak kullanılan düzenlileştirme tekniklerinin farklı türlerini özetlemektedir:
+
+
+
+
+**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+⟶ [Değişkenleri 0'a kadra küçült, Değişken seçimi için iyi, Katsayıları daha küçük yap, Değişken seçimi ile küçük katsayılar arasındaki çelişki]
+
+
+
+
+**35. Diagnostics**
+
+⟶ Tanı
+
+
+
+**36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
+
+⟶ Önyargı - Bir modelin önyargısı, beklenen tahmin ve verilen veri noktaları için tahmin etmeye çalıştığımız doğru model arasındaki farktır.
+
+
+
+**37. Variance ― The variance of a model is the variability of the model prediction for given data points.**
+
+⟶ Varyans - Bir modelin varyansı, belirli veri noktaları için model tahmininin değişkenliğidir.
+
+
+
+**38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
+
+⟶ Önyargı/varyans çelişkisi - Daha basit model, daha yüksek önyargı, ve daha karmaşık model, daha yüksek varyans.
+
+
+
+
+**39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
+
+⟶ [Belirtiler, Regresyon illüstrasyonu, sınıflandırma illüstrasyonu, derin öğrenme illüstrasyonu, olası çareler]
+
+
+
+**40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
+
+⟶ [Yüksek eğitim hatası, Test hatasına yakın eğitim hatası, Yüksek önyargı, Eğitim hatasından biraz daha düşük eğitim hatası, Çok düşük eğitim hatası, Eğitim hatası test hatasının çok altında, Yüksek varyans]
+
+
+
+
+**41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
+
+⟶ [Model karmaşıklaştığında, Daha fazla özellik ekle, Daha uzun eğitim süresi ile eğit, Düzenlileştirme gerçekleştir, Daha fazla bilgi edin]
+
+
+
+
+**42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
+
+⟶ Hata analizi - Hata analizinde mevcut ve mükemmel modeller arasındaki performans farkının temel nedeni analiz edilir.
+
+
+
+**43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
+
+⟶ Ablatif analiz - Ablatif analizde mevcut ve başlangıç modelleri arasındaki performans farkının temel nedeni analiz edilir.
+
+
+
+**44. Regression metrics**
+
+⟶ Regresyon metrikleri
+
+
+
+**45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
+
+⟶ [Sınıflandırma metrikleri, karışıklık matrisi, doğruluk, kesinlik, geri çağırma, F1 skoru, ROC]
+
+
+
+**46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]**
+
+⟶ [Regresyon metrikleri, R karesi, Mallow'un CP'si, AIC, BIC]
+
+
+
+**47. [Model selection, cross-validation, regularization]**
+
+⟶ [Model seçimi, çapraz doğrulama, düzenlileştirme]
+
+
+
+**48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
+
+⟶ [Tanı, Önyargı/varyans çelişkisi, hata/ablatif analiz]
diff --git a/tr/cs-229-probability.md b/tr/cs-229-probability.md
new file mode 100644
index 000000000..5e30fe358
--- /dev/null
+++ b/tr/cs-229-probability.md
@@ -0,0 +1,381 @@
+**1. Probabilities and Statistics refresher**
+
+⟶ Olasılık ve İstatistik hatırlatma
+
+
+
+**2. Introduction to Probability and Combinatorics**
+
+⟶ Olasılık ve Kombinasyonlara Giriş
+
+
+
+**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
+
+⟶ Örnek alanı - Bir deneyin olası tüm sonuçlarının kümesidir, deneyin örnek alanı olarak bilinir ve S ile gösterilir.
+
+
+
+**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
+
+⟶ Olay - Örnek alanın herhangi bir E alt kümesi, olay olarak bilinir. Yani bir olay, deneyin olası sonuçlarından oluşan bir kümedir. Deneyin sonucu E'de varsa, E'nin gerçekleştiğini söyleriz.
+
+
+
+**5. Axioms of probability: For each event E, we denote P(E) as the probability of event E occuring.**
+
+⟶ Olasılık aksiyomları: Her E olayı için, E olayının meydana gelme olasılığı P (E) olarak ifade edilir.
+
+
+
+**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
+
+⟶ Aksiyom 1 - Her olasılık 0 ve 1 de dahil olmak üzere 0 ve 1 arasındadır, yani:
+
+
+
+**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
+
+⟶ Aksiyom 2 - Tüm örnek uzayındaki temel olaylardan en az birinin ortaya çıkma olasılığı 1'dir, yani:
+
+
+
+**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
+
+⟶ Aksiyom 3 - Karşılıklı özel olayların herhangi bir dizisi için, E1, ..., En,
+
+
+
+**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
+
+⟶ Permütasyon - Permütasyon, n nesneler havuzundan r nesnelerinin belirli bir sıra ile düzenlenmesidir. Bu tür düzenlemelerin sayısı P (n, r) tarafından aşağıdaki gibi tanımlanır:
+
+
+
+**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
+
+⟶ Kombinasyon - Bir kombinasyon, sıranın önemli olmadığı n nesneler havuzundan r nesnelerinin bir düzenlemesidir. Bu tür düzenlemelerin sayısı C (n, r) tarafından aşağıdaki gibi tanımlanır:
+
+
+
+**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
+
+⟶ Not: 0⩽r⩽n için P (n, r) ⩾C (n, r) değerine sahibiz.
+
+
+
+**12. Conditional Probability**
+
+⟶ Koşullu Olasılık
+
+
+
+**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
+
+⟶ Bayes kuralı - A ve B olayları için P (B)> 0 olacak şekilde:
+
+
+
+**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
+
+⟶ Not: P(A∩B)=P(A)P(B|A)=P(A|B)P(B)
+
+
+
+**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
+
+⟶ Parça - Tüm i değerleri için Ai≠∅ olmak üzere {Ai,i∈[[1,n]]} olsun. {Ai} bir parça olduğunu söyleriz eğer :
+
+
+
+**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
+
+⟶ Not: Örneklem uzaydaki herhangi bir B olayı için P(B)=n∑i=1P(B|Ai)P(Ai)'ye sahibiz.
+
+
+
+**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
+
+⟶ Genişletilmiş Bayes kuralı formu - {Ai,i∈[[1,n]]} örneklem uzayının bir bölümü olsun. Elde edilen:
+
+
+
+**18. Independence ― Two events A and B are independent if and only if we have:**
+
+⟶ Bağımsızlık - İki olay A ve B birbirinden bağımsızdır ancak ve ancak eğer:
+
+
+
+**19. Random Variables**
+
+⟶ Rastgele Değişkenler
+
+
+
+**20. Definitions**
+
+⟶ Tanımlamalar
+
+
+
+**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
+
+⟶ Rastgele değişken - Genellikle X olarak ifade edilen rastgele bir değişken, bir örneklem uzayındaki her öğeyi gerçek bir çizgiye eşleyen bir fonksiyondur.
+
+
+
+**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
+
+⟶ Kümülatif dağılım fonksiyonu (KDF/ Cumulative distribution function-CDF) - Monotonik olarak azalmayan ve limx→−∞F(x)=0 ve limx→+∞F(x)=1 olacak şekilde kümülatif dağılım fonksiyonu F şu şekilde tanımlanır:
+
+
+
+**23. Remark: we have P(a
+
+**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
+
+⟶ Olasılık yoğunluğu fonksiyonu (OYF/Probability density function-PDF) - Olasılık yoğunluğu fonksiyonu f, X'in rastgele değişkenin iki bitişik gerçekleşmesi arasındaki değerleri alması ihtimalidir.
+
+
+
+**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
+
+⟶ OYF ve KDF'yi içeren ilişkiler - Ayrık (D) ve sürekli (C) olaylarında bilmeniz gereken önemli özelliklerdir.
+
+
+
+**26. [Case, CDF F, PDF f, Properties of PDF]**
+
+⟶ [Olay, KDF F, OYF f, OYF Özellikleri]
+
+
+
+**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
+
+⟶ Beklenti ve Dağılım Momentleri - Burada, ayrık ve sürekli durumlar için beklenen değer E[X], genelleştirilmiş beklenen değer E[g(X)], k. Moment E[Xk] ve karakteristik fonksiyon ψ(ω) ifadeleri verilmiştir :
+
+
+
+**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
+
+⟶ Varyans - Genellikle Var(X) veya σ2 olarak ifade edilen rastgele değişkenin varyansı, dağılım fonksiyonunun yayılmasının bir ölçüsüdür. Aşağıdaki şekilde belirlenir:
+
+
+
+**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
+
+⟶ Standart sapma - Genellikle σ olarak ifade edilen rastgele bir değişkenin standart sapması, gerçek rastgele değişkenin birimleriyle uyumlu olan dağılım fonksiyonunun yayılmasının bir ölçüsüdür. Aşağıdaki şekilde belirlenir:
+
+
+
+**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
+
+⟶ Rastgele değişkenlerin dönüşümü - X ve Y değişkenlerinin bazı fonksiyonlarla bağlanır. fX ve fY'ye sırasıyla X ve Y'nin dağılım fonksiyonu şöyledir:
+
+
+
+**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
+
+⟶ Leibniz integral kuralı - g, x'e ve potansiyel olarak c'nin, c'ye bağlı olabilecek potansiyel c ve a, b sınırlarının bir fonksiyonu olsun. Elde edilen:
+
+
+
+**32. Probability Distributions**
+
+⟶ Olasılık Dağılımları
+
+
+
+**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
+
+⟶ Chebyshev'in eşitsizliği - X beklenen değeri μ olan rastgele bir değişken olsun. K, σ>0 için aşağıdaki eşitsizliği elde edilir:
+
+
+
+**34. Main distributions ― Here are the main distributions to have in mind:**
+
+⟶ Ana dağıtımlar - İşte akılda tutulması gereken ana dağıtımlar:
+
+
+
+**35. [Type, Distribution]**
+
+⟶ [Tür, Dağılım]
+
+
+
+**36. Jointly Distributed Random Variables**
+
+⟶ Ortak Dağılımlı Rastgele Değişkenler
+
+
+
+**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
+
+⟶ Marjinal yoğunluk ve kümülatif dağılım - fXY ortak yoğunluk olasılık fonksiyonundan,
+
+
+
+**38. [Case, Marginal density, Cumulative function]**
+
+⟶ [Olay, Marjinal yoğunluk, Kümülatif fonksiyon]
+
+
+
+**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
+
+⟶ Koşullu yoğunluk - Y'ye göre X'in koşullu yoğunluğu, genellikle fX|Y olarak elde edilir:
+
+
+
+**40. Independence ― Two random variables X and Y are said to be independent if we have:**
+
+⟶ Bağımsızlık - İki rastgele değişkenin X ve Y olması durumunda bağımsız olduğu söylenir:
+
+
+
+**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
+
+⟶ Kovaryans - σ2XY veya daha genel olarak Cov(X,Y) olarak elde ettiğimiz iki rastgele değişken olan X ve Y'nin kovaryansını aşağıdaki gibi tanımlarız:
+
+
+
+**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
+
+⟶ Korelasyon - σX, σY, X ve Y'nin standart sapmalarını elde ederek, ρXY olarak belirtilen rastgele X ve Y değişkenleri arasındaki korelasyonu şu şekilde tanımlarız:
+
+
+
+**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
+
+⟶ Not 1: X, Y'nin herhangi bir rastgele değişkeni için ρXY∈ [note1,1] olduğuna dikkat edin.
+
+
+
+**44. Remark 2: If X and Y are independent, then ρXY=0.**
+
+⟶ Not 2: Eğer X ve Y bağımsızsa, ρXY = 0 olur.
+
+
+
+**45. Parameter estimation**
+
+⟶ Parametre tahmini (kestirimi)
+
+
+
+**46. Definitions**
+
+⟶ Tanımlamalar
+
+
+
+**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
+
+⟶ Rastgele örnek - Rastgele bir örnek, bağımsız ve aynı şekilde X ile dağıtılan n1, ..., Xn değişkeninin rastgele değişkenidir.
+
+
+
+**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
+
+⟶ Tahminci (Kestirimci) - Tahmin edici, istatistiksel bir modelde bilinmeyen bir parametrenin değerini ortaya çıkarmak için kullanılan verilerin bir fonksiyonudur.
+
+
+
+**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
+
+⟶ Önyargı - Bir tahmin edicinin önyargısı ^ θ, ^ θ dağılımının beklenen değeri ile gerçek değer arasındaki fark olarak tanımlanır, yani:
+
+
+
+**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
+
+⟶ Not: E [^ θ] = θ olduğunda bir tahmincinin tarafsız olduğu söylenir.
+
+
+
+**51. Estimating the mean**
+
+⟶ Ortalamayı tahmin etme
+
+
+
+**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
+
+⟶ Örnek ortalaması - Rastgele bir numunenin numune ortalaması, dağılımın gerçek ortalamasını to tahmin etmek için kullanılır, genellikle ¯¯¯¯¯X olarak belirtilir ve şöyle tanımlanır:
+
+
+
+**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
+
+⟶ Not: örnek ortalama tarafsız, yani: E[¯¯¯¯¯X]=μ.
+
+
+
+**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
+
+⟶ Merkezi Limit Teoremi - Ortalama μ ve varyans σ2 ile verilen bir dağılımın ardından rastgele bir X1, ..., Xn örneğine sahip olalım.
+
+
+
+**55. Estimating the variance**
+
+⟶ Varyansı tahmin etmek
+
+
+
+**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
+
+⟶ Örnek varyansı - Rastgele bir örneğin örnek varyansı, bir dağılımın σ2 gerçek varyansını tahmin etmek için kullanılır, genellikle s2 veya ^σ2 olarak elde edilir ve aşağıdaki gibi tanımlanır:
+
+
+
+**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
+
+⟶ Not: Örneklem sapması yansızdır,E[s2]=σ2.
+
+
+
+**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
+
+⟶ Örnek varyansı ile ki-kare ilişkisi - s2, rastgele bir örneğin örnek varyansı olsun. Elde edilir:
+
+
+
+**59. [Introduction, Sample space, Event, Permutation]**
+
+⟶ [Giriş, Örnek uzay, Olay, Permütasyon]
+
+
+
+**60. [Conditional probability, Bayes' rule, Independence]**
+
+⟶ [Koşullu olasılık, Bayes kuralı, Bağımsızlık]
+
+
+
+**61. [Random variables, Definitions, Expectation, Variance]**
+
+⟶ [Rastgele değişkenler, Tanımlamalar, Beklenti, Varyans]
+
+
+
+**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
+
+⟶ [Olasılık dağılımları, Chebyshev eşitsizliği, Ana dağılımlar]
+
+
+
+**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
+
+⟶ [Ortak dağınık rastgele değişkenler, Yoğunluk, Kovaryans, Korelasyon]
+
+
+
+**64. [Parameter estimation, Mean, Variance]**
+
+⟶ [Parameter tahmini, Ortalama, Varyans]
diff --git a/tr/cs-229-supervised-learning.md b/tr/cs-229-supervised-learning.md
new file mode 100644
index 000000000..90d816803
--- /dev/null
+++ b/tr/cs-229-supervised-learning.md
@@ -0,0 +1,567 @@
+**1. Supervised Learning cheatsheet**
+
+⟶ Gözetimli Öğrenme El kitabı
+
+
+
+**2. Introduction to Supervised Learning**
+
+⟶ Gözetimli Öğrenmeye Giriş
+
+
+
+**3. Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.**
+
+⟶ {y(1),...,y(m)} çıktı kümesi ile ilişkili olan {x(1),...,x(m)} veri noktalarının kümesi göz önüne alındığında, y'den x'i nasıl tahmin edebileceğimizi öğrenen bir sınıflandırıcı tasarlamak istiyoruz.
+
+
+
+**4. Type of prediction ― The different types of predictive models are summed up in the table below:**
+
+⟶ Tahmin türü ― Farklı tahmin modelleri aşağıdaki tabloda özetlenmiştir:
+
+
+
+**5. [Regression, Classifier, Outcome, Examples]**
+
+⟶ [Regresyon, Sınıflandırıcı, Çıktı , Örnekler]
+
+
+
+**6. [Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
+
+⟶ [Sürekli, Sınıf, Lineer regresyon (bağlanım), Lojistik regresyon (bağlanım), Destek Vektör Makineleri (DVM), Naive Bayes]
+
+
+
+**7. Type of model ― The different models are summed up in the table below:**
+
+⟶ Model türleri ― Farklı modeller aşağıdaki tabloda özetlenmiştir:
+
+
+
+**8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
+
+⟶ [Ayırt edici model, Üretici model, Amaç, Öğrenilenler, Örnekleme, Örnekler]
+
+
+
+**9. [Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]**
+
+⟶ [ Doğrudan tahmin P (y|x), P (y|x)'i tahmin etmek için P(x|y)'i tahmin etme, Karar Sınırı, Verilerin olasılık dağılımı, Regresyon, Destek Vektör Makineleri, Gauss Diskriminant Analizi, Naive Bayes]
+
+
+
+**10. Notations and general concepts**
+
+⟶ Gösterimler ve genel konsept
+
+
+
+**11. Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).**
+
+⟶ Hipotez ― Hipotez hθ olarak belirtilmiştir ve bu bizim seçtiğimiz modeldir. Verilen x(i) verisi için modelin tahminlediği çıktı hθ(x(i))'dir.
+
+
+
+**12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
+
+⟶ Kayıp fonksiyonu ― L:(z,y)∈R×Y⟼L(z,y)∈R şeklinde tanımlanan bir kayıp fonksiyonu y gerçek değerine karşılık geleceği öngörülen z değerini girdi olarak alan ve ne kadar farklı olduklarını gösteren bir fonksiyondur. Yaygın kayıp fonksiyonları aşağıdaki tabloda özetlenmiştir:
+
+
+
+**13. [Least squared error, Logistic loss, Hinge loss, Cross-entropy]**
+
+⟶ [En küçük kareler hatası, Lojistik yitimi (kaybı), Menteşe yitimi (kaybı), Çapraz entropi]
+
+
+
+**14. [Linear regression, Logistic regression, SVM, Neural Network]**
+
+⟶ [Lineer regresyon (bağlanım), Lojistik regresyon (bağlanım), Destek Vektör Makineleri, Sinir Ağı]
+
+
+
+**15. Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:**
+
+⟶ Maliyet fonksiyonu ― J maliyet fonksiyonu genellikle bir modelin performansını değerlendirmek için kullanılır ve L kayıp fonksiyonu aşağıdaki gibi tanımlanır:
+
+
+
+**16. Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:**
+
+⟶ Bayır inişi ― α∈R öğrenme oranı olmak üzere, bayır inişi için güncelleme kuralı olarak ifade edilen öğrenme oranı ve J maliyet fonksiyonu aşağıdaki gibi ifade edilir:
+
+
+
+**17. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.**
+
+⟶ Not: Stokastik bayır inişi her eğitim örneğine bağlı olarak parametreyi günceller, ve yığın bayır inişi bir dizi eğitim örneği üzerindedir.
+
+
+
+**18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
+
+⟶ Olabilirlik - θ parametreleri verilen bir L (θ) modelinin olabilirliğini,olabilirliği maksimize ederek en uygun θ parametrelerini bulmak için kullanılır. bulmak için kullanılır. Uygulamada, optimize edilmesi daha kolay olan log-olabilirlik ℓ (θ) = log (L (θ))'i kullanıyoruz. Sahip olduklarımız:
+
+
+
+**19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**
+
+⟶ Newton'un algoritması - ℓ′(θ)=0 olacak şekilde bir θ bulan nümerik bir yöntemdir. Güncelleme kuralı aşağıdaki gibidir:
+
+
+
+**20. Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:**
+
+⟶ Not: Newton-Raphson yöntemi olarak da bilinen çok boyutlu genelleme aşağıdaki güncelleme kuralına sahiptir:
+
+
+
+**21. Linear models**
+
+⟶ Lineer modeller
+
+
+
+**22. Linear regression**
+
+⟶ Lineer regresyon
+
+
+
+**23. We assume here that y|x;θ∼N(μ,σ2)**
+
+⟶y|x;θ∼N(μ,σ2) olduğunu varsayıyoruz
+
+
+
+**24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
+
+⟶ Normal denklemler - X matris tasarımı olmak üzere, maliyet fonksiyonunu en aza indiren θ değeri X'in matris tasarımını not ederek, maliyet fonksiyonunu en aza indiren θ değeri kapalı formlu bir çözümdür:
+
+
+
+**25. LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:**
+
+⟶ En Küçük Ortalama Kareler algoritması (Least Mean Squares-LMS) - α öğrenme oranı olmak üzere, m veri noktasını içeren eğitim kümesi için Widrow-Hoff öğrenme oranı olarak bilinen En Küçük Ortalama Kareler Algoritmasının güncelleme kuralı aşağıdaki gibidir:
+
+
+
+**26. Remark: the update rule is a particular case of the gradient ascent.**
+
+⟶ Not: güncelleme kuralı, bayır yükselişinin özel bir halidir.
+
+
+
+**27. LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:**
+
+⟶ Yerel Ağırlıklı Regresyon (Locally Weighted Regression-LWR) - LWR olarak da bilinen Yerel Ağırlıklı Regresyon ağırlıkları her eğitim örneğini maliyet fonksiyonunda w (i) (x) ile ölçen doğrusal regresyonun bir çeşididir.
+
+
+
+**28. Classification and logistic regression**
+
+⟶ Sınıflandırma ve lojistik regresyon
+
+
+
+**29. Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:**
+
+⟶ Sigmoid fonksiyonu - Lojistik fonksiyonu olarak da bilinen sigmoid fonksiyonu g, aşağıdaki gibi tanımlanır:
+
+
+
+**30. Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:**
+
+⟶ Lojistik regresyon - y|x;θ∼Bernoulli(ϕ) olduğunu varsayıyoruz. Aşağıdaki forma sahibiz:
+
+
+
+**31. Remark: there is no closed form solution for the case of logistic regressions.**
+
+⟶ Not: Lojistik regresyon durumunda kapalı form çözümü yoktur.
+
+
+
+**32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
+
+⟶ Softmax regresyonu - Çok sınıflı lojistik regresyon olarak da adlandırılan Softmax regresyonu 2'den fazla sınıf olduğunda lojistik regresyonu genelleştirmek için kullanılır. Genel kabul olarak, her i sınıfı için Bernoulli parametresi ϕi'nin eşit olmasını sağlaması için θK=0 olarak ayarlanır.
+
+
+
+**33. Generalized Linear Models**
+
+⟶ Genelleştirilmiş Lineer Modeller
+
+
+
+**34. Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:**
+
+⟶ Üstel aile - Eğer kanonik parametre veya bağlantı fonksiyonu olarak adlandırılan doğal bir parametre η, yeterli bir istatistik T (y) ve aşağıdaki gibi bir log-partition fonksiyonu a (η) şeklinde yazılabilirse, dağılım sınıfının üstel ailede olduğu söylenir:
+
+
+
+**35. Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.**
+
+⟶ Not: Sık sık T (y) = y olur. Ayrıca, exp (−a (η)), olasılıkların birleştiğinden emin olan normalleştirme parametresi olarak görülebilir.
+
+
+
+**36. Here are the most common exponential distributions summed up in the following table:**
+
+⟶ Aşağıdaki tabloda özetlenen en yaygın üstel dağılımlar:
+
+
+
+**37. [Distribution, Bernoulli, Gaussian, Poisson, Geometric]**
+
+⟶ [Dağılım, Bernoulli, Gauss, Poisson, Geometrik]
+
+
+
+**38. Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:**
+
+⟶ Genelleştirilmiş Lineer Modellerin (Generalized Linear Models-GLM) Yaklaşımları - Genelleştirilmiş Lineer Modeller x∈Rn+1 için rastgele bir y değişkenini tahminlemeyi hedeflen ve aşağıdaki 3 varsayıma dayanan bir fonksiyondur:
+
+
+
+**39. Remark: ordinary least squares and logistic regression are special cases of generalized linear models.**
+
+⟶ Not: sıradan en küçük kareler ve lojistik regresyon, genelleştirilmiş doğrusal modellerin özel durumlarıdır.
+
+
+
+**40. Support Vector Machines**
+
+⟶ Destek Vektör Makineleri
+
+
+
+**41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
+
+⟶ Destek Vektör Makinelerinin amacı minimum mesafeyi maksimuma çıkaran doğruyu bulmaktır.
+
+
+
+**42: Optimal margin classifier ― The optimal margin classifier h is such that:**
+
+⟶ Optimal marj sınıflandırıcısı - h optimal marj sınıflandırıcısı şöyledir:
+
+
+
+**43: where (w,b)∈Rn×R is the solution of the following optimization problem:**
+
+⟶ burada (w,b)∈Rn×R, aşağıdaki optimizasyon probleminin çözümüdür:
+
+
+
+**44. such that**
+
+⟶ öyle ki
+
+
+
+**45. support vectors**
+
+⟶ destek vektörleri
+
+
+
+**46. Remark: the line is defined as wTx−b=0.**
+
+⟶ Not: doğru wTx−b=0 şeklinde tanımlanır.
+
+
+
+**47. Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:**
+
+⟶ Menteşe yitimi (kaybı) - Menteşe yitimi Destek Vektör Makinelerinin ayarlarında kullanılır ve aşağıdaki gibi tanımlanır:
+
+
+
+**48. Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:**
+
+⟶ Çekirdek - ϕ gibi bir özellik haritası verildiğinde, K olarak tanımlanacak çekirdeği tanımlarız:
+
+
+**49. In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.**
+
+⟶ Uygulamada, K (x, z) = exp (- || x − z || 22σ2) tarafından tanımlanan çekirdek K, Gauss çekirdeği olarak adlandırılır ve yaygın olarak kullanılır.
+
+
+
+**50. [Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]**
+
+⟶ [Lineer olmayan ayrılabilirlik, Çekirdek Haritalamının Kullanımı, Orjinal uzayda karar sınırı]
+
+
+
+**51. Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.**
+
+⟶ Not: Çekirdeği kullanarak maliyet fonksiyonunu hesaplamak için "çekirdek numarası" nı kullandığımızı söylüyoruz çünkü genellikle çok karmaşık olan ϕ açık haritalamasını bilmeye gerek yok. Bunun yerine, yalnızca K(x,z) değerlerine ihtiyacımız vardır.
+
+
+
+**52. Lagrangian ― We define the Lagrangian L(w,b) as follows:**
+
+⟶ Lagranj - Lagranj L(w,b) şeklinde şöyle tanımlanır:
+
+
+
+**53. Remark: the coefficients βi are called the Lagrange multipliers.**
+
+⟶ Not: βi katsayılarına Lagranj çarpanları denir.
+
+
+
+**54. Generative Learning**
+
+⟶ Üretici Öğrenme
+
+
+
+**55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
+
+⟶ Üretken bir model, önce Bayes kuralını kullanarak P (y | x) değerini tahmin etmek için kullanabileceğimiz P (x | y) değerini tahmin ederek verilerin nasıl üretildiğini öğrenmeye çalışır.
+
+
+
+**56. Gaussian Discriminant Analysis**
+
+⟶ Gauss Diskriminant (Ayırtaç) Analizi
+
+
+
+**57. Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:**
+
+⟶ Yöntem - Gauss Diskriminant Analizi y ve x|y=0 ve x|y=1 'in şu şekilde olduğunu varsayar:
+
+
+
+**58. Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:**
+
+⟶ Tahmin - Aşağıdaki tablo, olasılığı en üst düzeye çıkarırken bulduğumuz tahminleri özetlemektedir:
+
+
+
+**59. Naive Bayes**
+
+⟶ Naive Bayes
+
+
+
+**60. Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:**
+
+⟶ Varsayım - Naive Bayes modeli, her veri noktasının özelliklerinin tamamen bağımsız olduğunu varsayar:
+
+
+
+**61. Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]**
+
+⟶ Çözümler - Log-olabilirliğinin k∈{0,1},l∈[[1,L]] ile birlikte aşağıdaki çözümlerle maksimize edilmesi:
+
+
+
+**62. Remark: Naive Bayes is widely used for text classification and spam detection.**
+
+⟶ Not: Naive Bayes, metin sınıflandırması ve spam tespitinde yaygın olarak kullanılır.
+
+
+
+**63. Tree-based and ensemble methods**
+
+⟶ Ağaç temelli ve topluluk yöntemleri
+
+
+
+**64. These methods can be used for both regression and classification problems.**
+
+⟶ Bu yöntemler hem regresyon hem de sınıflandırma problemleri için kullanılabilir.
+
+
+
+**65. CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.**
+
+⟶ CART - Sınıflandırma ve Regresyon Ağaçları (Classification and Regression Trees (CART)), genellikle karar ağaçları olarak bilinir, ikili ağaçlar olarak temsil edilirler.
+
+
+
+**66. Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.**
+
+⟶ Rastgele orman - Rastgele seçilen özelliklerden oluşan çok sayıda karar ağacı kullanan ağaç tabanlı bir tekniktir.
+Basit karar ağacının tersine, oldukça yorumlanamaz bir yapıdadır ancak genel olarak iyi performansı onu popüler bir algoritma yapar.
+
+
+
+**67. Remark: random forests are a type of ensemble methods.**
+
+⟶ Not: Rastgele ormanlar topluluk yöntemlerindendir.
+
+
+
+**68. Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:**
+
+⟶ Artırım - Artırım yöntemlerinin temel fikri bazı zayıf öğrenicileri biraraya getirerek güçlü bir öğrenici oluşturmaktır. Temel yöntemler aşağıdaki tabloda özetlenmiştir:
+
+
+
+**69. [Adaptive boosting, Gradient boosting]**
+
+⟶ [Adaptif artırma, Gradyan artırma]
+
+
+
+**70. High weights are put on errors to improve at the next boosting step**
+
+⟶ Yüksek ağırlıklar bir sonraki artırma adımında iyileşmesi için hatalara maruz kalır.
+
+
+
+**71. Weak learners trained on remaining errors**
+
+⟶ Zayıf öğreniciler kalan hatalar üzerinde eğitildi
+
+
+
+**72. Other non-parametric approaches**
+
+⟶ Diğer parametrik olmayan yaklaşımlar
+
+
+
+**73. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
+
+⟶ k-en yakın komşular - genellikle k-NN olarak adlandırılan k- en yakın komşular algoritması, bir veri noktasının tepkisi eğitim kümesindeki kendi k komşularının doğası ile belirlenen parametrik olmayan bir yaklaşımdır. Hem sınıflandırma hem de regresyon yöntemleri için kullanılabilir.
+
+
+
+**74. Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
+
+⟶ Not: k parametresi ne kadar yüksekse, yanlılık okadar yüksek ve k parametresi ne kadar düşükse, varyans o kadar yüksek olur.
+
+
+
+**75. Learning Theory**
+
+⟶ Öğrenme Teorisi
+
+
+
+**76. Union bound ― Let A1,...,Ak be k events. We have:**
+
+⟶ Birleşim sınırı - A1,...,Ak k olayları olsun. Sahip olduklarımız:
+
+
+
+**77. Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:**
+
+⟶ Hoeffding eşitsizliği - Z1, .., Zm, ϕ parametresinin Bernoulli dağılımından çizilen değişkenler olsun. Örnek ortalamaları mean ve γ>0 sabit olsun. Sahip olduklarımız:
+
+
+
+**78. Remark: this inequality is also known as the Chernoff bound.**
+
+⟶ Not: Bu eşitsizlik, Chernoff sınırı olarak da bilinir.
+
+
+
+**79. Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:**
+
+⟶ Eğitim hatası - Belirli bir h sınıflandırıcısı için, ampirik risk veya ampirik hata olarak da bilinen eğitim hatasını ˆϵ (h) şöyle tanımlarız:
+
+
+
+**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: **
+
+⟶ Olası Yaklaşık Doğru (Probably Approximately Correct (PAC)) ― PAC, öğrenme teorisi üzerine sayısız sonuçların kanıtlandığı ve aşağıdaki varsayımlara sahip olan bir çerçevedir:
+
+
+
+**81: the training and testing sets follow the same distribution **
+
+⟶ eğitim ve test kümeleri aynı dağılımı takip ediyor
+
+
+
+**82. the training examples are drawn independently**
+
+⟶ eğitim örnekleri bağımsız olarak çizilir
+
+
+
+**83. Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:**
+
+⟶ Parçalanma ― S={x(1),...,x(d)} kümesi ve H sınıflandırıcıların kümesi verildiğinde, H herhangi bir etiketler kümesi S'e parçalar.
+
+
+
+**84. Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:**
+
+⟶ Üst sınır teoremi ― |H|=k , δ ve örneklem sayısı m'nin sabit olduğu sonlu bir hipotez sınıfı H olsun. Ardından, en az 1−δ olasılığı ile elimizde:
+
+
+
+**85. VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.**
+
+⟶ VC boyutu ― VC(H) olarak ifade edilen belirli bir sonsuz H hipotez sınıfının Vapnik-Chervonenkis (VC) boyutu, H tarafından parçalanan en büyük kümenin boyutudur.
+
+
+
+**86. Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.**
+
+⟶ Not: H = {2 boyutta doğrusal sınıflandırıcılar kümesi}'nin VC boyutu 3'tür.
+
+
+
+**87. Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:**
+
+⟶ Teorem (Vapnik) - H, VC(H)=d ve eğitim örneği sayısı m verilmiş olsun. En az 1−δ olasılığı ile, sahip olduklarımız:
+
+
+
+**88. [Introduction, Type of prediction, Type of model]**
+
+⟶ [Giriş, Tahmin türü, Model türü]
+
+
+
+**89. [Notations and general concepts, loss function, gradient descent, likelihood]**
+
+⟶ [Notasyonlar ve genel kavramlar,kayıp fonksiyonu, bayır inişi, olabilirlik]
+
+
+
+**90. [Linear models, linear regression, logistic regression, generalized linear models]**
+
+⟶ [Lineer modeller, Lineer regresyon, lojistik regresyon, genelleştirilmiş lineer modeller]
+
+
+
+**91. [Support vector machines, Optimal margin classifier, Hinge loss, Kernel]**
+
+⟶ [Destek vektör makineleri, optimal marj sınıflandırıcı, Menteşe yitimi, Çekirdek]
+
+
+
+**92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]**
+
+⟶ [Üretici öğrenme, Gauss Diskriminant Analizi, Naive Bayes]
+
+
+
+**93. [Trees and ensemble methods, CART, Random forest, Boosting]**
+
+⟶ [Ağaçlar ve topluluk yöntemleri, CART, Rastegele orman, Artırma]
+
+
+
+**94. [Other methods, k-NN]**
+
+⟶ [Diğer yöntemler, k-NN]
+
+
+
+**95. [Learning theory, Hoeffding inequality, PAC, VC dimension]**
+
+⟶ [Öğrenme teorisi, Hoeffding eşitsizliği, PAC, VC boyutu]
diff --git a/tr/cs-229-unsupervised-learning.md b/tr/cs-229-unsupervised-learning.md
new file mode 100644
index 000000000..c6392c414
--- /dev/null
+++ b/tr/cs-229-unsupervised-learning.md
@@ -0,0 +1,340 @@
+**1. Unsupervised Learning cheatsheet**
+
+⟶ Gözetimsiz Öğrenme El Kitabı
+
+
+
+**2. Introduction to Unsupervised Learning**
+
+⟶ Gözetimsiz Öğrenmeye Giriş
+
+
+
+**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
+
+⟶ Motivasyon ― Gözetimsiz öğrenmenin amacı etiketlenmemiş verilerdeki gizli örüntüleri bulmaktır {x (1), ..., x (m)}.
+
+
+
+**4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
+
+⟶ Jensen eşitsizliği - f bir konveks fonksiyon ve X bir rastgele değişken olsun. Aşağıdaki eşitsizliklerimiz:
+
+
+
+**5. Clustering**
+
+⟶ Kümeleme
+
+
+
+**6. Expectation-Maximization**
+
+⟶ Beklenti-Ençoklama (Maksimizasyon)
+
+
+
+**7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
+
+⟶ Gizli değişkenler - Gizli değişkenler, tahmin problemlerini zorlaştıran ve çoğunlukla z olarak adlandırılan gizli / gözlemlenmemiş değişkenlerdir. Gizli değişkenlerin bulunduğu yerlerdeki en yaygın ayarlar şöyledir:
+
+
+
+**8. [Setting, Latent variable z, Comments]**
+
+⟶ Yöntem, Gizli değişken z, Açıklamalar
+
+
+
+**9. [Mixture of k Gaussians, Factor analysis]**
+
+⟶ [K Gaussianların birleşimi, Faktör analizi]
+
+
+
+**10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
+
+⟶ Algoritma - Beklenti-Ençoklama (Maksimizasyon) (BE) algoritması, θ parametresinin maksimum olabilirlik kestirimiyle tahmin edilmesinde, olasılığa ard arda alt sınırlar oluşturan (E-adımı) ve bu alt sınırın (M-adımı) aşağıdaki gibi optimize edildiği etkin bir yöntem sunar:
+
+
+
+**11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
+
+⟶ E-adımı: Her bir veri noktasının x(i)'in belirli bir kümeden z(i) geldiğinin sonsal olasılık değerinin Qi(z(i)) hesaplanması aşağıdaki gibidir:
+
+
+
+**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
+
+⟶ M-adımı: Her bir küme modelini ayrı ayrı yeniden tahmin etmek için x(i) veri noktalarındaki kümeye özgü ağırlıklar olarak Qi(z(i)) sonsal olasılıklarının kullanımı aşağıdaki gibidir:
+
+
+
+**13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
+
+⟶ [Gauss ilklendirme, Beklenti adımı, Maksimizasyon adımı, Yakınsaklık]
+
+
+
+**14. k-means clustering**
+
+⟶ k-ortalamalar (k-means) kümeleme
+
+
+
+**15. We note c(i) the cluster of data point i and μj the center of cluster j.**
+
+⟶ C(i), i veri noktasının bulunduğu küme olmak üzere, μj j kümesinin merkez noktasıdır.
+
+
+
+**16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
+
+⟶ Algoritma - Küme ortalamaları μ1, μ2, ..., μk∈Rn rasgele olarak başlatıldıktan sonra, k-ortalamalar algoritması yakınsayana kadar aşağıdaki adımı tekrar eder:
+
+
+
+**17. [Means initialization, Cluster assignment, Means update, Convergence]**
+
+⟶ [Başlangıç ortalaması, Küme Tanımlama, Ortalama Güncelleme, Yakınsama]
+
+
+
+**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
+
+⟶ Bozulma fonksiyonu - Algoritmanın yakınsadığını görmek için aşağıdaki gibi tanımlanan bozulma fonksiyonuna bakarız:
+
+
+
+**19. Hierarchical clustering**
+
+⟶ Hiyerarşik kümeleme
+
+
+
+**20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
+
+⟶ Algoritma - Ardışık olarak iç içe geçmiş kümelerden oluşturan hiyerarşik bir yaklaşıma sahip bir kümeleme algoritmasıdır.
+
+
+
+**21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
+
+⟶ Türler - Aşağıdaki tabloda özetlenen farklı amaç fonksiyonlarını optimize etmeyi amaçlayan farklı hiyerarşik kümeleme algoritmaları vardır:
+
+
+
+**22. [Ward linkage, Average linkage, Complete linkage]**
+
+⟶ [Ward bağlantı, Ortalama bağlantı, Tam bağlantı]
+
+
+
+**23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
+
+⟶ [Küme mesafesi içinde minimize edin, Küme çiftleri arasındaki ortalama uzaklığı en aza indirin, Küme çiftleri arasındaki maksimum uzaklığı en aza indirin]
+
+
+
+**24. Clustering assessment metrics**
+
+⟶ Kümeleme değerlendirme metrikleri
+
+
+
+**25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
+
+⟶ Gözetimsiz bir öğrenme ortamında, bir modelin performansını değerlendirmek çoğu zaman zordur, çünkü gözetimli öğrenme ortamında olduğu gibi, gerçek referans etiketlere sahip değiliz.
+
+
+
+**26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
+
+⟶ Siluet katsayısı - Bir örnek ile aynı sınıftaki diğer tüm noktalar arasındaki ortalama mesafeyi ve bir örnek ile bir sonraki en yakın kümedeki diğer tüm noktalar arasındaki ortalama mesafeyi not ederek, tek bir örnek için siluet katsayısı aşağıdaki gibi tanımlanır:
+
+
+
+**27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
+
+⟶ Calinski-Harabaz indeksi - k kümelerin sayısını belirtmek üzere Bk ve Wk sırasıyla, kümeler arası ve küme içi dağılım matrisleri olarak aşağıdaki gibi tanımlanır
+
+
+
+**28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
+
+⟶ Calinski-Harabaz indeksi s(k), kümelenme modelinin kümeleri ne kadar iyi tanımladığını gösterir, böylece skor ne kadar yüksek olursa, kümeler daha yoğun ve iyi ayrılır. Aşağıdaki şekilde tanımlanmıştır:
+
+
+
+**29. Dimension reduction**
+
+⟶ Boyut küçültme
+
+
+
+**30. Principal component analysis**
+
+⟶ Temel bileşenler analizi
+
+
+
+**31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
+
+⟶ Verilerin yansıtılacağı yönleri maksimize eden varyansı bulan bir boyut küçültme tekniğinidir.
+
+
+
+**32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+⟶ Özdeğer, özvektör - Bir matris A∈Rn×n verildiğinde λ'nın, özvektör olarak adlandırılan bir vektör z∈Rn∖{0} varsa, A'nın bir özdeğeri olduğu söylenir:
+
+
+
+**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+⟶ Spektral teorem - A∈Rn×n olsun. Eğer A simetrik ise, o zaman A gerçek ortogonal matris U∈Rn×n n ile diyagonalleştirilebilir. Λ=diag(λ1, ..., λn) yazarak, bizde:
+
+
+
+**34. diagonal**
+
+⟶ diyagonal
+
+
+
+**35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
+
+⟶ Not: En büyük özdeğere sahip özvektör, matris A'nın temel özvektörü olarak adlandırılır.
+
+
+
+**36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k
+dimensions by maximizing the variance of the data as follows:**
+
+⟶ Algoritma - Temel Bileşen Analizi (TBA) yöntemi, verilerin aşağıdaki gibi varyansı en üst düzeye çıkararak veriyi k boyutlarına yansıtan bir boyut azaltma tekniğidir:
+
+
+
+**37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
+
+⟶ Adım 1: Verileri ortalama 0 ve standart sapma 1 olacak şekilde normalleştirin.
+
+
+
+**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
+
+⟶ Adım 2: Gerçek özdeğerler ile simetrik olan Σ=1mm∑i=1x(i)x(i)T∈Rn×n hesaplayın.
+
+
+
+**39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
+
+⟶ u1, ...,uk∈Rn olmak üzere Σ ort'nin ortogonal ana özvektörlerini, yani k en büyük özdeğerlerin ortogonal özvektörlerini hesaplayın.
+
+
+
+**40. Step 4: Project the data on spanR(u1,...,uk).**
+
+⟶ Adım 4: spanR (u1, ..., uk) üzerindeki verileri gösterin.
+
+
+
+**41. This procedure maximizes the variance among all k-dimensional spaces.**
+
+⟶ Bu yöntem tüm k-boyutlu uzaylar arasındaki varyansı en üst düzeye çıkarır.
+
+
+
+**42. [Data in feature space, Find principal components, Data in principal components space]**
+
+⟶ [Öznitelik uzayındaki veri, Temel bileşenleri bul, Temel bileşenler uzayındaki veri]
+
+
+
+**43. Independent component analysis**
+
+⟶ Bağımsız bileşen analizi
+
+
+
+**44. It is a technique meant to find the underlying generating sources.**
+
+⟶ Temel oluşturan kaynakları bulmak için kullanılan bir tekniktir.
+
+
+
+**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
+
+⟶ Varsayımlar - Verilerin x'in n boyutlu kaynak vektörü s=(s1, ..., sn) tarafından üretildiğini varsayıyoruz, burada si bağımsız rasgele değişkenler, bir karışım ve tekil olmayan bir matris A ile aşağıdaki gibi:
+
+
+
+**46. The goal is to find the unmixing matrix W=A−1.**
+
+⟶ Amaç, işlem görmemiş matrisini W=A−1 bulmaktır.
+
+
+
+**47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
+
+⟶ Bell ve Sejnowski ICA algoritması - Bu algoritma, aşağıdaki adımları izleyerek işlem görmemiş matrisi W'yi bulur:
+
+
+
+**48. Write the probability of x=As=W−1s as:**
+
+⟶ X=As=W−1s olasılığını aşağıdaki gibi yazınız:
+
+
+
+**49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
+
+⟶ Eğitim verisi {x(i),i∈[[1, m]]} ve g sigmoid fonksiyonunu not ederek log olasılığını yazınız:
+
+
+
+**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
+
+⟶ Bu nedenle, rassal (stokastik) eğim yükselme öğrenme kuralı, her bir eğitim örneği için x(i), W'yi aşağıdaki gibi güncelleştiririz:
+
+
+
+**51. The Machine Learning cheatsheets are now available in Turkish.**
+
+⟶ Makine Öğrenmesi El Kitabı artık Türkçe dilinde mevcuttur.
+
+
+
+**52. Original authors**
+
+⟶ Orjinal yazarlar
+
+
+
+**53. Translated by X, Y and Z**
+
+⟶ X, Y ve Z ile çevrilmiştir.
+
+
+
+**54. Reviewed by X, Y and Z**
+
+⟶ X, Y ve Z tarafından yorumlandı
+
+
+
+**55. [Introduction, Motivation, Jensen's inequality]**
+
+⟶ [Giriş, Motivasyon, Jensen'in eşitsizliği]
+
+
+
+**56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
+
+⟶ [Kümeleme, Beklenti-Ençoklama (Maksimizasyon), k-ortalamalar, Hiyerarşik kümeleme, Metrikler]
+
+
+
+**57. [Dimension reduction, PCA, ICA]**
+
+⟶ [Boyut küçültme, TBA(PCA), BBA(ICA)]
diff --git a/tr/cs-230-convolutional-neural-networks.md b/tr/cs-230-convolutional-neural-networks.md
new file mode 100644
index 000000000..e1fd03e51
--- /dev/null
+++ b/tr/cs-230-convolutional-neural-networks.md
@@ -0,0 +1,712 @@
+**1. Convolutional Neural Networks cheatsheet**
+
+⟶ Evrişimli Sinir Ağları el kitabı
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶ CS 230 - Derin Öğrenme
+
+
+
+
+**3. [Overview, Architecture structure]**
+
+⟶ [Genel bakış, Mimari yapı]
+
+
+
+
+**4. [Types of layer, Convolution, Pooling, Fully connected]**
+
+⟶ [Katman tipleri, Evrişim, Ortaklama, Tam bağlantı]
+
+
+
+
+**5. [Filter hyperparameters, Dimensions, Stride, Padding]**
+
+⟶ [Filtre hiperparametreleri, Boyut, Adım aralığı/Adım kaydırma, Ekleme/Doldurma]
+
+
+
+
+**6. [Tuning hyperparameters, Parameter compatibility, Model complexity, Receptive field]**
+
+⟶ [Hiperparametrelerin ayarlanması, Parametre uyumluluğu, Model karmaşıklığı, Receptive field]
+
+
+
+
+**7. [Activation functions, Rectified Linear Unit, Softmax]**
+
+⟶ [Aktivasyon fonksiyonları, Düzeltilmiş Doğrusal Birim, Softmax]
+
+
+
+
+**8. [Object detection, Types of models, Detection, Intersection over Union, Non-max suppression, YOLO, R-CNN]**
+
+⟶ [Nesne algılama, Model tipleri, Algılama, Kesiştirilmiş Bölgeler, Maksimum olmayan bastırma, YOLO, R-CNN]
+
+
+
+
+**9. [Face verification/recognition, One shot learning, Siamese network, Triplet loss]**
+
+⟶ [Yüz doğrulama/tanıma, Tek atış öğrenme, Siamese ağ, Üçlü yitim/kayıp]
+
+
+
+
+**10. [Neural style transfer, Activation, Style matrix, Style/content cost function]**
+
+⟶ [Sinirsel stil aktarımı, Aktivasyon, Stil matrisi, Stil/içerik maliyet fonksiyonu]
+
+
+
+
+**11. [Computational trick architectures, Generative Adversarial Net, ResNet, Inception Network]**
+
+⟶ [İşlemsel püf nokta mimarileri, Çekişmeli Üretici Ağ, ResNet, Inception Ağı]
+
+
+
+
+**12. Overview**
+
+⟶ Genel bakış
+
+
+
+
+**13. Architecture of a traditional CNN ― Convolutional neural networks, also known as CNNs, are a specific type of neural networks that are generally composed of the following layers:**
+
+⟶ Geleneksel bir CNN (Evrişimli Sinir Ağı) mimarisi - CNN'ler olarak da bilinen evrişimli sinir ağları, genellikle aşağıdaki katmanlardan oluşan belirli bir tür sinir ağıdır:
+
+
+
+
+**14. The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters that are described in the next sections.**
+
+⟶ Evrişim katmanı ve ortaklama katmanı, sonraki bölümlerde açıklanan hiperparametreler ile ince ayar (fine-tuned) yapılabilir.
+
+
+
+
+**15. Types of layer**
+
+⟶ Katman tipleri
+
+
+
+
+**16. Convolution layer (CONV) ― The convolution layer (CONV) uses filters that perform convolution operations as it is scanning the input I with respect to its dimensions. Its hyperparameters include the filter size F and stride S. The resulting output O is called feature map or activation map.**
+
+⟶ Evrişim katmanı (CONV) ― Evrişim katmanı (CONV) evrişim işlemlerini gerçekleştiren filtreleri, I girişini boyutlarına göre tararken kullanır. Hiperparametreleri F filtre boyutunu ve S adımını içerir. Elde edilen çıktı O, öznitelik haritası veya aktivasyon haritası olarak adlandırılır.
+
+
+
+
+**17. Remark: the convolution step can be generalized to the 1D and 3D cases as well.**
+
+⟶ Not: evrişim adımı, 1B ve 3B durumlarda da genelleştirilebilir (B: boyut).
+
+
+
+
+**18. Pooling (POOL) ― The pooling layer (POOL) is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance. In particular, max and average pooling are special kinds of pooling where the maximum and average value is taken, respectively.**
+
+⟶ Ortaklama (POOL) - Ortaklama katmanı (POOL), tipik olarak bir miktar uzamsal değişkenlik gösteren bir evrişim katmanından sonra uygulanan bir örnekleme işlemidir. Özellikle, maksimum ve ortalama ortaklama, sırasıyla maksimum ve ortalama değerin alındığı özel ortaklama türleridir.
+
+
+
+
+**19. [Type, Purpose, Illustration, Comments]**
+
+⟶ [Tip, Amaç, Görsel Açıklama, Açıklama]
+
+
+
+
+**20. [Max pooling, Average pooling, Each pooling operation selects the maximum value of the current view, Each pooling operation averages the values of the current view]**
+
+⟶ [Maksimum ortaklama, Ortalama ortaklama, Her ortaklama işlemi, geçerli matrisin maksimum değerini seçer, Her ortaklama işlemi, geçerli matrisin değerlerinin ortalaması alır.]
+
+
+
+
+**21. [Preserves detected features, Most commonly used, Downsamples feature map, Used in LeNet]**
+
+⟶ [Algılanan özellikleri korur, En çok kullanılan, Boyut azaltarak örneklenmiştelik öznitelik haritası, LeNet'te kullanılmış]
+
+
+
+
+**22. Fully Connected (FC) ― The fully connected layer (FC) operates on a flattened input where each input is connected to all neurons. If present, FC layers are usually found towards the end of CNN architectures and can be used to optimize objectives such as class scores.**
+
+⟶ Tam Bağlantı (FC) ― Tam bağlı katman (FC), her girişin tüm nöronlara bağlı olduğu bir giriş üzerinde çalışır. Eğer varsa, FC katmanları genellikle CNN mimarisinin sonuna doğru bulunur ve sınıf skorları gibi hedefleri optimize etmek için kullanılabilir.
+
+
+
+
+**23. Filter hyperparameters**
+
+⟶ Hiperparametrelerin filtrelenmesi
+
+
+
+
+**24. The convolution layer contains filters for which it is important to know the meaning behind its hyperparameters.**
+
+⟶ Evrişim katmanı, hiperparametrelerinin ardındaki anlamı bilmenin önemli olduğu filtreler içerir.
+
+
+
+
+**25. Dimensions of a filter ― A filter of size F×F applied to an input containing C channels is a F×F×C volume that performs convolutions on an input of size I×I×C and produces an output feature map (also called activation map) of size O×O×1.**
+
+⟶ Bir filtrenin boyutları - C kanalları içeren bir girişe uygulanan F×F boyutunda bir filtre, I×I×C boyutundaki bir girişte evrişim gerçekleştiren ve aynı zamanda bir çıkış özniteliği haritası üreten F aktivitesi (aktivasyon olarak da adlandırılır) O) O×O×1 boyutunda harita.
+
+
+
+
+**26. Filter**
+
+⟶ Filtre
+
+
+
+
+**27. Remark: the application of K filters of size F×F results in an output feature map of size O×O×K.**
+
+⟶ Not: F×F boyutunda K filtrelerinin uygulanması, O×O×K boyutunda bir çıktı öznitelik haritasının oluşmasını sağlar.
+
+
+
+
+**28. Stride ― For a convolutional or a pooling operation, the stride S denotes the number of pixels by which the window moves after each operation.**
+
+⟶ Adım aralığı ― Evrişimli veya bir ortaklama işlemi için, S adımı (adım aralığı), her işlemden sonra pencerenin hareket ettiği piksel sayısını belirtir.
+
+
+
+
+**29. Zero-padding ― Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. This value can either be manually specified or automatically set through one of the three modes detailed below:**
+
+⟶ Sıfır ekleme/doldurma ― Sıfır ekleme/doldurma, girişin sınırlarının her bir tarafına P sıfır ekleme işlemini belirtir. Bu değer manuel olarak belirlenebilir veya aşağıda detaylandırılan üç moddan biri ile otomatik olarak ayarlanabilir:
+
+
+
+
+**30. [Mode, Value, Illustration, Purpose, Valid, Same, Full]**
+
+⟶ [Mod, Değer, Görsel Açıklama, Amaç, Geçerli, Aynı, Tüm]
+
+
+
+
+**31. [No padding, Drops last convolution if dimensions do not match, Padding such that feature map size has size ⌈IS⌉, Output size is mathematically convenient, Also called 'half' padding, Maximum padding such that end convolutions are applied on the limits of the input, Filter 'sees' the input end-to-end]**
+
+⟶ [Ekleme/doldurma yok, Boyutlar uyuşmuyorsa son evrişimi düşürür, Öznitelik harita büyüklüğüne sahip ekleme/doldurma ⌈IS⌉, Çıktı boyutu matematiksel olarak uygundur, 'Yarım' ekleme olarak da bilinir, Son konvolüsyonların giriş sınırlarına uygulandığı maksimum ekleme, Filtre girişi uçtan uca "görür"]
+
+
+
+
+**32. Tuning hyperparameters**
+
+⟶ Hiperparametreleri ayarlama
+
+
+
+
+**33. Parameter compatibility in convolution layer ― By noting I the length of the input volume size, F the length of the filter, P the amount of zero padding, S the stride, then the output size O of the feature map along that dimension is given by:**
+
+⟶ Evrişim katmanında parametre uyumu - Girdinin hacim büyüklüğü I uzunluğu, F filtresinin uzunluğu, P sıfır ekleme miktarı, S adım aralığı, daha sonra bu boyut boyunca öznitelik haritasının O çıkış büyüklüğü belirtilir:
+
+
+
+
+**34. [Input, Filter, Output]**
+
+⟶ [Giriş, Filtre, Çıktı]
+
+
+
+
+**35. Remark: often times, Pstart=Pend≜P, in which case we can replace Pstart+Pend by 2P in the formula above.**
+
+⟶ Not: çoğunlukla, Pstart=Pend≜P, bu durumda Pstart+Pend'i yukarıdaki formülde 2P ile değiştirebiliriz.
+
+
+
+
+**36. Understanding the complexity of the model ― In order to assess the complexity of a model, it is often useful to determine the number of parameters that its architecture will have. In a given layer of a convolutional neural network, it is done as follows:**
+
+⟶ Modelin karmaşıklığını anlama - Bir modelin karmaşıklığını değerlendirmek için mimarisinin sahip olacağı parametrelerin sayısını belirlemek genellikle yararlıdır. Bir evrişimsli sinir ağının belirli bir katmanında, aşağıdaki şekilde yapılır:
+
+
+
+
+**37. [Illustration, Input size, Output size, Number of parameters, Remarks]**
+
+⟶ [Görsel Açıklama, Giriş boyutu, Çıkış boyutu, Parametre sayısı, Not]
+
+
+
+
+**38. [One bias parameter per filter, In most cases, S
+
+
+**39. [Pooling operation done channel-wise, In most cases, S=F]**
+
+⟶ [Ortaklama işlemi kanal bazında yapılır, Çoğu durumda S=F]
+
+
+
+
+**40. [Input is flattened, One bias parameter per neuron, The number of FC neurons is free of structural constraints]**
+
+⟶ [Giriş bağlantılanmış, Nöron başına bir bias parametresi, tam bağlantı (FC) nöronlarının sayısı yapısal kısıtlamalardan arındırılmış]
+
+
+
+
+**41. Receptive field ― The receptive field at layer k is the area denoted Rk×Rk of the input that each pixel of the k-th activation map can 'see'. By calling Fj the filter size of layer j and Si the stride value of layer i and with the convention S0=1, the receptive field at layer k can be computed with the formula:**
+
+⟶ Evrişim sonucu oluşan haritanın boyutu ― K katmanında filtre çıkışı, k-inci aktivasyon haritasının her bir pikselinin 'görebileceği' girişin Rk×Rk olarak belirtilen alanını ifade eder. Fj, j ve Si katmanlarının filtre boyutu, i katmanının adım aralığı ve S0=1 (ilk adım aralığının 1 seçilmesi durumu) kuralıyla, k katmanındaki işlem sonucunda elde edilen aktivasyon haritasının boyutları bu formülle hesaplanabilir:
+
+
+
+
+**42. In the example below, we have F1=F2=3 and S1=S2=1, which gives R2=1+2⋅1+2⋅1=5.**
+
+⟶ Aşağıdaki örnekte, F1=F2=3 ve S1=S2=1 için R2=1+2⋅1+2⋅1=5 sonucu elde edilir.
+
+
+
+
+**43. Commonly used activation functions**
+
+⟶ Yaygın olarak kullanılan aktivasyon fonksiyonları
+
+
+
+
+**44. Rectified Linear Unit ― The rectified linear unit layer (ReLU) is an activation function g that is used on all elements of the volume. It aims at introducing non-linearities to the network. Its variants are summarized in the table below:**
+
+⟶ Düzeltilmiş Doğrusal Birim ― Düzeltilmiş doğrusal birim katmanı (ReLU), (g)'nin tüm elemanlarında kullanılan bir aktivasyon fonksiyonudur. Doğrusal olmamaları ile ağın öğrenmesi amaçlanmaktadır. Çeşitleri aşağıdaki tabloda özetlenmiştir:
+
+
+
+
+**45. [ReLU, Leaky ReLU, ELU, with]**
+
+⟶[ReLU, Sızıntı ReLU, ELU, ile]
+
+
+
+
+**46. [Non-linearity complexities biologically interpretable, Addresses dying ReLU issue for negative values, Differentiable everywhere]**
+
+⟶ [Doğrusal olmama karmaşıklığı biyolojik olarak yorumlanabilir, Negatif değerler için ölen ReLU sorununu giderir, Her yerde türevlenebilir]
+
+
+
+
+**47. Softmax ― The softmax step can be seen as a generalized logistic function that takes as input a vector of scores x∈Rn and outputs a vector of output probability p∈Rn through a softmax function at the end of the architecture. It is defined as follows:**
+
+⟶ Softmax ― Softmax adımı, x∈Rn skorlarının bir vektörünü girdi olarak alan ve mimarinin sonunda softmax fonksiyonundan p∈Rn çıkış olasılık vektörünü oluşturan genelleştirilmiş bir lojistik fonksiyon olarak görülebilir. Aşağıdaki gibi tanımlanır:
+
+
+
+
+**48. where**
+
+⟶ buna karşılık
+
+
+
+
+**49. Object detection**
+
+⟶ Nesne algılama
+
+
+
+
+**50. Types of models ― There are 3 main types of object recognition algorithms, for which the nature of what is predicted is different. They are described in the table below:**
+
+⟶ Model tipleri ― Burada, nesne tanıma algoritmasının doğası gereği 3 farklı kestirim türü vardır. Aşağıdaki tabloda açıklanmıştır:
+
+
+
+
+**51. [Image classification, Classification w. localization, Detection]**
+
+⟶ [Görüntü sınıflandırma, Sınıflandırma ve lokalizasyon (konumlama), Algılama]
+
+
+
+
+**52. [Teddy bear, Book]**
+
+⟶ [Oyuncak ayı, Kitap]
+
+
+
+
+**53. [Classifies a picture, Predicts probability of object, Detects an object in a picture, Predicts probability of object and where it is located, Detects up to several objects in a picture, Predicts probabilities of objects and where they are located]**
+
+⟶ [Bir görüntüyü sınıflandırır, Nesnenin olasılığını tahmin eder, Görüntüdeki bir nesneyi algılar/tanır, Nesnenin olasılığını ve bulunduğu yeri tahmin eder, Bir görüntüdeki birden fazla nesneyi algılar, Nesnelerin olasılıklarını ve nerede olduklarını tahmin eder]
+
+
+
+
+**54. [Traditional CNN, Simplified YOLO, R-CNN, YOLO, R-CNN]**
+
+⟶ [Geleneksel CNN, Basitleştirilmiş YOLO (You-Only-Look-Once), R-CNN (R: Region - Bölge), YOLO, R-CNN]
+
+
+
+
+**55. Detection ― In the context of object detection, different methods are used depending on whether we just want to locate the object or detect a more complex shape in the image. The two main ones are summed up in the table below:**
+
+⟶ Algılama ― Nesne algılama bağlamında, nesneyi konumlandırmak veya görüntüdeki daha karmaşık bir şekli tespit etmek isteyip istemediğimize bağlı olarak farklı yöntemler kullanılır. İki ana tablo aşağıdaki tabloda özetlenmiştir:
+
+
+
+
+**56. [Bounding box detection, Landmark detection]**
+
+⟶ [Sınırlayıcı kutu ile tespit, Karakteristik nokta algılama]
+
+
+
+
+**57. [Detects the part of the image where the object is located, Detects a shape or characteristics of an object (e.g. eyes), More granular]**
+
+⟶ [Görüntüde nesnenin bulunduğu yeri algılar, Bir nesnenin şeklini veya özelliklerini algılar (örneğin gözler), Daha ayrıntılı]
+
+
+
+
+**58. [Box of center (bx,by), height bh and width bw, Reference points (l1x,l1y), ..., (lnx,lny)]**
+
+⟶ [Kutu merkezi (bx,by), yükseklik bh ve genişlik bw, Referans noktalar (l1x,l1y), ..., (lnx,lny)]
+
+
+
+
+**59. Intersection over Union ― Intersection over Union, also known as IoU, is a function that quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding box Ba. It is defined as:**
+
+⟶ Kesiştirilmiş Bölgeler - Kesiştirilmiş Bölgeler, IoU (Intersection over Union) olarak da bilinir, Birleştirilmiş sınırlama kutusu, tahmin edilen sınırlama kutusu (Bp) ile gerçek sınırlama kutusu Ba üzerinde ne kadar doğru konumlandırıldığını ölçen bir fonksiyondur. Olarak tanımlanır:
+
+
+
+
+**60. Remark: we always have IoU∈[0,1]. By convention, a predicted bounding box Bp is considered as being reasonably good if IoU(Bp,Ba)⩾0.5.**
+
+⟶ Not: Her zaman IoU∈ [0,1] ile başlarız. Kural olarak, Öngörülen bir sınırlama kutusu Bp, IoU (Bp, Ba)⩾0.5 olması durumunda makul derecede iyi olarak kabul edilir.
+
+
+
+
+**61. Anchor boxes ― Anchor boxing is a technique used to predict overlapping bounding boxes. In practice, the network is allowed to predict more than one box simultaneously, where each box prediction is constrained to have a given set of geometrical properties. For instance, the first prediction can potentially be a rectangular box of a given form, while the second will be another rectangular box of a different geometrical form.**
+
+⟶ Öneri (Anchor) kutular, örtüşen sınırlayıcı kutuları öngörmek için kullanılan bir tekniktir. Uygulamada, ağın aynı anda birden fazla kutuyu tahmin etmesine izin verilir, burada her kutu tahmini belirli bir geometrik öznitelik setine sahip olmakla sınırlıdır. Örneğin, ilk tahmin potansiyel olarak verilen bir formun dikdörtgen bir kutusudur, ikincisi ise farklı bir geometrik formun başka bir dikdörtgen kutusudur.
+
+
+
+
+**62. Non-max suppression ― The non-max suppression technique aims at removing duplicate overlapping bounding boxes of a same object by selecting the most representative ones. After having removed all boxes having a probability prediction lower than 0.6, the following steps are repeated while there are boxes remaining:**
+
+⟶ Maksimum olmayan bastırma - Maksimum olmayan bastırma tekniği, nesne için yinelenen ve örtüşen öneri kutuları içinde en uygun temsilleri seçerek örtüşmesi düşük olan kutuları kaldırmayı amaçlar. Olasılık tahmini 0.6'dan daha düşük olan tüm kutuları çıkardıktan sonra, kalan kutular ile aşağıdaki adımlar tekrarlanır:
+
+
+
+
+**63. [For a given class, Step 1: Pick the box with the largest prediction probability., Step 2: Discard any box having an IoU⩾0.5 with the previous box.]**
+
+⟶ [Verilen bir sınıf için, Adım 1: En büyük tahmin olasılığı olan kutuyu seçin., Adım 2: Önceki kutuyla IoU⩾0.5 olan herhangi bir kutuyu çıkarın.]
+
+
+
+
+**64. [Box predictions, Box selection of maximum probability, Overlap removal of same class, Final bounding boxes]**
+
+⟶ [Kutu tahmini/kestirimi, Maksimum olasılığa göre kutu seçimi, Aynı sınıf için örtüşme kaldırma, Son sınırlama kutuları]
+
+
+
+
+**65. YOLO ― You Only Look Once (YOLO) is an object detection algorithm that performs the following steps:**
+
+⟶ YOLO ― You Only Look Once (YOLO), aşağıdaki adımları uygulayan bir nesne algılama algoritmasıdır:
+
+
+
+
+**66. [Step 1: Divide the input image into a G×G grid., Step 2: For each grid cell, run a CNN that predicts y of the following form:, repeated k times]**
+
+⟶ [Adım 1: Giriş görüntüsünü G×G kare parçalara (hücrelere) bölün., Adım 2: Her bir hücre için, aşağıdaki formdan y'yi öngören bir CNN çalıştırın: k kez tekrarlayın]
+
+
+
+
+**67. where pc is the probability of detecting an object, bx,by,bh,bw are the properties of the detected bouding box, c1,...,cp is a one-hot representation of which of the p classes were detected, and k is the number of anchor boxes.**
+
+⟶ pc'nin bir nesneyi algılama olasılığı olduğu durumlarda, bx, by, bh, bw tespit edilen olası sınırlayıcı kutusunun özellikleridir, cl, ..., cp, p sınıflarının tespit edilen one-hot temsildir ve k öneri (anchor) kutularının sayısıdır.
+
+
+
+
+**68. Step 3: Run the non-max suppression algorithm to remove any potential duplicate overlapping bounding boxes.**
+
+⟶ Adım3: Potansiyel yineli çakışan sınırlayıcı kutuları kaldırmak için maksimum olmayan bastırma algoritmasını çalıştır.
+
+
+
+
+**69. [Original image, Division in GxG grid, Bounding box prediction, Non-max suppression]**
+
+⟶ [Orijinal görüntü, GxG kare parçalara (hücrelere) bölünmesi, Sınırlayıcı kutu kestirimi, Maksimum olmayan bastırma]
+
+
+
+
+**70. Remark: when pc=0, then the network does not detect any object. In that case, the corresponding predictions bx,...,cp have to be ignored.**
+
+⟶ Not: pc=0 olduğunda, ağ herhangi bir nesne algılamamaktadır. Bu durumda, ilgili bx, ..., cp tahminleri dikkate alınmamalıdır.
+
+
+
+
+**71. R-CNN ― Region with Convolutional Neural Networks (R-CNN) is an object detection algorithm that first segments the image to find potential relevant bounding boxes and then run the detection algorithm to find most probable objects in those bounding boxes.**
+
+⟶ R-CNN - Evrişimli Sinir Ağları ile Bölge Bulma (R-CNN), potansiyel olarak sınırlayıcı kutuları bulmak için görüntüyü bölütleyen (segmente eden) ve daha sonra sınırlayıcı kutularda en olası nesneleri bulmak için algılama algoritmasını çalıştıran bir nesne algılama algoritmasıdır.
+
+
+
+
+**72. [Original image, Segmentation, Bounding box prediction, Non-max suppression]**
+
+⟶ [Orijinal görüntü, Bölütleme (Segmentasyon), Sınırlayıcu kutu kestirimi, Maksimum olmayan bastırma]
+
+
+
+
+**73. Remark: although the original algorithm is computationally expensive and slow, newer architectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.**
+
+⟶ Not: Orijinal algoritma hesaplamalı olarak maliyetli ve yavaş olmasına rağmen, yeni mimariler algoritmanın Hızlı R-CNN ve Daha Hızlı R-CNN gibi daha hızlı çalışmasını sağlamıştır.
+
+
+
+
+**74. Face verification and recognition**
+
+⟶ Yüz doğrulama ve tanıma
+
+
+
+
+**75. Types of models ― Two main types of model are summed up in table below:**
+
+⟶ Model tipleri ― İki temel model aşağıdaki tabloda özetlenmiştir:
+
+
+
+
+**76. [Face verification, Face recognition, Query, Reference, Database]**
+
+⟶ [Yüz doğrulama, Yüz tanıma, Sorgu, Kaynak, Veri tabanı]
+
+
+
+
+**77. [Is this the correct person?, One-to-one lookup, Is this one of the K persons in the database?, One-to-many lookup]**
+
+⟶ [Bu doğru kişi mi?, Bire bir arama, Veritabanındaki K kişilerden biri mi?, Bire-çok arama]
+
+
+
+
+**78. One Shot Learning ― One Shot Learning is a face verification algorithm that uses a limited training set to learn a similarity function that quantifies how different two given images are. The similarity function applied to two images is often noted d(image 1,image 2).**
+
+⟶ Tek Atış (Onr-Shot) Öğrenme - Tek Atış Öğrenme, verilen iki görüntünün ne kadar farklı olduğunu belirleyen benzerlik fonksiyonunu öğrenmek için sınırlı bir eğitim seti kullanan bir yüz doğrulama algoritmasıdır. İki resme uygulanan benzerlik fonksiyonu sıklıkla kaydedilir (resim 1, resim 2).
+
+
+
+
+**79. Siamese Network ― Siamese Networks aim at learning how to encode images to then quantify how different two images are. For a given input image x(i), the encoded output is often noted as f(x(i)).**
+
+⟶ Siyam (Siamese) Ağı - Siyam Ağı, iki görüntünün ne kadar farklı olduğunu ölçmek için görüntülerin nasıl kodlanacağını öğrenmeyi amaçlar. Belirli bir giriş görüntüsü x(i) için kodlanmış çıkış genellikle f(x(i)) olarak alınır.
+
+
+
+
+**80. Triplet loss ― The triplet loss ℓ is a loss function computed on the embedding representation of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive example belong to a same class, while the negative example to another one. By calling α∈R+ the margin parameter, this loss is defined as follows:**
+
+⟶ Üçlü kayıp - Üçlü kayıp ℓ, A (öneri), P (pozitif) ve N (negatif) görüntülerinin üçlüsünün gömülü gösterimde hesaplanan bir kayıp fonksiyonudur. Öneri ve pozitif örnek aynı sınıfa aitken, negatif örnek bir diğerine aittir. α∈R+ marjın parametresini çağırarak, bu kayıp aşağıdaki gibi tanımlanır:
+
+
+
+
+**81. Neural style transfer**
+
+⟶ Sinirsel stil transferi (aktarımı)
+
+
+
+
+**82. Motivation ― The goal of neural style transfer is to generate an image G based on a given content C and a given style S.**
+
+⟶ Motivasyon ― Sinirsel stil transferinin amacı, verilen bir C içeriğine ve verilen bir S stiline dayanan bir G görüntüsü oluşturmaktır.
+
+
+
+
+**83. [Content C, Style S, Generated image G]**
+
+⟶ [İçerik C, Stil S, Oluşturulan görüntü G]
+
+
+
+
+**84. Activation ― In a given layer l, the activation is noted a[l] and is of dimensions nH×nw×nc**
+
+⟶ Aktivasyon ― Belirli bir l katmanında, aktivasyon [l] olarak gösterilir ve nH×nw×nc boyutlarındadır
+
+
+
+
+**85. Content cost function ― The content cost function Jcontent(C,G) is used to determine how the generated image G differs from the original content image C. It is defined as follows:**
+
+⟶ İçerik maliyeti fonksiyonu ― İçerik maliyeti fonksiyonu Jcontent(C,G), G oluşturulan görüntüsünün, C orijinal içerik görüntüsünden ne kadar farklı olduğunu belirlemek için kullanılır.Aşağıdaki gibi tanımlanır:
+
+
+
+
+**86. Style matrix ― The style matrix G[l] of a given layer l is a Gram matrix where each of its elements G[l]kk′ quantifies how correlated the channels k and k′ are. It is defined with respect to activations a[l] as follows:**
+
+⟶ Stil matrisi - Stil matrisi G[l], belirli bir l katmanının her birinin G[l]kk′ elemanlarının k ve k′ kanallarının ne kadar ilişkili olduğunu belirlediği bir Gram matristir. A[l] aktivasyonlarına göre aşağıdaki gibi tanımlanır:
+
+
+
+
+**87. Remark: the style matrix for the style image and the generated image are noted G[l] (S) and G[l] (G) respectively.**
+
+⟶ Not: Stil görüntüsü ve oluşturulan görüntü için stil matrisi, sırasıyla G[l] (S) ve G[l] (G) olarak belirtilmiştir.
+
+
+
+
+**88. Style cost function ― The style cost function Jstyle(S,G) is used to determine how the generated image G differs from the style S. It is defined as follows:**
+
+⟶ Stil maliyeti fonksiyonu - Stil maliyeti fonksiyonu Jstyle(S,G), oluşturulan G görüntüsünün S stilinden ne kadar farklı olduğunu belirlemek için kullanılır. Aşağıdaki gibi tanımlanır:
+
+
+
+
+**89. Overall cost function ― The overall cost function is defined as being a combination of the content and style cost functions, weighted by parameters α,β, as follows:**
+
+⟶ Genel maliyet fonksiyonu - Genel maliyet fonksiyonu, α, β parametreleriyle ağırlıklandırılan içerik ve stil maliyet fonksiyonlarının bir kombinasyonu olarak tanımlanır:
+
+
+
+
+**90. Remark: a higher value of α will make the model care more about the content while a higher value of β will make it care more about the style.**
+
+⟶ Not: yüksek bir α değeri modelin içeriğe daha fazla önem vermesini sağlarken, yüksek bir β değeri de stile önem verir.
+
+
+
+
+**91. Architectures using computational tricks**
+
+⟶ Hesaplama ipuçları kullanan mimariler
+
+
+
+
+**92. Generative Adversarial Network ― Generative adversarial networks, also known as GANs, are composed of a generative and a discriminative model, where the generative model aims at generating the most truthful output that will be fed into the discriminative which aims at differentiating the generated and true image.**
+
+⟶ Çekişmeli Üretici Ağlar - GAN olarak da bilinen çekişmeli üretici ağlar, modelin üretici denen ve gerçek imajı ayırt etmeyi amaçlayan ayırıcıya beslenecek en doğru çıktının oluşturulmasını amaçladığı üretici ve ayırt edici bir modelden oluşur.
+
+
+
+
+**93. [Training, Noise, Real-world image, Generator, Discriminator, Real Fake]**
+
+⟶ [Eğitim, Gürültü, Gerçek dünya görüntüsü, Üretici, Ayırıcı, Gerçek Sahte]
+
+
+
+
+**94. Remark: use cases using variants of GANs include text to image, music generation and synthesis.**
+
+⟶ Not: GAN'ın kullanım alanları, yazıdan görüntüye, müzik üretimi ve sentezi.
+
+
+
+
+**95. ResNet ― The Residual Network architecture (also called ResNet) uses residual blocks with a high number of layers meant to decrease the training error. The residual block has the following characterizing equation:**
+
+⟶ ResNet ― Artık Ağ mimarisi (ResNet olarak da bilinir), eğitim hatasını azaltmak için çok sayıda katman içeren artık bloklar kullanır. Artık blok aşağıdaki karakterizasyon denklemine sahiptir:
+
+
+
+
+**96. Inception Network ― This architecture uses inception modules and aims at giving a try at different convolutions in order to increase its performance through features diversification. In particular, it uses the 1×1 convolution trick to limit the computational burden.**
+
+⟶ Inception Ağ ― Bu mimari inception modüllerini kullanır ve özelliklerini çeşitlendirme yoluyla performansını artırmak için farklı evrişim kombinasyonları denemeyi amaçlamaktadır. Özellikle, hesaplama yükünü sınırlamak için 1x1 evrişm hilesini kullanır.
+
+
+
+
+**97. The Deep Learning cheatsheets are now available in [target language].**
+
+⟶ Derinöğrenme el kitabı artık kullanıma hazır [hedef dilde].
+
+
+
+
+**98. Original authors**
+
+⟶ Orijinal yazarlar
+
+
+
+
+**99. Translated by X, Y and Z**
+
+⟶ X, Y ve Z tarafından çevirildi
+
+
+
+
+**100. Reviewed by X, Y and Z**
+
+⟶ X, Y ve Z tarafından kontrol edildi
+
+
+
+
+**101. View PDF version on GitHub**
+
+⟶ GitHub'da PDF sürümünü görüntüleyin
+
+
+
+
+**102. By X and Y**
+
+⟶ X ve Y ile
+
+
diff --git a/tr/cs-230-deep-learning-tips-and-tricks.md b/tr/cs-230-deep-learning-tips-and-tricks.md
new file mode 100644
index 000000000..8bc96d387
--- /dev/null
+++ b/tr/cs-230-deep-learning-tips-and-tricks.md
@@ -0,0 +1,450 @@
+**1. Deep Learning Tips and Tricks cheatsheet**
+
+⟶ Derin öğrenme püf noktaları ve ipuçları el kitabı
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶ CS 230 - Derin Öğrenme
+
+
+
+
+**3. Tips and tricks**
+
+⟶ Püf noktaları ve ipuçları
+
+
+
+
+**4. [Data processing, Data augmentation, Batch normalization]**
+
+⟶ [Veri işleme, Veri artırma, Küme normalizasyonu]
+
+
+
+**5. [Training a neural network, Epoch, Mini-batch, Cross-entropy loss, Backpropagation, Gradient descent, Updating weights, Gradient checking]**
+
+⟶ [Bir sinir ağının eğitilmesi, Dönem (Epok), Mini-küme, Çapraz-entropy yitimi (kaybı), Geriye yayılım, Gradyan (Bayır) iniş, Ağırlıkların güncellenmesi, Gradyan (Bayır) kontrolü]
+
+
+
+
+**6. [Parameter tuning, Xavier initialization, Transfer learning, Learning rate, Adaptive learning rates]**
+
+⟶ [Parametrelerin ayarlanması, Xavier başlatma, Transfer öğrenme, Öğrenme oranı, Uyarlamalı öğrenme oranları]
+
+
+
+
+**7. [Regularization, Dropout, Weight regularization, Early stopping]**
+
+⟶ [Düzenlileştirme, Seyreltme, Ağırlıkların düzeltilmesi, Erken durdurma]
+
+
+
+
+**8. [Good practices, Overfitting small batch, Gradient checking]**
+
+⟶ [İyi örnekler, Küçük kümelerin aşırı öğrenmesi, Gradyan kontrolü]
+
+
+
+
+**9. View PDF version on GitHub**
+
+⟶ GitHub'da PDF sürümünü görüntüleyin
+
+
+
+
+**10. Data processing**
+
+⟶ Veri işleme
+
+
+
+
+**11. Data augmentation ― Deep learning models usually need a lot of data to be properly trained. It is often useful to get more data from the existing ones using data augmentation techniques. The main ones are summed up in the table below. More precisely, given the following input image, here are the techniques that we can apply:**
+
+⟶ Veri artırma ― Derin öğrenme modelleri genellikle uygun şekilde eğitilmek için çok fazla veriye ihtiyaç duyar. Veri artırma tekniklerini kullanarak mevcut verilerden daha fazla veri üretmek genellikle yararlıdır. Temel işlemler aşağıdaki tabloda özetlenmiştir. Daha doğrusu, aşağıdaki girdi görüntüsüne bakıldığında, uygulayabileceğimiz teknikler şunlardır:
+
+
+
+
+**12. [Original, Flip, Rotation, Random crop]**
+
+⟶ [Orijinal, Çevirme, Rotasyon (Yönlendirme), Rastgele kırpma/kesme]
+
+
+
+
+**13. [Image without any modification, Flipped with respect to an axis for which the meaning of the image is preserved, Rotation with a slight angle, Simulates incorrect horizon calibration, Random focus on one part of the image, Several random crops can be done in a row]**
+
+⟶ [Herhangi bir değişiklik yapılmamış görüntü, Görüntünün anlamının korunduğu bir eksene göre çevrilmiş görüntü, Hafif açılı döndürme, Yanlış yatay kalibrasyonu simule eder, Görüntünün bir bölümüne rastgele odaklanma, Arka arkaya birkaç rasgele kesme yapılabilir]
+
+
+
+
+**14. [Color shift, Noise addition, Information loss, Contrast change]**
+
+⟶ [Renk değişimi, Gürültü ekleme, Bilgi kaybı, Kontrast değişimi]
+
+
+
+
+**15. [Nuances of RGB is slightly changed, Captures noise that can occur with light exposure, Addition of noise, More tolerance to quality variation of inputs, Parts of image ignored, Mimics potential loss of parts of image, Luminosity changes, Controls difference in exposition due to time of day]**
+
+⟶ [RGB'nin nüansları biraz değiştirilmesi, Işığa maruz kalırken oluşabilecek gürültü, Gürültü ekleme, Girdilerin kalite değişkenliğine daha fazla toleranslı olması, Yok sayılan görüntüler, Görüntünün parçalardaki olası kayıplarını kopyalanması, Gün içindeki ışık ve renk değişimim kontrolü]
+
+
+
+
+**16. Remark: data is usually augmented on the fly during training.**
+
+⟶ Not: Veriler genellikle eğitim sırasında artırılır.
+
+
+
+
+**17. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
+
+⟶ Küme normalleştirme - Bu, {xi} kümesini normalleştiren, β hiperparametresinin bir adımıdır. μB ve σ2B'ye dikkat ederek, kümeyi düzeltmek istediklerimizin ortalaması ve varyansı şu şekilde yapılır:
+
+
+
+
+**18. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
+
+⟶ Genellikle tam-tüm bağlı/evrişimli bir katmandan sonra ve doğrusal olmayan bir katmandan önce yapılır. Daha yüksek öğrenme oranlarına izin vermeyi ve başlangıç durumuna güçlü bir şekilde bağımlılığı azaltmayı amaçlar.
+
+
+
+
+**19. Training a neural network**
+
+⟶ Bir sinir ağının eğitilmesi
+
+
+
+
+**20. Definitions**
+
+⟶ Tanımlamalar
+
+
+
+
+**21. Epoch ― In the context of training a model, epoch is a term used to refer to one iteration where the model sees the whole training set to update its weights.**
+
+⟶ Dönem (Epok/Epoch) ― Bir modelin eğitimi kapsamında, modelin ağırlıklarını güncellemek için tüm eğitim setini kullandığı bir yinelemeye ifade etmek için kullanılan bir terimdir.
+
+
+
+
+**22. Mini-batch gradient descent ― During the training phase, updating weights is usually not based on the whole training set at once due to computation complexities or one data point due to noise issues. Instead, the update step is done on mini-batches, where the number of data points in a batch is a hyperparameter that we can tune.**
+
+⟶ Mini-küme gradyan (bayır) iniş ― Eğitim aşamasında, ağırlıkların güncellenmesi genellikle hesaplama karmaşıklıkları nedeniyle bir kerede ayarlanan tüm eğitime veya gürültü sorunları nedeniyle bir veri noktasına dayanmaz. Bunun yerine, güncelleme adımı bir toplu işdeki veri noktalarının sayısının ayarlayabileceğimiz bir hiperparametre olduğu mini kümelerle yapılır. Veriler mini-kümeler halinde alınır.
+
+
+
+
+**23. Loss function ― In order to quantify how a given model performs, the loss function L is usually used to evaluate to what extent the actual outputs y are correctly predicted by the model outputs z.**
+
+⟶ Yitim fonksiyonu ― Belirli bir modelin nasıl bir performans gösterdiğini ölçmek için, L yitim (kayıp) fonksiyonu genellikle y gerçek çıktıların, z model çıktıları tarafından ne kadar doğru tahmin edildiğini değerlendirmek için kullanılır.
+
+
+
+
+**24. Cross-entropy loss ― In the context of binary classification in neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
+
+⟶ Çapraz-entropi kaybı ― Yapay sinir ağlarında ikili sınıflandırma bağlamında, çapraz entropi kaybı L (z, y) yaygın olarak kullanılır ve şöyle tanımlanır:
+
+
+
+
+**25. Finding optimal weights**
+
+⟶ Optimum ağırlıkların bulunması
+
+
+
+
+**26. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight w is computed using the chain rule.**
+
+⟶ Geriye yayılım ― Geri yayılım, asıl çıktıyı ve istenen çıktıyı dikkate alarak sinir ağındaki ağırlıkları güncellemek için kullanılan bir yöntemdir. Her bir ağırlığa göre türev, zincir kuralı kullanılarak hesaplanır.
+
+
+
+
+**27. Using this method, each weight is updated with the rule:**
+
+⟶ Bu yöntemi kullanarak, her ağırlık kurala göre güncellenir:
+
+
+
+
+**28. Updating weights ― In a neural network, weights are updated as follows:**
+
+⟶ Ağırlıkların güncellenmesi ― Bir sinir ağında, ağırlıklar aşağıdaki gibi güncellenir:
+
+
+
+
+**29. [Step 1: Take a batch of training data and perform forward propagation to compute the loss, Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight, Step 3: Use the gradients to update the weights of the network.]**
+
+⟶ [Adım 1: Bir küme eğitim verisi alın ve kaybı hesaplamak için ileriye doğru ilerleyin, Step 2: Her ağırlığa göre kaybın derecesini elde etmek için kaybı tekrar geriye doğru yayın, Adım 3: Ağın ağırlıklarını güncellemek için gradyanları kullanın.]
+
+
+
+
+
+**30. [Forward propagation, Backpropagation, Weights update]**
+
+⟶ [İleri yayılım, Geriye yayılım, Ağırlıkların güncellenmesi]
+
+
+
+
+**31. Parameter tuning**
+
+⟶ Parametre ayarlama
+
+
+
+
+**32. Weights initialization**
+
+⟶ Ağırlıkların başlangıçlandırılması
+
+
+
+
+**33. Xavier initialization ― Instead of initializing the weights in a purely random manner, Xavier initialization enables to have initial weights that take into account characteristics that are unique to the architecture.**
+
+⟶ Xavier başlangıcı (ilklendirme) ― Ağırlıkları tamamen rastgele bir şekilde başlatmak yerine, Xavier başlangıcı, mimariye özgü özellikleri dikkate alan ilk ağırlıkların alınmasını sağlar.
+
+
+
+
+**34. Transfer learning ― Training a deep learning model requires a lot of data and more importantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets that took days/weeks to train, and leverage it towards our use case. Depending on how much data we have at hand, here are the different ways to leverage this:**
+
+⟶ Transfer öğrenme ― Bir derin öğrenme modelini eğitmek çok fazla veri ve daha da önemlisi çok zaman gerektirir. Kullanım durumumuza yönelik eğitim yapmak ve güçlendirmek için günler/haftalar süren dev veri setleri üzerinde önceden eğitilmiş ağırlıklardan yararlanmak genellikle yararlıdır. Elimizdeki ne kadar veri olduğuna bağlı olarak, aşağıdakilerden yararlanmanın farklı yolları:
+
+
+
+
+**35. [Training size, Illustration, Explanation]**
+
+⟶ [Eğitim boyutu, Görselleştirme, Açıklama]
+
+
+
+
+**36. [Small, Medium, Large]**
+
+⟶ [Küçük, Orta, Büyük]
+
+
+
+
+**37. [Freezes all layers, trains weights on softmax, Freezes most layers, trains weights on last layers and softmax, Trains weights on layers and softmax by initializing weights on pre-trained ones]**
+
+⟶ [Tüm katmanlar dondurulur, Softmax'taki ağırlıkları eğitilir, Çoğu katmanlar dondurulur, son katmanlar ve softmax katmanı ağırlıklar ile eğitilir, Önceden eğitilerek elde edilen ağırlıkları kullanarak katmanlar ve softmax için kullanır]
+
+
+
+
+**38. Optimizing convergence**
+
+⟶ Yakınsamayı optimize etmek
+
+
+
+
+**39. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. It can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
+
+⟶ Öğrenme oranı (adımı) ― Genellikle α veya bazen η olarak belirtilen öğrenme oranı, ağırlıkların hangi hızda güncellendiğini belirler. Sabitlenebilir veya uyarlanabilir şekilde değiştirilebilir. Mevcut en popüler yöntemin adı Adam'dır ve öğrenme hızını ayarlayan bir yöntemdir.
+
+
+
+**40. Adaptive learning rates ― Letting the learning rate vary when training a model can reduce the training time and improve the numerical optimal solution. While Adam optimizer is the most commonly used technique, others can also be useful. They are summed up in the table below:**
+
+⟶ Uyarlanabilir öğrenme oranları ― Bir modelin eğitilmesi sırasında öğrenme oranının değişmesine izin vermek eğitim süresini kısaltabilir ve sayısal optimum çözümü iyileştirebilir. Adam optimizasyonu yöntemi en çok kullanılan teknik olmasına rağmen, diğer yöntemler de faydalı olabilir. Bunlar aşağıdaki tabloda özetlenmiştir:
+
+
+
+
+**41. [Method, Explanation, Update of w, Update of b]**
+
+⟶ [Yöntem, Açıklama, w'ların güncellenmesi, b'nin güncellenmesi]
+
+
+
+
+**42. [Momentum, Dampens oscillations, Improvement to SGD, 2 parameters to tune]**
+
+⟶ [Momentum, Osilasyonların azaltılması/yumuşatılması, SGD (Stokastik Gradyan/Bayır İniş) iyileştirmesi, Ayarlanacak 2 parametre]
+
+
+
+
+**43. [RMSprop, Root Mean Square propagation, Speeds up learning algorithm by controlling oscillations]**
+
+⟶ [RMSprop, Ortalama Karekök yayılımı, Osilasyonları kontrol ederek öğrenme algoritmasını hızlandırır]
+
+
+
+
+**44. [Adam, Adaptive Moment estimation, Most popular method, 4 parameters to tune]**
+
+⟶ [Adam, Uyarlamalı Moment tahmini/kestirimi, En popüler yöntem, Ayarlanacak 4 parametre]
+
+
+
+
+**45. Remark: other methods include Adadelta, Adagrad and SGD.**
+
+⟶ Not: diğer yöntemler içinde Adadelta, Adagrad ve SGD.
+
+
+
+
+**46. Regularization**
+
+⟶ Düzenlileştirme
+
+
+
+
+**47. Dropout ― Dropout is a technique used in neural networks to prevent overfitting the training data by dropping out neurons with probability p>0. It forces the model to avoid relying too much on particular sets of features.**
+
+⟶ Seyreltme ― Seyreltme, sinir ağlarında, p>0 olasılıklı nöronları silerek eğitim verilerinin fazla kullanılmaması için kullanılan bir tekniktir. Modeli, belirli özellik kümelerine çok fazla güvenmekten kaçınmaya zorlar.
+
+
+
+
+**48. Remark: most deep learning frameworks parametrize dropout through the 'keep' parameter 1−p.**
+
+⟶ Not: Çoğunlukla derin öğrenme kütüphanleri, 'keep' ('tutma') parametresi 1−p aracılığıyla seyreltmeyi parametrize eder.
+
+
+
+
+**49. Weight regularization ― In order to make sure that the weights are not too large and that the model is not overfitting the training set, regularization techniques are usually performed on the model weights. The main ones are summed up in the table below:**
+
+⟶ Ağırlık düzenlileştirme ― Ağırlıkların çok büyük olmadığından ve modelin eğitim setine uygun olmadığından emin olmak için, genellikle model ağırlıklarında düzenlileştirme teknikleri uygulanır. Temel olanlar aşağıdaki tabloda özetlenmiştir:
+
+
+
+
+**50. [LASSO, Ridge, Elastic Net]**
+
+⟶ [LASSO, Ridge, Elastic Net]
+
+
+
+**50 bis. Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+⟶ [Katsayıları 0'a düşürür, Değişken seçimi için iyi, Katsayıları daha küçük yapar, Değişken seçimi ile küçük katsayılar arasında ödünleşim sağlar]
+
+
+
+**51. Early stopping ― This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase.**
+
+⟶ Erken durdurma ― Bu düzenleme tekniği, onaylama kaybı bir stabilliğe ulaştığında veya artmaya başladığında eğitim sürecini durdurur.
+
+
+
+
+**52. [Error, Validation, Training, early stopping, Epochs]**
+
+⟶ [Hata, Geçerleme/Doğrulama, Eğitim, erken durdurma, Epochs]
+
+
+
+
+**53. Good practices**
+
+⟶ İyi uygulamalar
+
+
+
+
+**54. Overfitting small batch ― When debugging a model, it is often useful to make quick tests to see if there is any major issue with the architecture of the model itself. In particular, in order to make sure that the model can be properly trained, a mini-batch is passed inside the network to see if it can overfit on it. If it cannot, it means that the model is either too complex or not complex enough to even overfit on a small batch, let alone a normal-sized training set.**
+
+⟶ Küçük kümelerin ezberlenmesi ― Bir modelde hata ayıklama yaparken, modelin mimarisinde büyük bir sorun olup olmadığını görmek için hızlı testler yapmak genellikle yararlıdır. Özellikle, modelin uygun şekilde eğitilebildiğinden emin olmak için, ezberleyecek mi diye görmek için ağ içinde bir mini küme ile eğitilir. Olmazsa, modelin normal boyutta bir eğitim setini bırakmadan, küçük bir kümeyi bile ezberleyecek kadar çok karmaşık ya da yeterince karmaşık olmadığı anlamına gelir.
+
+
+
+
+**55. Gradient checking ― Gradient checking is a method used during the implementation of the backward pass of a neural network. It compares the value of the analytical gradient to the numerical gradient at given points and plays the role of a sanity-check for correctness.**
+
+⟶ Gradyanların kontrolü ― Gradyan kontrolü, bir sinir ağının geriye doğru geçişinin uygulanması sırasında kullanılan bir yöntemdir. Analitik gradyanların değerini verilen noktalardaki sayısal gradyanlarla karşılaştırır ve doğruluk için bir kontrol rolü oynar.
+
+
+
+
+**56. [Type, Numerical gradient, Analytical gradient]**
+
+⟶ [Tip, Sayısal gradyan, Analitik gradyan]
+
+
+
+
+**57. [Formula, Comments]**
+
+⟶ [Formül, Açıklamalar]
+
+
+
+
+**58. [Expensive; loss has to be computed two times per dimension, Used to verify correctness of analytical implementation, Trade-off in choosing h not too small (numerical instability) nor too large (poor gradient approximation)]**
+
+⟶ [Maliyetli; Kayıp, boyut başına iki kere hesaplanmalı, Analitik uygulamanın doğruluğunu anlamak için kullanılır, Ne çok küçük (sayısal dengesizlik) ne de çok büyük (zayıf gradyan yaklaşımı) seçimi yapılmalı, bunun için ödünleşim gerekir]
+
+
+
+
+**59. ['Exact' result, Direct computation, Used in the final implementation]**
+
+⟶ ['Kesin' sonuç, Doğrudan hesaplama, Son uygulamada kullanılır]
+
+
+
+
+**60. The Deep Learning cheatsheets are now available in [target language].
+
+⟶ Derin Öğrenme el kitabı şimdi [hedef dilde] mevcuttur.
+
+**61. Original authors**
+
+⟶ Orijinal yazarlar
+
+
+
+**62.Translated by X, Y and Z**
+
+⟶ X, Y ve Z tarafından çevirildi
+
+
+
+**63.Reviewed by X, Y and Z**
+
+⟶ X, Y ve Z tarafından gözden geçirildi
+
+
+
+**64.View PDF version on GitHub**
+
+⟶ GitHub'da PDF sürümünü görüntüleyin
+
+
+
+**65.By X and Y**
+
+⟶ X ve Y tarafından
+
+
diff --git a/tr/cs-230-recurrent-neural-networks.md b/tr/cs-230-recurrent-neural-networks.md
new file mode 100644
index 000000000..17536b665
--- /dev/null
+++ b/tr/cs-230-recurrent-neural-networks.md
@@ -0,0 +1,674 @@
+**1. Recurrent Neural Networks cheatsheet**
+
+⟶ Tekrarlayan Yapay Sinir Ağları (Recurrent Neural Networks-RNN) El Kitabı
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶ CS 230 - Derin Öğrenme
+
+
+
+
+**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]**
+
+⟶ [Genel bakış, Mimari yapı, RNN'lerin uygulamaları, Kayıp fonksiyonu, Geriye Yayılım]
+
+
+
+
+**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]**
+
+⟶ [Uzun vadeli bağımlılıkların ele alınması, Ortak aktivasyon fonksiyonları, Gradyanın kaybolması / patlaması, Gradyan kırpma, GRU / LSTM, Kapı tipleri, Çift Yönlü RNN, Derin RNN]
+
+
+
+
+**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]**
+
+⟶ [Kelime gösterimini öğrenme, Notasyonlar, Gömme matrisi, Word2vec, Skip-gram, Negatif örnekleme, GloVe]
+
+
+
+
+**6. [Comparing words, Cosine similarity, t-SNE]**
+
+⟶ [Kelimeleri karşılaştırmak, Cosine benzerliği, t-SNE]
+
+
+
+
+**7. [Language model, n-gram, Perplexity]**
+
+⟶ [Dil modeli, n-gram, Karışıklık]
+
+
+
+
+**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]**
+
+⟶ [Makine çevirisi, Işın araması, Uzunluk normalizasyonu, Hata analizi, Bleu skoru]
+
+
+
+
+**9. [Attention, Attention model, Attention weights]**
+
+⟶ [Dikkat, Dikkat modeli, Dikkat ağırlıkları]
+
+
+
+
+**10. Overview**
+
+⟶ Genel Bakış
+
+
+
+
+**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:**
+
+⟶ Geleneksel bir RNN mimarisi - RNN'ler olarak da bilinen tekrarlayan sinir ağları, gizli durumlara sahipken önceki çıktıların girdi olarak kullanılmasına izin veren bir sinir ağları sınıfıdır. Tipik olarak aşağıdaki gibidirler:
+
+
+
+
+**12. For each timestep t, the activation a and the output y are expressed as follows:**
+
+⟶ Her bir t zamanında, a aktivasyonu ve y çıktısı aşağıdaki gibi ifade edilir:
+
+
+
+
+**13. and**
+
+⟶ ve
+
+
+
+
+**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.**
+
+⟶ burada Wax,Waa,Wya,ba,by geçici olarak paylaşılan katsayılardır ve g1,g2 aktivasyon fonksiyonlarıdır.
+
+
+
+
+**15. The pros and cons of a typical RNN architecture are summed up in the table below:**
+
+⟶ Tipik bir RNN mimarisinin artıları ve eksileri aşağıdaki tabloda özetlenmiştir:
+
+
+
+
+**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]**
+
+⟶ [Avantajlar, Herhangi bir uzunluktaki girdilerin işlenmesi imkanı, Girdi büyüklüğüyle artmayan model boyutu, Geçmiş bilgileri dikkate alarak hesaplama, Zaman içinde paylaşılan ağırlıklar]
+
+
+
+
+**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]**
+
+⟶ [Dezavantajları, Yavaş hesaplama, Uzun zaman önceki bilgiye erişme zorluğu, Mevcut durum için gelecekteki herhangi bir girdinin düşünülememesi]
+
+
+
+
+**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:**
+
+⟶ RNN'lerin Uygulamaları ― RNN modelleri çoğunlukla doğal dil işleme ve konuşma tanıma alanlarında kullanılır. Farklı uygulamalar aşağıdaki tabloda özetlenmiştir:
+
+
+
+
+**19. [Type of RNN, Illustration, Example]**
+
+⟶ [RNN Türü, Örnekleme, Örnek]
+
+
+
+
+**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]**
+
+⟶ [Bire bir, Bire çok, Çoka bir, Çoka çok]
+
+
+
+
+**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]**
+
+⟶ [Geleneksel sinir ağı, Müzik üretimi, Duygu sınıflandırma, İsim varlık tanıma, Makine çevirisi]
+
+
+
+
+**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:**
+
+⟶ Kayıp fonksiyonu ― Tekrarlayan bir sinir ağı olması durumunda, tüm zaman dilimlerindeki L kayıp fonksiyonu, her zaman dilimindeki kayıbı temel alınarak aşağıdaki gibi tanımlanır:
+
+
+
+
+**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:**
+
+⟶ Zamanla geri yayılım ― Geriye yayılım zamanın her noktasında yapılır. T zaman diliminde, ağırlık matrisi W'ye göre L kaybının türevi aşağıdaki gibi ifade edilir:
+
+
+
+
+**24. Handling long term dependencies**
+
+⟶ Uzun vadeli bağımlılıkların ele alınması
+
+
+
+
+**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:**
+
+⟶ Yaygın olarak kullanılan aktivasyon fonksiyonları ― RNN modüllerinde kullanılan en yaygın aktivasyon fonksiyonları aşağıda açıklanmıştır:
+
+
+
+
+**26. [Sigmoid, Tanh, RELU]**
+
+⟶ [Sigmoid, Tanh, RELU]
+
+
+
+
+**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.**
+
+⟶ Kaybolan / patlayan gradyan ― Kaybolan ve patlayan gradyan fenomenlerine RNN'ler bağlamında sıklıkla rastlanır. Bunların olmasının nedeni, katman sayısına göre katlanarak azalan / artan olabilen çarpımsal gradyan nedeniyle uzun vadeli bağımlılıkları yakalamanın zor olmasıdır.
+
+
+
+
+**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.**
+
+⟶ Gradyan kırpma ― Geri yayılım işlemi sırasında bazen karşılaşılan patlayan gradyan sorunuyla başa çıkmak için kullanılan bir tekniktir. Gradyan için maksimum değeri sınırlayarak, bu durum pratikte kontrol edilir.
+
+
+
+
+**29. clipped**
+
+⟶ kırpılmış
+
+
+
+
+**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:**
+
+⟶ Giriş Kapıları Çeşitleri ― Kaybolan gradyan problemini çözmek için bazı RNN türlerinde belirli kapılar kullanılır ve genellikle iyi tanımlanmış bir amaca sahiptir. Genellikle Γ olarak ifade edilir ve şuna eşittir:
+
+
+
+
+**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:**
+
+⟶ burada W, U, b kapıya özgü katsayılardır ve σ ise sigmoid fonksiyondur. Temel olanlar aşağıdaki tabloda özetlenmiştir:
+
+
+
+
+**32. [Type of gate, Role, Used in]**
+
+⟶ [Kapının tipi, Rol, Kullanılan]
+
+
+
+
+**33. [Update gate, Relevance gate, Forget gate, Output gate]**
+
+⟶ [Güncelleme kapısı, Uygunluk kapısı, Unutma kapısı, Çıkış kapısı]
+
+
+
+
+**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]**
+
+⟶ [Şimdi ne kadar geçmiş olması gerekir?, Önceki bilgiyi bırak?, Bir hücreyi sil ya da silme?, Bir hücreyi ortaya çıkarmak için ne kadar?]
+
+
+
+
+**35. [LSTM, GRU]**
+
+⟶ [LSTM, GRU]
+
+
+
+
+**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:**
+
+⟶ GRU/LSTM ― Geçitli Tekrarlayan Birim (Gated Recurrent Unit-GRU) ve Uzun Kısa Süreli Bellek Birimleri (Long Short-Term Memory-LSTM), geleneksel RNN'lerin karşılaştığı kaybolan gradyan problemini ele alır, LSTM ise GRU'nun genelleştirilmiş halidir. Her bir mimarinin karakterizasyon denklemlerini özetleyen tablo aşağıdadır:
+
+
+
+
+**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]**
+
+⟶ [Karakterizasyon, Geçitli Tekrarlayan Birim (GRU), Uzun Kısa Süreli Bellek (LSTM), Bağımlılıklar]
+
+
+
+
+**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.**
+
+⟶ Not: ⋆ işareti iki vektör arasındaki birimsel çarpımı belirtir.
+
+
+
+
+**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:**
+
+⟶ RNN varyantları ― Aşağıdaki tablo, diğer yaygın kullanılan RNN mimarilerini özetlemektedir:
+
+
+
+
+**40. [Bidirectional (BRNN), Deep (DRNN)]**
+
+⟶ [Çift Yönlü (Bidirectional-BRNN), Derin (Deep-DRNN)]
+
+
+
+
+**41. Learning word representation**
+
+⟶ Kelime temsilini öğrenme
+
+
+
+
+**42. In this section, we note V the vocabulary and |V| its size.**
+
+⟶ Bu bölümde V kelimeleri, |V| ise kelimelerin boyutlarını ifade eder.
+
+
+
+
+**43. Motivation and notations**
+
+⟶ Motivasyon ve notasyon
+
+
+
+
+**44. Representation techniques ― The two main ways of representing words are summed up in the table below:**
+
+⟶ Temsil etme teknikleri ― Kelimeleri temsil etmenin iki temel yolu aşağıdaki tabloda özetlenmiştir:
+
+
+
+
+**45. [1-hot representation, Word embedding]**
+
+⟶ [1-hot gösterim, Kelime gömme]
+
+
+
+
+**46. [teddy bear, book, soft]**
+
+⟶ [oyuncak ayı, kitap, yumuşak]
+
+
+
+
+**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]**
+
+⟶
+
+
[ow not edildi, Naive yaklaşım, benzerlik bilgisi yok, ew not edildi, kelime benzerliği dikkate alınır]
+
+
+**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:**
+
+⟶ Gömme matrisi ― Belirli bir w kelimesi için E gömme matrisi, 1-hot temsilini ew gömmesi sayesinde aşağıdaki gibi eşleştiren bir matristir:
+
+
+
+
+**49. Remark: learning the embedding matrix can be done using target/context likelihood models.**
+
+⟶ Not: Gömme matrisinin öğrenilmesi hedef / içerik olabilirlik modelleri kullanılarak yapılabilir.
+
+
+
+
+**50. Word embeddings**
+
+⟶ Kelime gömmeleri
+
+
+
+
+**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.**
+
+⟶ Word2vec ― Word2vec, belirli bir kelimenin diğer kelimelerle çevrili olma olasılığını tahmin ederek kelime gömmelerini öğrenmeyi amaçlayan bir çerçevedir. Popüler modeller arasında skip-gram, negatif örnekleme ve CBOW bulunur.
+
+
+
+
+**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]**
+
+⟶ [Sevimli ayıcık okuyor, ayıcık, yumuşak, Farsça şiir, sanat]
+
+
+
+
+**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]**
+
+⟶ [Proxy görevinde ağı eğitme, üst düzey gösterimi çıkartme, Kelime gömme hesaplama]
+
+
+
+
+**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:**
+
+⟶ Skip-gram ― Skip-gram word2vec modeli verilen herhangi bir t hedef kelimesinin c gibi bir bağlam kelimesi ile gerçekleşme olasılığını değerlendirerek kelime gömmelerini öğrenen denetimli bir öğrenme görevidir.
+
+
+
+
+**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.**
+
+⟶ Not: Softmax bölümünün paydasındaki tüm kelime dağarcığını toplamak, bu modeli hesaplama açısından maliyetli kılar. CBOW, verilen bir kelimeyi tahmin etmek için çevreleyen kelimeleri kullanan başka bir word2vec modelidir.
+
+
+
+
+**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:**
+
+⟶ Negatif örnekleme - Belirli bir bağlamın ve belirli bir hedef kelimenin eşzamanlı olarak ortaya çıkmasının muhtemel olup olmadığının değerlendirilmesini, modellerin k negatif örnek kümeleri ve 1 pozitif örnek kümesinde eğitilmesini hedefleyen, lojistik regresyon kullanan bir ikili sınıflandırma kümesidir. Bağlam sözcüğü c ve hedef sözcüğü t göz önüne alındığında, tahmin şöyle ifade edilir:
+
+
+
+
+**57. Remark: this method is less computationally expensive than the skip-gram model.**
+
+⟶ Not: Bu yöntem, skip-gram modelinden daha az hesaplamalıdır.
+
+
+
+
+**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:**
+
+⟶ GloVe ― Kelime gösterimi için Global vektörler tanımının kısaltılmış hali olan GloVe, eşzamanlı bir X matrisi kullanan ki burada her bir Xi,j , bir hedefin bir j bağlamında gerçekleştiği sayısını belirten bir kelime gömme tekniğidir. Maliyet fonksiyonu J aşağıdaki gibidir:
+
+
+
+
+**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0.
+Given the symmetry that e and ? play in this model, the final word embedding e(final)w is given by:**
+
+⟶ f, Xi,j=0⟹f(Xi,j)=0 olacak şekilde bir ağırlıklandırma fonksiyonudur.
+Bu modelde e ve θ'nin oynadığı simetri göz önüne alındığında, e (final) w'nin kelime gömmesi şöyle ifade edilir:
+
+
+
+
+**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.**
+
+⟶ Not: Öğrenilen kelime gömme bileşenlerinin ayrı ayrı bileşenleri tam olarak yorumlanamaz.
+
+
+
+
+**60. Comparing words**
+
+⟶ Kelimelerin karşılaştırılması
+
+
+
+
+**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:**
+
+⟶ Kosinüs benzerliği ― w1 ve w2 kelimeleri arasındaki kosinüs benzerliği şu şekilde ifade edilir:
+
+
+
+
+**62. Remark: θ is the angle between words w1 and w2.**
+
+⟶ Not: θ, w1 ve w2 kelimeleri arasındaki açıdır.
+
+
+
+
+**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.**
+
+⟶ t-SNE ― t-SNE (t-dağıtımlı Stokastik Komşu Gömme), yüksek boyutlu gömmeleri daha düşük boyutlu bir alana indirmeyi amaçlayan bir tekniktir. Uygulamada, kelime uzaylarını 2B alanda görselleştirmek için yaygın olarak kullanılır.
+
+
+
+
+**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]**
+
+⟶ [edebiyat, sanat, kitap, kültür, şiir, okuma, bilgi, eğlendirici, sevimli, çocukluk, kibar, ayıcık, yumuşak, sarılmak, sevimli, sevimli]
+
+
+
+
+**65. Language model**
+
+⟶ Dil modeli
+
+
+
+
+**66. Overview ― A language model aims at estimating the probability of a sentence P(y).**
+
+⟶ Genel bakış - Bir dil modeli P(y) cümlesinin olasılığını tahmin etmeyi amaçlar.
+
+
+
+
+**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.**
+
+⟶ n-gram modeli ― Bu model, eğitim verilerindeki görünüm sayısını sayarak bir ifadenin bir korpusta ortaya çıkma olasılığını ölçmeyi amaçlayan naif bir yaklaşımdır.
+
+
+
+
+**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:**
+
+⟶ Karışıklık - Dil modelleri yaygın olarak, PP olarak da bilinen karışıklık metriği kullanılarak değerlendirilir ve bunlar T kelimelerinin sayısıyla normalize edilmiş veri setinin ters olasılığı olarak yorumlanabilir. Karışıklık, daha düşük, daha iyi ve şöyle tanımlanır:
+
+
+
+
+**69. Remark: PP is commonly used in t-SNE.**
+
+⟶ Not: PP, t-SNE'de yaygın olarak kullanılır.
+
+
+
+
+**70. Machine translation**
+
+⟶ Makine çevirisi
+
+
+
+
+**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:**
+
+⟶ Genel bakış ― Bir makine çeviri modeli, daha önce yerleştirilmiş bir kodlayıcı ağına sahip olması dışında, bir dil modeline benzer. Bu nedenle, bazen koşullu dil modeli olarak da adlandırılır. Amaç şu şekilde bir cümle bulmaktır:
+
+
+
+
+**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.**
+
+⟶ Işın arama ― Makine çevirisinde ve konuşma tanımada kullanılan ve x girişi verilen en olası cümleyi bulmak için kullanılan sezgisel bir arama algoritmasıdır.
+
+
+
+
+**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]**
+
+⟶ [Adım 1: En olası B kelimeleri bulun y<1>, 2. Adım: Koşullu olasılıkları hesaplayın y|x,y<1>, ..., y, 3. Adım: En olası B kombinasyonlarını koruyun x,y<1>, ..., y, İşlemi durdurarak sonlandırın]
+
+
+
+
+**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.**
+
+⟶ Not: Eğer ışın genişliği 1 olarak ayarlanmışsa, bu naif (naive) bir açgözlü (greedy) aramaya eşdeğerdir.
+
+
+
+
+**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.**
+
+⟶ Işın genişliği ― Işın genişliği B, ışın araması için bir parametredir. Daha yüksek B değerleri daha iyi sonuç elde edilmesini sağlar fakat daha düşük performans ve daha yüksek hafıza ile. Küçük B değerleri daha kötü sonuçlara neden olur, ancak hesaplama açısından daha az yoğundur. B için standart bir değer 10 civarındadır.
+
+
+
+
+**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:**
+
+⟶ Uzunluk normalizasyonu ― Sayısal stabiliteyi arttırmak için, ışın arama genellikle, aşağıdaki gibi tanımlanan normalize edilmiş log-olabilirlik amacı olarak adlandırılan normalize edilmiş hedefe uygulanır:
+
+
+
+
+**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.**
+
+⟶ Not: α parametresi yumuşatıcı olarak görülebilir ve değeri genellikle 0,5 ile 1 arasındadır.
+
+
+
+
+**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:**
+
+⟶ Hata analizi ― Kötü bir çeviri elde edildiğinde, aşağıdaki hata analizini yaparak neden iyi bir çeviri almadığımızı araştırabiliriz:
+
+
+
+
+**79. [Case, Root cause, Remedies]**
+
+⟶ [Durum, Ana neden, Çözümler]
+
+
+
+
+**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]**
+
+⟶ [Işın arama hatası, RNN hatası, Işın genişliğini artırma, farklı mimariyi deneme, Düzenlileştirme, Daha fazla bilgi edinme]
+
+
+
+
+**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:**
+
+⟶ Bleu puanı ― İki dilli değerlendirme alt ölçeği (bleu) puanı, makine çevirisinin ne kadar iyi olduğunu, n-gram hassasiyetine dayalı bir benzerlik puanı hesaplayarak belirler. Aşağıdaki gibi tanımlanır:
+
+
+
+
+**82. where pn is the bleu score on n-gram only defined as follows:**
+
+⟶ pn, n-gramdaki bleu skorunun sadece aşağıdaki şekilde tanımlandığı durumlarda:
+
+
+
+
+**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.**
+
+⟶ Not: Yapay olarak şişirilmiş bir bleu skorunu önlemek için kısa öngörülen çevirilere küçük bir ceza verilebilir.
+
+
+
+
+**84. Attention**
+
+⟶ Dikkat
+
+
+
+
+**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:**
+
+⟶ Dikkat modeli ― Bu model, bir RNN'de girişin önemli olduğu düşünülen belirli kısımlarına dikkat etmesine olanak sağlar,sonuçta ortaya çıkan modelin pratikteki performansını arttırır. α ile ifade edilen dikkat miktarı, a aktivasyonu ve t zamanındaki c bağlamını y çıktısı olarak verir.
+
+
+
+
+**86. with**
+
+⟶ ile
+
+
+
+
+**87. Remark: the attention scores are commonly used in image captioning and machine translation.**
+
+⟶ Not: Dikkat skorları, görüntü altyazılama ve makine çevirisinde yaygın olarak kullanılır.
+
+
+
+
+**88. A cute teddy bear is reading Persian literature.**
+
+⟶ Sevimli bir oyuncak ayı Fars edebiyatı okuyor.
+
+
+
+
+**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:**
+
+⟶ Dikkat ağırlığı ― Y çıktısının a aktivasyonuna vermesi gereken dikkat miktarı, aşağıdaki gibi hesaplanan α ile verilir:
+
+
+
+
+**90. Remark: computation complexity is quadratic with respect to Tx.**
+
+⟶ Not: hesaplama karmaşıklığı Tx'e göre ikinci derecedendir.
+
+
+
+
+**91. The Deep Learning cheatsheets are now available in [target language].**
+
+⟶ Derin Öğrenme el kitapları şimdi [hedef dilde] mevcuttur.
+
+
+
+**92. Original authors**
+
+⟶Orijinal yazarlar
+
+
+
+**93. Translated by X, Y and Z**
+
+⟶ X, Y ve Z tarafından çevrilmiştir.
+
+
+
+**94. Reviewed by X, Y and Z**
+
+⟶ X, Y ve Z tarafından gözden geçirilmiştir.
+
+
+
+**95. View PDF version on GitHub**
+
+⟶ GitHub'da PDF versiyonunu görüntüleyin.
+
+
+
+**96. By X and Y**
+
+⟶ X ve Y tarafından
+
+
diff --git a/tr/refresher-probability.md b/tr/refresher-probability.md
deleted file mode 100644
index 5c9b34656..000000000
--- a/tr/refresher-probability.md
+++ /dev/null
@@ -1,381 +0,0 @@
-**1. Probabilities and Statistics refresher**
-
-⟶
-
-
-
-**2. Introduction to Probability and Combinatorics**
-
-⟶
-
-
-
-**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
-
-⟶
-
-
-
-**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
-
-⟶
-
-
-
-**5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
-
-⟶
-
-
-
-**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
-
-⟶
-
-
-
-**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
-
-⟶
-
-
-
-**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
-
-⟶
-
-
-
-**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
-
-⟶
-
-
-
-**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
-
-⟶
-
-
-
-**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
-
-⟶
-
-
-
-**12. Conditional Probability**
-
-⟶
-
-
-
-**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
-
-⟶
-
-
-
-**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
-
-⟶
-
-
-
-**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
-
-⟶
-
-
-
-**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
-
-⟶
-
-
-
-**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
-
-⟶
-
-
-
-**18. Independence ― Two events A and B are independent if and only if we have:**
-
-⟶
-
-
-
-**19. Random Variables**
-
-⟶
-
-
-
-**20. Definitions**
-
-⟶
-
-
-
-**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
-
-⟶
-
-
-
-**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
-
-⟶
-
-
-
-**23. Remark: we have P(a
-
-**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
-
-⟶
-
-
-
-**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
-
-⟶
-
-
-
-**26. [Case, CDF F, PDF f, Properties of PDF]**
-
-⟶
-
-
-
-**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
-
-⟶
-
-
-
-**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
-
-⟶
-
-
-
-**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
-
-⟶
-
-
-
-**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
-
-⟶
-
-
-
-**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
-
-⟶
-
-
-
-**32. Probability Distributions**
-
-⟶
-
-
-
-**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
-
-⟶
-
-
-
-**34. Main distributions ― Here are the main distributions to have in mind:**
-
-⟶
-
-
-
-**35. [Type, Distribution]**
-
-⟶
-
-
-
-**36. Jointly Distributed Random Variables**
-
-⟶
-
-
-
-**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
-
-⟶
-
-
-
-**38. [Case, Marginal density, Cumulative function]**
-
-⟶
-
-
-
-**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
-
-⟶
-
-
-
-**40. Independence ― Two random variables X and Y are said to be independent if we have:**
-
-⟶
-
-
-
-**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
-
-⟶
-
-
-
-**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
-
-⟶
-
-
-
-**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
-
-⟶
-
-
-
-**44. Remark 2: If X and Y are independent, then ρXY=0.**
-
-⟶
-
-
-
-**45. Parameter estimation**
-
-⟶
-
-
-
-**46. Definitions**
-
-⟶
-
-
-
-**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
-
-⟶
-
-
-
-**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
-
-⟶
-
-
-
-**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
-
-⟶
-
-
-
-**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
-
-⟶
-
-
-
-**51. Estimating the mean**
-
-⟶
-
-
-
-**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
-
-⟶
-
-
-
-**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
-
-⟶
-
-
-
-**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
-
-⟶
-
-
-
-**55. Estimating the variance**
-
-⟶
-
-
-
-**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
-
-⟶
-
-
-
-**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
-
-⟶
-
-
-
-**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
-
-⟶
-
-
-
-**59. [Introduction, Sample space, Event, Permutation]**
-
-⟶
-
-
-
-**60. [Conditional probability, Bayes' rule, Independence]**
-
-⟶
-
-
-
-**61. [Random variables, Definitions, Expectation, Variance]**
-
-⟶
-
-
-
-**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
-
-⟶
-
-
-
-**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
-
-⟶
-
-
-
-**64. [Parameter estimation, Mean, Variance]**
-
-⟶
diff --git a/uk/cs-229-probability.md b/uk/cs-229-probability.md
new file mode 100644
index 000000000..a09ab965d
--- /dev/null
+++ b/uk/cs-229-probability.md
@@ -0,0 +1,381 @@
+**1. Probabilities and Statistics refresher**
+
+⟶ Швидке повторення з теорії ймовірностей та комбінаторики.
+
+
+
+**2. Introduction to Probability and Combinatorics**
+
+⟶ Вступ до теорії ймовірностей та комбінаторики.
+
+
+
+**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
+
+⟶ Простір елементарних подій ― Множина всіх можливих результатiв експерименту називається простором елементарних подій і позначається літерою S.
+
+
+
+**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
+
+⟶ Випадкова подія - будь-яка підмножина E, що належить до певного простору елементарних подій, називається подією. Таким чином, подія це множина, що містить можливі результати експерименту. Якщо результати експерименту містяться в Е, тоді ми говоримо що Е відбулася.
+
+
+
+**5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
+
+⟶ Аксіоми теорії ймовірностей. Для кожної події Е, P(E) є ймовірністю події Е.
+
+
+
+**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
+
+⟶ Аксіома 1 - Всі ймовірності існують між 0 та 1 включно.
+
+
+
+**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
+
+⟶ Аксіома 2 - Ймовірність що як мінімум одна подія з простору елементарних подій відбудеться дорівнює 1.
+
+
+
+**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
+
+⟶ Аксіома 3 - Для будь-якої послідовності взаємновиключних подій E1,...,En, ми маємо:
+
+
+
+**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
+
+⟶ Підстановка - підстановка це спосіб вибору r об'єктів з набору n об'єктів в певному порядку. Кількість таких способів вибору задається через P(n,r):
+
+
+
+**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
+
+⟶ Комбiнацiя - комбiнацiя це спосіб вибору r об'єктів з набору n об'єктів, де порядок не має значення. Кількість таких способів вибору задається через C(n,r):
+
+
+
+**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
+
+⟶ Примітка: ми зауважуємо що для 0⩽r⩽n, ми маємо P(n,r)⩾C(n,r)
+
+
+
+**12. Conditional Probability**
+
+⟶ Умовна ймовірність
+
+
+
+**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
+
+⟶ Теорема Баєса - Для подій А і В таких що P(B)>0, маємо:
+
+
+
+**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
+
+⟶ Примітка: P(A∩B)=P(A)P(B|A)=P(A|B)P(B)
+
+
+
+**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
+
+⟶ Поділ множини - Нехай {Ai,i∈[[1,n]]} буде таким для всіх i, Ai≠∅. Ми називаємо {Ai} поділом множини якщо:
+
+
+
+**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
+
+⟶ Примітка: для будь-якої події В в просторі елементарних подій, маємо P(B)=n∑i=1P(B|Ai)P(Ai).
+
+
+
+**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
+
+⟶ Розгорнута форма теореми Баєса - Нехай {Ai,i∈[[1,n]]} буде поділом множини простору елементарних подій. Маємо:
+
+
+
+**18. Independence ― Two events A and B are independent if and only if we have:**
+
+⟶ Незалежність - Дві події А і В є незалежними якщо і тільки якщо ми маємо:
+
+
+
+**19. Random Variables**
+
+⟶ Випадкові змінні
+
+
+
+**20. Definitions**
+
+⟶ Означення
+
+
+
+**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
+
+⟶ Випадкова змінна - Випадкова змінна, часто означена X, є функцією що проектує кожну подію в просторі елементарних подій на реальну лінію.
+
+
+
+**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
+
+⟶ Функція розподілу ймовірностей (CDF) - Функція розподілу ймовірностей F, що є монотонно зростаючою і є такою, що limx→−∞F(x)=0 та limx→+∞F(x)=1 і задається як:
+
+
+
+**23. Remark: we have P(a
+
+**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
+
+⟶ Функція густини імовірності (PDF) - Функція густини імовірності F є імовірністю що X набирає значень між двома сусідніми випадковими величинами.
+
+
+
+**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
+
+⟶ Залежність між PDF та CDF - Ось деякі важливі характеристики в одиночних i тривалих випадках:
+
+
+
+**26. [Case, CDF F, PDF f, Properties of PDF]**
+
+⟶ [Випадок, CDF F, PDF f, характеристики PDF]
+
+
+
+**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
+
+⟶ Математичне сподівання і моменти випадкового значення - Ось вирази очікуваного значення E[X], узагальненого очікуваного значення E[g(X)], k-го моменту E[Xk] та характеристичною функцією ψ(ω) дискретного або неперервного значення величини:
+
+
+
+**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
+
+⟶ Дисперсія випадкової змiнної - Дисперсія випадкової змiнної, що позначається Var(X) або σ2 є мірою величини розподілення значень Функції. Вона визначаєтья:
+
+
+
+**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
+
+⟶ Стандартне відхилення - Стандартне відхилення випадкової величини, що позначається σ, є мірою величини розподілення значень функції, сумісною з одиницями випадкової величини. Вона визначаєтья:
+
+
+
+**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
+
+⟶ Перетворення випадкових величин - Нехай змінні X та Y будуть поєднані певною функцією. Називаючи fX та fY розподілом відповідно функцій X та Y, маємо:
+
+
+
+**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
+
+⟶ Інтегральне правило Лейбніца - Нехай g буде функцією x і потенційно c, і a,b будуть кордонами що можуть залежати від с. Маємо :
+
+
+
+**32. Probability Distributions**
+
+⟶ Розподіл ймовірностей
+
+
+
+**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
+
+⟶ Нерівність Чебишова ― Нехай X буде випадковою змінною з очікуваною велечиною μ. Для k,σ>0, маємо наступну нерівність :
+
+
+
+**34. Main distributions ― Here are the main distributions to have in mind:**
+
+⟶ Головні розподіли - Ось кілька найважливіших розподілів які варто знати:
+
+
+
+**35. [Type, Distribution]**
+
+⟶ [Тип, Розподіл]
+
+
+
+**36. Jointly Distributed Random Variables**
+
+⟶ Спільно розподілені випадкові величини
+
+
+
+**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
+
+⟶ Відособлена густина та розподіл ймовірностей - Виходячи з формули спільної густини ймовірностей fXY, маємо :
+
+
+
+**38. [Case, Marginal density, Cumulative function]**
+
+⟶ [Випадок, Відособлена густина, Розподіл ймовірностей]
+
+
+
+**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
+
+⟶ Умовна густина ― Умовна густина X відносно Y, означена fX|Y, визначаєтья:
+
+
+
+**40. Independence ― Two random variables X and Y are said to be independent if we have:**
+
+⟶ Незалежність - Дві події А і В є незалежними якщо і тільки якщо ми маємо:
+
+
+
+**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
+
+⟶ Коваріація ― Коваріація двох випадкових змінних X та Y, що означена як σ2XY або частіше як Cov(X,Y), визначаєтья :
+
+
+
+**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
+
+⟶ Кореляція ― Означивши σX,σY станартним відхиленням X та Y, ми визначаємо кореляцію X та Y, означену ρXY, в наступний спосіб :
+
+
+
+**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
+
+⟶ Примітка 1: ми зазначаємо що для будь-яких випадкових змінних X, Y, маємо ρXY∈[−1,1].
+
+
+
+**44. Remark 2: If X and Y are independent, then ρXY=0.**
+
+⟶ Примітка 2 : Якщо X та Y є незалежними, тоді ρXY=0.
+
+
+
+**45. Parameter estimation**
+
+⟶ Оцінювання параметрів
+
+
+
+**46. Definitions**
+
+⟶ Визначення
+
+
+
+**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
+
+⟶ Випадкова вибірка ― Випадкова вибірка це набір випадкових змінних X1,...,Xn які є незалежними і ідентично розподіленими в X.
+
+
+
+**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
+
+⟶ Статистична оцінка - Статистична оцінка це функція даних що використовується щоб визначити невідомий параметр статистичної моделі.
+
+
+
+**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
+
+⟶ Систематична похибка ― Систематична похибка статистичної оцінки ^θ визначаєтья як різниця очікуваної величини розподілу ^θ і фактичної величини, тобіж:
+
+
+
+**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
+
+⟶ Примітка: оцінка немає похибки якщо E[^θ]=θ.
+
+
+
+**51. Estimating the mean**
+
+⟶ Оцінка середнього значення
+
+
+
+**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
+
+⟶ Середнє значення вибірки ― Середнє значення вибірки ¯¯¯¯¯X вказує середнє μ розподілу і визначаєтья:
+
+
+
+**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
+
+⟶ Примітка : середнє значення не має похибки, тобто E[¯¯¯¯¯X]=μ.
+
+
+
+**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
+
+⟶ Центральна гранична теорема ― Маючи випадкову вибірку X1,...,Xn слідуючи даному розподілу з середнім значенням σ2, маємо :
+
+
+
+**55. Estimating the variance**
+
+⟶ Розрахунок дисперсії
+
+
+
+**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
+
+⟶ Дисперсія вибірки ― Дисперсія випадкової вибірки - s2 або ^σ2, використовується щоб визначити справжню дисперсію σ2 вибірки, і визначаєтья:
+
+
+
+**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
+
+⟶ Примітка: дисперсія вибірки не має похибки, тобто E[s2]=σ2.
+
+
+
+**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
+
+⟶ Розподіл хі-квадрат та дисперсія вибірки ― Нехай s2 буде дисперсією випадкової вибірка. Маємо:
+
+
+
+**59. [Introduction, Sample space, Event, Permutation]**
+
+⟶ [Вступ, Простір елементарних подій, Подія, Підстановка];
+
+
+
+**60. [Conditional probability, Bayes' rule, Independence]**
+
+⟶ [Умовна ймовірність, Теорема Баєса, Незалежність];
+
+
+
+**61. [Random variables, Definitions, Expectation, Variance]**
+
+⟶ [Випадкові змінні, Означення, Очікування, Дисперсія]
+
+
+
+**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
+
+⟶ [Розподіли ймовірності, Нерівність Чебишова, Головні розподіли]
+
+
+
+**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
+
+⟶ [Спільно розподілені випадкові величини, Щільність, Коваріація, Кореляція]
+
+
+
+**64. [Parameter estimation, Mean, Variance]**
+
+⟶ [Оцінювання параметрів, Середнє значення, Дисперсія]
diff --git a/vi/cs-221-logic-models.md b/vi/cs-221-logic-models.md
new file mode 100644
index 000000000..94057d8d2
--- /dev/null
+++ b/vi/cs-221-logic-models.md
@@ -0,0 +1,462 @@
+**Logic-based models translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-221/cheatsheet-logic-models)
+
+
+
+**1. Logic-based models with propositional and first-order logic**
+
+⟶ Các mô hình dựa trên logic với logic mệnh đề và logic bậc nhất
+
+
+
+
+**2. Basics**
+
+⟶ Cơ bản
+
+
+
+
+**3. Syntax of propositional logic ― By noting f,g formulas, and ¬,∧,∨,→,↔ connectives, we can write the following logical expressions:**
+
+⟶ Cú pháp của logic mệnh đề ― Kí hiệu f,g là các công thức, và ¬,∧,∨,→,↔ các kết nối, chúng ta có thể viết các biểu thức logic sau:
+
+
+
+
+**4. [Name, Symbol, Meaning, Illustration]**
+
+⟶ [Tên, Kí hiệu, Ý nghĩa, Miêu tả]
+
+
+
+
+**5. [Affirmation, Negation, Conjunction, Disjunction, Implication, Biconditional]**
+
+⟶ [Khẳng định, phủ định, kết hợp, phân ly, hàm ý, nhị phân]
+
+
+
+
+**6. [not f, f and g, f or g, if f then g, f, that is to say g]**
+
+⟶ [phủ định f, f và g, f hoặc g, nếu f thì g, f, đó là nói g]
+
+
+
+
+**7. Remark: formulas can be built up recursively out of these connectives.**
+
+⟶ Ghi chú: công thức có thể được xây dựng đệ quy từ các kết nối này.
+
+
+
+
+**8. Model ― A model w denotes an assignment of binary weights to propositional symbols.**
+
+⟶ Mô hình - Một mô hình w biểu thị việc gán trọng số nhị phân cho các ký hiệu mệnh đề.
+
+
+
+
+**9. Example: the set of truth values w={A:0,B:1,C:0} is one possible model to the propositional symbols A, B and C.**
+
+⟶ Ví dụ: tập hợp các giá trị chân lý w ={A:0,B:1,C:0} là một mô hình có thể có cho các ký hiệu mệnh đề A, B và C.
+
+
+
+
+**10. Interpretation function ― The interpretation function I(f,w) outputs whether model w satisfies formula f:**
+
+⟶ Hàm giải thích - Hàm giải thích I (f, w) đưa ra liệu mô hình w có thỏa mãn công thức f:
+
+
+
+
+**11. Set of models ― M(f) denotes the set of models w that satisfy formula f. Mathematically speaking, we define it as follows:**
+
+⟶ Tập hợp các mô hình - M(f) biểu thị tập hợp các mô hình w thỏa mãn công thức f. Về mặt toán học, chúng ta định nghĩa nó như sau:
+
+
+
+
+**12. Knowledge base**
+
+⟶ Cơ sở tri thức
+
+
+
+
+**13. Definition ― The knowledge base KB is the conjunction of all formulas that have been considered so far. The set of models of the knowledge base is the intersection of the set of models that satisfy each formula. In other words:**
+
+⟶ Định nghĩa - Cơ sở tri thức KB là sự kết hợp của tất cả các công thức đã được xem xét cho đến nay. Tập hợp các mô hình của cơ sở tri thức là tập giao của tập hợp các mô hình thỏa mãn từng công thức. Nói cách khác:
+
+
+
+
+**14. Probabilistic interpretation ― The probability that query f is evaluated to 1 can be seen as the proportion of models w of the knowledge base KB that satisfy f, i.e.:**
+
+⟶ Giải thích xác suất - Xác suất mà truy vấn f được ước tính là 1 có thể được xem là tỷ lệ của các mô hình w của cơ sở tri thức KB thỏa mãn f, tức là:
+
+
+
+
+**15. Satisfiability ― The knowledge base KB is said to be satisfiable if at least one model w satisfies all its constraints. In other words:**
+
+⟶ Mức độ thỏa mãn - Cơ sở tri thức KB được cho là thỏa đáng nếu có ít nhất một mô hình w thỏa mãn tất cả các ràng buộc của nó. Nói cách khác:
+
+
+
+
+**16. satisfiable**
+
+⟶ thỏa đáng
+
+
+
+
+**17. Remark: M(KB) denotes the set of models compatible with all the constraints of the knowledge base.**
+
+⟶ Ghi chú: M(KB) biểu thị tập hợp các mô hình tương thích với tất cả các ràng buộc của cơ sở tri thức.
+
+
+
+
+**18. Relation between formulas and knowledge base - We define the following properties between the knowledge base KB and a new formula f:**
+
+⟶ Mối liên hệ giữa công thức và cơ sở tri thức - Chúng tôi định nghĩa các thuộc tính sau giữa KB cơ sở tri thức và công thức mới f:
+
+
+
+
+**19. [Name, Mathematical formulation, Illustration, Notes]**
+
+⟶ [Tên, Công thức toán học, Minh họa, Ghi chú]
+
+
+
+
+**20. [KB entails f, KB contradicts f, f contingent to KB]**
+
+⟶ [KB suy luận (kết thừa) từ f, KB mâu thuẫn với f, f phụ thuộc vào KB]
+
+
+
+
+**21. [f does not bring any new information, Also written KB⊨f, No model satisfies the constraints after adding f, Equivalent to KB⊨¬f, f does not contradict KB, f adds a non-trivial amount of information to KB]**
+
+⟶ [f không mang lại bất kỳ thông tin mới nào, KB writtenf cũng được viết, Không có mô hình nào thỏa mãn các ràng buộc sau khi thêm f, Tương đương với KB⊨¬f, f không mâu thuẫn với KB, f thêm một lượng thông tin không tầm thường vào KB]
+
+
+
+
+**22. Model checking ― A model checking algorithm takes as input a knowledge base KB and outputs whether it is satisfiable or not.**
+
+⟶ Kiểm tra mô hình - Thuật toán kiểm tra mô hình lấy đầu vào là KB cơ sở tri thức và đưa ra liệu nó có thỏa đáng hay không.
+
+
+
+
+**23. Remark: popular model checking algorithms include DPLL and WalkSat.**
+
+⟶ Ghi chú: các thuật toán kiểm tra mô hình phổ biến bao gồm DPLL và WalkSat.
+
+
+
+
+**24. Inference rule ― An inference rule of premises f1,...,fk and conclusion g is written:**
+
+⟶ Quy tắc suy luận - Một quy tắc suy luận của các cơ sở f1,...,fk và kết luận g được viết:
+
+
+
+
+**25. Forward inference algorithm ― From a set of inference rules Rules, this algorithm goes through all possible f1,...,fk and adds g to the knowledge base KB if a matching rule exists. This process is repeated until no more additions can be made to KB.**
+
+⟶ Thuật toán suy luận chuyển tiếp - Từ một tập hợp các quy tắc suy luận Quy tắc, thuật toán này sẽ đi qua tất cả các F1, ..., fk và thêm g vào cơ sở kiến thức KB nếu tồn tại quy tắc phù hợp. Quá trình này được lặp lại cho đến khi không thể bổ sung thêm vào KB.
+
+
+
+
+**26. Derivation ― We say that KB derives f (written KB⊢f) with rules Rules if f already is in KB or gets added during the forward inference algorithm using the set of rules Rules.**
+
+⟶ Đạo hàm - Chúng ta nói rằng KB xuất phát f (viết KB⊢f) với các quy tắc Quy tắc nếu f đã có trong KB hoặc được thêm vào trong thuật toán suy luận chuyển tiếp bằng cách sử dụng bộ quy tắc Quy tắc.
+
+
+
+
+**27. Properties of inference rules ― A set of inference rules Rules can have the following properties:**
+
+⟶ Thuộc tính của quy tắc suy luận - Một tập hợp các quy tắc suy luận Quy tắc có thể có các thuộc tính sau:
+
+
+
+
+**28. [Name, Mathematical formulation, Notes]**
+
+⟶ [Tên, Công thức toán học, Ghi chú]
+
+
+
+
+**29. [Soundness, Completeness]**
+
+⟶ [Âm thanh, Hoàn chỉnh]
+
+
+
+
+**30. [Inferred formulas are entailed by KB, Can be checked one rule at a time, "Nothing but the truth", Formulas entailing KB are either already in the knowledge base or inferred from it, "The whole truth"]**
+
+⟶ [Các công thức được suy luận được KB yêu cầu, Có thể kiểm tra một quy tắc tại một thời điểm, "Không có gì ngoài sự thật", Các công thức đòi hỏi KB đã có trong cơ sở tri thức hoặc được suy ra từ đó, "Toàn bộ sự thật"]
+
+
+
+
+**31. Propositional logic**
+
+⟶ Logic mệnh đề
+
+
+
+
+**32. In this section, we will go through logic-based models that use logical formulas and inference rules. The idea here is to balance expressivity and computational efficiency.**
+
+⟶ Trong phần này, chúng ta sẽ đi qua các mô hình dựa trên logic sử dụng các công thức logic và quy tắc suy luận. Ý tưởng ở đây là để cân bằng giữa tính biểu thức và hiệu quả tính toán.
+
+
+
+
+**33. Horn clause ― By noting p1,...,pk and q propositional symbols, a Horn clause has the form:**
+
+⟶ Mệnh đề sừng - Bằng cách lưu ý các ký hiệu mệnh đề p1,...,pk và q, mệnh đề Sừng có dạng:
+
+
+
+
+**34. Remark: when q=false, it is called a "goal clause", otherwise we denote it as a "definite clause".**
+
+⟶ Ghi chú: khi q = false, nó được gọi là "mệnh đề mục tiêu", nếu không, chúng ta biểu thị nó là "mệnh đề xác định".
+
+
+
+
+**35. Modus ponens ― For propositional symbols f1,...,fk and p, the modus ponens rule is written:**
+
+⟶ Modus ponens - Đối với các ký hiệu mệnh đề F1, ..., fk và p, quy tắc modus ponens được viết:
+
+
+
+
+**36. Remark: it takes linear time to apply this rule, as each application generate a clause that contains a single propositional symbol.**
+
+⟶ Lưu ý: phải mất thời gian tuyến tính để áp dụng quy tắc này, vì mỗi ứng dụng tạo ra một mệnh đề có chứa một ký hiệu mệnh đề duy nhất.
+
+
+
+
+**37. Completeness ― Modus ponens is complete with respect to Horn clauses if we suppose that KB contains only Horn clauses and p is an entailed propositional symbol. Applying modus ponens will then derive p.**
+
+⟶ Tính đầy đủ - Modus ponens hoàn thành đối với các mệnh đề Sừng nếu chúng ta cho rằng KB chỉ chứa các mệnh đề Sừng và p là một biểu tượng mệnh đề bắt buộc. Áp dụng modus ponens sau đó sẽ lấy được p.
+
+
+
+
+**38. Conjunctive normal form ― A conjunctive normal form (CNF) formula is a conjunction of clauses, where each clause is a disjunction of atomic formulas.**
+
+⟶ Dạng bình thường kết hợp - Một công thức dạng thường kết hợp (CNF) là một sự kết hợp của các mệnh đề, trong đó mỗi mệnh đề là một sự tách rời của các công thức nguyên tử.
+
+
+
+
+**39. Remark: in other words, CNFs are ∧ of ∨.**
+
+⟶ Ghi chú: nói cách khác, CNF là ∧ của ∨.
+
+
+
+
+**40. Equivalent representation ― Every formula in propositional logic can be written into an equivalent CNF formula. The table below presents general conversion properties:**
+
+⟶ Biểu diễn tương đương - Mọi công thức trong logic mệnh đề có thể được viết thành một công thức CNF tương đương. Bảng dưới đây trình bày các thuộc tính chuyển đổi chung:
+
+
+
+
+**41. [Rule name, Initial, Converted, Eliminate, Distribute, over]**
+
+⟶ [Tên quy tắc, Ban đầu, Chuyển đổi, Loại bỏ, Phân phối, kết thúc]
+
+
+
+
+**42. Resolution rule ― For propositional symbols f1,...,fn, and g1,...,gm as well as p, the resolution rule is written:**
+
+⟶ Quy tắc phân giải - Đối với các ký hiệu mệnh đề F1, ..., fn và g1, ..., gm cũng như p, quy tắc phân giải được viết:
+
+
+
+
+**43. Remark: it can take exponential time to apply this rule, as each application generates a clause that has a subset of the propositional symbols.**
+
+⟶ Lưu ý: có thể mất thời gian theo cấp số nhân để áp dụng quy tắc này, vì mỗi ứng dụng tạo ra một mệnh đề có tập hợp con của các ký hiệu mệnh đề.
+
+
+
+
+**44. [Resolution-based inference ― The resolution-based inference algorithm follows the following steps:, Step 1: Convert all formulas into CNF, Step 2: Repeatedly apply resolution rule, Step 3: Return unsatisfiable if and only if False, is derived]**
+
+⟶ [Suy luận dựa trên độ phân giải - Thuật toán suy luận dựa trên độ phân giải tuân theo các bước sau:, Bước 1: Chuyển đổi tất cả các công thức thành CNF, Bước 2: Áp dụng lại quy tắc độ phân giải, Bước 3: Trả về không thỏa đáng khi và chỉ khi Sai, được dẫn xuất]
+
+
+
+
+**45. First-order logic**
+
+⟶ Logic bậc nhất
+
+
+
+
+**46. The idea here is to use variables to yield more compact knowledge representations.**
+
+⟶ Ý tưởng ở đây là sử dụng các biến để mang lại các biểu diễn tri thức nhỏ gọn hơn.
+
+
+
+
+**47. [Model ― A model w in first-order logic maps:, constant symbols to objects, predicate symbols to tuple of objects]**
+
+⟶ [Mô hình - Một mô hình w trong các ánh xạ logic bậc nhất:, các ký hiệu không đổi cho các đối tượng, các ký hiệu vị ngữ cho đến các đối tượng]
+
+
+
+
+**48. Horn clause ― By noting x1,...,xn variables and a1,...,ak,b atomic formulas, the first-order logic version of a horn clause has the form:**
+
+⟶ Mệnh đề sừng - Bằng cách lưu ý các biến x1,...,xn và a1,...,ak,b công thức nguyên tử, phiên bản logic thứ nhất của mệnh đề sừng có dạng:
+
+
+
+
+**49. Substitution ― A substitution θ maps variables to terms and Subst[θ,f] denotes the result of substitution θ on f.**
+
+⟶ Thay thế - Một thay thế θ ánh xạ các biến thành các thuật ngữ và Subst [θ, f] biểu thị kết quả của sự thay thế θ trên f.
+
+
+
+
+**50. Unification ― Unification takes two formulas f and g and returns the most general substitution θ that makes them equal:**
+
+⟶ Hợp nhất - Hợp nhất có hai công thức f và g và trả về sự thay thế chung nhất làm cho chúng bằng nhau:
+
+
+
+
+**51. such that**
+
+⟶ sao cho
+
+
+
+
+**52. Note: Unify[f,g] returns Fail if no such θ exists.**
+
+⟶ Lưu ý: Thống nhất [f, g] trả về Fail nếu không tồn tại θ.
+
+
+
+
+**53. Modus ponens ― By noting x1,...,xn variables, a1,...,ak and a′1,...,a′k atomic formulas and by calling θ=Unify(a′1∧...∧a′k,a1∧...∧ak) the first-order logic version of modus ponens can be written:**
+
+⟶ Modus ponens - Bằng cách lưu ý các biến x1, ..., xn, a1, ..., ak và a′1, ..., a′k công thức nguyên tử và bằng cách gọi θ=Unify(a′1∧... ∧a′k,a1∧...ak) phiên bản logic bậc nhất của modus ponens có thể được viết:
+
+
+
+
+**54. Completeness ― Modus ponens is complete for first-order logic with only Horn clauses.**
+
+⟶ Tính đầy đủ - Modus ponens hoàn thành cho logic thứ nhất chỉ với các mệnh đề Horn.
+
+
+
+
+**55. Resolution rule ― By noting f1,...,fn, g1,...,gm, p, q formulas and by calling θ=Unify(p,q), the first-order logic version of the resolution rule can be written:**
+
+⟶ Quy tắc phân giải - Bằng cách lưu ý các công thức f1,...,fn,g1,...,gm,p,q và bằng cách gọi θ=Unify(p,q), có thể viết phiên bản logic bậc nhất của quy tắc phân giải :
+
+
+
+
+**56. [Semi-decidability ― First-order logic, even restricted to only Horn clauses, is semi-decidable., if KB⊨f, forward inference on complete inference rules will prove f in finite time, if KB⊭f, no algorithm can show this in finite time]**
+
+⟶ [Độ phân giải bán - Logic bậc một, thậm chí chỉ giới hạn ở các mệnh đề Sừng, là bán có thể quyết định., Nếu KB⊨f, suy luận về các quy tắc suy luận hoàn chỉnh sẽ chứng minh f trong thời gian hữu hạn, nếu KB⊭f, không thuật toán nào có thể hiển thị Điều này trong thời gian hữu hạn]
+
+
+
+
+**57. [Basics, Notations, Model, Interpretation function, Set of models]**
+
+⟶ [Khái niệm cơ bản, Ký hiệu, Mô hình, Hàm diễn giải, Bộ mô hình]
+
+
+
+
+**58. [Knowledge base, Definition, Probabilistic interpretation, Satisfiability, Relationship with formulas, Forward inference, Rule properties]**
+
+⟶ [Cơ sở tri thức, Định nghĩa, Giải thích xác suất, Sự thỏa mãn, Mối quan hệ với các công thức, Suy luận chuyển tiếp, Thuộc tính quy tắc]
+
+
+
+
+**59. [Propositional logic, Clauses, Modus ponens, Conjunctive normal form, Representation equivalence, Resolution]**
+
+⟶ [Logic đề xuất, Mệnh đề, Modus ponens, Hình thức bình thường kết hợp, Tương đương đại diện, Độ phân giải]
+
+
+
+
+**60. [First-order logic, Substitution, Unification, Resolution rule, Modus ponens, Resolution, Semi-decidability]**
+
+⟶ [Logic thứ nhất, Thay thế, Thống nhất, Quy tắc giải quyết, Modus ponens, Độ phân giải, Bán quyết định]
+
+
+
+
+**61. View PDF version on GitHub**
+
+⟶ Xem bản PDF trên GitHub
+
+
+
+
+**62. Original authors**
+
+⟶ Các tác giả
+
+
+
+
+**63. Translated by X, Y and Z**
+
+⟶ Dịch bởi X, Y và Z
+
+
+
+
+**64. Reviewed by X, Y and Z**
+
+⟶ Đánh giá bới X, Y và Z
+
+
+
+
+**65. By X and Y**
+
+⟶ Bởi X và Y
+
+
+
+
+**66. The Artificial Intelligence cheatsheets are now available in [target language].**
+
+⟶ Trí tuệ nhân tạo cheatsheets hiện đã có với ngôn ngữ [Tiếng Việt].
diff --git a/vi/cs-229-deep-learning.md b/vi/cs-229-deep-learning.md
new file mode 100644
index 000000000..e03a3f0ca
--- /dev/null
+++ b/vi/cs-229-deep-learning.md
@@ -0,0 +1,321 @@
+**1. Deep Learning cheatsheet**
+
+⟶ Deep Learning cheatsheet
+
+
+
+**2. Neural Networks**
+
+⟶ Mạng Neural
+
+
+
+**3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.**
+
+⟶ Mạng Neural là 1 lớp của các mô hình (models) được xây dựng với các tầng (layers). Các loại mạng Neural thường được sử dụng bao gồm: Mạng Neural tích chập (Convolutional Neural Networks) và Mạng Neural hồi quy (Recurrent Neural Networks).
+
+
+
+**4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:**
+
+⟶ Kiến trúc - Các thuật ngữ xoay quanh kiến trúc của mạng neural được mô tả như hình phía dưới
+
+
+
+**5. [Input layer, hidden layer, output layer]**
+
+⟶ [Tầng đầu vào, tầng ẩn, tầng đầu ra]
+
+
+
+**6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
+
+⟶ Bằng việc kí hiệu i là tầng thứ i của mạng, j là hidden unit (đơn vị ẩn) thứ j của tầng, ta có:
+
+
+
+**7. where we note w, b, z the weight, bias and output respectively.**
+
+⟶ Chúng ta kí hiệu w, b, z tương ứng với trọng số (weights), bias và đầu ra.
+
+
+
+**8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:**
+
+⟶ Hàm kích hoạt (Activation function) - Hàm kích hoạt được sử dụng ở phần cuối của đơn vị ẩn để đưa ra độ phức tạp phi tuyến tính (non-linear) cho mô hình (model). Đây là những trường hợp phổ biến nhất:
+
+
+
+**9. [Sigmoid, Tanh, ReLU, Leaky ReLU]**
+
+⟶ [Sigmoid, Tanh, ReLU, Leaky ReLU]
+
+
+
+**10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
+
+⟶ Lỗi (loss) Cross-entropy - Trong bối cảnh của mạng neural, hàm lỗi cross-entropy L(z, y) thường được sử dụng và định nghĩa như sau:
+
+
+
+**11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
+
+⟶ Tốc độ học (Learning rate) - Tốc độ học, thường được kí hiệu bởi α hoặc đôi khi là η, chỉ ra tốc độ mà trọng số được cập nhật. Thông số này có thể là cố định hoặc được thay đổi tuỳ biến. Phương thức (method) phổ biến nhất hiện tại là Adam, đó là phương thức thay đổi tốc độ học một cách phù hợp nhất có thể.
+
+
+
+**12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:**
+
+⟶ Backpropagation (Lan truyền ngược) - Backpropagation là phương thức dùng để cập nhật trọng số trong mạng neural bằng cách tính toán đầu ra thực sự và đầu ra mong muốn. Đạo hàm theo trọng số w được tính bằng cách sử dụng quy tắc chuỗi (chain rule) theo như cách dưới đây:
+
+
+
+**13. As a result, the weight is updated as follows:**
+
+⟶ Như kết quả, trọng số được cập nhật như sau:
+
+
+
+**14. Updating weights ― In a neural network, weights are updated as follows:**
+
+⟶ Cập nhật trọng số - Trong mạng neural, trọng số được cập nhật như sau:
+
+
+
+**15. Step 1: Take a batch of training data.**
+
+⟶ Bước 1: Lấy một mẻ (batch) dữ liệu huấn luyện (training data).
+
+
+
+**16. Step 2: Perform forward propagation to obtain the corresponding loss.**
+
+⟶ Bước 2: Thực thi lan truyền tiến (forward propagation) để lấy được lỗi (loss) tương ứng.
+
+
+
+**17. Step 3: Backpropagate the loss to get the gradients.**
+
+⟶ Bước 3: Lan truyền ngược lỗi để lấy được gradients (độ dốc).
+
+
+
+**18. Step 4: Use the gradients to update the weights of the network.**
+
+⟶ Bước 4: Sử dụng gradients để cập nhật trọng số của mạng (network).
+
+
+
+**19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
+
+⟶ Dropout - Dropout là thuật ngữ kĩ thuật dùng trong việc tránh overfitting tập dữ liệu huấn luyện bằng việc bỏ đi các đơn vị trong mạng neural. Trong thực tế, các neurals hoặc là bị bỏ đi bởi xác suất p hoặc được giữ lại với xác suất 1-p
+
+
+
+**20. Convolutional Neural Networks**
+
+⟶ Mạng neural tích chập (Convolutional Neural Networks)
+
+
+
+**21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
+
+⟶ Yêu cầu của tầng tích chập (Convolutional layer) - Bằng việc ghi chú W là kích cỡ của volume đầu vào, F là kích cỡ của neurals thuộc convolutional layer, P là số lượng zero padding, khi đó số lượng neurals N phù hợp với volume cho trước sẽ như sau:
+
+
+
+**22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
+
+⟶ Batch normalization (chuẩn hoá) - Đây là bước mà các hyperparameter γ,β chuẩn hoá batch {xi}. Bằng việc kí hiệu μB,σ2B là giá trị trung bình, phương sai mà ta muốn gán cho batch, nó được thực hiện như sau:
+
+
+
+**23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
+
+⟶ Nó thường được tính sau fully connected/convolutional layer và trước non-linearity layer và mục tiêu là cho phép tốc độ học cao hơn cũng như giảm đi sự phụ thuộc mạnh mẽ vào việc khởi tạo.
+
+
+
+**24. Recurrent Neural Networks**
+
+⟶ Mạng neural hồi quy (Recurrent Neural Networks)
+
+
+
+**25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:**
+
+⟶ Các loại cổng - Đây là các loại cổng (gate) khác nhau mà chúng ta sẽ gặp ở một mạng neural hồi quy điển hình:
+
+
+
+**26. [Input gate, forget gate, gate, output gate]**
+
+⟶ [Cổng đầu vào, cổng quên, cổng đầu ra]
+
+
+
+**27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]**
+
+⟶ [Ghi vào cell hay không?, Xoá cell hay không?, Ghi bao nhiêu vào cell?, Cần tiết lộ bao nhiêu về cell?]
+
+
+
+**28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
+
+⟶ LSTM - Mạng bộ nhớ dài-ngắn (LSTM) là 1 loại RNN model tránh vấn đề vanishing gradient (gradient biến mất đột ngột) bằng cách thêm vào cổng 'quên' ('forget' gates).
+
+
+
+**29. Reinforcement Learning and Control**
+
+⟶ Reinforcement Learning (Học tăng cường) và điều khiển
+
+
+
+**30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.**
+
+⟶ Mục tiêu của reinforcement learning đó là cho tác tử (agent) học cách làm sao để tối ưu hoá trong một môi trường.
+
+
+
+**31. Definitions**
+
+⟶ Định nghĩa
+
+
+
+**32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:**
+
+⟶ Tiến trình quyết định Markov (Markov decision processes) - Tiến trình quyết định Markov (MDP) là một dạng 5-tuple (S,A,{Psa},γ,R) mà ở đó:
+
+
+
+**33. S is the set of states**
+
+⟶ S là tập hợp các trạng thái (states)
+
+
+
+**34. A is the set of actions**
+
+⟶ A là tập hợp các hành động (actions)
+
+
+
+**35. {Psa} are the state transition probabilities for s∈S and a∈A**
+
+⟶ {Psa} là xác suất chuyển tiếp trạng thái cho s∈S và a∈A
+
+
+
+**36. γ∈[0,1[ is the discount factor**
+
+⟶ γ∈[0,1[ là discount factor
+
+
+
+**37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize**
+
+⟶ R:S×A⟶R hoặc R:S⟶R là reward function (hàm định nghĩa phần thưởng) mà giải thuật muốn tối đa hoá.
+
+
+
+**38. Policy ― A policy π is a function π:S⟶A that maps states to actions.**
+
+⟶ Policy - Policy π là 1 hàm π:S⟶A có nhiệm vụ ánh xạ states tới actions
+
+
+
+**39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).**
+
+⟶ Chú ý: Ta quy ước rằng ta thực thi policy π cho trước nếu cho trước state s ta có action a=π(s)
+
+
+
+**40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:**
+
+⟶ Hàm giá trị (Value function) - Với policy cho trước π và state s, ta định nghĩa value function Vπ như sau:
+
+
+
+**41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:**
+
+⟶ Phương trình Bellman - Phương trình tối ưu Bellman đặc trưng hoá value function Vπ∗ của policy tối ưu (optimal policy) π∗:
+
+
+
+**42. Remark: we note that the optimal policy π∗ for a given state s is such that:**
+
+⟶ Chú ý: ta quy ước optimal policy π∗ đối với state s cho trước như sau:
+
+
+
+**43. Value iteration algorithm ― The value iteration algorithm is in two steps:**
+
+⟶ Giải thuật duyệt giá trị (Value iteration) - Giải thuật duyệt giá trị gồm 2 bước:
+
+
+
+**44. 1) We initialize the value:**
+
+⟶ 1) Ta khởi tạo giá trị (value):
+
+
+
+**45. 2) We iterate the value based on the values before:**
+
+⟶ 2) Ta duyệt qua giá trị dựa theo giá trị phía trước:
+
+
+
+**46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
+
+⟶ Ước lượng khả năng tối đa (Maximum likelihood estimate) - Ước lượng khả năng tối đa cho xác suất chuyển tiếp trạng thái (state) sẽ như sau:
+
+
+
+**47. times took action a in state s and got to s′**
+
+⟶ thời gian hành động a tiêu tốn cho state s và biến đổi nó thành s′
+
+
+
+**48. times took action a in state s**
+
+⟶ thời gian hành động a tiêu tốn cho state (trạng thái) s
+
+
+
+**49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:**
+
+⟶ Q-learning ― Q-learning là 1 dạng phán đoán phi mô hình (model-free) của Q, được thực hiện như sau:
+
+
+
+**50. View PDF version on GitHub**
+
+⟶ Xem bản PDF trên GitHub
+
+
+
+**51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
+
+⟶ [Mạng neural, Kiến trúc, Hàm kích hoạt, Lan truyền ngược, Dropout]
+
+
+
+**52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
+
+⟶ [Mạng neural tích chập, Tầng tích chập, Chuẩn hoá batch]
+
+
+
+**53. [Recurrent Neural Networks, Gates, LSTM]**
+
+⟶ [Mạng neural hồi quy, Gates, LSTM]
+
+
+
+**54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
+
+⟶ [Học tăng cường (Reinforcement learning), Tiến trình quyết định Markov, Lặp Giá trị/policy, Lập trình động xấp xỉ, Tìm kiếm Policy]
diff --git a/vi/cs-229-linear-algebra.md b/vi/cs-229-linear-algebra.md
new file mode 100644
index 000000000..53d7f2ff4
--- /dev/null
+++ b/vi/cs-229-linear-algebra.md
@@ -0,0 +1,345 @@
+**Linear Algebra and Calculus translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/refresher-algebra-calculus)
+
+
+
+**1. Linear Algebra and Calculus refresher**
+
+⟶ Đại số tuyến tính và Giải tích cơ bản
+
+
+
+**2. General notations**
+
+⟶ Kí hiệu chung
+
+
+
+**3. Definitions**
+
+⟶ Định nghĩa
+
+
+
+**4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:**
+
+⟶ Vectơ ― Chúng ta kí hiệu x∈Rn là một vectơ với n phần tử, với xi∈R là phần tử thứ i:
+
+
+
+**5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:**
+
+⟶ Ma trận ― Kí hiệu A∈Rm×n là một ma trận với m hàng và n cột, Ai,j∈R là phần tử nằm ở hàng thứ i, cột j:
+
+
+
+**6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.**
+
+⟶ Ghi chú: vectơ x được xác định ở trên có thể coi như một ma trận nx1 và được gọi là vectơ cột.
+
+
+
+**7. Main matrices**
+
+⟶ Ma trận chính
+
+
+
+**8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:**
+
+⟶ Ma trận đơn vị ― Ma trận đơn vị I∈Rn×n là một ma trận vuông với các phần tử trên đường chéo chính bằng 1 và các phần tử còn lại bằng 0:
+
+
+
+**9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.**
+
+⟶ Ghi chú: với mọi ma trận vuông A∈Rn×n, ta có A×I=I×A=A.
+
+
+
+**10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:**
+
+⟶ Ma trận đường chéo ― Ma trận đường chéo D∈Rn×n là một ma trận vuông với các phần tử trên đường chéo chính khác 0 và các phần tử còn lại bằng 0:
+
+
+
+**11. Remark: we also note D as diag(d1,...,dn).**
+
+⟶ Ghi chú: chúng ta kí hiệu D là diag(d1,...,dn).
+
+
+
+**12. Matrix operations**
+
+⟶ Các phép toán ma trận
+
+
+
+**13. Multiplication**
+
+⟶ Phép nhân
+
+
+
+**14. Vector-vector ― There are two types of vector-vector products:**
+
+⟶ Vectơ/vectơ ― Có hai loại phép nhân vectơ/vectơ:
+
+
+
+**15. inner product: for x,y∈Rn, we have:**
+
+⟶ phép nhân inner: với x,y∈Rn, ta có:
+
+
+
+**16. outer product: for x∈Rm,y∈Rn, we have:**
+
+⟶ phép nhân outer: với x∈Rm,y∈Rn, ta có:
+
+
+
+**17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:**
+
+⟶ Ma trận/vectơ ― Phép nhân giữa ma trận A∈Rm×n và vectơ x∈Rn là một vectơ có kích thước Rn:
+
+
+
+**18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.**
+
+⟶ với aTr,i là các vectơ hàng và ac,j là các vectơ cột của A, và xi là các phần tử của x.
+
+
+
+**19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:**
+
+⟶ Ma trận/ma trận ― Phép nhân giữa ma trận A∈Rm×n và B∈Rn×p là một ma trận kích thước Rn×p:
+
+
+
+**20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively**
+
+⟶ với aTr,i,bTr,i là các vectơ hàng và ac,j,bc,j lần lượt là các vectơ cột của A và B.
+
+
+
+**21. Other operations**
+
+⟶ Một số phép toán khác
+
+
+
+**22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:**
+
+⟶ Chuyển vị ― Chuyển vị của một ma trận A∈Rm×n, kí hiệu AT, khi các phần tử hàng cột hoán đổi vị trí cho nhau:
+
+
+
+**23. Remark: for matrices A,B, we have (AB)T=BTAT**
+
+⟶ Ghi chú: với ma trận A,B, ta có (AB)T=BTAT
+
+
+
+**24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:**
+
+⟶ Nghịch đảo ― Nghịch đảo của ma trận vuông khả đảo A được kí hiệu là A-1 và chỉ tồn tại duy nhất:
+
+
+
+**25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1**
+
+⟶ Ghi chú: không phải tất cả các ma trận vuông đều khả đảo. Ngoài ra, với ma trận A,B, ta có (AB)−1=B−1A−1
+
+
+
+**26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:**
+
+⟶ Truy vết ― Truy vết của ma trận vuông A, kí hiệu tr(A), là tổng của các phần tử trên đường chéo chính của nó:
+
+
+
+**27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)**
+
+⟶ Ghi chú: với ma trận A,B, chúng ta có tr(AT)=tr(A) và tr(AB)=tr(BA)
+
+
+
+**28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:**
+
+⟶ Định thức ― Định thức của một ma trận vuông A∈Rn×n, kí hiệu |A| hay det(A) được tính đệ quy với A∖i,∖j, ma trận A xóa đi hàng thứ i và cột thứ j:
+
+
+
+**29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.**
+
+⟶ Ghi chú: A khả đảo nếu và chỉ nếu |A|≠0. Ngoài ra, |AB|=|A||B| và |AT|=|A|.
+
+
+
+**30. Matrix properties**
+
+⟶ Những tính chất của ma trận
+
+
+
+**31. Definitions**
+
+⟶ Định nghĩa
+
+
+
+**32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:**
+
+⟶ Phân rã đối xứng - Một ma trận A đã cho có thể được biểu diễn dưới dạng các phần đối xứng và phản đối xứng của nó như sau:
+
+
+
+**33. [Symmetric, Antisymmetric]**
+
+⟶ [Đối xứng, Phản đối xứng]
+
+
+
+**34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:**
+
+⟶ Chuẩn (norm) ― Một chuẩn (norm) là một hàm N:V⟶[0,+∞[ mà V là một không gian vectơ, và với mọi x,y∈V, ta có:
+
+
+
+**35. N(ax)=|a|N(x) for a scalar**
+
+⟶ N(ax)=|a|N(x) với a là một số
+
+
+
+**36. if N(x)=0, then x=0**
+
+⟶ nếu N(x)=0, thì x=0
+
+
+
+**37. For x∈V, the most commonly used norms are summed up in the table below:**
+
+⟶ Với x∈V, các chuẩn thường dùng được tổng hợp ở bảng dưới đây:
+
+
+
+**38. [Norm, Notation, Definition, Use case]**
+
+⟶ [Chuẩn, Kí hiệu, Định nghĩa, Trường hợp dùng]
+
+
+
+**39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.**
+
+⟶ Sự phụ thuộc tuyến tính―- Một tập hợp các vectơ được cho là phụ thuộc tuyến tính nếu một trong các vectơ trong tập hợp có thể được biểu diễn bởi một tổ hợp tuyến tính của các vectơ khác.
+
+
+
+**40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent**
+
+⟶ Ghi chú: nếu không có vectơ nào có thể được viết theo cách này, thì các vectơ được cho là độc lập tuyến tính
+
+
+
+**41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.**
+
+⟶ Hạng ma trận (rank) ― Hạng của một ma trận A kí hiệu rank(A) và là số chiều của không gian vectơ được tạo bởi các cột của nó. Điều này tương đương với số cột độc lập tuyến tính tối đa của A.
+
+
+
+**42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:**
+
+⟶ Ma trận bán xác định dương - Ma trận A∈Rn×n là bán xác định dương (PSD) kí hiệu A⪰0 nếu chúng ta có:
+
+
+
+**43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.**
+
+⟶ Ghi chú: tương tự, một ma trận A được cho là xác định dương và được kí hiệu A≻0, nếu đó là ma trận PSD thỏa mãn cho tất cả các vectơ khác không x, xTAx>0.
+
+
+
+**44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+⟶ Giá trị riêng, vectơ riêng - Cho ma trận A∈Rn×n, λ được gọi là giá trị riêng của A nếu tồn tại một vectơ z∈Rn∖{0}, được gọi là vectơ riêng, sao cho:
+
+
+
+**45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+⟶ Định lý phổ - Cho A∈Rn×n. Nếu A đối xứng, thì A có thể chéo hóa bởi một ma trận trực giao thực U∈Rn×n. Bằng cách kí hiệu Λ=diag(1,...,n), chúng ta có:
+
+
+
+**46. diagonal**
+
+⟶ đường chéo
+
+
+
+**47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:**
+
+⟶ Phân tích giá trị suy biến - Đối với một ma trận A có kích thước m×n, Phân tích giá trị suy biến (SVD) là một kỹ thuật phân tích nhân tố nhằm đảm bảo sự tồn tại của đơn vị U m×m, đường chéo Σm×n và đơn vị V n×n ma trận, sao cho:
+
+
+
+**48. Matrix calculus**
+
+⟶ Giải tích ma trận
+
+
+
+**49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:**
+
+⟶ Gradient ― Cho f:Rm×n→R là một hàm và A∈Rm×n là một ma trận. Gradient của f đối với A là ma trận m×n, được kí hiệu là ∇Af(A), sao cho:
+
+
+
+
+
+**50. Remark: the gradient of f is only defined when f is a function that returns a scalar.**
+
+⟶ Ghi chú: gradient của f chỉ được xác định khi f là hàm trả về một số.
+
+
+
+**51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:**
+
+⟶ Hessian ― Cho f:Rn→R là một hàm và x∈Rn là một vectơ. Hessian của f đối với x là một ma trận đối xứng n×n, ghi chú ∇2xf(x), sao cho:
+
+
+
+**52. Remark: the hessian of f is only defined when f is a function that returns a scalar**
+
+⟶ Ghi chú: hessian của f chỉ được xác định khi f là hàm trả về một số.
+
+
+
+**53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:**
+
+⟶ Các phép toán của gradient ― Đối với ma trận A,B,C, các thuộc tính gradient sau cần để lưu ý:
+
+
+
+**54. [General notations, Definitions, Main matrices]**
+
+⟶ [Kí hiệu chung, Định nghĩa, Ma trận chính]
+
+
+
+**55. [Matrix operations, Multiplication, Other operations]**
+
+⟶ [Phép toán ma trận, Phép nhân, Các phép toán khác]
+
+
+
+**56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]**
+
+⟶ [Các thuộc tính ma trận, Chuẩn, Giá trị riêng/Vectơ riêng, Phân tích giá trị suy biến]
+
+
+
+**57. [Matrix calculus, Gradient, Hessian, Operations]**
+
+⟶ [Giải tích ma trận, Gradient, Hessian, Phép tính]
diff --git a/vi/cs-229-machine-learning-tips-and-tricks.md b/vi/cs-229-machine-learning-tips-and-tricks.md
new file mode 100644
index 000000000..d08b7cd9a
--- /dev/null
+++ b/vi/cs-229-machine-learning-tips-and-tricks.md
@@ -0,0 +1,285 @@
+**1. Machine Learning tips and tricks cheatsheet**
+
+⟶ Các mẹo và thủ thuật trong Machine Learning (Học máy)
+
+
+
+**2. Classification metrics**
+
+⟶ Độ đo phân loại
+
+
+
+**3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
+
+⟶ Đối với phân loại nhị phân (binary classification) là các độ đo chính, chúng khá quan trọng để theo dõi (track), qua đó đánh giá hiệu năng của mô hình (model)
+
+
+
+**4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
+
+⟶ Ma trận nhầm lẫn (Confusion matrix) - Confusion matrix được sử dụng để có kết quả hoàn chỉnh hơn khi đánh giá hiệu năng của model. Nó được định nghĩa như sau:
+
+
+
+**5. [Predicted class, Actual class]**
+
+⟶ [Lớp dự đoán, lớp thực sự]
+
+
+
+**6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
+
+⟶ Độ đo chính - Các độ đo sau thường được sử dụng để đánh giá hiệu năng của mô hình phân loại:
+
+
+
+**7. [Metric, Formula, Interpretation]**
+
+⟶ [Độ đo, Công thức, Diễn giải]
+
+
+
+**8. Overall performance of model**
+
+⟶ Hiệu năng tổng thể của mô hình
+
+
+
+**9. How accurate the positive predictions are**
+
+⟶ Độ chính xác của các dự đoán positive
+
+
+
+**10. Coverage of actual positive sample**
+
+⟶ Bao phủ các mẫu thử chính xác (positive) thực sự
+
+
+
+**11. Coverage of actual negative sample**
+
+⟶ Bao phủ các mẫu thử sai (negative) thực sự
+
+
+
+**12. Hybrid metric useful for unbalanced classes**
+
+⟶ Độ đo Hybrid hữu ích cho các lớp không cân bằng (unbalanced classes)
+
+
+
+**13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
+
+⟶ ROC - Đường cong thao tác nhận, được kí hiệu là ROC, là minh hoạ của TPR với FPR bằng việc thay đổi ngưỡng (threshold). Các độ đo này được tổng kết ở bảng bên dưới:
+
+
+
+**14. [Metric, Formula, Equivalent]**
+
+⟶ [Độ đo, Công thức, Tương đương]
+
+
+
+**15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
+
+⟶ AUC - Khu vực phía dưới đường cong thao tác nhận, còn được gọi tắt là AUC hoặc AUROC, là khu vực phía dưới ROC như hình minh hoạ phía dưới:
+
+
+
+**16. [Actual, Predicted]**
+
+⟶ [Thực sự, Dự đoán]
+
+
+
+**17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
+
+⟶ Độ đo cơ bản - Cho trước mô hình hồi quy f, độ đo sau được sử dụng phổ biến để đánh giá hiệu năng của mô hình:
+
+
+
+**18. [Total sum of squares, Explained sum of squares, Residual sum of squares]**
+
+⟶ [Tổng của tổng các bình phương, Mô hình tổng bình phương, Tổng bình phương dư]
+
+
+
+**19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
+
+⟶ Hệ số quyết định - Hệ số quyết định, thường được kí hiệu là R2 hoặc r2, cung cấp độ đo mức độ tốt của kết quả quan sát đầu ra (được nhân rộng bởi mô hình), và được định nghĩa như sau:
+
+
+
+**20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
+
+⟶ Độ đo chính - Độ đo sau đây thường được sử dụng để đánh giá hiệu năng của mô hình hồi quy, bằng cách tính số lượng các biến n mà độ đo đó sẽ cân nhắc:
+
+
+
+**21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
+
+⟶ trong đó L là khả năng và ˆσ2 là giá trị ước tính của phương sai tương ứng với mỗi response (hồi đáp).
+
+
+
+**22. Model selection**
+
+⟶ Lựa chọn model (mô hình)
+
+
+
+**23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
+
+⟶ Vocabulary - Khi lựa chọn mô hình, chúng ta chia tập dữ liệu thành 3 tập con như sau:
+
+
+
+**24. [Training set, Validation set, Testing set]**
+
+⟶ [Tập huấn luyện, Tập xác thực, Tập kiểm tra (testing)]
+
+
+
+**25. [Model is trained, Model is assessed, Model gives predictions]**
+
+⟶ [Mô hình được huấn luyện, mô hình được xác thực, mô hình đưa ra dự đoán]
+
+
+
+**26. [Usually 80% of the dataset, Usually 20% of the dataset]**
+
+⟶ [Thường là 80% tập dữ liệu, Thường là 20% tập dữ liệu]
+
+
+
+**27. [Also called hold-out or development set, Unseen data]**
+
+⟶ [Cũng được gọi là hold-out hoặc development set (tập phát triển), Dữ liệu chưa được biết]
+
+
+
+**28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
+
+⟶ Khi mô hình đã được chọn, nó sẽ được huấn luyện trên tập dữ liệu đầu vào và được test trên tập dữ liệu test hoàn toàn khác. Tất cả được minh hoạ ở hình bên dưới:
+
+
+
+**29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
+
+⟶ Cross-validation - Cross-validation, còn được gọi là CV, một phương thức được sử dụng để chọn ra một mô hình không dựa quá nhiều vào tập dữ liệu huấn luyện ban đầu. Các loại khác nhau được tổng kết ở bảng bên dưới:
+
+
+
+**30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
+
+⟶ [Huấn luyện trên k-1 phần và đánh giá trên 1 phần còn lại, Huấn luyện trên n-p phần và đánh giá trên p phần còn lại]
+
+
+
+**31. [Generally k=5 or 10, Case p=1 is called leave-one-out]**
+
+⟶ [Thường thì k=5 hoặc 10, Trường hợp p=1 được gọi là leave-one-out]
+
+
+
+**32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
+
+⟶ Phương thức hay được sử dụng được gọi là k-fold cross-validation và chia dữ liệu huấn luyện thành k phần, đánh giá mô hình trên 1 phần trong khi huấn luyện mô hình trên k-1 phần còn lại, tất cả k lần. Lỗi sau đó được tính trung bình trên k phần và được đặt tên là cross-validation error.
+
+
+
+**33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
+
+⟶ Chuẩn hoá - Mục đích của thủ tục chuẩn hoá là tránh cho mô hình bị overfit với dữ liệu, do đó gặp phải vấn đề phương sai lớn. Bảng sau đây sẽ tổng kết các loại kĩ thuật chuẩn hoá khác nhau hay được sử dụng:
+
+
+
+**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+⟶ [Giảm hệ số xuống còn 0, Tốt cho việc lựa chọn biến, Làm cho hệ số nhỏ hơn, Thay đổi giữa chọn biến và hệ số nhỏ hơn]
+
+
+
+**35. Diagnostics**
+
+⟶ Dự đoán (Diagnostics)
+
+
+
+**36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
+
+⟶ Bias - Bias của mô hình là sai số giữa dự đoán mong đợi và dự đoán của mô hình trên các điểm dữ liệu cho trước.
+
+
+
+**37. Variance ― The variance of a model is the variability of the model prediction for given data points.**
+
+⟶ Phương sai - Phương sai của một mô hình là sự thay đổi dự đoán của mô hình trên các điểm dữ liệu cho trước.
+
+
+
+**38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
+
+⟶ Thay đổi/ Thay thế Bias/phương sai - Mô hình càng đơn giản bias càng lớn, mô hình càng phức tạp phương sai càng cao.
+
+
+
+**39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
+
+⟶ [Symptoms, Minh hoạ hồi quy, Minh hoạ phân loại, Minh hoạ deep learning (học sâu), Biện pháp khắc phục có thể dùng]
+
+
+
+**40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
+
+⟶ [Lỗi huấn luyện cao, Lỗi huấn luyện tiến gần tới lỗi test, Bias cao, Lỗi huấn luyện thấp hơn một chút so với lỗi test, Lỗi huấn luyện rất thấp, Lỗi huấn luyện thấp hơn lỗi test rất nhiều, Phương sai cao]
+
+
+
+**41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
+
+⟶ [Mô hình phức tạp, Thêm nhiều đặc trưng, Huấn luyện lâu hơn, Thực hiện chuẩn hóa, Lấy nhiều dữ liệu hơn]
+
+
+
+**42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
+
+⟶ Phân tích lỗi - Phân tích lỗi là phân tích nguyên nhân của sự khác biệt trong hiệu năng giữa mô hình hiện tại và mô hình lí tưởng.
+
+
+
+**43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
+
+⟶ Phân tích Ablative - Phân tích Ablative là phân tích nguyên nhân của sự khác biệt giữa hiệu năng của mô hình hiện tại và mô hình cơ sở.
+
+
+
+**44. Regression metrics**
+
+⟶ Độ đo hồi quy
+
+
+
+**45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
+
+⟶ [Độ đo phân loại, Ma trận nhầm lẫn, chính xác, dự đoán, recall, Điểm F1, ROC]
+
+
+
+**46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]**
+
+⟶ [Độ đo hồi quy, Bình phương R, CP của Mallow, AIC, BIC]
+
+
+
+**47. [Model selection, cross-validation, regularization]**
+
+⟶ [Lựa chọn mô hình, cross-validation, Chuẩn hoá (regularization)]
+
+
+
+**48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
+
+⟶ [Dự đoán, Đánh đổi Bias/phương sai, Phân tích lỗi/ablative]
diff --git a/vi/cs-229-probability.md b/vi/cs-229-probability.md
new file mode 100644
index 000000000..4cedb5ed5
--- /dev/null
+++ b/vi/cs-229-probability.md
@@ -0,0 +1,385 @@
+**Probabilities and Statistics translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/refresher-probabilities-statistics)
+
+
+
+**1. Probabilities and Statistics refresher**
+
+⟶ Xác suất và Thống kê cơ bản
+
+
+
+**2. Introduction to Probability and Combinatorics**
+
+⟶ Giới thiệu về Xác suất và Tổ hợp
+
+
+
+**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
+
+⟶ Không gian mẫu - Một tập hợp các kết cục có thể xảy ra của một phép thử được gọi là không gian mẫu của phép thử và được kí hiệu là S.
+
+
+
+**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
+
+⟶ Sự kiện (hay còn gọi là biến cố) - Bất kỳ một tập hợp con E nào của không gian mẫu đều được gọi là một sự kiện. Một sự kiện là một tập các kết cục có thể xảy ra của phép thử. Nếu kết quả của phép thử chứa trong E, chúng ta nói sự kiện E đã xảy ra.
+
+
+
+**5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
+
+⟶ Tiên đề của xác suất Với mỗi sự kiện E, chúng ta kí hiệu P(E) là xác suất sự kiện E xảy ra.
+
+
+
+**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
+
+⟶ Tiên đề 1 - Mọi xác suất bất kì đều nằm trong khoảng 0 đến 1:
+
+
+
+**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
+
+⟶ Tiên đề 2 - Xác suất xảy ra của ít nhất một phần tử trong toàn bộ không gian mẫu là 1:
+
+
+
+**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
+
+⟶ Tiên đề 3 - Với một chuỗi các biến cố xung khắc E1,...,En, ta có:
+
+
+
+**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
+
+⟶ Hoán vị - Hoán vị là một cách sắp xếp r phần tử từ một nhóm n phần tử, theo một thứ tự nhất định. Số lượng cách sắp xếp như vậy là P(n,r), được định nghĩa như sau:
+
+
+
+**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
+
+⟶ Tổ hợp - Một tổ hợp là một cách sắp xếp r phần tử từ n phần tử, không quan trọng thứ tự. Số lượng cách sắp xếp như vậy là C(n,r), được định nghĩa như sau:
+
+
+
+**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
+
+⟶ Ghi chú: Chúng ta lưu ý rằng với 0⩽r⩽n, ta có P(n,r)⩾C(n,r)
+
+
+
+**12. Conditional Probability**
+
+⟶ Xác suất có điều kiện
+
+
+
+**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
+
+⟶ Định lí Bayes - Với các sự kiện A và B sao cho P(B)>0, ta có:
+
+
+
+**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
+
+⟶ Ghi chú: ta có P(A∩B)=P(A)P(B|A)=P(A|B)P(B)
+
+
+
+**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
+
+⟶ Phân vùng ― Cho {Ai,i∈[[1,n]]} sao cho với mỗi i, Ai≠∅. Chúng ta nói rằng {Ai} là một phân vùng nếu có:
+
+
+
+**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
+
+⟶ Ghi chú: với bất cứ sự kiện B nào trong không gian mẫu, ta có P(B)=n∑i=1P(B|Ai)P(Ai).
+
+
+
+**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
+
+⟶ Định lý Bayes mở rộng - Cho {Ai,i∈[[1,n]]} là một phân vùng của không gian mẫu. Ta có:
+
+
+
+**18. Independence ― Two events A and B are independent if and only if we have:**
+
+⟶ Sự kiện độc lập - Hai sự kiện A và B được coi là độc lập khi và chỉ khi ta có:
+
+
+
+**19. Random Variables**
+
+⟶ Biến ngẫu nhiên
+
+
+
+**20. Definitions**
+
+⟶ Định nghĩa
+
+
+
+**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
+
+⟶ Biến ngẫu nhiên - Một biến ngẫu nhiên, thường được kí hiệu là X, là một hàm nối mỗi phần tử trong một không gian mẫu thành một số thực.
+
+
+
+**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
+
+⟶ Hàm phân phối tích lũy (CDF) ― Hàm phân phối tích lũy F, là một hàm đơn điệu không giảm, sao cho limx→−∞F(x)=0 và limx→+∞F(x)=1, được định nghĩa là:
+
+
+
+**23. Remark: we have P(a
+
+**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
+
+⟶ Hàm mật độ xác suất (PDF) - Hàm mật độ xác suất f là xác suất mà X nhận các giá trị giữa hai giá trị thực liền kề của biến ngẫu nhiên.
+
+
+
+**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
+
+⟶ Mối quan hệ liên quan giữa PDF và CDF - Dưới đây là các thuộc tính quan trọng cần biết trong trường hợp rời rạc (D) và liên tục (C).
+
+
+
+**26. [Case, CDF F, PDF f, Properties of PDF]**
+
+⟶ [Trường hợp, CDF F, PDF f, Thuộc tính của PDF]
+
+
+
+**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
+
+⟶ Kỳ vọng và moment của phân phối - Dưới đây là các biểu thức của giá trị kì vọng E[X], giá trị kì vọng tổng quát E[g(X)], moment bậc k E[Xk] và hàm đặc trưng ψ(ω) cho các trường hợp rời rạc và liên tục:
+
+
+
+**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
+
+⟶ Phương sai - Phương sai của một biến ngẫu nhiên, thường được kí hiệu là Var(X) hoặc σ2, là một độ đo mức độ phân tán của hàm phân phối. Nó được xác định như sau:
+
+
+
+**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
+
+⟶ Độ lệch chuẩn - Độ lệch chuẩn của một biến ngẫu nhiên, thường được kí hiệu σ, là thước đo mức độ phân tán của hàm phân phối của nó so với các đơn vị của biến ngẫu nhiên thực tế. Nó được xác định như sau:
+
+
+
+**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
+
+⟶ Biến đổi các biến ngẫu nhiên - Đặt các biến X và Y được liên kết với nhau bởi một hàm. Kí hiệu fX và fY lần lượt là các phân phối của X và Y, ta có:
+
+
+
+**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
+
+⟶ Quy tắc tích phân Leibniz - Gọi g là hàm của x và có khả năng c, và a,b là các ranh giới có thể phụ thuộc vào c. Chúng ta có:
+
+
+
+**32. Probability Distributions**
+
+⟶ Phân bố xác suất
+
+
+
+**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
+
+⟶ Bất đẳng thức Chebyshev - Gọi X là biến ngẫu nhiên có giá trị kỳ vọng μ. Với k,σ>0, chúng ta có bất đẳng thức sau:
+
+
+
+**34. Main distributions ― Here are the main distributions to have in mind:**
+
+⟶ Các phân phối chính - Dưới là các phân phối chính cần ghi nhớ:
+
+
+
+**35. [Type, Distribution]**
+
+⟶ [Loại, Phân phối]
+
+
+
+**36. Jointly Distributed Random Variables**
+
+⟶ Phân phối đồng thời biến ngẫu nhiên
+
+
+
+**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
+
+⟶ Mật độ biên và phân phối tích lũy - Từ hàm phân phối mật độ đồng thời fXY, ta có
+
+
+
+**38. [Case, Marginal density, Cumulative function]**
+
+⟶ [Trường hợp, Mật độ biên, Hàm tích lũy]
+
+
+
+**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
+
+⟶ Mật độ có điều kiện - Mật độ có điều kiện của X với Y, thường được kí hiệu là fX|Y, được định nghĩa như sau:
+
+
+
+**40. Independence ― Two random variables X and Y are said to be independent if we have:**
+
+⟶ Tính chất độc lập - Hai biến ngẫu nhiên X và Y độc lập nếu ta có:
+
+
+
+**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
+
+⟶ Hiệp phương sai - Chúng ta xác định hiệp phương sai của hai biến ngẫu nhiên X và Y, thường được kí hiệu σ2XY hay Cov(X,Y), như sau:
+
+
+
+**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
+
+⟶ Hệ số tương quan ― Kí hiệu σX,σY là độ lệch chuẩn của X và Y, chúng ta xác định hệ số tương quan giữa X và Y, kí hiệu ρXY, như sau:
+
+
+
+**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
+
+⟶ Ghi chú 1: chúng ta lưu ý rằng với bất cứ biến ngẫu nhiên X,Y nào, ta luôn có ρXY∈[−1,1].
+
+
+
+**44. Remark 2: If X and Y are independent, then ρXY=0.**
+
+⟶ Ghi chú 2: Nếu X và Y độc lập với nhau thì ρXY=0.
+
+
+
+**45. Parameter estimation**
+
+⟶ Ước lượng tham số
+
+
+
+**46. Definitions**
+
+⟶ Định nghĩa
+
+
+
+**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
+
+⟶ Mẫu ngẫu nhiên - Mẫu ngẫu nhiên là tập hợp của n biến ngẫu nhiên X1,...,Xn độc lập và được phân phối giống hệt với X.
+
+
+
+**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
+
+⟶ Công cụ ước tính (estimator) - Công cụ ước tính (estimator) là một hàm của dữ liệu được sử dụng để suy ra giá trị của một tham số chưa biết trong mô hình thống kê.
+
+
+
+**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
+
+⟶ Thiên vị (bias) - Thiên vị (bias) của Estimator ^θ được định nghĩa là chênh lệch giữa giá trị kì vọng của phân phối ^θ và giá trị thực, tức là
+
+
+
+**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
+
+⟶ Ghi chú: một công cụ ước tính (estimator) được cho là không thiên vị (unbiased) khi chúng ta có E[^θ]=θ.
+
+
+
+**51. Estimating the mean**
+
+⟶ Ước lượng trung bình
+
+
+
+**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
+
+⟶ Giá trị trung bình mẫu - Giá trị trung bình mẫu của mẫu ngẫu nhiên được sử dụng để ước tính giá trị trung bình thực μ của phân phối, thường được kí hiệu ¯¯¯¯¯X và được định nghĩa như sau:
+
+
+
+**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
+
+⟶ Ghi chú: Trung bình mẫu là không thiên vị (unbiased), nghĩa là E[¯¯¯¯¯X]=μ.
+
+
+
+**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
+
+⟶ Định lý giới hạn trung tâm - Giả sử chúng ta có một mẫu ngẫu nhiên X1,...,Xn theo một phân phối nhất định với trung bình μ và phương sai σ2, sau đó chúng ta có:
+
+
+
+**55. Estimating the variance**
+
+⟶ Ước lượng phương sai
+
+
+
+**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
+
+⟶ Phương sai mẫu - Phương sai mẫu của mẫu ngẫu nhiên được sử dụng để ước lượng phương sai thực sự σ2 của phân phối, thường được kí hiệu là s2 hoặc ^σ2 và được định nghĩa như sau:
+
+
+
+**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
+
+⟶ Ghi chú: phương sai mẫu không thiên vị (unbiased), nghĩa là E[s2]=σ2.
+
+
+
+**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
+
+⟶ Quan hệ Chi-Squared với phương sai mẫu - Với s2 là phương sai mẫu của một mẫu ngẫu nhiên, ta có:
+
+
+
+**59. [Introduction, Sample space, Event, Permutation]**
+
+⟶ [Giới thiệu, Không gian mẫu, Sự kiện, Hoán vị]
+
+
+
+**60. [Conditional probability, Bayes' rule, Independence]**
+
+⟶ [Xác suất có điều kiện, Định lý Bayes, Sự độc lập]
+
+
+
+**61. [Random variables, Definitions, Expectation, Variance]**
+
+⟶ [Biến ngẫu nhiên, Định nghĩa, Kì vọng, Phương sai]
+
+
+
+**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
+
+⟶ [Phân bố xác suất, Bất đẳng thức Chebyshev, Xác suất chính]
+
+
+
+**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
+
+⟶ [Các biến ngẫu nhiên đồng thời, Mật độ, Hiệp phương sai, Hệ số tương quan]
+
+
+
+**64. [Parameter estimation, Mean, Variance]**
+
+⟶ [Ước lượng tham số, Trung bình, Phương sai]
diff --git a/vi/cs-229-supervised-learning.md b/vi/cs-229-supervised-learning.md
new file mode 100644
index 000000000..3bdac042a
--- /dev/null
+++ b/vi/cs-229-supervised-learning.md
@@ -0,0 +1,567 @@
+**1. Supervised Learning cheatsheet**
+
+⟶ Cheatsheet học có giám sát
+
+
+
+**2. Introduction to Supervised Learning**
+
+⟶ Giới thiệu về học có giám sát
+
+
+
+**3. Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.**
+
+⟶ Cho một tập hợp các điểm dữ liệu {x(1),...,x(m)} tương ứng với đó là tập các đầu ra {y(1),...,y(m)}, chúng ta muốn xây dựng một bộ phân loại học được cách dự đoán y từ x.
+
+
+
+**4. Type of prediction ― The different types of predictive models are summed up in the table below:**
+
+⟶ Loại dự đoán - Các loại mô hình dự đoán được tổng kết trong bảng bên dưới:
+
+
+
+**5. [Regression, Classifier, Outcome, Examples]**
+
+⟶ [Hồi quy, Phân loại, Đầu ra, Các ví dụ]
+
+
+
+**6. [Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
+
+⟶ [Liên tục, Lớp, Hồi quy tuyến tính, Hồi quy Logistic, SVM, Naive Bayes]
+
+
+
+**7. Type of model ― The different models are summed up in the table below:**
+
+⟶ Loại mô hình - Các mô hình khác nhau được tổng kết trong bảng bên dưới:
+
+
+
+**8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
+
+⟶ [Mô hình phân biệt, Mô hình sinh, Mục tiêu, Những gì học được, Hình minh hoạ, Các ví dụ]
+
+
+
+**9. [Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]**
+
+⟶ [Ước lượng trực tiếp P(y|x), Ước lượng P(x|y) để tiếp tục suy luận P(y|x), Biên quyết định, Phân bố xác suất của dữ liệu, Hồi quy, SVMs, GDA, Naive Bayes]
+
+
+
+**10. Notations and general concepts**
+
+⟶ Các kí hiệu và khái niệm tổng quát
+
+
+
+**11. Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).**
+
+⟶ Hypothesis - Hypothesis được kí hiệu là hθ, là một mô hình mà chúng ta chọn. Với dữ liệu đầu vào cho trước x(i), mô hình dự đoán đầu ra là hθ(x(i)).
+
+
+
+**12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
+
+⟶ Hàm mất mát - Hàm mất mát là một hàm số dạng: L:(z,y)∈R×Y⟼L(z,y)∈R lấy đầu vào là giá trị dự đoán được z tương ứng với đầu ra thực tế là y, hàm có đầu ra là sự khác biệt giữa hai giá trị này. Các hàm mất mát phổ biến được tổng kết ở bảng dưới đây:
+
+
+
+**13. [Least squared error, Logistic loss, Hinge loss, Cross-entropy]**
+
+⟶ [Least squared error, Mất mát Logistic, Mất mát Hinge, Cross-entropy]
+
+
+
+**14. [Linear regression, Logistic regression, SVM, Neural Network]**
+
+⟶ [Hồi quy tuyến tính, Hồi quy Logistic, SVM, Mạng neural]
+
+
+
+**15. Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:**
+
+⟶ Hàm giá trị (Cost function) - Cost function J thường được sử dụng để đánh giá hiệu năng của mô hình và được định nghĩa với hàm mất mát L như sau:
+
+
+
+**16. Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:**
+
+⟶ Gradient descent - Bằng việc kí hiệu α∈R là tốc độ học, việc cập nhật quy tắc/ luật cho gradient descent được mô tả với tốc độ học và cost function J như sau:
+
+
+
+**17. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.**
+
+⟶ Chú ý: Stochastic gradient descent (SGD) là việc cập nhật tham số dựa theo mỗi ví dụ huấn luyện, và batch gradient descent là dựa trên một lô (batch) các ví dụ huấn luyện.
+
+
+
+**18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
+
+⟶ Likelihood - Likelihood của một mô hình L(θ) với tham số θ được sử dụng để tìm tham số tối ưu θ thông qua việc cực đại hoá likelihood. Trong thực tế, chúng ta sử dụng log-likelihood ℓ(θ)=log(L(θ)) đễ dễ dàng hơn trong việc tôi ưu hoá. Ta có:
+
+
+
+**19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**
+
+⟶ Giải thuật Newton - Giải thuật Newton là một phương thức số tìm θ thoả mãn điều kiện ℓ′(θ)=0. Quy tắc cập nhật của nó là như sau:
+
+
+
+**20. Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:**
+
+⟶ Chú ý: Tổng quát hoá đa chiều, còn được biết đến như là phương thức Newton-Raphson, có quy tắc cập nhật như sau:
+
+
+
+**21. Linear models**
+
+⟶ Các mô hình tuyến tính
+
+
+
+**22. Linear regression**
+
+⟶ Hồi quy tuyến tính
+
+
+
+**23. We assume here that y|x;θ∼N(μ,σ2)**
+
+⟶ Chúng ta giả sử ở đây rằng y|x;θ∼N(μ,σ2)
+
+
+
+**24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
+
+⟶ Phương trình chuẩn - Bằng việc kí hiệu X là ma trận thiết kế, giá trị của θ làm cực tiểu hoá cost function là một phương pháp dạng đóng như sau:
+
+
+
+**25. LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:**
+
+⟶ Giải thuật LMS - Bằng việc kí hiệu α là tốc độ học, quy tắc cập nhật của giải thuật Least Mean Squares (LMS) cho tập huấn luyện của m điểm dữ liệu, còn được biết như là quy tắc học Widrow-Hoff, là như sau:
+
+
+
+**26. Remark: the update rule is a particular case of the gradient ascent.**
+
+⟶ Chú ý: Luật cập nhật là một trường hợp đặc biệt của gradient ascent.
+
+
+
+**27. LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:**
+
+⟶ LWR - Hồi quy trọng số cục bộ, còn được biết với cái tên LWR, là biến thể của hồi quy tuyến tính, nó sẽ đánh trọng số cho mỗi ví dụ huấn luyện trong cost function của nó bởi w(i)(x), được định nghĩa với tham số τ∈R như sau:
+
+
+
+**28. Classification and logistic regression**
+
+⟶ Phân loại và logistic hồi quy
+
+
+
+**29. Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:**
+
+⟶ Hàm Sigmoid - Hàm sigmoid g, còn được biết đến như là hàm logistic, được định nghĩa như sau:
+
+
+
+**30. Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:**
+
+⟶ Hồi quy logistic - Chúng ta giả sử ở đây rằng y|x;θ∼Bernoulli(ϕ). Ta có công thức như sau:
+
+
+
+**31. Remark: there is no closed form solution for the case of logistic regressions.**
+
+⟶ Chú ý: không có giải pháp dạng đóng cho trường hợp của hồi quy logistic.
+
+
+
+**32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
+
+⟶ Hồi quy Softmax - Hồi quy softmax, còn được gọi là hồi quy logistic đa lớp, được sử dụng để tổng quát hoá hồi quy logistic khi có nhiều hơn 2 lớp đầu ra. Theo quy ước, chúng ta thiết lập θK=0, làm cho tham số Bernoulii ϕi của mỗi lớp i bằng với:
+
+
+
+**33. Generalized Linear Models**
+
+⟶ Mô hình tuyến tính tổng quát
+
+
+
+**34. Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:**
+
+⟶ Họ số mũ - Một lớp của phân phối được cho rằng thuộc về họ số mũ nếu nó có thể được viết dưới dạng một thuật ngữ của tham số tự nhiên, cũng được gọi là tham số kinh điển (canonical parameter) hoặc hàm kết nối, η, một số liệu thống kê đầy đủ T(y) và hàm phân vùng log (log-partition function) a(η) sẽ có dạng như sau:
+
+
+
+**35. Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.**
+
+⟶ Chú ý: chúng ta thường có T(y)=y. Đồng thời, exp(−a(η)) có thể được xem như là tham số chuẩn hoá sẽ đảm bảo rằng tổng các xác suất là một.
+
+
+
+**36. Here are the most common exponential distributions summed up in the following table:**
+
+⟶ Ở đây là các phân phối mũ phổ biến nhất được tổng kết ở bảng bên dưới:
+
+
+
+**37. [Distribution, Bernoulli, Gaussian, Poisson, Geometric]**
+
+⟶ [Phân phối, Bernoulli, Gaussian, Poisson, Geometric]
+
+
+
+**38. Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:**
+
+⟶ Giả thuyết GLMs - Mô hình tuyến tính tổng quát (GLM) với mục đích là dự đoán một biến ngẫu nhiên y như là hàm cho biến x∈Rn+1 và dựa trên 3 giả thuyết sau:
+
+
+
+**39. Remark: ordinary least squares and logistic regression are special cases of generalized linear models.**
+
+⟶ Chú ý: Bình phương nhỏ nhất thông thường và hồi quy logistic đều là các trường hợp đặc biệt của các mô hình tuyến tính tổng quát.
+
+
+
+**40. Support Vector Machines**
+
+⟶ Máy vector hỗ trợ
+
+
+
+**41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
+
+⟶ Mục tiêu của máy vector hỗ trợ là tìm ra dòng tối đa hoá khoảng cách nhỏ nhất tới dòng.
+
+
+
+**42: Optimal margin classifier ― The optimal margin classifier h is such that:**
+
+⟶ Optimal margin classifier - Optimal margin classifier h là như sau:
+
+
+
+**43: where (w,b)∈Rn×R is the solution of the following optimization problem:**
+
+⟶ với (w,b)∈Rn×R là giải pháp cho vấn đề tối ưu hoá sau đây:
+
+
+
+**44. such that**
+
+⟶ như là:
+
+
+
+**45. support vectors**
+
+⟶ vector hỗ trợ
+
+
+
+**46. Remark: the line is defined as wTx−b=0.**
+
+⟶ Chú ý: đường thẳng có phương trình là wTx−b=0.
+
+
+
+**47. Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:**
+
+⟶ Mất mát Hinge - Mất mát Hinge được sử dụng trong thiết lập của SVMs và nó được định nghĩa như sau:
+
+
+
+**48. Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:**
+
+⟶ Kernel (nhân) - Cho trước feature mapping ϕ, chúng ta định nghĩa kernel K như sau:
+
+
+
+**49. In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.**
+
+⟶ Trong thực tế, kernel K được định nghĩa bởi K(x,z)=exp(−||x−z||22σ2) được gọi là Gaussian kernal và thường được sử dụng.
+
+
+
+**50. [Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]**
+
+⟶ [Phân tách phi tuyến, Việc sử dụng một kernel mapping, Biến quyết định trong không gian gốc]
+
+
+
+**51. Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.**
+
+⟶ Chú ý: chúng ta nói rằng chúng ta sử dụng "kernel trick" để tính toán cost function sử dụng kernel bởi vì chúng ta thực sự không cần biết đến ánh xạ tường minh ϕ, nó thường khá phức tạp. Thay vào đó, chỉ cần biết giá trị K(x,z).
+
+
+
+**52. Lagrangian ― We define the Lagrangian L(w,b) as follows:**
+
+⟶ Lagrangian - Chúng ta định nghĩa Lagrangian L(w,b) như sau:
+
+
+
+**53. Remark: the coefficients βi are called the Lagrange multipliers.**
+
+⟶ Chú ý: hệ số βi được gọi là bội số Lagrange.
+
+
+
+**54. Generative Learning**
+
+⟶ Generative Learning
+
+
+
+**55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
+
+⟶ Một mô hình sinh đầu tiên cố gắng học cách dữ liệu được sinh ra thông qua việc ước lượng P(x|y), sau đó chúng ta có thể sử dụng P(x|y) để ước lượng P(y|x) bằng cách sử dụng luật Bayes.
+
+
+
+**56. Gaussian Discriminant Analysis**
+
+⟶ Gaussian Discriminant Analysis
+
+
+
+**57. Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:**
+
+⟶ Thiết lập - Gaussian Discriminant Analysis giả sử rằng y và x|y=0 và x|y=1 là như sau:
+
+
+
+**58. Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:**
+
+⟶ Sự ước lượng - Bảng sau đây tổng kết các ước lượng mà chúng ta tìm thấy khi tối đa hoá likelihood:
+
+
+
+**59. Naive Bayes**
+
+⟶ Naive Bayes
+
+
+
+**60. Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:**
+
+⟶ Giả thiết - Mô hình Naive Bayes giả sử rằng các features của các điểm dữ liệu đều độc lập với nhau:
+
+
+
+**61. Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]**
+
+⟶ Giải pháp - Tối đa hoá log-likelihood đưa ra những lời giải sau đây, với k∈{0,1},l∈[[1,L]]
+
+
+
+**62. Remark: Naive Bayes is widely used for text classification and spam detection.**
+
+⟶ Chú ý: Naive Bayes được sử dụng rộng rãi cho bài toán phân loại văn bản và phát hiện spam.
+
+
+
+**63. Tree-based and ensemble methods**
+
+⟶ Các phương thức Tree-based và ensemble
+
+
+
+**64. These methods can be used for both regression and classification problems.**
+
+⟶ Các phương thức này có thể được sử dụng cho cả bài toán hồi quy lẫn bài toán phân loại.
+
+
+
+**65. CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.**
+
+⟶ CART - Cây phân loại và hồi quy (CART), thường được biết đến là cây quyết định, có thể được biểu diễn dưới dạng cây nhị phân. Chúng có các ưu điểm có thể được diễn giải một cách dễ dàng.
+
+
+
+**66. Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.**
+
+⟶ Rừng ngẫu nhiên - Là một kĩ thuật dựa trên cây (tree-based), sử dụng số lượng lớn các cây quyết định để lựa chọn ngẫu nhiên các tập thuộc tính. Ngược lại với một cây quyết định đơn, kĩ thuật này khá khó diễn giải nhưng do có hiệu năng tốt nên đã trở thành một giải thuật khá phổ biến hiện nay.
+
+
+
+**67. Remark: random forests are a type of ensemble methods.**
+
+⟶ Chú ý: rừng ngẫu nhiên là một loại giải thuật ensemble.
+
+
+
+**68. Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:**
+
+⟶ Boosting - Ý tưởng của các phương thức boosting là kết hợp các phương pháp học yếu hơn để tạo nên phương pháp học mạnh hơn. Những phương thức chính được tổng kết ở bảng dưới đây:
+
+
+
+**69. [Adaptive boosting, Gradient boosting]**
+
+⟶ [Adaptive boosting, Gradient boosting]
+
+
+
+**70. High weights are put on errors to improve at the next boosting step**
+
+⟶ Các trọng số có giá trị lớn được đặt vào các phần lỗi để cải thiện ở bước boosting tiếp theo
+
+
+
+**71. Weak learners trained on remaining errors**
+
+⟶ Các phương pháp học yếu huấn luyện trên các phần lỗi còn lại
+
+
+
+**72. Other non-parametric approaches**
+
+⟶ Các cách tiếp cận phi-tham số khác
+
+
+
+**73. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
+
+⟶ k-nearest neighbors - Giải thuật k-nearest neighbors, thường được biết đến là k-NN, là cách tiếp cận phi-tham số, ở phương pháp này phân lớp của một điểm dữ liệu được định nghĩa bởi k điểm dữ liệu gần nó nhất trong tập huấn luyện. Phương pháp này có thể được sử dụng trong quá trình thiết lập cho bài toán phân loại cũng như bài toán hồi quy.
+
+
+
+**74. Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
+
+⟶ Chú ý: Tham số k cao hơn, độ chệch (bias) cao hơn, tham số k thấp hơn, phương sai cao hơn
+
+
+
+**75. Learning Theory**
+
+⟶ Lý thuyết học
+
+
+
+**76. Union bound ― Let A1,...,Ak be k events. We have:**
+
+⟶ Union bound - Cho k sự kiện là A1,...,Ak. Ta có:
+
+
+
+**77. Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:**
+
+⟶ Bất đẳng thức Hoeffding - Cho Z1,..,Zm là m biến iid được đưa ra từ phân phối Bernoulli của tham số ϕ. Cho ˆϕ là trung bình mẫu của chúng và γ>0 cố định. Ta có:
+
+
+
+**78. Remark: this inequality is also known as the Chernoff bound.**
+
+⟶ Chú ý: bất đẳng thức này còn được biết đến như là ràng buộc Chernoff.
+
+
+
+**79. Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:**
+
+⟶ Lỗi huấn luyện (Training error) - Cho trước classifier h, ta định nghĩa training error ˆϵ(h), còn được biết đến là empirical risk hoặc empirical error, như sau:
+
+
+
+**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions:**
+
+⟶ Probably Approximately Correct (PAC) - PAC là một framework với nhiều kết quả về lí thuyết học đã được chứng minh, và có tập hợp các giả thiết như sau:
+
+
+
+**81: the training and testing sets follow the same distribution**
+
+⟶ tập huấn luyện và test có cùng phân phối
+
+
+
+**82. the training examples are drawn independently**
+
+⟶ các ví dụ huấn luyện được tạo ra độc lập
+
+
+
+**83. Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:**
+
+⟶ Shattering (Chia nhỏ) - Cho một tập hợp S={x(1),...,x(d)}, và một tập hợp các classifiers H, ta nói rằng H chia nhỏ S nếu với bất kì tập các nhãn {y(1),...,y(d)} nào, ta có:
+
+
+
+**84. Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:**
+
+⟶ Định lí giới hạn trên - Cho H là một finite hypothesis class mà |H|=k với δ, kích cỡ m là cố định. Khi đó, với xác suất nhỏ nhất là 1−δ, ta có:
+
+
+
+**85. VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.**
+
+⟶ VC dimension - Vapnik-Chervonenkis (VC) dimension của class infinite hypothesis H cho trước, kí hiệu là VC(H) là kích thước của tập lớn nhất được chia nhỏ bởi H.
+
+
+
+**86. Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.**
+
+⟶ Chú ý: VC dimension của H={tập hợp các linear classifiers trong 2 chiều} là 3.
+
+
+
+**87. Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:**
+
+⟶ Định lí (Vapnik) - Cho H với VC(H)=d và m là số lượng các ví dụ huấn luyện. Với xác suất nhỏ nhất là 1−δ, ta có:
+
+
+
+**88. [Introduction, Type of prediction, Type of model]**
+
+⟶ [Giới thiệu, Loại dự đoán, Loại mô hình]
+
+
+
+**89. [Notations and general concepts, loss function, gradient descent, likelihood]**
+
+⟶ [Các kí hiệu và các khái niệm tổng quát, hàm mất mát, gradient descent, likelihood]
+
+
+
+**90. [Linear models, linear regression, logistic regression, generalized linear models]**
+
+⟶ [Các mô hình tuyến tính, hồi quy tuyến tính, hồi quy logistic, các mô hình tuyến tính tổng quát]
+
+
+
+**91. [Support vector machines, Optimal margin classifier, Hinge loss, Kernel]**
+
+⟶ [Máy vector hỗ trợ, Optimal margin classifier, Mất mát Hinge, Kernel]
+
+
+
+**92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]**
+
+⟶ [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]
+
+
+
+**93. [Trees and ensemble methods, CART, Random forest, Boosting]**
+
+⟶ [Cây và các phương pháp ensemble, CART, Rừng ngẫu nhiên, Boosting]
+
+
+
+**94. [Other methods, k-NN]**
+
+⟶ [Các phương thức khác, k-NN]
+
+
+
+**95. [Learning theory, Hoeffding inequality, PAC, VC dimension]**
+
+⟶ [Lí thuyết học, Bất đẳng thức Hoeffding, PAC, VC dimension]
diff --git a/vi/cs-229-unsupervised-learning.md b/vi/cs-229-unsupervised-learning.md
new file mode 100644
index 000000000..f0ded11f1
--- /dev/null
+++ b/vi/cs-229-unsupervised-learning.md
@@ -0,0 +1,344 @@
+**Unsupervised Learning translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-unsupervised-learning)
+
+
+
+**1. Unsupervised Learning cheatsheet**
+
+⟶ Cheatsheet học không giám sát
+
+
+
+**2. Introduction to Unsupervised Learning**
+
+⟶ Giới thiệu về học không giám sát
+
+
+
+**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
+
+⟶ Động lực ― Mục tiêu của học không giám sát là tìm được quy luật ẩn (hidden pattern) trong tập dữ liệu không được gán nhãn {x(1),...,x(m)}.
+
+
+
+**4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
+
+⟶ Bất đẳng thức Jensen - Cho f là một hàm lồi và X là một biến ngẫu nhiên. Chúng ta có bất đẳng thức sau:
+
+
+
+**5. Clustering**
+
+⟶ Phân cụm
+
+
+
+**6. Expectation-Maximization**
+
+⟶ Tối đa hoá kì vọng
+
+
+
+**7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
+
+⟶ Các biến Latent - Các biến Latent là các biến ẩn/ không thấy được khiến cho việc dự đoán trở nên khó khăn, và thường được kí hiệu là z. Đây là các thiết lập phổ biến mà các biến latent thường có:
+
+
+
+**8. [Setting, Latent variable z, Comments]**
+
+⟶ [Thiết lập, Biến Latent z, Các bình luận]
+
+
+
+**9. [Mixture of k Gaussians, Factor analysis]**
+
+⟶ [Sự kết hợp của k Gaussians, Phân tích hệ số]
+
+
+
+**10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
+
+⟶ Thuật toán - Thuật toán tối đa hoá kì vọng (EM) mang lại một phương thức có hiệu quả trong việc ước lượng tham số θ thông qua tối đa hoá giá trị ước lượng likelihood bằng cách lặp lại việc tạo nên một cận dưới cho likelihood (E-step) và tối ưu hoá cận dưới (M-step) như sau:
+
+
+
+**11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
+
+⟶ E-step: Đánh giá xác suất hậu nghiệm Qi(z(i)) cho mỗi điểm dữ liệu x(i) đến từ một cụm z(i) cụ thể như sau:
+
+
+
+**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
+
+⟶ M-step: Sử dụng xác suất hậu nghiệm Qi(z(i)) như các trọng số cụ thể của cụm trên các điểm dữ liệu x(i) để ước lượng lại một cách riêng biệt cho mỗi mô hình cụm như sau:
+
+
+
+**13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
+
+⟶ [Khởi tạo Gaussians, Bước kì vọng, Bước tối đa hoá, Hội tụ]
+
+
+
+**14. k-means clustering**
+
+⟶ Phân cụm k-means
+
+
+
+**15. We note c(i) the cluster of data point i and μj the center of cluster j.**
+
+⟶ Chúng ta kí hiệu c(i) là cụm của điểm dữ liệu i và μj là điểm trung tâm của cụm j.
+
+
+
+**16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
+
+⟶ Thuật toán - Sau khi khởi tạo ngẫu nhiên các tâm cụm (centroids) μ1,μ2,...,μk∈Rn, thuật toán k-means lặp lại bước sau cho đến khi hội tụ:
+
+
+
+**17. [Means initialization, Cluster assignment, Means update, Convergence]**
+
+⟶ [Khởi tạo giá trị trung bình, Gán cụm, Cập nhật giá trị trung bình, Hội tụ]
+
+
+
+**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
+
+⟶ Hàm Distortion - Để nhận biết khi nào thuật toán hội tụ, chúng ta sẽ xem xét hàm distortion được định nghĩa như sau:
+
+
+
+**19. Hierarchical clustering**
+
+⟶ Phân cụm phân cấp
+
+
+
+**20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
+
+⟶ Thuật toán - Là một thuật toán phân cụm với cách tiếp cận phân cấp kết tập, cách tiếp cận này sẽ xây dựng các cụm lồng nhau theo một quy tắc nối tiếp.
+
+
+
+**21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
+
+⟶ Các loại - Các loại thuật toán hierarchical clustering khác nhau với mục tiêu là tối ưu hoá các hàm đối tượng khác nhau sẽ được tổng kết trong bảng dưới đây:
+
+
+
+**22. [Ward linkage, Average linkage, Complete linkage]**
+
+⟶ [Liên kết Ward, Liên kết trung bình, Liên kết hoàn chỉnh]
+
+
+
+**23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
+
+⟶ [Tối thiểu hoá trong phạm vi khoảng cách của một cụm, Tối thiểu hoá khoảng cách trung bình giữa các cặp cụm, Tối thiểu hoá khoảng cách tối đa giữa các cặp cụm]
+
+
+
+**24. Clustering assessment metrics**
+
+⟶ Các số liệu đánh giá phân cụm
+
+
+
+**25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
+
+⟶ Trong quá trình thiết lập học không giám sát, sẽ khá khó khăn để đánh giá hiệu năng của một mô hình vì chúng ta không có các nhãn đủ tin cậy như trong trường hợp của học có giám sát.
+
+
+
+**26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
+
+⟶ Hệ số Silhouette - Bằng việc kí hiệu a và b là khoảng cách trung bình giữa một điểm mẫu với các điểm khác trong cùng một lớp, và giữa một điểm mẫu với các điểm khác thuộc cụm kế cận gần nhất, hệ số silhouette s đối với một điểm mẫu đơn được định nghĩa như sau:
+
+
+
+**27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
+
+⟶ Chỉ số Calinski-Harabaz - Bằng việc kí hiệu k là số cụm, các chỉ số Bk và Wk về độ phân tán giữa và trong một cụm lần lượt được định nghĩa như là
+
+
+
+**28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
+
+⟶ Chỉ số Calinski-Harabaz s(k) cho biết khả năng phân cụm tốt đến đâu của một mô hình phân cụm, ví dụ như với score cao hơn thì sẽ dày đặc hơn và việc phân cụm tốt hơn. Nó được định nghĩa như sau:
+
+
+
+**29. Dimension reduction**
+
+⟶ Giảm số chiều dữ liệu
+
+
+
+**30. Principal component analysis**
+
+⟶ Principal component analysis
+
+
+
+**31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
+
+⟶ Là một kĩ thuật giảm số chiều dữ liệu, kĩ thuật này sẽ tìm các hướng tối đa hoá phương sai để chiếu dữ liệu lên trên đó.
+
+
+
+**32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+⟶ Giá trị riêng, vector riêng - Cho ma trận A∈Rn×n, λ là giá trị riêng của A nếu tồn tại một vector z∈Rn∖{0}, gọi là vector riêng, như vậy ta có:
+
+
+
+**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+⟶ Định lý Spectral - Với A∈Rn×n. Nếu A đối xứng thì A có thể chéo hoá bởi một ma trận trực giao U∈Rn×n. Bằng việc kí hiệu Λ=diag(λ1,...,λn), ta có:
+
+
+
+**34. diagonal**
+
+⟶ đường chéo
+
+
+
+**35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
+
+⟶ Chú thích: vector riêng tương ứng với giá trị riêng lớn nhất được gọi là vector riêng chính của ma trận A.
+
+
+
+**36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k
+dimensions by maximizing the variance of the data as follows:**
+
+⟶ Thuật toán - Principal Component Analysis (PCA) là một kĩ thuật giảm số chiều dữ liệu, nó sẽ chiếu dữ liệu lên k chiều bằng cách tối đa hoá phương sai của dữ liệu như sau:
+
+
+
+**37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
+
+⟶ Bước 1: Chuẩn hoá dữ liệu để có giá trị trung bình bằng 0 và độ lệch chuẩn bằng 1.
+
+
+
+**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
+
+⟶ Bước 2: Tính Σ=1mm∑i=1x(i)x(i)T∈Rn×n, là đối xứng với các giá trị riêng thực.
+
+
+
+**39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
+
+⟶ Bước 3: Tính u1,...,uk∈Rn là k vector riêng trực giao của Σ, tức các vector trực giao riêng của k giá trị riêng lớn nhất.
+
+
+
+**40. Step 4: Project the data on spanR(u1,...,uk).**
+
+⟶ Bước 4: Chiếu dữ liệu lên spanR(u1,...,uk).
+
+
+
+**41. This procedure maximizes the variance among all k-dimensional spaces.**
+
+⟶ Thủ tục này tối đa hoá phương sai giữa các không gian k-chiều.
+
+
+
+**42. [Data in feature space, Find principal components, Data in principal components space]**
+
+⟶ [Dữ liệu trong không gian đặc trưng, Tìm các thành phần chính, Dữ liệu trong không gian các thành phần chính]
+
+
+
+**43. Independent component analysis**
+
+⟶ Independent component analysis
+
+
+
+**44. It is a technique meant to find the underlying generating sources.**
+
+⟶ Là một kĩ thuật tìm các nguồn tạo cơ bản.
+
+
+
+**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
+
+⟶ Giả định - Chúng ta giả sử rằng dữ liệu x được tạo ra bởi vector nguồn n-chiều s=(s1,...,sn), với si là các biến ngẫu nhiên độc lập, thông qua một ma trận mixing và non-singular A như sau:
+
+
+
+**46. The goal is to find the unmixing matrix W=A−1.**
+
+⟶ Mục tiêu là tìm ma trận unmixing W=A−1.
+
+
+
+**47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
+
+⟶ Giải thuật Bell và Sejnowski ICA - Giải thuật này tìm ma trận unmixing W bằng các bước dưới đây:
+
+
+
+**48. Write the probability of x=As=W−1s as:**
+
+⟶ Ghi xác suất của x=As=W−1s như là:
+
+
+
+**49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
+
+⟶ Ghi log likelihood cho dữ liệu huấn luyện {x(i),i∈[[1,m]]} và kí hiệu g là hàm sigmoid như sau:
+
+
+
+**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
+
+⟶ Vì thế, quy tắc học của stochastic gradient ascent là với mỗi ví dụ huấn luyện x(i), chúng ta sẽ cập nhật W như sau:
+
+
+
+**51. The Machine Learning cheatsheets are now available in [target language].**
+
+⟶ Machine Learning cheatsheets hiện đã có bản [tiếng Việt].
+
+
+
+**52. Original authors**
+
+⟶ Các tác giả
+
+
+
+**53. Translated by X, Y and Z**
+
+⟶ Được dịch bởi X, Y và Z
+
+
+
+**54. Reviewed by X, Y and Z**
+
+⟶ Được review bởi X, Y và Z
+
+
+
+**55. [Introduction, Motivation, Jensen's inequality]**
+
+⟶ [Giới thiệu, Động lực, Bất đẳng thức Jensen]
+
+
+
+**56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
+
+⟶ [Phân cụm, Tối đa hoá kì vọng, k-means, Hierarchical clustering, Các chỉ số]
+
+
+
+**57. [Dimension reduction, PCA, ICA]**
+
+⟶ [Giảm số chiều dữ liệu, PCA, ICA]
diff --git a/vi/cs-230-convolutional-neural-networks.md b/vi/cs-230-convolutional-neural-networks.md
new file mode 100644
index 000000000..fae937a63
--- /dev/null
+++ b/vi/cs-230-convolutional-neural-networks.md
@@ -0,0 +1,717 @@
+**Convolutional Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks)
+
+
+
+**1. Convolutional Neural Networks cheatsheet**
+
+⟶Convolutional Neural Networks cheatsheet
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶ CS 230 - Deep Learning
+
+
+
+
+**3. [Overview, Architecture structure]**
+
+⟶ [Tổng quan, Kết cấu kiến trúc]
+
+
+
+
+**4. [Types of layer, Convolution, Pooling, Fully connected]**
+
+⟶ [Các kiểu tầng (layer), Tích chập, Pooling, Kết nối đầy đủ]
+
+
+
+
+**5. [Filter hyperparameters, Dimensions, Stride, Padding]**
+
+⟶ [Các siêu tham số của bộ lọc, Các chiều, Stride, Padding]
+
+
+
+
+**6. [Tuning hyperparameters, Parameter compatibility, Model complexity, Receptive field]**
+
+⟶ [Điều chỉnh các siêu tham số, Độ tương thích tham số, Độ phức tạp mô hình, Receptive field]
+
+
+
+
+**7. [Activation functions, Rectified Linear Unit, Softmax]**
+
+⟶ [Các hàm kích hoạt, Rectified Linear Unit, Softmax]
+
+
+
+
+**8. [Object detection, Types of models, Detection, Intersection over Union, Non-max suppression, YOLO, R-CNN]**
+
+⟶ [Phát hiện vật thể, Các kiểu mô hình, Detection, Intersection over Union, Non-max suppression, YOLO, R-CNN]
+
+
+
+
+**9. [Face verification/recognition, One shot learning, Siamese network, Triplet loss]**
+
+⟶ [Nhận diện/ xác nhận gương mặt, One shot learning, Siamese network, Triplet loss]
+
+
+
+
+**10. [Neural style transfer, Activation, Style matrix, Style/content cost function]**
+
+⟶ [Neural style transfer, Activation, Style matrix, Style/content cost function]
+
+
+
+
+**11. [Computational trick architectures, Generative Adversarial Net, ResNet, Inception Network]**
+
+⟶ [Computational trick architectures, Generative Adversarial Net, ResNet, Inception Network]
+
+
+
+
+**12. Overview**
+
+⟶ Tổng quan
+
+
+
+
+**13. Architecture of a traditional CNN ― Convolutional neural networks, also known as CNNs, are a specific type of neural networks that are generally composed of the following layers:**
+
+⟶ Kiến trúc truyền thống của một mạng CNN ― Mạng neural tích chập (Convolutional neural networks), còn được biết đến với tên CNNs, là một dạng mạng neural được cấu thành bởi các tầng sau:
+
+
+
+
+**14. The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters that are described in the next sections.**
+
+⟶ Tầng tích chập và tầng pooling có thể được hiệu chỉnh theo các siêu tham số (hyperparameters) được mô tả ở những phần tiếp theo.
+
+
+
+
+**15. Types of layer**
+
+⟶ Các kiểu tầng
+
+
+
+
+**16. Convolution layer (CONV) ― The convolution layer (CONV) uses filters that perform convolution operations as it is scanning the input I with respect to its dimensions. Its hyperparameters include the filter size F and stride S. The resulting output O is called feature map or activation map.**
+
+⟶ Tầng tích chập (CONV) ― Tầng tích chập (CONV) sử dụng các bộ lọc để thực hiện phép tích chập khi đưa chúng đi qua đầu vào I theo các chiều của nó. Các siêu tham số của các bộ lọc này bao gồm kích thước bộ lọc F và độ trượt (stride) S. Kết quả đầu ra O được gọi là feature map hay activation map.
+
+
+
+
+**17. Remark: the convolution step can be generalized to the 1D and 3D cases as well.**
+
+⟶ Lưu ý: Bước tích chập cũng có thể được khái quát hóa cả với trường hợp một chiều (1D) và ba chiều (3D).
+
+
+
+
+**18. Pooling (POOL) ― The pooling layer (POOL) is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance. In particular, max and average pooling are special kinds of pooling where the maximum and average value is taken, respectively.**
+
+⟶ Pooling (POOL) ― Tầng pooling (POOL) là một phép downsampling, thường được sử dụng sau tầng tích chập, giúp tăng tính bất biến không gian. Cụ thể, max pooling và average pooling là những dạng pooling đặc biệt, mà tương ứng là trong đó giá trị lớn nhất và giá trị trung bình được lấy ra.
+
+
+
+
+**19. [Type, Purpose, Illustration, Comments]**
+
+⟶ [Kiểu, Chức năng, Minh họa, Nhận xét]
+
+
+
+
+**20. [Max pooling, Average pooling, Each pooling operation selects the maximum value of the current view, Each pooling operation averages the values of the current view]**
+
+⟶ [Max pooling, Average pooling, Từng phép pooling chọn giá trị lớn nhất trong khu vực mà nó đang được áp dụng, Từng phép pooling tính trung bình các giá trị trong khu vực mà nó đang được áp dụng]
+
+
+
+
+**21. [Preserves detected features, Most commonly used, Downsamples feature map, Used in LeNet]**
+
+⟶ [Bảo toàn các đặc trưng đã phát hiện, Được sử dụng thường xuyên, Giảm kích thước feature map, Được sử dụng trong mạng LeNet]
+
+
+
+
+**22. Fully Connected (FC) ― The fully connected layer (FC) operates on a flattened input where each input is connected to all neurons. If present, FC layers are usually found towards the end of CNN architectures and can be used to optimize objectives such as class scores.**
+
+⟶ Fully Connected (FC) ― Tầng kết nối đầy đủ (FC) nhận đầu vào là các dữ liệu đã được làm phẳng, mà mỗi đầu vào đó được kết nối đến tất cả neuron. Trong mô hình mạng CNNs, các tầng kết nối đầy đủ thường được tìm thấy ở cuối mạng và được dùng để tối ưu hóa mục tiêu của mạng ví dụ như độ chính xác của lớp (class).
+
+
+
+
+**23. Filter hyperparameters**
+
+⟶ Các siêu tham số của bộ lọc
+
+
+
+
+**24. The convolution layer contains filters for which it is important to know the meaning behind its hyperparameters.**
+
+⟶ Tầng tích chập chứa các bộ lọc mà rất quan trọng cho ta khi biết ý nghĩa đằng sau các siêu tham số của chúng.
+
+
+
+
+**25. Dimensions of a filter ― A filter of size F×F applied to an input containing C channels is a F×F×C volume that performs convolutions on an input of size I×I×C and produces an output feature map (also called activation map) of size O×O×1.**
+
+⟶ Các chiều của một bộ lọc ― Một bộ lọc kích thước F×F áp dụng lên đầu vào chứa C kênh (channels) thì có kích thước tổng kể là F×F×C thực hiện phép tích chập trên đầu vào kích thước I×I×C và cho ra một feature map (hay còn gọi là activation map) có kích thước O×O×1.
+
+
+
+
+**26. Filter**
+
+⟶ Bộ lọc
+
+
+
+
+**27. Remark: the application of K filters of size F×F results in an output feature map of size O×O×K.**
+
+⟶ Lưu ý: Việc áp dụng K bộ lọc có kích thước F×F cho ra một feature map có kích thước O×O×K.
+
+
+
+
+**28. Stride ― For a convolutional or a pooling operation, the stride S denotes the number of pixels by which the window moves after each operation.**
+
+⟶ Stride ― Đối với phép tích chập hoặc phép pooling, độ trượt S ký hiệu số pixel mà cửa sổ sẽ di chuyển sau mỗi lần thực hiện phép tính.
+
+
+
+
+**29. Zero-padding ― Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. This value can either be manually specified or automatically set through one of the three modes detailed below:**
+
+⟶ Zero-padding ― Zero-padding là tên gọi của quá trình thêm P số không vào các biên của đầu vào. Giá trị này có thể được lựa chọn thủ công hoặc một cách tự động bằng một trong ba những phương pháp mô tả bên dưới:
+
+
+
+
+**30. [Mode, Value, Illustration, Purpose, Valid, Same, Full]**
+
+⟶ [Phương pháp, Giá trị, Mục đích, Valid, Same, Full]
+
+
+
+
+**31. [No padding, Drops last convolution if dimensions do not match, Padding such that feature map size has size ⌈IS⌉, Output size is mathematically convenient, Also called 'half' padding, Maximum padding such that end convolutions are applied on the limits of the input, Filter 'sees' the input end-to-end]**
+
+⟶ [Không sử dụng padding, Bỏ phép tích chập cuối nếu số chiều không khớp, Sử dụng padding để làm cho feature map có kích thước ⌈IS⌉, Kích thước đầu ra thuận lợi về mặt toán học, Còn được gọi là 'half' padding, Padding tối đa sao cho các phép tích chập có thể được sử dụng tại các rìa của đầu vào, Bộ lọc 'thấy' được đầu vào từ đầu đến cuối]
+
+
+
+
+**32. Tuning hyperparameters**
+
+⟶ Điều chỉnh siêu tham số
+
+
+
+
+**33. Parameter compatibility in convolution layer ― By noting I the length of the input volume size, F the length of the filter, P the amount of zero padding, S the stride, then the output size O of the feature map along that dimension is given by:**
+
+⟶ Tính tương thích của tham số trong tầng tích chập ― Bằng cách ký hiệu I là độ dài kích thước đầu vào, F là độ dài của bộ lọc, P là số lượng zero padding, S là độ trượt, ta có thể tính được độ dài O của feature map theo một chiều bằng công thức:
+
+
+
+
+**34. [Input, Filter, Output]**
+
+⟶ [Đầu vào, Bộ lọc, Đầu ra]
+
+
+
+
+**35. Remark: often times, Pstart=Pend≜P, in which case we can replace Pstart+Pend by 2P in the formula above.**
+
+⟶ Lưu ý: Trong một số trường hợp, Pstart=Pend≜P, ta có thể thay thế Pstart+Pend bằng 2P trong công thức trên.
+
+
+
+
+**36. Understanding the complexity of the model ― In order to assess the complexity of a model, it is often useful to determine the number of parameters that its architecture will have. In a given layer of a convolutional neural network, it is done as follows:**
+
+⟶ Hiểu về độ phức tạp của mô hình ― Để đánh giá độ phức tạp của một mô hình, cách hữu hiệu là xác định số tham số mà mô hình đó sẽ có. Trong một tầng của mạng neural tích chập, nó sẽ được tính toán như sau:
+
+
+
+
+**37. [Illustration, Input size, Output size, Number of parameters, Remarks]**
+
+⟶ [Minh họa, Kích thước đầu vào, Kích thước đầu ra, Số lượng tham số, Lưu ý]
+
+
+
+
+**38. [One bias parameter per filter, In most cases, S
+
+
+**39. [Pooling operation done channel-wise, In most cases, S=F]**
+
+⟶ [Phép pooling được áp dụng lên từng kênh (channel-wise), Trong đa số trường hợp, S=F]
+
+
+
+
+**40. [Input is flattened, One bias parameter per neuron, The number of FC neurons is free of structural constraints]**
+
+⟶ [Đầu vào được làm phẳng, Mỗi neuron có một tham số bias, Số neuron trong một tầng FC phụ thuộc vào ràng buộc kết cấu]
+
+
+
+
+**41. Receptive field ― The receptive field at layer k is the area denoted Rk×Rk of the input that each pixel of the k-th activation map can 'see'. By calling Fj the filter size of layer j and Si the stride value of layer i and with the convention S0=1, the receptive field at layer k can be computed with the formula:**
+
+⟶ Trường thụ cảm (Receptive field) ― Trường thụ cảm tại tầng k là vùng được ký hiệu Rk×Rk của đầu vào mà những pixel của activation map thứ k có thể "nhìn thấy". Bằng cách gọi Fj là kích thước bộ lọc của tầng j và Si là giá trị độ trượt của tầng i và để thuận tiện, ta mặc định S0=1, trường thụ cảm của tầng k được tính toán bằng công thức:
+
+
+
+
+**42. In the example below, we have F1=F2=3 and S1=S2=1, which gives R2=1+2⋅1+2⋅1=5.**
+
+⟶ Trong ví dụ bên dưới, ta có F1=F2=3 và S1=S2=1, nên cho ra được R2=1+2⋅1+2⋅1=5.
+
+
+
+
+**43. Commonly used activation functions**
+
+⟶ Các hàm kích hoạt thường gặp
+
+
+
+
+**44. Rectified Linear Unit ― The rectified linear unit layer (ReLU) is an activation function g that is used on all elements of the volume. It aims at introducing non-linearities to the network. Its variants are summarized in the table below:**
+
+⟶ Rectified Linear Unit ― Tầng rectified linear unit (ReLU) là một hàm kích hoạt g được sử dụng trên tất cả các thành phần. Mục đích của nó là tăng tính phi tuyến tính cho mạng. Những biến thể khác của ReLU được tổng hợp ở bảng dưới:
+
+
+
+
+**45. [ReLU, Leaky ReLU, ELU, with]**
+
+⟶ [ReLU, Leaky ReLU, ELU, with]
+
+
+
+
+**46. [Non-linearity complexities biologically interpretable, Addresses dying ReLU issue for negative values, Differentiable everywhere]**
+
+⟶ [Độ phức tạp phi tuyến tính có thể thông dịch được về mặt sinh học, Gán vấn đề ReLU chết cho những giá trị âm, Khả vi tại mọi nơi]
+
+
+
+
+**47. Softmax ― The softmax step can be seen as a generalized logistic function that takes as input a vector of scores x∈Rn and outputs a vector of output probability p∈Rn through a softmax function at the end of the architecture. It is defined as follows:**
+
+⟶ Softmax ― Bước softmax có thể được coi là một hàm logistic tổng quát lấy đầu vào là một vector chứa các giá trị x∈Rn và cho ra là một vector gồm các xác suất p∈Rn thông qua một hàm softmax ở cuối kiến trúc. Nó được định nghĩa như sau:
+
+
+
+
+**48. where**
+
+⟶ với
+
+
+
+
+**49. Object detection**
+
+⟶ Phát hiện vật thể (Object detection)
+
+
+
+
+**50. Types of models ― There are 3 main types of object recognition algorithms, for which the nature of what is predicted is different. They are described in the table below:**
+
+⟶ Các kiểu mô hình ― Có 3 kiểu thuật toán nhận diện vật thể chính, vì thế mà bản chất của thứ được dự đoán sẽ khác nhau. Chúng được miêu tả ở bảng dưới:
+
+
+
+
+**51. [Image classification, Classification w. localization, Detection]**
+
+⟶ [Phân loại hình ảnh, Phân loại cùng với khoanh vùng, Phát hiện]
+
+
+
+
+**52. [Teddy bear, Book]**
+
+⟶ [Gấu bông, Sách]
+
+
+
+
+**53. [Classifies a picture, Predicts probability of object, Detects an object in a picture, Predicts probability of object and where it is located, Detects up to several objects in a picture, Predicts probabilities of objects and where they are located]**
+
+⟶ [Phân loại một tấm ảnh, Dự đoán xác suất của một vật thể, Phát hiện một vật thể trong ảnh, Dự đoán xác suất của vật thể và định vị nó, Phát hiện nhiều vật thể trong cùng một tấm ảnh, Dự đoán xác suất của các vật thể và định vị chúng]
+
+
+
+
+**54. [Traditional CNN, Simplified YOLO, R-CNN, YOLO, R-CNN]**
+
+⟶ [CNN cổ điển, YOLO đơn giản hóa, R-CNN, YOLO, R-CNN]
+
+
+
+
+**55. Detection ― In the context of object detection, different methods are used depending on whether we just want to locate the object or detect a more complex shape in the image. The two main ones are summed up in the table below:**
+
+⟶ Detection ― Trong bối cảnh phát hiện vật thể, những phương pháp khác nhau được áp dụng tùy thuộc vào liệu chúng ta chỉ muốn định vị vật thể hay phát hiện được những hình dạng phức tạp hơn trong tấm ảnh. Hai phương pháp chính được tổng hợp ở bảng dưới:
+
+
+
+
+**56. [Bounding box detection, Landmark detection]**
+
+⟶ [Phát hiện hộp giới hạn (bounding box), Phát hiện landmark]
+
+
+
+
+**57. [Detects the part of the image where the object is located, Detects a shape or characteristics of an object (e.g. eyes), More granular]**
+
+⟶ [Phát hiện phần trong ảnh mà có sự xuất hiện của vật thể, Phát hiện hình dạng và đặc điểm của một đối tượng (vd: mắt), Nhiều hạt]
+
+
+
+
+**58. [Box of center (bx,by), height bh and width bw, Reference points (l1x,l1y), ..., (lnx,lny)]**
+
+⟶ [Hộp có tọa độ trung tâm (bx, by), chiều cao bh và chiều rộng bw, Các điểm tương quan (l1x,l1y), ..., (lnx,lny)]
+
+
+
+
+**59. Intersection over Union ― Intersection over Union, also known as IoU, is a function that quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding box Ba. It is defined as:**
+
+⟶ Intersection over Union ― Tỉ lệ vùng giao trên vùng hợp, còn được biết đến là IoU, là một hàm định lượng vị trí Bp của hộp giới hạn dự đoán được định vị đúng như thế nào so với hộp giới hạn thực tế Ba. Nó được định nghĩa:
+
+
+
+
+**60. Remark: we always have IoU∈[0,1]. By convention, a predicted bounding box Bp is considered as being reasonably good if IoU(Bp,Ba)⩾0.5.**
+
+⟶ Lưu ý: ta luôn có IoU∈[0,1]. Để thuận tiện, một hộp giới hạn Bp được cho là khá tốt nếu IoU(Bp,Ba)⩾0.5.
+
+
+
+
+**61. Anchor boxes ― Anchor boxing is a technique used to predict overlapping bounding boxes. In practice, the network is allowed to predict more than one box simultaneously, where each box prediction is constrained to have a given set of geometrical properties. For instance, the first prediction can potentially be a rectangular box of a given form, while the second will be another rectangular box of a different geometrical form.**
+
+⟶ Anchor boxes ― Hộp mỏ neo là một kỹ thuật được dùng để dự đoán những hộp giới hạn nằm chồng lên nhau. Trong thực nghiệm, mạng được phép dự đoán nhiều hơn một hộp cùng một lúc, trong đó mỗi dự đoán được giới hạn theo một tập những tính chất hình học cho trước. Ví dụ, dự đoán đầu tiên có khả năng là một hộp hình chữ nhật có hình dạng cho trước, trong khi dự đoán thứ hai sẽ là một hộp hình chữ nhật nữa với hình dạng hình học khác.
+
+
+
+
+**62. Non-max suppression ― The non-max suppression technique aims at removing duplicate overlapping bounding boxes of a same object by selecting the most representative ones. After having removed all boxes having a probability prediction lower than 0.6, the following steps are repeated while there are boxes remaining:**
+
+⟶ Non-max suppression ― Kỹ thuật non-max suppression hướng tới việc loại bỏ những hộp giới hạn bị trùng chồng lên nhau của cùng một đối tượng bằng cách chọn chiếc hộp có tính đặc trưng nhất. Sau khi loại bỏ tất cả các hộp có xác suất dự đoán nhỏ hơn 0.6, những bước tiếp theo được lặp lại khi vẫn còn tồn tại những hộp khác.
+
+
+
+
+**63. [For a given class, Step 1: Pick the box with the largest prediction probability., Step 2: Discard any box having an IoU⩾0.5 with the previous box.]**
+
+⟶ [Với một lớp cho trước, Bước 1: Chọn chiếc hộp có xác suất dự đoán lớn nhất., Bước 2: Loại bỏ những hộp có IoU⩾0.5 với hộp đã chọn.]
+
+
+
+
+**64. [Box predictions, Box selection of maximum probability, Overlap removal of same class, Final bounding boxes]**
+
+⟶ [Các dự đoán hộp, Chọn hộp với xác suất cao nhất, Loại bỏ trùng lặp trong cùng một lớp, Các hộp giới hạn cuối cùng]
+
+
+
+
+**65. YOLO ― You Only Look Once (YOLO) is an object detection algorithm that performs the following steps:**
+
+⟶ YOLO ― You Only Look Once (YOLO) là một thuật toán phát hiện vật thể thực hiện những bước sau:
+
+
+
+
+**66. [Step 1: Divide the input image into a G×G grid., Step 2: For each grid cell, run a CNN that predicts y of the following form:, repeated k times]**
+
+⟶ [Bước 1: Phân chia tấm ảnh đầu vào thành một lưới G×G., Bước 2: Với mỗi lưới, chạy một mạng CNN dự đoán y có dạng sau:, lặp lại k lần]
+
+
+
+
+**67. where pc is the probability of detecting an object, bx,by,bh,bw are the properties of the detected bouding box, c1,...,cp is a one-hot representation of which of the p classes were detected, and k is the number of anchor boxes.**
+
+⟶ với pc là xác suất dự đoán được một vật thể, bx,by,bh,bw là những thuộc tính của hộp giới hạn được dự đoán, c1,...,cp là biểu diễn one-hot của việc lớp nào trong p các lớp được dự đoán, và k là số lượng các hộp mỏ neo.
+
+
+
+
+**68. Step 3: Run the non-max suppression algorithm to remove any potential duplicate overlapping bounding boxes.**
+
+⟶ Bước 3: Chạy thuật toán non-max suppression để loại bỏ bất kỳ hộp giới hạn có khả năng bị trùng lặp.
+
+
+
+
+**69. [Original image, Division in GxG grid, Bounding box prediction, Non-max suppression]**
+
+⟶ [Ảnh gốc, Phân chia thành lưới GxG, Dự đoán hộp giới hạn, Non-max suppression]
+
+
+
+
+**70. Remark: when pc=0, then the network does not detect any object. In that case, the corresponding predictions bx,...,cp have to be ignored.**
+
+⟶ Lưu ý: khi pc=0, thì mạng không phát hiện bất kỳ vật thể nào. Trong trường hợp đó, Các dự đoán liên quan bx,...,cp sẽ bị lờ đi.
+
+
+
+
+**71. R-CNN ― Region with Convolutional Neural Networks (R-CNN) is an object detection algorithm that first segments the image to find potential relevant bounding boxes and then run the detection algorithm to find most probable objects in those bounding boxes.**
+
+⟶ R-CNN ― Region with Convolutional Neural Networks (R-CNN) là một thuật toán phát hiện vật thể mà đầu tiên phân chia ảnh thành các vùng để tìm các hộp giới hạn có khả năng liên quan cao rồi chạy một thuật toán phát hiện để tìm những thứ có khả năng cao là vật thể trong những hộp giới hạn đó.
+
+
+
+
+**72. [Original image, Segmentation, Bounding box prediction, Non-max suppression]**
+
+⟶ [Ảnh gốc, Phân vùng, Dự đoán hộp giới hạn, Non-max suppression]
+
+
+
+
+**73. Remark: although the original algorithm is computationally expensive and slow, newer architectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.**
+
+⟶ Lưu ý: mặc dù thuật toán gốc có chi phí tính toán cao và chậm, những kiến trúc mới đã có thể cho phép thuật toán này chạy nhanh hơn, như là Fast R-CNN và Faster R-CNN.
+
+
+
+
+**74. Face verification and recognition**
+
+⟶ Xác nhận khuôn mặt và nhận diện khuôn mặt
+
+
+
+
+**75. Types of models ― Two main types of model are summed up in table below:**
+
+⟶ Các kiểu mô hình ― Hai kiểu mô hình chính được tổng hợp trong bảng dưới:
+
+
+
+
+**76. [Face verification, Face recognition, Query, Reference, Database]**
+
+⟶ [Xác nhận khuôn mặt, Nhận diện khuôn mặt, Truy vấn, Tham vấn, Cơ sở dữ liệu]
+
+
+
+
+**77. [Is this the correct person?, One-to-one lookup, Is this one of the K persons in the database?, One-to-many lookup]**
+
+⟶ [Có đúng người không?, Tra cứu một-một, Đây có phải là 1 trong K người trong cơ sở dữ liệu không?, Tra cứu một với tất cả]
+
+
+
+
+**78. One Shot Learning ― One Shot Learning is a face verification algorithm that uses a limited training set to learn a similarity function that quantifies how different two given images are. The similarity function applied to two images is often noted d(image 1,image 2).**
+
+⟶ One Shot Learning ― One Shot Learning là một thuật toán xác minh khuôn mặt sử dụng một tập huấn luyện hạn chế để học một hàm similarity nhằm ước lượng sự khác nhau giữa hai tấm hình. Hàm này được áp dụng cho hai tấm ảnh thường được ký hiệu d(image 1,image 2).
+
+
+
+
+**79. Siamese Network ― Siamese Networks aim at learning how to encode images to then quantify how different two images are. For a given input image x(i), the encoded output is often noted as f(x(i)).**
+
+⟶ Siamese Network ― Siamese Networks hướng tới việc học cách mã hóa tấm ảnh để rồi định lượng sự khác nhau giữa hai tấm ảnh. Với một tấm ảnh đầu vào x(i), đầu ra được mã hóa thường được ký hiệu là f(x(i)).
+
+
+
+
+**80. Triplet loss ― The triplet loss ℓ is a loss function computed on the embedding representation of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive example belong to a same class, while the negative example to another one. By calling α∈R+ the margin parameter, this loss is defined as follows:**
+
+⟶ Triplet loss ― Triplet loss ℓ là một hàm mất mát được tính toán dựa trên biểu diễn nhúng của bộ ba hình ảnh A (mỏ neo), P (dương tính) và N(âm tính). Ảnh mỏ neo và ảnh dương tính đều thuộc một lớp, trong khi đó ảnh âm tính thuộc về một lớp khác. Bằng các gọi α∈R+ là tham số margin, hàm mất mát này được định nghĩa như sau:
+
+
+
+
+**81. Neural style transfer**
+
+⟶ Neural style transfer
+
+
+
+
+**82. Motivation ― The goal of neural style transfer is to generate an image G based on a given content C and a given style S.**
+
+⟶ Ý tưởng ― Mục tiêu của neural style transfer là tạo ra một ảnh G dựa trên một nội dung C và một phong cách S.
+
+
+
+
+**83. [Content C, Style S, Generated image G]**
+
+⟶ [Nội dung C, Phong cách S, Ảnh tạo được G]
+
+
+
+
+**84. Activation ― In a given layer l, the activation is noted a[l] and is of dimensions nH×nw×nc**
+
+⟶ Tầng kích hoạt ― Trong một tầng l cho trước, tầng kích hoạt được ký hiệu a[l] và có các chiều là nH×nw×nc
+
+
+
+
+**85. Content cost function ― The content cost function Jcontent(C,G) is used to determine how the generated image G differs from the original content image C. It is defined as follows:**
+
+⟶ Hàm mất mát nội dung ― Hàm mất mát nội dung Jcontent(C,G) được sử dụng để xác định nội dung của ảnh được tạo G khác biệt với nội dung gốc trong ảnh C. Nó được định nghĩa như dưới đây:
+
+
+
+
+**86. Style matrix ― The style matrix G[l] of a given layer l is a Gram matrix where each of its elements G[l]kk′ quantifies how correlated the channels k and k′ are. It is defined with respect to activations a[l] as follows:**
+
+⟶ Ma trận phong cách ― Ma trận phong cách G[l] của một tầng cho trước l là một ma trận Gram mà mỗi thành phần G[l]kk′ của ma trận xác định sự tương quan giữa kênh k và kênh k'. Nó được định nghĩa theo tầng kích hoạt a[l] như sau:
+
+
+
+
+**87. Remark: the style matrix for the style image and the generated image are noted G[l] (S) and G[l] (G) respectively.**
+
+⟶ Lưu ý: ma trận phong cách cho ảnh phong cách và ảnh được tạo được ký hiệu tương ứng là G[l] (S) và G[l] (G).
+
+
+
+
+**88. Style cost function ― The style cost function Jstyle(S,G) is used to determine how the generated image G differs from the style S. It is defined as follows:**
+
+⟶ Hàm mất mát phong cách ― Hàm mất mát phong cách Jstyle(S,G) được sử dụng để xác định sự khác biệt về phong cách giữa ảnh được tạo G và ảnh phong cách S. Nó được định nghĩa như sau:
+
+
+
+
+**89. Overall cost function ― The overall cost function is defined as being a combination of the content and style cost functions, weighted by parameters α,β, as follows:**
+
+⟶ Hàm mất mát tổng quát ― Hàm mất mát tổng quát được định nghĩa là sự kết hợp của hàm mất mát nội dung và hàm mất mát phong cách, độ quan trọng của chúng được xác định bởi hai tham số α,β, như dưới đây:
+
+
+
+
+**90. Remark: a higher value of α will make the model care more about the content while a higher value of β will make it care more about the style.**
+
+⟶ Lưu ý: giá trị của α càng lớn dẫn tới việc mô hình sẽ quan tâm hơn cho nội dung, trong khi đó, giá trị của β càng lớn sẽ khiến nó quan tâm hơn đến phong cách.
+
+
+
+
+**91. Architectures using computational tricks**
+
+⟶ Những kiến trúc sử dụng computational tricks
+
+
+
+
+**92. Generative Adversarial Network ― Generative adversarial networks, also known as GANs, are composed of a generative and a discriminative model, where the generative model aims at generating the most truthful output that will be fed into the discriminative which aims at differentiating the generated and true image.**
+
+
+⟶ Generative Adversarial Network ― Generative adversarial networks, hay còn được gọi là GAN, là sự kết hợp giữa mô hình khởi tạo và mô hình phân biệt, khi mà mô hình khởi tạo cố gắng tạo ra hình ảnh đầu ra chân thực nhất, sau đó được đưa vô mô hình phân biệt, mà mục tiêu của nó là phân biệt giữa ảnh được tạo và ảnh thật.
+
+
+
+
+**93. [Training, Noise, Real-world image, Generator, Discriminator, Real Fake]**
+
+⟶ [Huấn luyện, Nhiễu, Ảnh thật, Mô hình khởi tạo, Mô hình phân biệt, Thật Giả]
+
+
+
+
+**94. Remark: use cases using variants of GANs include text to image, music generation and synthesis.**
+
+⟶ Lưu ý: có nhiều loại GAN khác nhau bao gồm từ văn bản thành ảnh, sinh nhạc và tổ hợp.
+
+
+
+
+**95. ResNet ― The Residual Network architecture (also called ResNet) uses residual blocks with a high number of layers meant to decrease the training error. The residual block has the following characterizing equation:**
+
+⟶ ResNet ― Kiến trúc Residual Network (hay còn gọi là ResNet) sử dụng những khối residual (residual blocks) cùng với một lượng lớn các tầng để giảm lỗi huấn luyện. Những khối residual có những tính chất sau đây:
+
+
+
+
+**96. Inception Network ― This architecture uses inception modules and aims at giving a try at different convolutions in order to increase its performance through features diversification. In particular, it uses the 1×1 convolution trick to limit the computational burden.**
+
+⟶ Inception Network ― Kiến trúc này sử dụng những inception module và hướng tới việc thử các tầng tích chập khác nhau để tăng hiệu suất thông qua sự đa dạng của các feature. Cụ thể, kiến trúc này sử dụng thủ thuật tầng tích chập 1×1 để hạn chế gánh nặng tính toán.
+
+
+
+
+**97. The Deep Learning cheatsheets are now available in [target language].**
+
+⟶ Những cheatsheet về Deep Learning nay đã được dịch sang [target language].
+
+
+
+
+**98. Original authors**
+
+⟶ Các tác giả
+
+
+
+
+**99. Translated by X, Y and Z**
+
+⟶ Được dịch bởi X, Y và Z
+
+
+
+
+**100. Reviewed by X, Y and Z**
+
+⟶ Xem qua bởi X, Y và Z
+
+
+
+
+**101. View PDF version on GitHub**
+
+⟶ Xem bản PDF trên Github
+
+
+
+
+**102. By X and Y**
+
+⟶ Bởi X và Y
+
+
diff --git a/vi/cs-230-deep-learning-tips-and-tricks.md b/vi/cs-230-deep-learning-tips-and-tricks.md
new file mode 100644
index 000000000..88a821daa
--- /dev/null
+++ b/vi/cs-230-deep-learning-tips-and-tricks.md
@@ -0,0 +1,456 @@
+**Deep Learning Tips and Tricks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-deep-learning-tips-and-tricks)
+
+
+
+**1. Deep Learning Tips and Tricks cheatsheet**
+
+⟶ Cheatsheet về một số thủ thuật trong Deep Learning
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶ CS 230 - Deep Learning
+
+
+
+
+**3. Tips and tricks**
+
+⟶ Mẹo và thủ thuật
+
+
+
+
+**4. [Data processing, Data augmentation, Batch normalization]**
+
+⟶ [Xử lí dữ liệu, Data augmentation, Batch normalization]
+
+
+
+
+**5. [Training a neural network, Epoch, Mini-batch, Cross-entropy loss, Backpropagation, Gradient descent, Updating weights, Gradient checking]**
+
+⟶ [Huấn luyện mạng neural, Epoch, Mini-batch, Cross-entropy loss, Lan truyền ngược, Gradient descent, Cập nhật trọng số, Gradient checking]
+
+
+
+
+**6. [Parameter tuning, Xavier initialization, Transfer learning, Learning rate, Adaptive learning rates]**
+
+⟶ [Parameter tuning, Khởi tạo Xavier, Transfer learning, Tốc độ học, Tốc độ học đáp ứng]
+
+
+
+
+**7. [Regularization, Dropout, Weight regularization, Early stopping]**
+
+⟶ [Regularization, Dropout, Weight regularization, Kỹ thuật Dừng sớm]
+
+
+
+
+**8. [Good practices, Overfitting small batch, Gradient checking]**
+
+⟶ [Good practices, Overfitting small batch, Gradient checking]
+
+
+
+
+**9. View PDF version on GitHub**
+
+⟶ [Xem bản PDF trên GitHub]
+
+
+
+
+**10. Data processing**
+
+⟶ Xử lí dữ liệu
+
+
+
+
+**11. Data augmentation ― Deep learning models usually need a lot of data to be properly trained. It is often useful to get more data from the existing ones using data augmentation techniques. The main ones are summed up in the table below. More precisely, given the following input image, here are the techniques that we can apply:**
+
+⟶ Data augmentation - Các mô hình Deep Learning thường cần rất nhiều dữ liệu để có thể được huấn luyện đúng cách. Việc sử dụng các kỹ thuật Data augmentation là khá hữu ích để có thêm nhiều dữ liệu hơn từ tập dữ liệu hiện thời. Những kĩ thuật chính được tóm tắt trong bảng dưới đây. Chính xác hơn, với hình ảnh đầu vào sau đây, đây là những kỹ thuật mà chúng ta có thể áp dụng:
+
+
+
+
+**12. [Original, Flip, Rotation, Random crop]**
+
+⟶ [Hình gốc, Lật, Xoay, Cắt ngẫu nhiên]
+
+
+
+
+**13. [Image without any modification, Flipped with respect to an axis for which the meaning of the image is preserved, Rotation with a slight angle, Simulates incorrect horizon calibration, Random focus on one part of the image, Several random crops can be done in a row]**
+
+⟶ [Hình ảnh không có bất kỳ sửa đổi nào, Lật đối với một trục mà ý nghĩa của hình ảnh được giữ nguyên, Xoay với một góc nhỏ, Mô phỏng hiệu chỉnh đường chân trời không chính xác, Lấy nét ngẫu nhiên trên một phần của hình ảnh, Một số cách cắt ngẫu nhiên có thể được thực hiện trên một hàng]
+
+
+
+
+**14. [Color shift, Noise addition, Information loss, Contrast change]**
+
+⟶ [Dịch chuyển màu, Thêm nhiễu, Mất mát thông tin, Thay đổi độ tương phản]
+
+
+
+
+**15. [Nuances of RGB is slightly changed, Captures noise that can occur with light exposure, Addition of noise, More tolerance to quality variation of inputs, Parts of image ignored, Mimics potential loss of parts of image, Luminosity changes, Controls difference in exposition due to time of day]**
+
+⟶ [Các sắc thái của RGB bị thay đổi một chút, Captures noise có thể xảy ra khi tiếp xúc với ánh sáng nhẹ, Bổ sung nhiễu, Chịu được sự thay đổi chất lượng của các yếu tố đầu vào, Các phần của hình ảnh bị bỏ qua, Mô phỏng khả năng mất của các phần trong hình ảnh, Thay đổi độ sáng, Kiểm soát sự khác biệt do phơi sáng theo thời gian trong ngày]
+
+
+
+
+**16. Remark: data is usually augmented on the fly during training.**
+
+⟶ Ghi chú: dữ liệu thường được tăng cường khi huấn luyện
+
+
+
+
+**17. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
+
+⟶ Chuẩn hóa batch ― Đây là một bước của hyperparameter γ,β chuẩn hóa tập dữ liệu {xi}. Bằng việc kí hiệu μB,σ2B là trung bình và phương sai của tập dữ liệu ta muốn chuẩn hóa, nó được thực hiện như sau:
+
+
+
+
+**18. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
+
+⟶ Thường hoàn thành sau một lớp fully connected/nhân chập và trước lớp phi tuyến tính và mục đích cho phép tốc độc học cao hơn và giảm thiểu sự phụ thuộc vào khởi tạo
+
+
+
+
+**19. Training a neural network**
+
+⟶ Huấn luyện mạng neural
+
+
+
+
+**20. Definitions**
+
+⟶ Định nghĩa
+
+
+
+
+**21. Epoch ― In the context of training a model, epoch is a term used to refer to one iteration where the model sees the whole training set to update its weights.**
+
+⟶ Epoch ― Trong ngữ cảnh huấn luyện mô hình, epoch là một thuật ngữ chỉ một vòng lặp mà mô hình sẽ duyệt toàn bộ tập dữ liệu huấn luyện để cập nhật trọng số của nó.
+
+
+
+
+**22. Mini-batch gradient descent ― During the training phase, updating weights is usually not based on the whole training set at once due to computation complexities or one data point due to noise issues. Instead, the update step is done on mini-batches, where the number of data points in a batch is a hyperparameter that we can tune.**
+
+⟶ Mini-batch gradient descent - Trong quá trình huấn luyện, việc cập nhật trọng số thường không dựa trên toàn bộ tập huấn luyện cùng một lúc do độ phức tạp tính toán hoặc một điểm dữ liệu nhiễu. Thay vào đó, bước cập nhật được thực hiện trên các lô nhỏ (mini-batch), trong đó số lượng điểm dữ liệu trong một lô (batch) là một siêu tham số (hyperparameter) mà chúng ta có thể điều chỉnh.
+
+
+
+
+**23. Loss function ― In order to quantify how a given model performs, the loss function L is usually used to evaluate to what extent the actual outputs y are correctly predicted by the model outputs z.**
+
+⟶ Hàm mất mát - Để định lượng cách thức một mô hình nhất định thực hiện, hàm mất mát L thường được sử dụng để đánh giá mức độ đầu ra thực tế y được dự đoán chính xác bởi đầu ra của mô hình là z.
+
+
+
+
+**24. Cross-entropy loss ― In the context of binary classification in neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
+
+⟶ Cross-entropy loss - Khi áp dụng phân loại nhị phân (binary classification) trong các mạng neural, cross-entropy loss L(z,y) thường được sử dụng và được định nghĩa như sau:
+
+
+
+
+**25. Finding optimal weights**
+
+⟶ Tìm trọng số tối ưu
+
+
+
+
+**26. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight w is computed using the chain rule.**
+
+⟶ Lan truyền ngược (Backpropagation) - Lan truyền ngược là một phương thức để cập nhật các trọng số trong mạng neural bằng cách tính toán đầu ra thực tế và đầu ra mong muốn. Đạo hàm tương ứng với từng trọng số w được tính bằng quy tắc chuỗi.
+
+
+
+
+**27. Using this method, each weight is updated with the rule:**
+
+⟶ Sử dụng phương thức này, mỗi trọng số được cập nhật theo quy luật:
+
+
+
+
+**28. Updating weights ― In a neural network, weights are updated as follows:**
+
+⟶ Cập nhật trọng số ― Trong một mạng neural, các trọng số được cập nhật như sau:
+
+
+
+
+**29. [Step 1: Take a batch of training data and perform forward propagation to compute the loss, Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight, Step 3: Use the gradients to update the weights of the network.]**
+
+⟶ [Bước 1: Lấy một loạt dữ liệu huấn luyện và thực hiện lan truyền xuôi (forward propagation) để tính toán mất mát, Bước 2: Lan truyền ngược mất mát để có được độ dốc (gradient) của mất mát theo từng trọng số, Bước 3: Sử dụng độ dốc để cập nhật trọng số của mạng.]
+
+
+
+
+**30. [Forward propagation, Backpropagation, Weights update]**
+
+⟶ [Lan truyền xuôi, Lan truyền ngược, Cập nhật trọng số]
+
+
+
+
+**31. Parameter tuning**
+
+⟶ Tinh chỉnh tham số
+
+
+
+
+**32. Weights initialization**
+
+⟶ Khởi tạo trọng số
+
+
+
+
+**33. Xavier initialization ― Instead of initializing the weights in a purely random manner, Xavier initialization enables to have initial weights that take into account characteristics that are unique to the architecture.**
+
+⟶ Khởi tạo Xavier - Thay vì khởi tạo trọng số một cách ngẫu nhiên, khởi tạo Xavier cho chúng ta một cách khởi tạo trọng số dựa trên một đặc tính độc nhất của kiến trúc mô hình.
+
+
+
+
+**34. Transfer learning ― Training a deep learning model requires a lot of data and more importantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets that took days/weeks to train, and leverage it towards our use case. Depending on how much data we have at hand, here are the different ways to leverage this:**
+
+⟶ Transfer learning - Huấn luyện một mô hình deep learning đòi hỏi nhiều dữ liệu và quan trọng hơn là rất nhiều thời gian. Sẽ rất hữu ích để tận dụng các trọng số đã được huyến luyện trước trên các bộ dữ liệu rất lớn mất vài ngày / tuần để huấn luyện và tận dụng nó cho trường hợp (use case) của chúng ta. Tùy thuộc vào lượng dữ liệu chúng ta có trong tay, đây là các cách khác nhau để tận dụng điều này:
+
+
+
+
+**35. [Training size, Illustration, Explanation]**
+
+⟶ [Kích thước tập huấn luyện, Mô phỏng, Giải thích]
+
+
+
+
+**36. [Small, Medium, Large]**
+
+⟶ [Nhỏ, Trung bình, Lớn]
+
+
+
+
+**37. [Freezes all layers, trains weights on softmax, Freezes most layers, trains weights on last layers and softmax, Trains weights on layers and softmax by initializing weights on pre-trained ones]**
+
+⟶ [Cố định các tầng, huấn luyện trọng số trên hàm softmax, Cố định hầu hết các tầng, huấn luyện trọng số trên tầng cuối và hàm softmax, Huấn luyện trọng số trên tầng và softmax bằng việc khởi tạo trọng số trên mô hình đã huấn luyện sẵn]
+
+
+
+
+**38. Optimizing convergence**
+
+⟶ Tối ưu hội tụ
+
+
+
+
+**39. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. It can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
+
+⟶ Tốc độ học - Tốc độ học, thường được kí hiệu là α hoặc đôi khi là η, cho biết mức độ thay đổi của các trọng số sau mỗi lần được cập nhật. Nó có thể được cố định hoặc thay đổi thích ứng. Phương thức phổ biến nhất hiện nay là Adam, đây là phương thức thích nghi với tốc độ học.
+
+
+
+
+**40. Adaptive learning rates ― Letting the learning rate vary when training a model can reduce the training time and improve the numerical optimal solution. While Adam optimizer is the most commonly used technique, others can also be useful. They are summed up in the table below:**
+
+⟶ Tốc độ học thích nghi - Để cho tốc độ học thay đổi khi huấn luyện một mô hình có thể giảm thời gian huấn luyện và cải thiện giải pháp tối ưu số. Trong khi tối ưu hóa Adam (Adam optimizer) là kỹ thuật được sử dụng phổ biến nhất, nhưng những phương pháp khác cũng có thể hữu ích. Chúng được tổng kết trong bảng dưới đây:
+
+
+
+
+**41. [Method, Explanation, Update of w, Update of b]**
+
+⟶ [Phương thức, Giải thích, Cập nhật của w, Cập nhật của b]
+
+
+
+
+**42. [Momentum, Dampens oscillations, Improvement to SGD, 2 parameters to tune]**
+
+⟶ [Momentum, Làm giảm dao động, Cải thiện SGD, 2 tham số để tinh chỉnh]
+
+
+
+
+**43. [RMSprop, Root Mean Square propagation, Speeds up learning algorithm by controlling oscillations]**
+
+⟶ [RMSprop, lan truyền Root Mean Square, Thuật toán tăng tốc độ học bằng kiểm soát dao động]
+
+
+
+
+**44. [Adam, Adaptive Moment estimation, Most popular method, 4 parameters to tune]**
+
+⟶ [Adam, Ước lượng Adaptive Moment, Các phương pháp phổ biến, 4 tham số để tinh chỉnh]
+
+
+
+
+**45. Remark: other methods include Adadelta, Adagrad and SGD.**
+
+⟶ Chú ý: những phương pháp khác bao gồm Adadelta, Adagrad và SGD.
+
+
+
+
+**46. Regularization**
+
+⟶ Regularization
+
+
+
+
+**47. Dropout ― Dropout is a technique used in neural networks to prevent overfitting the training data by dropping out neurons with probability p>0. It forces the model to avoid relying too much on particular sets of features.**
+
+⟶ Dropout - Dropout là một kỹ thuật được sử dụng trong các mạng neural để tránh overfitting trên tập huấn luyện bằng cách loại bỏ các nơ-ron (neural) với xác suất p>0. Nó giúp mô hình không bị phụ thuộc quá nhiều vào một tập thuộc tính nào đó.
+
+
+
+
+**48. Remark: most deep learning frameworks parametrize dropout through the 'keep' parameter 1−p.**
+
+⟶ Ghi chú: hầu hết các frameworks deep learning đều có thiết lập dropout thông qua biến tham số 'keep' 1-p.
+
+
+
+
+**49. Weight regularization ― In order to make sure that the weights are not too large and that the model is not overfitting the training set, regularization techniques are usually performed on the model weights. The main ones are summed up in the table below:**
+
+⟶ Weight regularization - Để đảm bảo rằng các trọng số không quá lớn và mô hình không bị overfitting trên tập huấn luyện, các kỹ thuật chính quy (regularization) thường được thực hiện trên các trọng số của mô hình. Những kĩ thuật chính được tổng kết trong bảng dưới đây:
+
+
+
+
+**50. [LASSO, Ridge, Elastic Net]**
+
+⟶ [LASSO, Ridge, Elastic Net]
+
+
+
+**50 bis. Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+⟶ bis. Giảm hệ số về 0, Tốt cho việc lựa chọn biến, Làm cho hệ số nhỏ hơn, Đánh đổi giữa việc lựa chọn biến và hệ số nhỏ]
+
+
+
+**51. Early stopping ― This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase.**
+
+⟶ Dừng sớm - Kĩ thuật regularization này sẽ dừng quá trình huấn luyện một khi mất mát trên tập thẩm định (validation) đạt đến một ngưỡng nào đó hoặc bắt đầu tăng.
+
+
+
+
+**52. [Error, Validation, Training, early stopping, Epochs]**
+
+⟶ [Lỗi, Thẩm định, Huấn luyện, dừng sớm, Vòng lặp]
+
+
+
+
+**53. Good practices**
+
+⟶ Thói quen tốt
+
+
+
+
+**54. Overfitting small batch ― When debugging a model, it is often useful to make quick tests to see if there is any major issue with the architecture of the model itself. In particular, in order to make sure that the model can be properly trained, a mini-batch is passed inside the network to see if it can overfit on it. If it cannot, it means that the model is either too complex or not complex enough to even overfit on a small batch, let alone a normal-sized training set.**
+
+⟶ Overfitting small batch - Khi gỡ lỗi một mô hình, khá hữu ích khi thực hiện các kiểm tra (tests) nhanh để xem liệu có bất kỳ vấn đề lớn nào với kiến trúc của mô hình đó không. Đặc biệt, để đảm bảo rằng mô hình có thể được huấn luyện đúng cách, một batch nhỏ (mini-batch) được truyền vào bên trong mạng để xem liệu nó có thể overfit không. Nếu không, điều đó có nghĩa là mô hình quá phức tạp hoặc không đủ phức tạp để thậm chí overfit trên batch nhỏ (mini-batch), chứ đừng nói đến một tập huấn luyện có kích thước bình thường.
+
+
+
+
+**55. Gradient checking ― Gradient checking is a method used during the implementation of the backward pass of a neural network. It compares the value of the analytical gradient to the numerical gradient at given points and plays the role of a sanity-check for correctness.**
+
+⟶ Kiểm tra gradient - Kiểm tra gradient là một phương thức được sử dụng trong quá trình thực hiện lan truyền ngược của mạng neural. Nó so sánh giá trị của gradient phân tích (analytical gradient) với gradient số (numerical gradient) tại các điểm đã cho và đóng vai trò kiểm tra độ chính xác.
+
+
+
+
+**56. [Type, Numerical gradient, Analytical gradient]**
+
+⟶ [Loại, Gradient số, Gradient phân tích]
+
+
+
+
+**57. [Formula, Comments]**
+
+⟶ [Công thức, Bình luận]
+
+
+
+
+**58. [Expensive; loss has to be computed two times per dimension, Used to verify correctness of analytical implementation, Trade-off in choosing h not too small (numerical instability) nor too large (poor gradient approximation)]**
+
+⟶ [Đắt; Mất mát phải được tính hai lần cho mỗi chiều, Được sử dụng để xác minh tính chính xác của việc triển khai phân tích, Đánh đổi trong việc chọn h không quá nhỏ (mất ổn định số) cũng không quá lớn (xấp xỉ độ dốc kém)]
+
+
+
+
+**59. ['Exact' result, Direct computation, Used in the final implementation]**
+
+⟶ [Kết quả 'Chính xác', Tính toán trực tiếp, Được sử dụng trong quá trình triển khai cuối cùng]
+
+
+
+
+**60. The Deep Learning cheatsheets are now available in [target language].**
+
+⟶ Deep Learning cheetsheets đã khả dụng trên [Tiếng Việt]
+
+
+**61. Original authors**
+
+⟶ Những tác giả
+
+
+
+**62.Translated by X, Y and Z**
+
+⟶ Dịch bởi X, Y và Z
+
+
+
+**63.Reviewed by X, Y and Z**
+
+⟶ Đánh giá bởi X, Y và Z
+
+
+
+**64.View PDF version on GitHub**
+
+⟶ Xem bản PDF trên GitHub
+
+
+
+**65.By X and Y**
+
+⟶ Bởi X và Y
+
+
diff --git a/vi/cs-230-recurrent-neural-networks.md b/vi/cs-230-recurrent-neural-networks.md
new file mode 100644
index 000000000..91316df26
--- /dev/null
+++ b/vi/cs-230-recurrent-neural-networks.md
@@ -0,0 +1,677 @@
+**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks)
+
+
+
+**1. Recurrent Neural Networks cheatsheet**
+
+⟶ Cheatsheet về mạng neural hồi quy
+
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶ CS 230 - Deep Learning
+
+
+
+
+**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]**
+
+⟶ [Tổng quan, Kết cấu kiến trúc, Ứng dụng của RNNs, Hàm mất mát, Lan truyền ngược]
+
+
+
+
+**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]**
+
+⟶ [Xử lí các phụ thuộc dài hạn, Các hàm kích hoạt phổ biến, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Các loại cổng, RNN hai chiều, RNN xâu]
+
+
+
+
+**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]**
+
+⟶ [Học từ đại diện, Ký hiệu, Ma trận nhúng, Word2vec, Skip-gram, Lấy mẫu âm, GloVe]
+
+
+
+
+**6. [Comparing words, Cosine similarity, t-SNE]**
+
+⟶ [So sánh các từ, Độ tương đồng Cosine, t-SNE]
+
+
+
+
+**7. [Language model, n-gram, Perplexity]**
+
+⟶ [Hô hình ngôn ngữ, n-gram, Perplexity]
+
+
+
+
+**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]**
+
+⟶ [Dịch máy, Tìm kiếm Beam, Chuẩn hoá độ dài, Phân tích lỗi, Bleu score]
+
+
+
+
+**9. [Attention, Attention model, Attention weights]**
+
+⟶ [Attention, Mô hình Attention, Trọng số Attention]
+
+
+
+
+**10. Overview**
+
+⟶ Tổng quan
+
+
+
+
+**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:**
+
+⟶ Kiến trúc của một mạng RNN truyền thống - Các mạng neural hồi quy, còn được biến đến như là RNNs, là một lớp của mạng neural cho phép đầu ra được sử dụng như đầu vào trong khi có các trạng thái ẩn. Thông thường là như sau:
+
+
+
+
+**12. For each timestep t, the activation a and the output y are expressed as follows:**
+
+⟶ Tại mỗi bước t, giá trị kích hoạt a và đầu ra y được biểu diễn như sau:
+
+
+
+
+**13. and**
+
+⟶ và
+
+
+
+
+**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.**
+
+⟶ với Wax,Waa,Wya,ba,by là các hệ số được chia sẻ tạm thời và g1,g2 là các hàm kích hoạt.
+
+
+
+
+**15. The pros and cons of a typical RNN architecture are summed up in the table below:**
+
+⟶ Ưu và nhược điểm của một kiến trúc RNN thông thường được tổng kết ở bảng dưới đây:
+
+
+
+
+**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]**
+
+⟶ [Ưu điểm, Khả năng xử lí đầu vào với bất kì độ dài nào, Kích cỡ mô hình không tăng theo kích cỡ đầu vào, Quá trình tính toán sử dụng các thông tin cũ, Trọng số được chia sẻ trong suốt thời gian]
+
+
+
+
+**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]**
+
+⟶ [Hạn chế, Tính toán chậm, Khó để truy cập các thông tin từ một khoảng thời gian dài trước đây, Không thể xem xét bất kì đầu vào sau này nào cho trạng thái hiện tại]
+
+
+
+
+**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:**
+
+⟶ Ứng dụng của RNNs - Các mô hình RNN hầu như được sử dụng trong lĩnh vực xử lí ngôn ngữ tự nhiên và ghi nhận tiếng nói. Các ứng dụng khác được tổng kết trong bảng dưới đây:
+
+
+
+
+**19. [Type of RNN, Illustration, Example]**
+
+⟶ [Các loại RNN, Hình minh hoạ, Ví dụ]
+
+
+
+
+**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]**
+
+⟶ [Một-Một, Một-nhiều, Nhiều-một, Nhiều-nhiều]
+
+
+
+
+**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]**
+
+⟶ [Mạng neural truyền thống, Sinh nhạc, Phân loại ý kiến, Ghi nhận thực thể tên, Dịch máy]
+
+
+
+
+**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:**
+
+⟶ Hàm mất mát - Trong trường hợp của mạng neural hồi quy, hàm mất mát L của tất cả các bước thời gian được định nghĩa dựa theo mất mát ở mọi thời điểm như sau:
+
+
+
+
+**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:**
+
+⟶ Lan truyền ngược theo thời gian - Lan truyền ngược được hoàn thành ở mỗi một thời điểm cụ thể. Ở bước T, đạo hàm của hàm mất mát L với ma trận trọng số W được biểu diễn như sau:
+
+
+
+
+**24. Handling long term dependencies**
+
+⟶ Xử lí phụ thuộc dài hạn
+
+
+
+
+**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:**
+
+⟶ Các hàm kích hoạt thường dùng - Các hàm kích hoạt thường dùng trong các modules RNN được miêu tả như sau:
+
+
+
+
+**26. [Sigmoid, Tanh, RELU]**
+
+⟶ [Sigmoid, Tanh, RELU]
+
+
+
+
+**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.**
+
+⟶ Vanishing/exploding gradient - Hiện tượng vanishing và exploding gradient thường gặp trong ngữ cảnh của RNNs. Lí do tại sao chúng thường xảy ra đó là khó để có được sự phụ thuộc dài hạn vì multiplicative gradient có thể tăng/giảm theo hàm mũ tương ứng với số lượng các tầng.
+
+
+
+
+**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.**
+
+⟶ Gradient clipping - Là một kĩ thuật được sử dụng để giải quyết vấn đề exploding gradient xảy ra khi thực hiện lan truyền ngược. Bằng việc giới hạn giá trị lớn nhất cho gradient, hiện tượng này sẽ được kiểm soát trong thực tế.
+
+
+
+
+**29. clipped**
+
+⟶ clipped
+
+
+
+
+**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:**
+
+⟶ Các loại cổng - Để giải quyết vấn đề vanishing gradient, các cổng cụ thể được sử dụng trong một vài loại RNNs và thường có mục đích rõ ràng. Chúng thường được kí hiệu là Γ và bằng với:
+
+
+
+
+**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:**
+
+⟶ Với W, U, b là các hệ số của một cổng và σ là hàm sigmoid. Các loại chính được tổng kết ở bảng dưới đây:
+
+
+
+
+**32. [Type of gate, Role, Used in]**
+
+⟶ [Loại cổng, Vai trò, Được sử dụng trong]
+
+
+
+
+**33. [Update gate, Relevance gate, Forget gate, Output gate]**
+
+⟶ [Cổng cập nhật, Cổng relevance, Cổng quên, Cổng ra]
+
+
+
+
+**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]**
+
+⟶ [Dữ liệu cũ nên có tầm quan trọng như thế nào ở hiện tại?, Bỏ qua thông tin phía trước?, Xoá ô hay không xoá?, Biểu thị một ô ở mức độ bao nhiêu?]
+
+
+
+
+**35. [LSTM, GRU]**
+
+⟶ [LSTM, GRU]
+
+
+
+
+**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:**
+
+⟶ GRU/LSTM ― Gated Recurrent Unit (GRU) và Các đơn vị bộ nhớ dài-ngắn hạn (LSTM) đối phó với vấn đề vanishing gradient khi gặp phải bằng mạng RNNs truyền thống, với LSTM là sự tổng quát của GRU. Phía dưới là bảng tổng kết các phương trình đặc trưng của mỗi kiến trúc:
+
+
+
+
+**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]**
+
+⟶ [Đặc tính, Gated Recurrent Unit (GRU), Bộ nhớ dài-ngắn hạn (LSTM), Các phụ thuộc]
+
+
+
+
+**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.**
+
+⟶ Chú ý: kí hiệu ⋆ chỉ phép nhân từng phần tử với nhau giữa hai vectors.
+
+
+
+
+**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:**
+
+⟶ Các biến thể của RNNs - Bảng dưới đây tổng kết các kiến trúc thường được sử dụng khác của RNN:
+
+
+
+
+**40. [Bidirectional (BRNN), Deep (DRNN)]**
+
+⟶ [RNN hai chiều (Bidirectional - BRNN), RNN sâu (Deep - DRNN)]
+
+
+
+
+**41. Learning word representation**
+
+⟶ Học từ đại diện
+
+
+
+
+**42. In this section, we note V the vocabulary and |V| its size.**
+
+⟶ Trong phần này, chúng ta kí hiệu V là tập từ vựng và |V| là kích cỡ của nó.
+
+
+
+
+**43. Motivation and notations**
+
+⟶ Giải thích và các kí hiệu
+
+
+
+
+**44. Representation techniques ― The two main ways of representing words are summed up in the table below:**
+
+⟶ Các kĩ thuật biểu diễn - Có hai cách chính để biểu diễn từ được tổng kết ở bảng bên dưới:
+
+
+
+
+**45. [1-hot representation, Word embedding]**
+
+⟶ [Biểu diễn 1-hot, Word embedding]
+
+
+
+
+**46. [teddy bear, book, soft]**
+
+⟶ [gấu bông, sách, mềm]
+
+
+
+
+**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]**
+
+⟶ [Lưu ý ow, Tiếp cận Naive, không có thông tin chung, Lưu ý ew, Xem xét độ tương đồng của các từ]
+
+
+
+
+**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:**
+
+⟶ Embedding matrix - Cho một từ w, embedding matrix E là một ma trận tham chiếu thể hiện 1-hot ow của nó với embedding ew của nó như sau:
+
+
+
+
+**49. Remark: learning the embedding matrix can be done using target/context likelihood models.**
+
+⟶ Chú ý: học embedding matrix có thể hoàn thành bằng cách sử dụng các mô hình target/context likelihood.
+
+
+
+
+**50. Word embeddings**
+
+⟶ Word embeddings
+
+
+
+
+**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.**
+
+⟶ Word2vec - Word2vec là một framework tập trung vào việc học word embeddings bằng cách ước lượng khả năng mà một từ cho trước được bao quanh bởi các từ khác. Các mô hình phổ biến bao gồm skip-gram, negative sampling và CBOW.
+
+
+
+
+**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]**
+
+⟶ [Một chú gấu bông dễ thương đang đọc sách, gấu bông teddy, soft, thơ Persian, hội hoạ]
+
+
+
+
+**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]**
+
+⟶ [Huấn luyện mạng trên proxy task, Bóc tách các thể hiện cấp cao, Tính toán word embeddings]
+
+
+
+
+**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:**
+
+⟶ Skip-gram - Mô hình skip-gram word2vec là một task supervised learning, nó học các word embeddings bằng cách đánh giá khả năng của bất kì target word t cho trước nào xảy ra với context word c. Bằng việc kí hiệu θt là tham số đi kèm với t, xác suất P(t|c) được tính như sau:
+
+
+
+
+**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.**
+
+⟶ Chú ý: Cộng tổng tất cả các từ vựng trong mẫu số của phần softmax khiến mô hình này tốn nhiều chi phí tính toán. CBOW là một mô hình word2vec khác sử dụng các từ xung quanh để dự đoán một từ cho trước.
+
+
+
+
+**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:**
+
+⟶ Negative sampling ― Nó là một tập của các bộ phân loại nhị phân sử dụng logistic regressions với mục tiêu là đánh giá khả năng mà một ngữ cảnh cho trước và các target words cho trước có thể xuất hiện đồng thời, với các mô hình đang được huấn luyện trên các tập của k negative examples và 1 positive example. Cho trước context word c và target word t, dự đoán được thể hiện bởi:
+
+
+
+
+**57. Remark: this method is less computationally expensive than the skip-gram model.**
+
+⟶ Chú ý: phương thức này tốn ít chi phí tính toán hơn mô hình skip-gram.
+
+
+
+
+**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:**
+
+⟶ GloVe - Mô hình GloVe, viết tắt của global vectors for word representation, nó là một kĩ thuật word embedding sử dụng ma trận đồng xuất hiện X với mỗi Xi,j là số lần mà từ đích (target) i xuất hiện tại ngữ cảnh j. Cost function J của nó như sau:
+
+
+
+
+**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0.
+Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:**
+
+⟶ f là hàm trong số với Xi,j=0⟹f(Xi,j)=0. Với tính đối xứng mà e và θ có được trong mô hình này, word embedding cuối cùng e(final)w được định nghĩa như sau:
+
+
+
+
+**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.**
+
+⟶ Chú ý: Các phần tử riêng của các word embedding học được không nhất thiết là phải thông dịch được.
+
+
+
+
+**60. Comparing words**
+
+⟶ So sánh các từ
+
+
+
+
+**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:**
+
+⟶ Độ tương đồng Cosine - Độ tương đồng cosine giữa các từ w1 và w2 được trình bày như sau:
+
+
+
+
+**62. Remark: θ is the angle between words w1 and w2.**
+
+⟶ Chú ý: θ là góc giữa các từ w1 và w2.
+
+
+
+
+**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.**
+
+⟶ t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) là một kĩ thuật nhằm giảm đi số chiều của không gian embedding. Trong thực tế, nó thường được sử dụng để trực quan hoá các word vectors trong không gian 2 chiều (2D).
+
+
+
+
+**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]**
+
+⟶ [văn học, nghệ thuật, sách, văn hoá, thơ, đọc, hiểu biết, giải trí, ngôn tình, thiếu nhi, loại, gấu teddy, mềm, ôm, dễ thương, đáng mến]
+
+
+
+
+**65. Language model**
+
+⟶ Mô hình ngôn ngữ
+
+
+
+
+**66. Overview ― A language model aims at estimating the probability of a sentence P(y).**
+
+⟶ Tổng quan - Một mô hình ngôn ngữ sẽ dự đoán xác suất của một câu P(y).
+
+
+
+
+**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.**
+
+⟶ Mô hình n-gram - Mô hình này là cách tiếp cận naive với mục đích định lượng xác suất mà một biểu hiện xuất hiện trong văn bản bằng cách đếm số lần xuất hiện của nó trong tập dữ liệu huấn luyện.
+
+
+
+
+**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:**
+
+⟶ Độ hỗn tạp - Các mô hình ngôn ngữ thường được đánh giá dựa theo độ đo hỗ tạp, cũng được biết đến là PP, có thể được hiểu như là nghịch đảo xác suất của tập dữ liệu được chuẩn hoá bởi số lượng các từ T. Độ hỗn tạp càng thấp thì càng tốt và được định nghĩa như sau:
+
+
+
+
+**69. Remark: PP is commonly used in t-SNE.**
+
+⟶ Chú ý: PP thường được sử dụng trong t-SNE.
+
+
+
+
+**70. Machine translation**
+
+⟶ Dịch máy
+
+
+
+
+**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:**
+
+⟶ Tổng quan - Một mô hình dịch máy tương tự với mô hình ngôn ngữ ngoại trừ nó có một mạng encoder được đặt phía trước. Vì lí do này, đôi khi nó còn được biết đến là mô hình ngôn ngữ có điều kiện. Mục tiêu là tìm một câu văn y như sau:
+
+
+
+
+**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.**
+
+⟶ Tìm kiếm Beam - Nó là một giải thuật tìm kiếm heuristic được sử dụng trong dịch máy và ghi nhận tiếng nói để tìm câu văn y đúng nhất tương ứng với đầu vào x.
+
+
+
+
+**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]**
+
+⟶ [Bước 1: Tìm top B các từ y<1>, Bước 2: Tính xác suất có điều kiện y|x,y<1>,...,y, Bước 3: Giữ top B các tổ hợp x,y<1>,...,y, Kết thúc quá trình xử lí bằng một từ dừng]
+
+
+
+
+**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.**
+
+⟶ Chú ý: nếu độ rộng của beam được thiết lập là 1, thì nó tương đương với tìm kiếm tham lam naive.
+
+
+
+
+**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.**
+
+⟶ Độ rộng Beam - Độ rộng beam B là một tham số của giải thuật tìm kiếm beam. Các giá trị lớn của B tạo ra kết quả tốt hơn nhưng với hiệu năng thấp hơn và lượng bộ nhớ sử dụng sẽ tăng.
+
+
+
+
+**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:**
+
+⟶ Chuẩn hoá độ dài - Đến cải thiện tính ổn định, beam search thường được áp dụng mục tiêu chuẩn hoá sau, thường được gọi là mục tiêu chuẩn hoá log-likelihood, được định nghĩa như sau:
+
+
+
+
+**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.**
+
+⟶ Chú ý: tham số α có thể được xem như là softener, và giá trị của nó thường nằm trong đoạn 0.5 và 1.
+
+
+
+
+**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:**
+
+⟶ Phân tích lỗi - Khi có được một bản dịch tồi ˆy, chúng ta có thể tự hỏi rằng tại sao chúng ta không có được một kết quả dịch tốt y∗ bằng việc thực hiện việc phân tích lỗi như sau:
+
+
+
+
+**79. [Case, Root cause, Remedies]**
+
+⟶ [Trường hợp, Nguyên nhân sâu xa, Biện pháp khắc phục]
+
+
+
+
+**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]**
+
+⟶ [Lỗi Beam search, lỗi RNN, Tăng beam width, Thử kiến trúc khác, Chính quy, Lấy nhiều dữ liệu hơn]
+
+
+
+
+**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:**
+
+⟶ Điểm Bleu - Bilingual evaluation understudy (bleu) score định lượng mức độ tốt của dịch máy bằng cách tính một độ tương đồng dựa trên dự đoán n-gram. Nó được định nghĩa như sau:
+
+
+
+
+**82. where pn is the bleu score on n-gram only defined as follows:**
+
+⟶ với pn là bleu score chỉ trên n-gram được định nghĩa như sau:
+
+
+
+
+**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.**
+
+⟶ Chú ý: một mức phạt ngắn có thể được áp dụng với các dự đoán dịch ngắn để tránh việc làm thổi phồng giá trị bleu score.
+
+
+
+
+**84. Attention**
+
+⟶ Chú ý
+
+
+
+
+**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:**
+
+⟶ Attention model - Mô hình này cho phép một RNN tập trung vào các phần cụ thể của đầu vào được xem xét là quan trọng, nó giúp cải thiện hiệu năng của mô hình kết quả trong thực tế. Bằng việc kí hiệu α là mức độ chú ý mà đầu ra y nên có đối với hàm kích hoạt a và c là ngữ cảnh ở thời điểm t, chúng ta có:
+
+
+
+
+**86. with**
+
+⟶ với
+
+
+
+
+**87. Remark: the attention scores are commonly used in image captioning and machine translation.**
+
+⟶ Chú ý: Các attention scores thường được sử dụng trong chú thích ảnh và dịch máy.
+
+
+
+
+**88. A cute teddy bear is reading Persian literature.**
+
+⟶ Một chú gấu bông dễ thương đang đọc bài văn Persian.
+
+
+
+
+**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:**
+
+⟶ Attention weight - Sự chú ý mà đầu ra y nên có với hàm kích hoạt a với α được tính như sau:
+
+
+
+
+**90. Remark: computation complexity is quadratic with respect to Tx.**
+
+⟶ Chú ý: độ phức tạp tính toán là một phương trình bậc hai đối với Tx.
+
+
+
+
+**91. The Deep Learning cheatsheets are now available in [target language].**
+
+⟶ Deep Learning cheatsheets hiện đã có bản dịch [tiếng việt].
+
+
+
+**92. Original authors**
+
+⟶ Tác giả
+
+
+
+**93. Translated by X, Y and Z**
+
+⟶ Dịch bởi X, Y và Z
+
+
+
+**94. Reviewed by X, Y and Z**
+
+⟶ Reviewed bởi X, Y và Z
+
+
+
+**95. View PDF version on GitHub**
+
+⟶ Xem bản PDF trên GibHub
+
+
+
+**96. By X and Y**
+
+⟶ Bởi X và Y
+
+
diff --git a/zh-tw/cheatsheet-deep-learning.md b/zh-tw/cs-229-deep-learning.md
similarity index 100%
rename from zh-tw/cheatsheet-deep-learning.md
rename to zh-tw/cs-229-deep-learning.md
diff --git a/zh/refresher-linear-algebra.md b/zh-tw/cs-229-linear-algebra.md
similarity index 58%
rename from zh/refresher-linear-algebra.md
rename to zh-tw/cs-229-linear-algebra.md
index 6cef234fe..36d4cef5d 100644
--- a/zh/refresher-linear-algebra.md
+++ b/zh-tw/cs-229-linear-algebra.md
@@ -1,339 +1,338 @@
1. **Linear Algebra and Calculus refresher**
⟶
-
+線性代數與微積分回顧
2. **General notations**
⟶
-
+通用符號
3. **Definitions**
⟶
-
+定義
4. **Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:**
⟶
-
+向量 - 我們定義 x∈Rn 是一個向量,包含 n 維元素,xi∈R 是第 i 維元素:
5. **Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:**
⟶
-
+矩陣 - 我們定義 A∈Rm×n 是一個 m 列 n 行的矩陣,Ai,j∈R 代表位在第 i 列第 j 行的元素:
6. **Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.**
⟶
-
+注意:上述定義的向量 x 可以視為 nx1 的矩陣,或是更常被稱為行向量
7. **Main matrices**
⟶
-
+主要的矩陣
8. **Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:**
⟶
-
+單位矩陣 - 單位矩陣 I∈Rn×n 是一個方陣,其主對角線皆為 1,其餘皆為 0
9. **Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.**
⟶
-
+注意:對於所有矩陣 A∈Rn×n,我們有 A×I=I×A=A
10. **Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:**
⟶
-
+對角矩陣 - 對角矩陣 D∈Rn×n 是一個方陣,其主對角線為非 0,其餘皆為 0
11. **Remark: we also note D as diag(d1,...,dn).**
⟶
-
+注意:我們令 D 為 diag(d1,...,dn)
12. **Matrix operations**
⟶
-
+矩陣運算
13. **Multiplication**
⟶
-
+乘法
14. **Vector-vector ― There are two types of vector-vector products:**
⟶
-
+向量-向量 - 有兩種類型的向量-向量相乘:
15. **inner product: for x,y∈Rn, we have:**
⟶
-
+內積:對於 x,y∈Rn,我們可以得到:
16. **outer product: for x∈Rm,y∈Rn, we have:**
⟶
-
+外積:對於 x∈Rm,y∈Rn,我們可以得到:
17. **Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:**
⟶
-
+矩陣-向量 - 矩陣 A∈Rm×n 和向量 x∈Rn 的乘積是一個大小為 Rm 的向量,使得:
18. **where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.**
⟶
-
+其中 aTr,i 是 A 的列向量、ac,j 是 A 的行向量、xi 是 x 的元素
19. **Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:**
⟶
-
+矩陣-矩陣:矩陣 A∈Rm×n 和 B∈Rn×p 的乘積為一個大小 Rm×p 的矩陣,使得:
20. **where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively**
⟶
-
+其中,aTr,i,bTr,i 和 ac,j,bc,j 分別是 A 和 B 的列向量與行向量
21. **Other operations**
⟶
-
+其他操作
22. **Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:**
⟶
-
+轉置 - 一個矩陣的轉置矩陣 A∈Rm×n,記作 AT,指的是其中元素的翻轉:
23. **Remark: for matrices A,B, we have (AB)T=BTAT**
⟶
-
+注意:對於矩陣 A、B,我們有 (AB)T=BTAT
24. **Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:**
⟶
-
+可逆 - 一個可逆矩陣 A 記作 A−1,存在唯一的矩陣,使得:
25. **Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1**
⟶
-
+注意:並非所有的方陣都是可逆的。同樣的,對於矩陣 A、B 來說,我們有 (AB)−1=B−1A−1
26. **Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:**
⟶
-
+跡 - 一個方陣 A 的跡,記作 tr(A),指的是主對角線元素之合:
27. **Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)**
⟶
-
+注意:對於矩陣 A、B 來說,我們有 tr(AT)=tr(A) 及 tr(AB)=tr(BA)
28. **Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:**
⟶
-
+行列式 - 一個方陣 A∈Rn×n 的行列式,記作|A| 或 det(A),可以透過 A∖i,∖j 來遞迴表示,它是一個沒有第 i 列和第 j 行的矩陣 A:
29. **Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.**
⟶
-
+注意:A 是一個可逆矩陣,若且唯若 |A|≠0。同樣的,|AB|=|A||B| 且 |AT|=|A|
30. **Matrix properties**
⟶
-
+矩陣的性質
31. **Definitions**
⟶
-
+定義
32. **Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:**
⟶
-
+對稱分解 - 給定一個矩陣 A,它可以透過其對稱和反對稱的部分表示如下:
33. **[Symmetric, Antisymmetric]**
⟶
-
+[對稱, 反對稱]
34. **Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:**
⟶
-
+範數 - 範數指的是一個函式 N:V⟶[0,+∞[,其中 V 是一個向量空間,且對於所有 x,y∈V,我們有:
35. **N(ax)=|a|N(x) for a scalar**
⟶
-
+對一個純量來說,我們有 N(ax)=|a|N(x)
36. **if N(x)=0, then x=0**
⟶
-
+若 N(x)=0 時,則 x=0
37. **For x∈V, the most commonly used norms are summed up in the table below:**
⟶
-
+對於 x∈V,最常用的範數總結如下表:
38. **[Norm, Notation, Definition, Use case]**
⟶
-
+[範數, 表示法, 定義, 使用情境]
39. **Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.**
⟶
-
+線性相關 - 當集合中的一個向量可以用被定義為集合中其他向量的線性組合時,則則稱此集合的向量為線性相關
40. **Remark: if no vector can be written this way, then the vectors are said to be linearly independent**
⟶
-
+注意:如果沒有向量可以如上表示時,則稱此集合的向量彼此為線性獨立
41. **Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.**
⟶
-
+矩陣的秩 - 一個矩陣 A 的秩記作 rank(A),指的是其列向量空間所產生的維度,等價於 A 的線性獨立的最大最大行向量
42. **Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:**
⟶
-
+半正定矩陣 - 當以下成立時,一個矩陣 A∈Rn×n 是半正定矩陣 (PSD),且記作A⪰0:
43. **Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.**
⟶
-
+注意:同樣的,一個矩陣 A 是一個半正定矩陣 (PSD),且滿足所有非零向量 x,xTAx>0 時,稱之為正定矩陣,記作 A≻0
44. **Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
⟶
-
+特徵值、特徵向量 - 給定一個矩陣 A∈Rn×n,當存在一個向量 z∈Rn∖{0} 時,此向量被稱為特徵向量,λ 稱之為 A 的特徵值,且滿足:
45. **Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
⟶
-
+譜分解 - 令 A∈Rn×n,如果 A 是對稱的,則 A 可以被一個實數正交矩陣 U∈Rn×n 給對角化。令 Λ=diag(λ1,...,λn),我們得到:
46. **diagonal**
⟶
-
+對角線
47. **Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:**
⟶
-
+奇異值分解 - 對於給定維度為 mxn 的矩陣 A,其奇異值分解指的是一種因子分解技巧,保證存在 mxm 的單式矩陣 U、對角線矩陣 Σ m×n 和 nxn 的單式矩陣 V,滿足:
48. **Matrix calculus**
⟶
-
+矩陣導數
49. **Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:**
⟶
-
+梯度 - 令 f:Rm×n→R 是一個函式,且 A∈Rm×n 是一個矩陣。f 相對於 A 的梯度是一個 mxn 的矩陣,記作 ∇Af(A),滿足:
50. **Remark: the gradient of f is only defined when f is a function that returns a scalar.**
⟶
-
+注意:f 的梯度僅在 f 為一個函數且該函數回傳一個純量時有效
51. **Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:**
⟶
-
+海森 - 令 f:Rn→R 是一個函式,且 x∈Rn 是一個向量,則一個 f 的海森對於向量 x 是一個 nxn 的對稱矩陣,記作 ∇2xf(x),滿足:
52. **Remark: the hessian of f is only defined when f is a function that returns a scalar**
⟶
-
+注意:f 的海森僅在 f 為一個函數且該函數回傳一個純量時有效
53. **Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:**
-
+梯度運算 - 對於矩陣 A、B、C,下列的梯度性質值得牢牢記住:
⟶
-
-
54. **[General notations, Definitions, Main matrices]**
⟶
-
+[通用符號, 定義, 主要矩陣]
55. **[Matrix operations, Multiplication, Other operations]**
⟶
-
+[矩陣運算, 矩陣乘法, 其他運算]
56. **[Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]**
⟶
-
+[矩陣性質, 範數, 特徵值/特徵向量, 奇異值分解]
57. **[Matrix calculus, Gradient, Hessian, Operations]**
⟶
+[矩陣導數, 梯度, 海森, 運算]
\ No newline at end of file
diff --git a/zh/cheatsheet-machine-learning-tips-and-tricks.md b/zh-tw/cs-229-machine-learning-tips-and-tricks.md
similarity index 59%
rename from zh/cheatsheet-machine-learning-tips-and-tricks.md
rename to zh-tw/cs-229-machine-learning-tips-and-tricks.md
index 61fab788c..b7a5db1c0 100644
--- a/zh/cheatsheet-machine-learning-tips-and-tricks.md
+++ b/zh-tw/cs-229-machine-learning-tips-and-tricks.md
@@ -1,285 +1,257 @@
1. **Machine Learning tips and tricks cheatsheet**
⟶
-
+機器學習秘訣和技巧參考手冊
2. **Classification metrics**
⟶
-
+分類器的評估指標
3. **In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
⟶
-
+在二元分類的問題上,底下是主要用來衡量模型表現的指標
4. **Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
⟶
-
+混淆矩陣 - 混淆矩陣是用來衡量模型整體表現的指標
5. **[Predicted class, Actual class]**
⟶
-
+[預測類別, 真實類別]
6. **Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
⟶
-
+主要的衡量指標 - 底下的指標經常用在評估分類模型的表現
7. **[Metric, Formula, Interpretation]**
⟶
-
+[指標, 公式, 解釋]
8. **Overall performance of model**
⟶
-
+模型的整體表現
9. **How accurate the positive predictions are**
⟶
-
+預測的類別有多精準的比例
10. **Coverage of actual positive sample**
⟶
-
+實際正的樣本的覆蓋率有多少
11. **Coverage of actual negative sample**
⟶
-
+實際負的樣本的覆蓋率
12. **Hybrid metric useful for unbalanced classes**
⟶
-
+對於非平衡類別相當有用的混合指標
13. **ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
⟶
-
+ROC - 接收者操作特徵曲線 (ROC Curve),又被稱為 ROC,是透過改變閥值來表示 TPR 和 FPR 之間關係的圖形。這些指標總結如下:
14. **[Metric, Formula, Equivalent]**
⟶
-
+[衡量指標, 公式, 等同於]
15. **AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
⟶
-
+AUC - 在接收者操作特徵曲線 (ROC) 底下的面積,也稱為 AUC 或 AUROC:
16. **[Actual, Predicted]**
⟶
-
+[實際值, 預測值]
17. **Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
⟶
-
+基本的指標 - 給定一個迴歸模型 f,底下是經常用來評估此模型的指標:
18. **[Total sum of squares, Explained sum of squares, Residual sum of squares]**
⟶
-
+[總平方和, 被解釋平方和, 殘差平方和]
19. **Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
⟶
-
+決定係數 - 決定係數又被稱為 R2 or r2,它提供了模型是否具備復現觀測結果的能力。定義如下:
20. **Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
⟶
-
+主要的衡量指標 - 藉由考量變數 n 的數量,我們經常用使用底下的指標來衡量迴歸模型的表現:
21. **where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
⟶
-
+當中,L 代表的是概似估計,ˆσ2 則是變異數的估計
22. **Model selection**
⟶
-
+模型選擇
23. **Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
⟶
-
+詞彙 - 當進行模型選擇時,我們會針對資料進行以下區分:
24. **[Training set, Validation set, Testing set]**
⟶
-
+[訓練資料集, 驗證資料集, 測試資料集]
25. **[Model is trained, Model is assessed, Model gives predictions]**
⟶
-
+[用來訓練模型, 用來評估模型, 模型用來預測用的資料集]
26. **[Usually 80% of the dataset, Usually 20% of the dataset]**
⟶
-
+[通常是 80% 的資料集, 通常是 20% 的資料集]
27. **[Also called hold-out or development set, Unseen data]**
⟶
-
+[又被稱為 hold-out 資料集或開發資料集, 模型沒看過的資料集]
28. **Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
⟶
-
+當模型被選擇後,就會使用整個資料集來做訓練,並且在沒看過的資料集上做測試。你可以參考以下的圖表:
29. **Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
⟶
-
+交叉驗證 - 交叉驗證,又稱之為 CV,它是一種不特別依賴初始訓練集來挑選模型的方法。幾種不同的方法如下:
-30. [**Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
+30. **[Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
⟶
-
+[把資料分成 k 份,利用 k-1 份資料來訓練,剩下的一份用來評估模型效能, 在 n-p 份資料上進行訓練,剩下的 p 份資料用來評估模型效能]
31. **[Generally k=5 or 10, Case p=1 is called leave-one-out]**
⟶
-
+[一般來說 k=5 或 10, 當 p=1 時,又稱為 leave-one-out]
32. **The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
⟶
-
+最常用到的方法叫做 k-fold 交叉驗證。它將訓練資料切成 k 份,在 k-1 份資料上進行訓練,而剩下的一份用來評估模型的效能,這樣的流程會重複 k 次次。最後計算出來的模型損失是 k 次結果的平均,又稱為交叉驗證損失值。
33. **Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
⟶
-
+正規化 - 正歸化的目的是為了避免模型對於訓練資料過擬合,進而導致高方差。底下的表格整理了常見的正規化技巧:
34. **[Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
⟶
-
+[將係數縮減為 0, 有利變數的選擇, 將係數變得更小, 在變數的選擇和小係數之間作權衡]
35. **Diagnostics**
⟶
-
+診斷
36. **Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
⟶
-
+偏差 - 模型的偏差指的是模型預測值與實際值之間的差異
37. **Variance ― The variance of a model is the variability of the model prediction for given data points.**
⟶
-
+變異 - 變異指的是模型在預測資料時的變異程度
38. **Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
⟶
-
+偏差/變異的權衡 - 越簡單的模型,偏差就越大。而越複雜的模型,變異就越大
39. **[Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
⟶
-
+[現象, 迴歸圖示, 分類圖示, 深度學習圖示, 可能的解法]
40. **[High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
⟶
-
+[訓練錯誤較高, 訓練錯誤和測試錯誤接近, 高偏差, 訓練誤差會稍微比測試誤差低, 訓練誤差很低, 訓練誤差比測試誤差低很多, 高變異]
41. **[Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
⟶
-
+[使用較複雜的模型, 增加更多特徵, 訓練更久, 採用正規化化的方法, 取得更多資料]
42. **Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
⟶
-
+誤差分析 - 誤差分析指的是分析目前使用的模型和最佳模型之間差距的根本原因
43. **Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
⟶
-
-
-
-44. **Regression metrics**
-
-⟶
-
+銷蝕分析 (Ablative analysis) - 銷蝕分析指的是分析目前模型和基準模型之間差異的根本原因
-
-45. **[Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
-
-⟶
-
-
-
-46. **[Regression metrics, R squared, Mallow's CP, AIC, BIC]**
-
-⟶
-
-
-
-47. **[Model selection, cross-validation, regularization]**
-
-⟶
-
-
-
-48. **[Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
-
-⟶
diff --git a/zh/refresher-probability.md b/zh-tw/cs-229-probability.md
similarity index 56%
rename from zh/refresher-probability.md
rename to zh-tw/cs-229-probability.md
index 52e0056e0..0db481cf5 100644
--- a/zh/refresher-probability.md
+++ b/zh-tw/cs-229-probability.md
@@ -1,381 +1,382 @@
1. **Probabilities and Statistics refresher**
⟶
-
+機率和統計回顧
2. **Introduction to Probability and Combinatorics**
⟶
-
+幾率與組合數學介紹
3. **Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
⟶
-
+樣本空間 - 一個實驗的所有可能結果的集合稱之為這個實驗的樣本空間,記做 S
4. **Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
⟶
-
+事件 - 樣本空間的任何子集合 E 被稱之為一個事件。也就是說,一個事件是實驗的可能結果的集合。如果該實驗的結果包含 E,我們稱我們稱 E 發生
5. **Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
⟶
-
+機率公理。對於每個事件 E,我們用 P(E) 表示事件 E 發生的機率
6. **Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
⟶
-
+公理 1 - 每一個機率值介於 0 到 1 之間,包含兩端點。即:
7. **Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
⟶
-
+公理 2 - 至少一個基本事件出現在整個樣本空間中的機率是 1。即:
8. **Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
⟶
-
+公理 3 - 對於任何互斥的事件 E1,...,En,我們定義如下:
9. **Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
⟶
-
+排列 - 排列指的是從 n 個相異的物件中,取出 r 個物件按照固定順序重新安排,這樣安排的數量用 P(n,r) 來表示,定義為:
10. **Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
⟶
-
+組合 - 組合指的是從 n 個物件中,取出 r 個物件,但不考慮他的順序。這樣組合要考慮的數量用 C(n,r) 來表示,定義為:
11. **Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
⟶
-
+注意:對於 0⩽r⩽n,我們會有 P(n,r)⩾C(n,r)
12. **Conditional Probability**
⟶
-
+條件機率
13. **Bayes' rule ― For events A and B such that P(B)>0, we have:**
⟶
-
+貝氏定理 - 對於事件 A 和 B 滿足 P(B)>0 時,我們定義如下:
14. **Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
⟶
-
+注意:P(A∩B)=P(A)P(B|A)=P(A|B)P(B)
15. **Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
⟶
-
+分割 - 令 {Ai,i∈[[1,n]]} 對所有的 i,Ai≠∅,我們說 {Ai} 是一個分割,當底下成立時:
16. **Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
⟶
-
+注意:對於任何在樣本空間的事件 B 來說,P(B)=n∑i=1P(B|Ai)P(Ai)
17. **Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
⟶
-
+貝氏定理的擴展 - 令 {Ai,i∈[[1,n]]} 為樣本空間的一個分割,我們定義:
18. **Independence ― Two events A and B are independent if and only if we have:**
⟶
-
+獨立 - 當以下條件滿足時,兩個事件 A 和 B 為獨立事件:
19. **Random Variables**
⟶
-
+隨機變數
20. **Definitions**
⟶
-
+定義
21. **Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
⟶
-
+隨機變數 - 一個隨機變數 X,它是一個將樣本空間中的每個元素映射到實數域的函數
22. **Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
⟶
-
+累積分佈函數 (CDF) - 累積分佈函數 F 是單調遞增的函數,其 limx→−∞F(x)=0 且 limx→+∞F(x)=1,定義如下:
23. **Remark: we have P(a
24. **Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
⟶
-
+機率密度函數 - 機率密度函數 f 是隨機變數 X 在兩個相鄰的實數值附近取值的機率
25. **Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
⟶
-
+機率密度函數和累積分佈函數的關係 - 底下是一些關於離散 (D) 和連續 (C) 的情況下的重要屬性
26. **[Case, CDF F, PDF f, Properties of PDF]**
⟶
-
+[情況, 累積分佈函數 F, 機率密度函數 f, 機率密度函數的屬性]
27. **Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
⟶
-
+分佈的期望值和動差 - 底下是期望值 E[X]、一般期望值 E[g(X)]、第 k 個動差和特徵函數 ψ(ω) 在離散和連續的情況下的表示式:
28. **Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
⟶
-
+變異數 - 隨機變數的變異數通常表示為 Var(X) 或 σ2,用來衡量一個分佈離散程度的指標。其表示如下:
29. **Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
⟶
-
+標準差 - 一個隨機變數的標準差通常表示為 σ,用來衡量一個分佈離散程度的指標,其單位和實際的隨機變數相容,表示如下:
30. **Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
⟶
-
+隨機變數的轉換 - 令變數 X 和 Y 由某個函式連結在一起。我們定義 fX 和 fY 是 X 和 Y 的分佈函式,可以得到:
31. **Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
⟶
-
+萊布尼茲積分法則 - 令 g 為 x 和 c 的函數,a 和 b 是依賴於 c 的的邊界,我們得到:
32. **Probability Distributions**
⟶
-
+機率分佈
33. **Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
⟶
-
+柴比雪夫不等式 - 令 X 是一隨機變數,期望值為 μ。對於 k, σ>0,我們有以下不等式:
34. **Main distributions ― Here are the main distributions to have in mind:**
⟶
-
+主要的分佈 - 底下是我們需要熟悉的幾個主要的不等式:
35. **[Type, Distribution]**
⟶
-
+[種類, 分佈]
36. **Jointly Distributed Random Variables**
⟶
-
+聯合分佈隨機變數
37. **Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
⟶
-
+邊緣密度和累積分佈 - 從聯合密度機率函數 fXY 中我們可以得到:
38. **[Case, Marginal density, Cumulative function]**
⟶
-
+[種類, 邊緣密度函數, 累積函數]
39. **Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
⟶
-
+條件密度 - X 對於 Y 的條件密度,通常用 fX|Y 表示如下:
40. **Independence ― Two random variables X and Y are said to be independent if we have:**
⟶
-
+獨立 - 當滿足以下條件時,我們稱隨機變數 X 和 Y 互相獨立:
41. **Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
⟶
-
+共變異數 - 我們定義隨機變數 X 和 Y 的共變異數為 σ2XY 或 Cov(X,Y) 如下:
42. **Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
⟶
-
+相關性 - 我們定義 σX、σY 為 X 和 Y 的標準差,而 X 和 Y 的相關係數 ρXY 定義如下:
43. **Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
⟶
-
+注意一:對於任何隨機變數 X 和 Y 來說,ρXY∈[−1,1] 成立
44. **Remark 2: If X and Y are independent, then ρXY=0.**
⟶
-
+注意二:當 X 和 Y 獨立時,ρXY=0
45. **Parameter estimation**
⟶
-
+參數估計
46. **Definitions**
⟶
-
+定義
47. **Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
⟶
-
+隨機抽樣 - 隨機抽樣指的是 n 個隨機變數 X1,...,Xn 和 X 獨立且同分佈的集合
48. **Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
⟶
-
+估計量 - 估計量是一個資料的函數,用來推斷在統計模型中未知參數的值
49. **Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
⟶
-
+偏差 - 一個估計量的偏差 ^θ 定義為 ^θ 分佈期望值和真實值之間的差距:
50. **Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
⟶
-
+注意:當 E[^θ]=θ 時,我們稱為不偏估計量
51. **Estimating the mean**
⟶
-
+預估平均數
52. **Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯X and is defined as follows:**
⟶
-
+樣本平均 - 一個隨機樣本的樣本平均是用來預估一個分佈的真實平均 μ,通常我們用 ¯X 來表示,定義如下:
53. **Remark: the sample mean is unbiased, i.e E[¯X]=μ.**
⟶
-
+注意:當 E[¯X]=μ 時,則為不偏樣本平均
54. **Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
⟶
-
+中央極限定理 - 當我們有一個隨機樣本 X1,...,Xn 滿足一個給定的分佈,其平均數為 μ,變異數為 σ2,我們有:
55. **Estimating the variance**
⟶
-
+估計變異數
56. **Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
⟶
-
+樣本變異數 - 一個隨機樣本的樣本變異數是用來估計一個分佈的真實變異數 σ2,通常使用 s2 或 ^σ2 來表示,定義如下:
57. **Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
⟶
-
+注意:當 E[s2]=σ2 時,稱之為不偏樣本變異數
58. **Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
⟶
-
+與樣本變異數的卡方關聯 - 令 s2 是一個隨機樣本的樣本變異數,我們可以得到:
-59. **[Introduction, Sample space, Event, Permutation]**
+**59. [Introduction, Sample space, Event, Permutation]**
⟶
-
+[介紹, 樣本空間, 事件, 排列]
-60. **[Conditional probability, Bayes' rule, Independence]**
+**60. [Conditional probability, Bayes' rule, Independence]**
⟶
-
+[條件機率, 貝氏定理, 獨立性]
-61. **[Random variables, Definitions, Expectation, Variance]**
+**61. [Random variables, Definitions, Expectation, Variance]**
⟶
-
+[隨機變數, 定義, 期望值, 變異數]
-62. **[Probability distributions, Chebyshev's inequality, Main distributions]**
+**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
⟶
-
+[機率分佈, 柴比雪夫不等式, 主要分佈]
-63. **[Jointly distributed random variables, Density, Covariance, Correlation]**
+**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
⟶
-
+[聯合分佈隨機變數, 密度, 共變異數, 相關]
-64. **[Parameter estimation, Mean, Variance]**
+**64. [Parameter estimation, Mean, Variance]**
⟶
+[參數估計, 平均數, 變異數]
\ No newline at end of file
diff --git a/zh-tw/cs-229-supervised-learning.md b/zh-tw/cs-229-supervised-learning.md
new file mode 100644
index 000000000..0b329e8db
--- /dev/null
+++ b/zh-tw/cs-229-supervised-learning.md
@@ -0,0 +1,352 @@
+1. **Supervised Learning cheatsheet**
+
+⟶ 監督式學習參考手冊
+
+2. **Introduction to Supervised Learning**
+
+⟶ 監督式學習介紹
+
+3. **Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.**
+
+⟶ 給定一組資料點 {x(1),...,x(m)},以及對應的一組輸出 {y(1),...,y(m)},我們希望建立一個分類器,用來學習如何從 x 來預測 y
+
+4. **Type of prediction ― The different types of predictive models are summed up in the table below:**
+
+⟶ 預測的種類 - 根據預測的種類不同,我們將預測模型分為底下幾種:
+
+5. **[Regression, Classifier, Outcome, Examples]**
+
+⟶ [迴歸, 分類器, 結果, 範例]
+
+6. **[Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
+
+⟶ [連續, 類別, 線性迴歸, 邏輯迴歸, 支援向量機 (SVM) , 單純貝式分類器]
+
+7. **Type of model ― The different models are summed up in the table below:**
+
+⟶ 模型種類 - 不同種類的模型歸納如下表:
+
+8. **[Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
+
+⟶ [判別模型, 生成模型, 目標, 學到什麼, 示意圖, 範例]
+
+9. **[Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]**
+
+⟶ [直接估計 P(y|x), 先估計 P(x|y),然後推論出 P(y|x), 決策分界線, 資料的機率分佈, 迴歸, 支援向量機 (SVM), 高斯判別分析 (GDA), 單純貝氏 (Naive Bayes)]
+
+10. **Notations and general concepts**
+
+⟶ 符號及一般概念
+
+11. **Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).**
+
+⟶ 假設 - 我們使用 hθ 來代表所選擇的模型,對於給定的輸入資料 x(i),模型預測的輸出是 hθ(x(i))
+
+12. **Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
+
+⟶ 損失函數 - 損失函數是一個函數 L:(z,y)∈R×Y⟼L(z,y)∈R,
+目的在於計算預測值 z 和實際值 y 之間的差距。底下是一些常見的損失函數:
+
+13. **[Least squared error, Logistic loss, Hinge loss, Cross-entropy]**
+
+⟶ [最小平方法, Logistic 損失函數, Hinge 損失函數, 交叉熵]
+
+14. **[Linear regression, Logistic regression, SVM, Neural Network]**
+
+⟶ [線性迴歸, 邏輯迴歸, 支援向量機 (SVM), 神經網路]
+
+15. **Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:**
+
+⟶ 代價函數 - 代價函數 J 通常用來評估一個模型的表現,它可以透過損失函數 L 來定義:
+
+16. **Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:**
+
+⟶ 梯度下降 - 使用 α∈R 表示學習速率,我們透過學習速率和代價函數來使用梯度下降的方法找出網路參數更新的方法可以表示為:
+
+17. **Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.**
+
+⟶ 注意:隨機梯度下降法 (SGD) 使用每一個訓練資料來更新參數。而批次梯度下降法則是透過一個批次的訓練資料來更新參數。
+
+18. **Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
+
+⟶ 概似估計 - 在給定參數 θ 的條件下,一個模型 L(θ) 的概似估計的目的是透過最大概似估計法來找到最佳的參數。實務上,我們會使用對數概似估計函數 (log-likelihood) ℓ(θ)=log(L(θ)),會比較容易最佳化。如下:
+
+19. **Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**
+
+⟶ 牛頓演算法 - 牛頓演算法是一個數值方法,目的在於找到一個 θ,讓 ℓ′(θ)=0。其更新的規則為:
+
+20. **Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:**
+
+⟶ 注意:多維度正規化的方法,或又被稱之為牛頓-拉弗森 (Newton-Raphson) 演算法,是透過以下的規則更新:
+
+21. **Linear models**
+
+⟶ 線性模型
+
+22. **Linear regression**
+
+⟶ 線性迴歸
+
+23. **We assume here that y|x;θ∼N(μ,σ2)**
+
+⟶ 我們假設 y|x;θ∼N(μ,σ2)
+
+24. **Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
+
+⟶ 正規方程法 - 我們使用 X 代表矩陣,讓代價函數最小的 θ 值有一個封閉解,如下:
+
+25. **LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:**
+
+⟶ 最小均方演算法 (LMS) - 我們使用 α 表示學習速率,針對 m 個訓練資料,透過最小均方演算法的更新規則,或是叫做 Widrow-Hoff 學習法如下:
+
+26. **Remark: the update rule is a particular case of the gradient ascent.**
+
+⟶ 注意:這個更新的規則是梯度上升的一種特例
+
+27. **LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:**
+
+⟶ 局部加權迴歸 ,又稱為 LWR,是線性洄歸的變形,通過w(i)(x) 對其成本函數中的每個訓練樣本進行加權,其中參數 τ∈R 定義為:
+
+28. **Classification and logistic regression**
+
+⟶ 分類與邏輯迴歸
+
+29. **Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:**
+
+⟶ Sigmoid 函數 - Sigmoid 函數 g,也可以稱為邏輯函數定義如下:
+
+30. **Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:**
+
+⟶ 邏輯迴歸 - 我們假設 y|x;θ∼Bernoulli(ϕ),請參考以下:
+
+31. **Remark: there is no closed form solution for the case of logistic regressions.**
+
+⟶ 注意:對於這種情況的邏輯迴歸,並沒有一個封閉解
+
+32. **Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
+
+⟶ Softmax 迴歸 - Softmax 迴歸又稱做多分類邏輯迴歸,目的是用在超過兩個以上的分類時的迴歸使用。按照慣例,我們設定 θK=0,讓每一個類別的 Bernoulli 參數 ϕi 等同於:
+
+33. **Generalized Linear Models**
+
+⟶ 廣義線性模型
+
+34. **Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:**
+
+⟶ 指數族分佈 - 一個分佈如果可以透過自然參數 (或稱之為正準參數或連結函數) η、充分統計量 T(y) 和對數區分函數 (log-partition function) a(η) 來表示時,我們就稱這個分佈是屬於指數族分佈。該分佈可以表示如下:
+
+35. **Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.**
+
+⟶ 注意:我們經常讓 T(y)=y,同時,exp(−a(η)) 可以看成是一個正規化的參數,目的在於讓機率總和為一。
+
+36. **Here are the most common exponential distributions summed up in the following table:**
+
+⟶ 底下是最常見的指數分佈:
+
+37. **[Distribution, Bernoulli, Gaussian, Poisson, Geometric]**
+
+⟶ [分佈, 白努利 (Bernoulli), 高斯 (Gaussian), 卜瓦松 (Poisson), 幾何 (Geometric)]
+
+38. **Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:**
+
+⟶ 廣義線性模型的假設 - 廣義線性模型 (GLM) 的目的在於,給定 x∈Rn+1,要預測隨機變數 y,同時它依賴底下三個假設:
+
+39. **Remark: ordinary least squares and logistic regression are special cases of generalized linear models.**
+
+⟶ 注意:最小平方法和邏輯迴歸是廣義線性模型的一種特例
+
+40. **Support Vector Machines**
+
+⟶ 支援向量機
+
+41. **The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
+
+⟶ 支援向量機的目的在於找到一條決策邊界和資料樣本之間最大化最小距離的線
+
+42. **Optimal margin classifier ― The optimal margin classifier h is such that:**
+
+⟶ 最佳的邊界分類器 - 最佳的邊界分類器可以表示為:
+
+43. **where (w,b)∈Rn×R is the solution of the following optimization problem:**
+
+⟶ 其中,(w,b)∈Rn×R 是底下最佳化問題的答案:
+
+44. **such that**
+
+⟶ 使得
+
+45. **support vectors**
+
+⟶ 支援向量
+
+46. **Remark: the line is defined as wTx−b=0.**
+
+⟶ 注意:該條直線定義為 wTx−b=0
+
+47. **Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:**
+
+⟶ Hinge 損失函數 - Hinge 損失函數用在支援向量機上,定義如下:
+
+48. **Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:**
+
+⟶ 核(函數) - 給定特徵轉換 ϕ,我們定義核(函數) K 為:
+
+49. **In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.**
+
+⟶ 實務上,K(x,z)=exp(−||x−z||22σ2) 定義的核(函數) K,一般稱作高斯核(函數)。這種核(函數)經常被使用
+
+50. **[Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]**
+
+⟶ [非線性可分, 使用核(函數)進行映射, 原始空間中的決策邊界]
+
+51. **Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.**
+
+⟶ 注意:我們使用 "核(函數)技巧" 來計算代價函數時,不需要真正的知道映射函數 ϕ,這個函數非常複雜。相反的,我們只需要知道 K(x,z) 的值即可。
+
+52. **Lagrangian ― We define the Lagrangian L(w,b) as follows:**
+
+⟶ Lagrangian - 我們將 Lagrangian L(w,b) 定義如下:
+
+53. **Remark: the coefficients βi are called the Lagrange multipliers.**
+
+⟶ 注意:係數 βi 稱為 Lagrange 乘數
+
+54. **Generative Learning**
+
+⟶ 生成學習
+
+55. **A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
+
+⟶ 生成模型嘗試透過預估 P(x|y) 來學習資料如何生成,而我們可以透過貝氏定理來預估 P(y|x)
+
+56. **Gaussian Discriminant Analysis**
+
+⟶ 高斯判別分析
+
+57. **Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:**
+
+⟶ 設定 - 高斯判別分析針對 y、x|y=0 和 x|y=1 進行以下假設:
+
+58. **Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:**
+
+⟶ 估計 - 底下的表格總結了我們在最大概似估計時的估計值:
+
+59. **Naive Bayes**
+
+⟶ 單純貝氏
+
+60. **Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:**
+
+⟶ 假設 - 單純貝氏模型會假設每個資料點的特徵都是獨立的。
+
+61. **Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]**
+
+⟶ 解決方法 - 最大化對數概似估計來給出以下解答,k∈{0,1},l∈[[1,L]]
+
+62. **Remark: Naive Bayes is widely used for text classification and spam detection.**
+
+⟶ 注意:單純貝氏廣泛應用在文字分類和垃圾信件偵測上
+
+63. **Tree-based and ensemble methods**
+
+⟶ 基於樹狀結構的學習和整體學習
+
+64. **These methods can be used for both regression and classification problems.**
+
+⟶ 這些方法可以應用在迴歸或分類問題上
+
+65. **CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.**
+
+⟶ CART - 分類與迴歸樹 (CART),通常稱之為決策數,可以被表示為二元樹。它的優點是具有可解釋性。
+
+66. **Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.**
+
+⟶ 隨機森林 - 這是一個基於樹狀結構的方法,它使用大量經由隨機挑選的特徵所建構的決策樹。與單純的決策樹不同,它通常具有高度不可解釋性,但它的效能通常很好,所以是一個相當流行的演算法。
+
+67. **Remark: random forests are a type of ensemble methods.**
+
+⟶ 注意:隨機森林是一種整體學習方法
+
+68. **Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:**
+
+⟶ 增強學習 (Boosting) - 增強學習方法的概念是結合數個弱學習模型來變成強學習模型。主要的分類如下:
+
+69. **[Adaptive boosting, Gradient boosting]**
+
+⟶ [自適應增強, 梯度增強]
+
+70. **High weights are put on errors to improve at the next boosting step**
+
+⟶ 在下一輪的提升步驟中,錯誤的部分會被賦予較高的權重
+
+71. **Weak learners trained on remaining errors**
+
+⟶ 弱學習器會負責訓練剩下的錯誤
+
+72. **Other non-parametric approaches**
+
+⟶ 其他非參數方法
+
+73. **k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
+
+⟶ k-最近鄰 - k-最近鄰演算法,又稱之為 k-NN,是一個非參數的方法,其中資料點的決定是透過訓練集中最近的 k 個鄰居而決定。它可以用在分類和迴歸問題上。
+
+74. **Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
+
+⟶ 注意:參數 k 的值越大,偏差越大。k 的值越小,變異越大。
+
+75. **Learning Theory**
+
+⟶ 學習理論
+
+76. **Union bound ― Let A1,...,Ak be k events. We have:**
+
+⟶ 聯集上界 - 令 A1,...,Ak 為 k 個事件,我們有:
+
+77. **Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:**
+
+⟶ 霍夫丁不等式 - 令 Z1,..,Zm 為 m 個從參數 ϕ 的白努利分佈中抽出的獨立同分佈 (iid) 的變數。令 ˆϕ 為其樣本平均、固定 γ>0,我們可以得到:
+
+78. **Remark: this inequality is also known as the Chernoff bound.**
+
+⟶ 注意:這個不等式也被稱之為 Chernoff 界線
+
+79. **Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:**
+
+⟶ 訓練誤差 - 對於一個分類器 h,我們定義訓練誤差為 ˆϵ(h),也可以稱為經驗風險或經驗誤差。定義如下:
+
+80. **Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: **
+
+⟶ 可能近似正確 (PAC) - PAC 是一個框架,有許多學習理論都證明其有效性。它包含以下假設:
+
+81: **the training and testing sets follow the same distribution**
+
+⟶ 訓練和測試資料集具有相同的分佈
+
+82. **the training examples are drawn independently**
+
+⟶ 訓練資料集之間彼此獨立
+
+83. **Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:**
+
+⟶ 打散 (Shattering) - 給定一個集合 S={x(1),...,x(d)} 以及一組分類器的集合 H,如果對於任何一組標籤 {y(1),...,y(d)},H 都能打散 S,定義如下:
+
+84. **Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:**
+
+⟶ 上限定理 - 令 H 是一個有限假設類別,使 |H|=k 且令 δ 和樣本大小 m 固定,結著,在機率至少為 1−δ 的情況下,我們得到:
+
+85. **VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.**
+
+⟶ VC 維度 - 一個有限假設類別的 Vapnik-Chervonenkis (VC) 維度 VC(H) 指的是 H 最多能夠打散的數量
+
+86. **Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.**
+
+⟶ 注意:H={2 維的線性分類器} 的 VC 維度為 3
+
+87. **Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:**
+
+⟶ 理論 (Vapnik) - 令 H 已給定,VC(H)=d 且 m 是訓練資料級的數量,在機率至少為 1−δ 的情況下,我們得到:
+
+88. **Known as Adaboost**
+
+⟶ 被稱為 Adaboost
diff --git a/zh/cheatsheet-unsupervised-learning.md b/zh-tw/cs-229-unsupervised-learning.md
similarity index 59%
rename from zh/cheatsheet-unsupervised-learning.md
rename to zh-tw/cs-229-unsupervised-learning.md
index 93708b826..0f6d5ee34 100644
--- a/zh/cheatsheet-unsupervised-learning.md
+++ b/zh-tw/cs-229-unsupervised-learning.md
@@ -1,339 +1,298 @@
1. **Unsupervised Learning cheatsheet**
⟶
-
+非監督式學習參考手冊
2. **Introduction to Unsupervised Learning**
⟶
-
+非監督式學習介紹
3. **Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
⟶
-
+動機 - 非監督式學習的目的是要找出未標籤資料 {x(1),...,x(m)} 之間的隱藏模式
4. **Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
⟶
-
+Jensen's 不等式 - 令 f 為一個凸函數、X 為一個隨機變數,我們可以得到底下這個不等式:
5. **Clustering**
⟶
-
+分群
6. **Expectation-Maximization**
⟶
-
+最大期望值
7. **Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
⟶
-
+潛在變數 (Latent variables) - 潛在變數指的是隱藏/沒有觀察到的變數,這會讓問題的估計變得困難,我們通常使用 z 來代表它。底下是潛在變數的常見設定:
8. **[Setting, Latent variable z, Comments]**
⟶
-
+[設定, 潛在變數 z, 評論]
9. **[Mixture of k Gaussians, Factor analysis]**
⟶
-
+[k 元高斯模型, 因素分析]
10. **Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
⟶
-
+演算法 - 最大期望演算法 (EM Algorithm) 透過重複建構一個概似函數的下界 (E-step) 和最佳化下界 (M-step) 來進行最大概似估計給出參數 θ 的高效率估計方法:
11. **E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
⟶
-
+E-step: 評估後驗機率 Qi(z(i)),其中每個資料點 x(i) 來自於一個特定的群集 z(i),如下:
12. **M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
⟶
-
+M-step: 使用後驗機率 Qi(z(i)) 作為資料點 x(i) 在群集中特定的權重,用來分別重新估計每個群集,如下:
13. **[Gaussians initialization, Expectation step, Maximization step, Convergence]**
⟶
-
+[高斯分佈初始化, E-Step, M-Step, 收斂]
14. **k-means clustering**
⟶
-
+k-means 分群法
15. **We note c(i) the cluster of data point i and μj the center of cluster j.**
⟶
-
+我們使用 c(i) 表示資料 i 屬於某群,而 μj 則是群 j 的中心
16. **Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
⟶
-
+演算法 - 在隨機初始化群集中心點 μ1,μ2,...,μk∈Rn 後,k-means 演算法重複以下步驟直到收斂:
17. **[Means initialization, Cluster assignment, Means update, Convergence]**
⟶
-
+[中心點初始化, 指定群集, 更新中心點, 收斂]
18. **Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
⟶
-
+畸變函數 - 為了確認演算法是否收斂,我們定義以下的畸變函數:
19. **Hierarchical clustering**
⟶
-
+階層式分群法
20. **Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
⟶
-
+演算法 - 階層式分群法是透過一種階層架構的方式,將資料建立為一種連續層狀結構的形式。
21. **Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
⟶
-
+類型 - 底下是幾種不同類型的階層式分群法,差別在於要最佳化的目標函式的不同,請參考底下:
22. **[Ward linkage, Average linkage, Complete linkage]**
⟶
-
+[Ward 鏈結距離, 平均鏈結距離, 完整鏈結距離]
23. **[Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
⟶
-
+[最小化群內距離, 最小化各群彼此的平均距離, 最小化各群彼此的最大距離]
24. **Clustering assessment metrics**
⟶
-
+分群衡量指標
25. **In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
⟶
-
+在非監督式學習中,通常很難去評估一個模型的好壞,因為我們沒有擁有像在監督式學習任務中正確答案的標籤
26. **Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
⟶
-
+輪廓係數 (Silhouette coefficient) - 我們指定 a 為一個樣本點和相同群集中其他資料點的平均距離、b 為一個樣本點和下一個最接近群集其他資料點的平均距離,輪廓係數 s 對於此一樣本點的定義為:
27. **Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
⟶
-
+Calinski-Harabaz 指標 - 定義 k 是群集的數量,Bk 和 Wk 分別是群內和群集之間的離差矩陣 (dispersion matrices):
28. **the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
⟶
-
+Calinski-Harabaz 指標 s(k) 指出分群模型的好壞,此指標的值越高,代表分群模型的表現越好。定義如下:
29. **Dimension reduction**
⟶
-
+維度縮減
30. **Principal component analysis**
⟶
-
+主成份分析
31. **It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
⟶
-
+這是一個維度縮減的技巧,在於找到投影資料的最大方差
32. **Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
⟶
-
+特徵值、特徵向量 - 給定一個矩陣 A∈Rn×n,我們說 λ 是 A 的特徵值,當存在一個特徵向量 z∈Rn∖{0},使得:
33. **Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
⟶
-
+譜定理 - 令 A∈Rn×n,如果 A 是對稱的,則 A 可以可以透過正交矩陣 U∈Rn×n 對角化。當 Λ=diag(λ1,...,λn),我們得到:
34. **diagonal**
⟶
-
+對角線
35. **Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
⟶
-
+注意:與特徵值所關聯的特徵向量就是 A 矩陣的主特徵向量
36. **Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:**
⟶
-
+演算法 - 主成份分析 (PCA) 是一種維度縮減的技巧,它會透過尋找資料最大變異的方式,將資料投影在 k 維空間上:
37. **Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
⟶
-
+第一步:正規化資料,讓資料平均為 0,變異數為 1
38. **Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
⟶
-
+第二步:計算 Σ=1mm∑i=1x(i)x(i)T∈Rn×n,即對稱實際特徵值
39. **Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
⟶
-
+第三步:計算 u1,...,uk∈Rn,k 個正交主特徵向量的總和 Σ,即是 k 個最大特徵值的正交特徵向量
40. **Step 4: Project the data on spanR(u1,...,uk).**
⟶
-
+第四部:將資料投影到 spanR(u1,...,uk)
41. **This procedure maximizes the variance among all k-dimensional spaces.**
⟶
-
+這個步驟會最大化所有 k 維空間的變異數
42. **[Data in feature space, Find principal components, Data in principal components space]**
⟶
-
+[資料在特徵空間, 尋找主成分, 資料在主成分空間]
43. **Independent component analysis**
⟶
-
+獨立成分分析
44. **It is a technique meant to find the underlying generating sources.**
⟶
-
+這是用來尋找潛在生成來源的技巧
45. **Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
⟶
-
+假設 - 我們假設資料 x 是從 n 維的來源向量 s=(s1,...,sn) 產生,si 為獨立變數,透過一個混合與非奇異矩陣 A 產生如下:
46. **The goal is to find the unmixing matrix W=A−1.**
⟶
-
+目的在於找到一個 unmixing 矩陣 W=A−1
47. **Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
⟶
-
+Bell 和 Sejnowski 獨立成份分析演算法 - 此演算法透過以下步驟來找到 unmixing 矩陣:
48. **Write the probability of x=As=W−1s as:**
⟶
-
+紀錄 x=As=W−1s 的機率如下:
49. **Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
⟶
-
+在給定訓練資料 {x(i),i∈[[1,m]]} 的情況下,其對數概似估計函數與定義 g 為 sigmoid 函數如下:
50. **Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
⟶
-
-
-
-51. **The Machine Learning cheatsheets are now available in Mandarin.**
-
-⟶
-
-
-
-52. **Original authors**
-
-⟶
-
-
-
-53. **Translated by X, Y and Z**
-
-⟶
-
-
-
-54. **Reviewed by X, Y and Z**
-
-⟶
-
-
-
-55. **[Introduction, Motivation, Jensen's inequality]**
-
-⟶
-
-
-
-56. **[Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
-
-⟶
-
-
-
-57. **[Dimension reduction, PCA, ICA]**
-
-⟶
+因此,梯度隨機下降學習規則對每個訓練樣本 x(i) 來說,我們透過以下方法來更新 W:
diff --git a/zh-tw/cs-230-convolutional-neural-networks.md b/zh-tw/cs-230-convolutional-neural-networks.md
new file mode 100644
index 000000000..87e24704a
--- /dev/null
+++ b/zh-tw/cs-230-convolutional-neural-networks.md
@@ -0,0 +1,715 @@
+**Convolutional Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks)
+
+
+
+**1. Convolutional Neural Networks cheatsheet**
+
+⟶ 卷積神經網路
+
+
+
+**2. CS 230 - Deep Learning**
+
+⟶ CS230 - 深度學習
+
+
+
+
+**3. [Overview, Architecture structure]**
+
+⟶ [概論, 架構結構]
+
+
+
+
+**4. [Types of layer, Convolution, Pooling, Fully connected]**
+
+⟶ [層的種類, 卷積, 池化, 全連接]
+
+
+
+
+**5. [Filter hyperparameters, Dimensions, Stride, Padding]**
+
+⟶ [卷積核超參數, 維度, 滑動間隔, 填充]
+
+
+
+
+**6. [Tuning hyperparameters, Parameter compatibility, Model complexity, Receptive field]**
+
+⟶ [調整超參數, 參數相容性, 模型複雜度, 感知區域]
+
+
+
+
+**7. [Activation functions, Rectified Linear Unit, Softmax]**
+
+⟶ [激活函數, 線性整流函數, 歸一化指數函數]
+
+
+
+
+**8. [Object detection, Types of models, Detection, Intersection over Union, Non-max suppression, YOLO, R-CNN]**
+
+⟶ [物體偵測, 模型種類, 偵測, 交併比, 非最大值抑制, YOLO, 區域卷積神經網路]
+
+
+
+
+**9. [Face verification/recognition, One shot learning, Siamese network, Triplet loss]**
+
+⟶ [人臉驗證/辨別, 單樣本學習, 孿生網路, 三重損失函數]
+
+
+
+
+**10. [Neural style transfer, Activation, Style matrix, Style/content cost function]**
+
+⟶ [神經風格轉換, 激發, 風格矩陣/內容矩陣, 風格/內容成本函數]
+
+
+
+
+**11. [Computational trick architectures, Generative Adversarial Net, ResNet, Inception Network]**
+
+⟶ [計算架構手法, 生成對抗網路, 殘差網路, inception 網路]
+
+
+
+
+**12. Overview**
+
+⟶ 概論
+
+
+
+
+**13. Architecture of a traditional CNN ― Convolutional neural networks, also known as CNNs, are a specific type of neural networks that are generally composed of the following layers:**
+
+⟶ 傳統卷積神經網路架構 - 卷積神經網路, 簡稱為 CNNs, 是一種神經網路的變形,通常由下列的層組成:
+
+
+
+
+**14. The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters that are described in the next sections.**
+
+⟶ 卷積層和池化層可利用超參數來優化,詳細內容由下個部分敘述。
+
+
+
+
+**15. Types of layer**
+
+⟶ 層的種類
+
+
+
+
+**16. Convolution layer (CONV) ― The convolution layer (CONV) uses filters that perform convolution operations as it is scanning the input I with respect to its dimensions. Its hyperparameters include the filter size F and stride S. The resulting output O is called feature map or activation map.**
+
+⟶ 卷積層 (CONV) - 卷積層利用卷積核沿著輸入數據的維度進行掃描。其超參數包含卷積核的尺寸 F 和滑動間隔 S。輸出 O 稱為特徵圖或激發圖。
+
+
+
+
+**17. Remark: the convolution step can be generalized to the 1D and 3D cases as well.**
+
+⟶ 備註:卷積之運算亦可推廣為一維或三維。
+
+
+
+
+**18. Pooling (POOL) ― The pooling layer (POOL) is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance. In particular, max and average pooling are special kinds of pooling where the maximum and average value is taken, respectively.**
+
+⟶ 池化層 (POOL) - 池化層用於降低取樣頻率,通常用於卷積層之後以處理空間變異性。其中,最大池化與平均池化,分別選取池中之最大值與平均值,為特別的池化種類。
+
+
+
+
+**19. [Type, Purpose, Illustration, Comments]**
+
+⟶ [種類, 目的, 圖示, 註解]
+
+
+
+
+**20. [Max pooling, Average pooling, Each pooling operation selects the maximum value of the current view, Each pooling operation averages the values of the current view]**
+
+⟶ [最大池化層, 平均池化層, 每個池化計算該池中之最大值, 每個池化計算該池中平均值]
+
+
+
+
+**21. [Preserves detected features, Most commonly used, Downsamples feature map, Used in LeNet]**
+
+⟶ [保留偵測到之特徵, 最常使用, 降低特徵圖之採樣頻率,於 LeNet 中使用]
+
+
+
+
+**22. Fully Connected (FC) ― The fully connected layer (FC) operates on a flattened input where each input is connected to all neurons. If present, FC layers are usually found towards the end of CNN architectures and can be used to optimize objectives such as class scores.**
+
+⟶ 全連接層 (FC) - 全連接層之運作需要扁平的輸入,其中,所有的輸入數值與所有的神經元是全連接的。
+
+
+
+
+**23. Filter hyperparameters**
+
+⟶ 卷積核的超參數
+
+
+
+
+**24. The convolution layer contains filters for which it is important to know the meaning behind its hyperparameters.**
+
+⟶ 卷積層有卷積核,而了解其中超參數的意義是重要的。
+
+
+
+
+**25. Dimensions of a filter ― A filter of size F×F applied to an input containing C channels is a F×F×C volume that performs convolutions on an input of size I×I×C and produces an output feature map (also called activation map) of size O×O×1.**
+
+⟶ 卷積核的維度 - 一個尺寸為 F×F 的卷積核,套用在有 C 個頻道的輸入,是一個維度為 F×F×C 的體,計算卷積於輸入維度為 I×I×C,輸出一個維度為 O×O×1 的特徵圖。
+
+
+
+
+**26. Filter**
+
+⟶ 卷積核
+
+
+
+
+**27. Remark: the application of K filters of size F×F results in an output feature map of size O×O×K.**
+
+⟶ 備註:應用 K 個維度為 F×F 的卷積核會得到維度為 O×O×K 的特徵圖。
+
+
+
+
+**28. Stride ― For a convolutional or a pooling operation, the stride S denotes the number of pixels by which the window moves after each operation.**
+
+⟶ 滑動間隔 - 對卷積或池化的運算,滑動間隔S表示每次運算結束後,視窗移動的像素數量。
+
+
+
+
+**29. Zero-padding ― Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. This value can either be manually specified or automatically set through one of the three modes detailed below:**
+
+⟶ 零填充 - 零填充表示將 P 個 0 填充於輸入資料的邊緣。此數值可手動指定,或是透過以下三種模式自動設定。
+
+
+
+
+**30. [Mode, Value, Illustration, Purpose, Valid, Same, Full]**
+
+⟶ [模式, 數值, 圖示, 用途, Valid, Same, Full]
+
+
+
+
+**31. [No padding, Drops last convolution if dimensions do not match, Padding such that feature map size has size ⌈IS⌉, Output size is mathematically convenient, Also called 'half' padding, Maximum padding such that end convolutions are applied on the limits of the input, Filter 'sees' the input end-to-end]**
+
+⟶ [無填充, 維度不相符則捨棄最後一個卷積, 填充使得特徵圖的維度為 ⌈IS⌉, 輸出維度是數學上方便的, 又稱為半填充, 最大的填充使終端的卷積運作於輸入之限度, 卷積核可端到端的「看到」整個輸入]
+
+
+
+
+**32. Tuning hyperparameters**
+
+⟶ 優化超參數
+
+
+
+
+**33. Parameter compatibility in convolution layer ― By noting I the length of the input volume size, F the length of the filter, P the amount of zero padding, S the stride, then the output size O of the feature map along that dimension is given by:**
+
+⟶ 卷積層中的參數相容性 - 輸入資料維度 I,卷積核維度 F,零填充維度 P,滑動間隔 S,則輸出的特徵圖維度為 O。
+
+
+
+
+**34. [Input, Filter, Output]**
+
+⟶ [輸入, 卷積核, 輸出]
+
+
+
+
+**35. Remark: often times, Pstart=Pend≜P, in which case we can replace Pstart+Pend by 2P in the formula above.**
+
+⟶ 備註:時常 Pstart=Pend≜P,則我們於上式中將 Pstart+Pend 以取 2P 代為。
+
+
+
+
+**36. Understanding the complexity of the model ― In order to assess the complexity of a model, it is often useful to determine the number of parameters that its architecture will have. In a given layer of a convolutional neural network, it is done as follows:**
+
+⟶ 了解模型複雜度 - 為了了解模型的複雜度,我們時常計算模型中含有的參數量。給定一卷積神經網路,定義為:
+
+
+
+
+**37. [Illustration, Input size, Output size, Number of parameters, Remarks]**
+
+⟶ [圖示, 輸入維度, 輸出維度, 參數數量, 備註]
+
+
+
+
+**38. [One bias parameter per filter, In most cases, S
+
+
+**39. [Pooling operation done channel-wise, In most cases, S=F]**
+
+⟶ [池化運算以頻道為單位, 大部分來說 S=F]
+
+
+
+
+**40. [Input is flattened, One bias parameter per neuron, The number of FC neurons is free of structural constraints]**
+
+⟶ [輸入需扁平化, 一個神經元一個偏差值, 全連接層中的神經元數量沒有結構限制]
+
+
+
+
+**41. Receptive field ― The receptive field at layer k is the area denoted Rk×Rk of the input that each pixel of the k-th activation map can 'see'. By calling Fj the filter size of layer j and Si the stride value of layer i and with the convention S0=1, the receptive field at layer k can be computed with the formula:**
+
+⟶ 接受區 - 在第 k 層的接受區表示為 Rk×Rk,是輸入資料中,可被第 k 個激發圖所看見的像素。設 Fj 為第 j 層中卷積核的尺寸,Si 為第 i 層的滑動間隔,通常為 1;在第 k 層的接受區之運算為以下公式:
+
+
+
+
+**42. In the example below, we have F1=F2=3 and S1=S2=1, which gives R2=1+2⋅1+2⋅1=5.**
+
+⟶ 以下範例中,F1=F2=3, S1=S2=1,因此 R2=1+2⋅1+2⋅1=5。
+
+
+
+
+**43. Commonly used activation functions**
+
+⟶ 常用的激發函數。
+
+
+
+
+**44. Rectified Linear Unit ― The rectified linear unit layer (ReLU) is an activation function g that is used on all elements of the volume. It aims at introducing non-linearities to the network. Its variants are summarized in the table below:**
+
+⟶ 線性整流函數 - 線性整流函數(ReLU)是一激發函數,可應用於所有體中的元素。用於增加非線性的性質到網路中。線性整流函數的變形如下:
+
+
+
+
+**45. [ReLU, Leaky ReLU, ELU, with]**
+
+⟶ 線性整流函數, 洩漏線性整流器,指數性線性函數, 其中
+
+
+
+
+**46. [Non-linearity complexities biologically interpretable, Addresses dying ReLU issue for negative values, Differentiable everywhere]**
+
+⟶ [非線性複雜度生物可解釋性, 處理線性整流函數抑制負數問題, 全區間可微分]
+
+
+
+
+**47. Softmax ― The softmax step can be seen as a generalized logistic function that takes as input a vector of scores x∈Rn and outputs a vector of output probability p∈Rn through a softmax function at the end of the architecture. It is defined as follows:**
+
+⟶ 歸一化指數函數 - 歸一化指數函數可被視為一廣義的邏輯函數,將一個分數的陣列 x∈Rn 輸出為一個機率的陣列 p∈Rn,用於網路架構的終端。定義為:
+
+
+
+
+**48. where**
+
+⟶ 其中
+
+
+
+
+**49. Object detection**
+
+⟶ 物體偵測
+
+
+
+
+**50. Types of models ― There are 3 main types of object recognition algorithms, for which the nature of what is predicted is different. They are described in the table below:**
+
+⟶ 模型種類 - 有三種主要的物體辨別演算法,差別在於預測的目的不同。敘述於以下表格:
+
+
+
+
+**51. [Image classification, Classification w. localization, Detection]**
+
+⟶ [影像分類, 影像分類定位, 偵測]
+
+
+
+
+**52. [Teddy bear, Book]**
+
+⟶ [泰迪熊, 書]
+
+
+
+
+**53. [Classifies a picture, Predicts probability of object, Detects an object in a picture, Predicts probability of object and where it is located, Detects up to several objects in a picture, Predicts probabilities of objects and where they are located]**
+
+⟶ [分類一張圖, 預測可能為一物件的機率, 偵測一張圖中的物件, 預測可能為一物件的機率與物件的位置, 偵測一張圖中的數個物件, 預測可能為一物件的機率與物件的位置]
+
+
+
+
+**54. [Traditional CNN, Simplified YOLO, R-CNN, YOLO, R-CNN]**
+
+⟶ [傳統的卷積神經網路. 簡化版 YOLO, 區域卷積神經網路, YOLO, 區域卷積神經網路]
+
+
+
+
+**55. Detection ― In the context of object detection, different methods are used depending on whether we just want to locate the object or detect a more complex shape in the image. The two main ones are summed up in the table below:**
+
+⟶ 偵測 - 於物件之中,選擇不同方法取決於是否想要定位物體的位置,或是偵測更複雜的形狀。兩個主要的介紹如下表:
+
+
+
+
+**56. [Bounding box detection, Landmark detection]**
+
+⟶ [定界框偵測, 特徵點偵測]
+
+
+
+
+**57. [Detects the part of the image where the object is located, Detects a shape or characteristics of an object (e.g. eyes), More granular]**
+
+⟶ [偵測影像中有包含物件的部分, 偵測一物件之形狀或特性(如:眼睛), 更精準]
+
+
+
+
+**58. [Box of center (bx,by), height bh and width bw, Reference points (l1x,l1y), ..., (lnx,lny)]**
+
+⟶ 框的中心 (bx,by), 高 bh 與寬 bw, 參考點 (l1x,l1y), ..., (lnx,lny)]
+
+
+
+
+**59. Intersection over Union ― Intersection over Union, also known as IoU, is a function that quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding box Ba. It is defined as:**
+
+⟶ 交併比 - 交併比,簡稱為 IoU,是一個用於評估定界框 Bp 預測位置與實際位置 Ba 比較正確性之函數。定義如下:
+
+
+
+
+**60. Remark: we always have IoU∈[0,1]. By convention, a predicted bounding box Bp is considered as being reasonably good if IoU(Bp,Ba)⩾0.5.**
+
+⟶ 備註:交併比介於 0 到 1 之間。一般來說,一個好的定界框該有 IoU(Bp,Ba)⩾0.5。
+
+
+
+
+**61. Anchor boxes ― Anchor boxing is a technique used to predict overlapping bounding boxes. In practice, the network is allowed to predict more than one box simultaneously, where each box prediction is constrained to have a given set of geometrical properties. For instance, the first prediction can potentially be a rectangular box of a given form, while the second will be another rectangular box of a different geometrical form.**
+
+⟶ 錨框 - 錨框是一個用於預測重疊定界框的技術。實務上,網路可以同時預測多個定界框,而每個定界框有限制的幾何性質。例如:第一個預測定界框可能是一個正方形,而第二個可能是另一個有不同幾何性質的正方形。
+
+
+
+
+**62. Non-max suppression ― The non-max suppression technique aims at removing duplicate overlapping bounding boxes of a same object by selecting the most representative ones. After having removed all boxes having a probability prediction lower than 0.6, the following steps are repeated while there are boxes remaining:**
+
+⟶ 非最大值抑制 - 非最大值抑制是一個用於移除重複、重疊選取同一物體定界框的方法,並選取最具代表性的。在去除預測機率小於 0.6 的定界框後,會重複以下的步驟:
+
+
+
+
+**63. [For a given class, Step 1: Pick the box with the largest prediction probability., Step 2: Discard any box having an IoU⩾0.5 with the previous box.]**
+
+⟶ [給定一類別, 步驟一:選擇有最大機率的定界框, 步驟二:拋棄與前一步驟選取的定界框有 IoU⩾0.5 的定界框]
+
+
+
+
+**64. [Box predictions, Box selection of maximum probability, Overlap removal of same class, Final bounding boxes]**
+
+⟶ [定界框預測, 選擇有最大機率的定界框, 移除同類別且重疊的定界框, 最終的定界框]
+
+
+
+
+**65. YOLO ― You Only Look Once (YOLO) is an object detection algorithm that performs the following steps:**
+
+⟶ YOLO - YOLO 是一個物體偵測演算法, 流程如下:
+
+
+
+
+**66. [Step 1: Divide the input image into a G×G grid., Step 2: For each grid cell, run a CNN that predicts y of the following form:, repeated k times]**
+
+⟶ [步驟一:把輸入影像切成 G×G 個格子, 步驟二:對於每一個格子, 分別進行 CNN 的運算來預測以下所表示的 y:, 重複 k 次]
+
+
+
+
+**67. where pc is the probability of detecting an object, bx,by,bh,bw are the properties of the detected bouding box, c1,...,cp is a one-hot representation of which of the p classes were detected, and k is the number of anchor boxes.**
+
+⟶ 其中, pc 為預測物體之機率, bx,by,bh,bw 為定界框的屬性, c1,...,cp 為 p 個偵測類別的一位有效編碼, k 為錨框的數量。
+
+
+
+
+**68. Step 3: Run the non-max suppression algorithm to remove any potential duplicate overlapping bounding boxes.**
+
+⟶ 步驟三: 計算非最大值抑制演算法來移除可能是重複、重疊的定界框。
+
+
+
+
+**69. [Original image, Division in GxG grid, Bounding box prediction, Non-max suppression]**
+
+⟶[原始影像, GxG 的格子, 定界框的預測, 非最大值抑制]
+
+
+
+
+**70. Remark: when pc=0, then the network does not detect any object. In that case, the corresponding predictions bx,...,cp have to be ignored.**
+
+⟶ 備註:當 pc=0,代表網路沒有預測到任何物件。在這種情況下,相關的預測 bx,...,cp可 忽略。
+
+
+
+
+**71. R-CNN ― Region with Convolutional Neural Networks (R-CNN) is an object detection algorithm that first segments the image to find potential relevant bounding boxes and then run the detection algorithm to find most probable objects in those bounding boxes.**
+
+⟶ 區域卷積神經網路 ― 區域卷積神經網路是一個物件偵測演算法, 先將一個影像分割以找尋可能的定界框, 再執行偵測的演算法來預測最可能出現在該定界框的物件。
+
+
+
+
+**72. [Original image, Segmentation, Bounding box prediction, Non-max suppression]**
+
+⟶ [原始圖片, 分割, 定界框預測, 非最大值抑制]
+
+
+
+
+**73. Remark: although the original algorithm is computationally expensive and slow, newer architectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.**
+
+⟶ 備註:即使原始的演算法耗費很多計算資源且速度慢,新提出的架構提供更快的演算法,例如快速型區域卷積神經網路與更快速型區域卷積神經網路。
+
+
+
+
+**74. Face verification and recognition**
+
+⟶ 人臉驗證與辨別
+
+
+
+
+**75. Types of models ― Two main types of model are summed up in table below:**
+
+⟶ 模型的種類 - 有兩種主要的模型種類,如下表:
+
+
+
+
+**76. [Face verification, Face recognition, Query, Reference, Database]**
+
+⟶ [人臉驗證, 人臉辨別, 查詢, 對照, 資料庫]
+
+
+
+
+**77. [Is this the correct person?, One-to-one lookup, Is this one of the K persons in the database?, One-to-many lookup]**
+
+⟶ [是否是正確的人?, 一對一查詢, 是否是K個存在資料庫中的其中一人?, 一對多查詢]
+
+
+
+
+**78. One Shot Learning ― One Shot Learning is a face verification algorithm that uses a limited training set to learn a similarity function that quantifies how different two given images are. The similarity function applied to two images is often noted d(image 1,image 2).**
+
+⟶ 單樣本學習 - 單樣本學習是一種人臉驗證演算法,使用有限的訓練資料集來學習一個相似度函數,用來量化兩影像之間的差異。應用於兩影像之間的相似度函數時常標示為 d (影像1、 影像2)。
+
+
+
+
+**79. Siamese Network ― Siamese Networks aim at learning how to encode images to then quantify how different two images are. For a given input image x(i), the encoded output is often noted as f(x(i)).**
+
+⟶ 孿生網路 - 孿生網路之目的為學習如何將影像編碼,並用於後續量化兩影像之間得差異。給定一輸入影像 x(i), 編碼後的輸出標示為 f(x(i))。
+
+
+
+
+**80. Triplet loss ― The triplet loss ℓ is a loss function computed on the embedding representation of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive example belong to a same class, while the negative example to another one. By calling α∈R+ the margin parameter, this loss is defined as follows:**
+
+⟶ 三重損失函數 - 三重損失函數ℓ是一個計算影像 A(錨框)、P(正向樣本) 和 N(負向樣本) 間嵌入表徵的損失函數。錨框與正向樣本屬於同個類別,而與負向樣本不同。指定 α∈R+ 為一範圍參數,此損失函數定義為:
+
+
+
+
+**81. Neural style transfer**
+
+⟶ 神經風格轉換
+
+
+
+
+**82. Motivation ― The goal of neural style transfer is to generate an image G based on a given content C and a given style S.**
+
+⟶ 動機 - 神經風格轉換之目的為根據給定的內容 C 與風格 S,產生一張圖片 G。
+
+
+
+
+**83. [Content C, Style S, Generated image G]**
+
+⟶ [內容 C, 風格 S, 生成影像 G]
+
+
+
+
+**84. Activation ― In a given layer l, the activation is noted a[l] and is of dimensions nH×nw×nc**
+
+⟶ 激發 - 給定一層 l, 它的激發可表示為 a[l], 其維度為 nH×nw×nc。
+
+
+
+
+**85. Content cost function ― The content cost function Jcontent(C,G) is used to determine how the generated image G differs from the original content image C. It is defined as follows:**
+
+⟶ 內容成本函數 - 內容成本函數 Jcontent(C,G) 用於計算生成影像 G 與內容影像 C 之間的差異。定義如下:
+
+
+
+
+**86. Style matrix ― The style matrix G[l] of a given layer l is a Gram matrix where each of its elements G[l]kk′ quantifies how correlated the channels k and k′ are. It is defined with respect to activations a[l] as follows:**
+
+⟶ 風格矩陣 - 於第 l 層的風格矩陣 G[l] 是一個格拉姆矩陣,矩陣中的每個元素 G[l] kk′ 量化 k 與 k′ 頻道之間的相關程度。此矩陣透過激發函數 a[l] 定義如下:
+
+
+
+
+**87. Remark: the style matrix for the style image and the generated image are noted G[l] (S) and G[l] (G) respectively.**
+
+⟶ 備註:風格影像 S 與生成影像 G 的風格矩陣分別表示為 G[l] (S) 與 G[l] (G)。
+
+
+
+
+**88. Style cost function ― The style cost function Jstyle(S,G) is used to determine how the generated image G differs from the style S. It is defined as follows:**
+
+⟶ 風格成本函數 - 風格成本函數 Jstyle(S,G) 用於評估生成影像 G 與風格 S 之差別。定義如下:
+
+
+
+
+**89. Overall cost function ― The overall cost function is defined as being a combination of the content and style cost functions, weighted by parameters α,β, as follows:**
+
+⟶ 總體成本函數 - 總體成本函數定義為內容成本函數與風格成本函數之組合,權重為 α, β 。
+
+
+
+
+**90. Remark: a higher value of α will make the model care more about the content while a higher value of β will make it care more about the style.**
+
+⟶ 備註:越高的 α 值會使模型會較注重於內容,而較高的 β 值會使模型較注重風格。
+
+
+
+
+**91. Architectures using computational tricks**
+
+⟶ 計算架構手法
+
+
+
+
+**92. Generative Adversarial Network ― Generative adversarial networks, also known as GANs, are composed of a generative and a discriminative model, where the generative model aims at generating the most truthful output that will be fed into the discriminative which aims at differentiating the generated and true image.**
+
+⟶ 生成對抗網路 - 生成對抗網路,簡稱為 GANs,是一個由生成網路與對抗網路所組成的模型,其中生成網路的目的為生成最貼近真實的輸出,並當作對抗網路之輸入,而對抗網路之目的為分辨輸入資料為真實或偽造。
+
+
+
+
+**93. [Training, Noise, Real-world image, Generator, Discriminator, Real Fake]**
+
+⟶ [訓練資料, 雜訊, 真實影像, 生成網路, 對抗網路, 真實 偽造]
+
+
+
+
+**94. Remark: use cases using variants of GANs include text to image, music generation and synthesis.**
+
+⟶ 備註:生成對抗網路不同種類的用途包括:由文字生成影像、生成或合成音樂等。
+
+
+
+
+**95. ResNet ― The Residual Network architecture (also called ResNet) uses residual blocks with a high number of layers meant to decrease the training error. The residual block has the following characterizing equation:**
+
+⟶ 殘差網路 - 殘差網路(ResNet) 利用殘差架構連接更高層以減少訓練誤差。殘差架構可表示為下式:
+
+
+
+
+**96. Inception Network ― This architecture uses inception modules and aims at giving a try at different convolutions in order to increase its performance through features diversification. In particular, it uses the 1×1 convolution trick to limit the computational burden.**
+
+⟶ Inception 網路 - 此架構利用 inception 模組, 目的為嘗試不同的卷積運算, 透過特徵多樣化來提高模型的效能。特別的是, 此架構利用 1×1 卷積技術來限制計算負擔。
+
+
+
+
+**97. The Deep Learning cheatsheets are now available in [target language].**
+
+⟶ 深度學習參考手冊目前已有[目標語言]版。
+
+
+
+
+**98. Original authors**
+
+⟶ 原始作者
+
+
+
+
+**99. Translated by X, Y and Z**
+
+⟶ 由 X, Y 與 Z 翻譯
+
+
+
+
+**100. Reviewed by X, Y and Z**
+
+⟶ 由 X, Y 與 Z 檢閱
+
+
+
+
+**101. View PDF version on GitHub**
+
+⟶ 在 GitHub 上閱讀 PDF 版
+
+
+
+
+**102. By X and Y**
+
+⟶ X, Y
+
+
diff --git a/zh/cheatsheet-deep-learning.md b/zh/cheatsheet-deep-learning.md
deleted file mode 100644
index a7604ccc6..000000000
--- a/zh/cheatsheet-deep-learning.md
+++ /dev/null
@@ -1,321 +0,0 @@
-1. **Deep Learning cheatsheet**
-
-⟶
-
-
-
-2. **Neural Networks**
-
-⟶
-
-
-
-3. **Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.**
-
-⟶
-
-
-
-4. **Architecture ― The vocabulary around neural networks architectures is described in the figure below:**
-
-⟶
-
-
-
-5. **[Input layer, hidden layer, output layer]**
-
-⟶
-
-
-
-6. **By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
-
-⟶
-
-
-
-7. **where we note w, b, z the weight, bias and output respectively.**
-
-⟶
-
-
-
-8. **Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:**
-
-⟶
-
-
-
-9. **[Sigmoid, Tanh, ReLU, Leaky ReLU]**
-
-⟶
-
-
-
-10. **Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
-
-⟶
-
-
-
-11. **Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
-
-⟶
-
-
-
-12. **Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:**
-
-⟶
-
-
-
-13. **As a result, the weight is updated as follows:**
-
-⟶
-
-
-
-14. **Updating weights ― In a neural network, weights are updated as follows:**
-
-⟶
-
-
-
-15. **Step 1: Take a batch of training data.**
-
-⟶
-
-
-
-16. **Step 2: Perform forward propagation to obtain the corresponding loss.**
-
-⟶
-
-
-
-17. **Step 3: Backpropagate the loss to get the gradients.**
-
-⟶
-
-
-
-18. **Step 4: Use the gradients to update the weights of the network.**
-
-⟶
-
-
-
-19. **Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
-
-⟶
-
-
-
-20. **Convolutional Neural Networks**
-
-⟶
-
-
-
-21. **Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
-
-⟶
-
-
-
-22. **Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
-
-⟶
-
-
-
-23. **It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
-
-⟶
-
-
-
-24. **Recurrent Neural Networks**
-
-⟶
-
-
-
-25. **Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:**
-
-⟶
-
-
-
-26. **[Input gate, forget gate, gate, output gate]**
-
-⟶
-
-
-
-27. **[Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]**
-
-⟶
-
-
-
-28. **LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
-
-⟶
-
-
-
-29. **Reinforcement Learning and Control**
-
-⟶
-
-
-
-30. **The goal of reinforcement learning is for an agent to learn how to evolve in an environment.**
-
-⟶
-
-
-
-31. **Definitions**
-
-⟶
-
-
-
-32. **Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:**
-
-⟶
-
-
-
-33. **S is the set of states**
-
-⟶
-
-
-
-34. **A is the set of actions**
-
-⟶
-
-
-
-35. **{Psa} are the state transition probabilities for s∈S and a∈A**
-
-⟶
-
-
-
-36. **γ∈[0,1[ is the discount factor**
-
-⟶
-
-
-
-37. **R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize**
-
-⟶
-
-
-
-38. **Policy ― A policy π is a function π:S⟶A that maps states to actions.**
-
-⟶
-
-
-
-39. **Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).**
-
-⟶
-
-
-
-40. **Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:**
-
-⟶
-
-
-
-41. **Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:**
-
-⟶
-
-
-
-42. **Remark: we note that the optimal policy π∗ for a given state s is such that:**
-
-⟶
-
-
-
-43. **Value iteration algorithm ― The value iteration algorithm is in two steps:**
-
-⟶
-
-
-
-44. **1) We initialize the value:**
-
-⟶
-
-
-
-45. **2) We iterate the value based on the values before:**
-
-⟶
-
-
-
-46. **Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
-
-⟶
-
-
-
-47. **times took action a in state s and got to s′**
-
-⟶
-
-
-
-48. **times took action a in state s**
-
-⟶
-
-
-
-49. **Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:**
-
-⟶
-
-
-
-50. **View PDF version on GitHub**
-
-⟶
-
-
-
-51. **[Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
-
-⟶
-
-
-
-52. **[Convolutional Neural Networks, Convolutional layer, Batch normalization]**
-
-⟶
-
-
-
-53. **[Recurrent Neural Networks, Gates, LSTM]**
-
-⟶
-
-
-
-54. **[Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
-
-⟶
diff --git a/zh/cheatsheet-supervised-learning.md b/zh/cs-229-supervised-learning.md
similarity index 100%
rename from zh/cheatsheet-supervised-learning.md
rename to zh/cs-229-supervised-learning.md