diff --git a/docs/_config.yml b/docs/_config.yml
index 2118ffcf3..8ea984046 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -719,6 +719,7 @@ bn:
     - path: bn/week02/02.md
       sections:
         - path: bn/week02/02-1.md
+        - path: bn/week02/02-2.md
 
 ################################## Portuguese ##################################
 pt:
diff --git a/docs/bn/week02/02-2.md b/docs/bn/week02/02-2.md
new file mode 100644
index 000000000..099fd08fc
--- /dev/null
+++ b/docs/bn/week02/02-2.md
@@ -0,0 +1,258 @@
+---
+lang-ref: ch.02-2
+lecturer: Yann LeCun
+title: NN মডিউলগুলোর জন্য গ্রেডিয়েন্ট ক্যালকুলেশন এবং ব্যাকপ্রোপাগেশন ব‍্যবহারের কৌশল
+authors: Micaela Flores, Sheetal Laad, Brina Seidel, Aishwarya Rajan
+date: 3 Feb 2020
+lang: bn
+translation-date: 14 Dec 2020
+translator: Khalid Saifullah
+---
+
+
+<!-- ## [A concrete example of backpropagation and intro to basic neural network modules](https://www.youtube.com/watch?v=d9vdh3b787Y&t=2989s) -->
+## [ব্যাকপ্রোপাগেশন এর উদাহরণ এবং নিউরাল নেটওয়ার্ক এর সাধারণ কিছু মডিউল এর পরিচিতি](https://www.youtube.com/watch?v=d9vdh3b787Y&t=2989s)
+
+
+<!-- ### Example -->
+### উদাহরণ 
+
+<!-- We next consider a concrete example of backpropagation assisted by a visual graph. The arbitrary function $G(w)$ is input into the cost function $C$, which can be represented as a graph. Through the manipulation of multiplying the Jacobian matrices, we can transform this graph into the graph that will compute the gradients going backwards. (Note that PyTorch and TensorFlow do this automatically for the user, i.e. the forward graph is automatically "reversed" to create the derivative graph that backpropagates the gradient.) -->
+এরপর আমরা একটি ভিস্যুয়াল গ্রাফ এর সাহায্যে ব্যাকপ্রোপাগেশনের পূর্ণ উদাহরণ দেখবো।  আরবিট্রারি $G(w)$ ফাংশনটি কস্ট ফাংশন $C$ এর একটি ইনপুট যা একটি গ্রাফ হিসেবে রিপ্রেজেন্ট করা যায়। জ্যাকোবিয়ান ম্যাট্রিক্সের দ্বারা গুণের মাধ্যমে আমরা এই গ্রাফটিকে গ্রেডিয়েন্ট বের করার গ্রাফে পরিণত করতে পারি যা পিছনে (backword) ফ্লো করে। (জেনে রেখো, পাইটর্চ এবং টেনসরফ্লো ফ্রেমওয়ার্কগুলো ব্যবহারকারীদের জন্য এই কাজটি অটোমেটিকালি করে দেয়, অর্থাৎ, ফরওয়ার্ড গ্রাফটিকে অটোমেটিকালি উল্টিয়ে ডেরিভেটিভ গ্রাফ তৈরী করা হয় যেটি গ্রেডিয়েন্টগুলোকে ব্যাকপ্রোপাগেট করে।)
+
+<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-1.png" alt="Gradient diagram" style="zoom:40%;" /></center>
+
+<!-- In this example, the green graph on the right represents the gradient graph. Following the graph from the topmost node, it follows that -->
+এই উদাহরণটিতে, ডান দিকের সবুজ গ্রাফটি হচ্ছে গ্রেডিয়েন্ট গ্রাফ। গ্রাফটির সর্বোচ্চ নোড থেকে যদি আমরা গ্রেডিয়েন্ট ফ্লো টি লক্ষ্য করি
+
+$$
+\frac{\partial C(y,\bar{y})}{\partial w}=1 \cdot \frac{\partial C(y,\bar{y})}{\partial\bar{y}}\cdot\frac{\partial G(x,w)}{\partial w}
+$$
+
+<!-- In terms of dimensions, $\frac{\partial C(y,\bar{y})}{\partial w}$ is a row vector of size $1\times N$ where $N$ is the number of components of $w$; $\frac{\partial C(y,\bar{y})}{\partial \bar{y}}$  is a row vector of size $1\times M$, where $M$ is the dimension of the output; $\frac{\partial \bar{y}}{\partial w}=\frac{\partial G(x,w)}{\partial w}$ is a matrix of size $M\times N$, where $M$ is the number of outputs of $G$ and $N$ is the dimension of $w$. -->
+ডাইমেনশন এর দিক থেকে $\frac{\partial C(y,\bar{y})}{\partial w}$ হচ্ছে একটি $1\times N$ সাইজের রো (row) ভেক্টর যেখানে $N$ হচ্ছে $w$ এর উপাদান (elememt) সংখ্যা। $\frac{\partial C(y,\bar{y})}{\partial \bar{y}}$ হচ্ছে $1\times M$ সাইজের রো (row) ভেক্টর যেখানে $M$ আউটপুট এর ডাইমেনশন বুঝাচ্ছে। আর এখানে $\frac{\partial \bar{y}}{\partial w}=\frac{\partial G(x,w)}{\partial w}$ একটি ম্যাট্রিক্স যার সাইজ $M\times N$, যেখানে $M$ হচ্ছে $G$ এর আউটপুট সংখ্যা এবং $N$ হচ্ছে $w$ এর ডাইমেনশন।
+
+লক্ষ্য করো, গ্রাফটি যদি ডাটা নির্ভরশীল হয় এবং এর গঠন অপরিবর্তনীয় না হয় সেক্ষেত্রে কিছু জটিলতা সৃষ্টি হতে পারে। যেমন আমরা ইনপুট ভেক্টর এর সাইজের উপর ভিত্তি করে নিউরাল নেটওয়ার্ক এর মডিউল বাছাই করতে পারি। যদিও এটি করা সম্ভব, কিন্তু লুপ এর সংখ্যা অনেক বেশি বাড়ার সাথে সাথে এই পরিবর্তনশীল গ্রাফকে ম্যানেজ করা খুবই কঠিন হয়ে ওঠে।
+
+
+
+<!-- ### Basic neural net modules -->
+### নিউরাল নেটের সাধারণ কিছু মডিউল
+
+<!-- There exist different types of pre-built modules besides the familiar Linear and ReLU modules. These are useful because they are uniquely optimized to perform their respective functions (as opposed to being built by a combination of other, elementary modules). -->
+আমাদের পরিচিত লিনিয়ার এবং রেলু (ReLU) মডিউল ছাড়াও আরো অনেক ধরনের প্রিবিল্ট মডিউল রয়েছে। এই মডিউলগুলো ইউনিকলি অপ্টিমাইজড (অন্যান্য মৌলিক মডিউলের কম্বিনেশন দ্বারা তৈরী নয়), তাই এগুলো অনেক বেশি কার্যকরী।
+
+<!-- - Linear: $Y=W\cdot X$ -->
+- লিনিয়ার: $Y=W\cdot X$
+
+  $$
+  \begin{aligned}
+  \frac{dC}{dX} &= W^\top \cdot \frac{dC}{dY} \\
+  \frac{dC}{dW} &= \frac{dC}{dY} \cdot X^\top
+  \end{aligned}
+  $$
+
+<!-- - ReLU: $y=(x)^+$ -->
+- রেলু (ReLU): $y=(x)^+$
+
+  $$
+  \frac{dC}{dX} =
+      \begin{cases}
+        0 & x<0\\
+        \frac{dC}{dY} & \text{otherwise}
+      \end{cases}
+  $$
+
+<!-- - Duplicate: $Y_1=X$, $Y_2=X$ -->
+- ডুপ্লিকেট: $Y_1=X$, $Y_2=X$
+
+  <!-- - Akin to a "Y - splitter" where both outputs are equal to the input.
+
+  - When backpropagating, the gradients get summed
+
+  - Can be split into $n$ branches similarly -->
+  - একটি "Y - splitter" এর মতো, যেখানে উভয় আউটপুট ই ইনপুটের সমান।
+
+  - ব্যাকপ্রোপাগেশনের সময়, গ্রেডিয়েন্টগুলো যোগ হয়ে যায়।
+
+  - একইভাবে $n$ শাখায় ভাগ করা যায়।
+
+    $$
+    \frac{dC}{dX}=\frac{dC}{dY_1}+\frac{dC}{dY_2}
+    $$
+
+
+- যোগ (Add): $Y=X_1+X_2$
+
+  <!-- - With two variables being summed, when one is perturbed, the output will be perturbed by the same quantity, i.e. -->
+  - দুটি ভ্যারিয়েবল যোগ হওয়ায়, যখন একটিকে পার্টার্ব (পরিবর্তন) করা হয়, আউটপুটেও একই পরিমান পার্টার্ব (পরিবর্তন) লক্ষ্য করা যায়, অর্থাৎ
+
+    $$
+    \frac{dC}{dX_1}=\frac{dC}{dY}\cdot1 \quad \text{and}\quad \frac{dC}{dX_2}=\frac{dC}{dY}\cdot1
+    $$
+
+
+<!-- - Max: $Y=\max(X_1,X_2)$ -->
+- ম্যাক্স: $Y=\max(X_1,X_2)$
+
+  <!-- -  Since this function can also be represented as -->
+  - যেহেতু এই ফাংশনটিকে এভাবেও প্রকাশ করা যায়
+
+    $$
+    Y=\max(X_1,X_2)=\begin{cases}
+          X_1 & X_1 > X_2 \\
+          X_2 & \text{else}
+       \end{cases}
+    \Rightarrow
+    \frac{dY}{dX_1}=\begin{cases}
+          1 & X_1 > X_2 \\
+          0 & \text{else}
+       \end{cases}
+    $$
+
+  <!-- - Therefore, by the chain rule, -->
+  - সেহেতু, চেইন রুল দ্বারা আমরা পাই,
+
+    $$
+    \frac{dC}{dX_1}=\begin{cases}
+          \frac{dC}{dY}\cdot1 & X_1 > X_2 \\
+          0 & \text{else}
+      \end{cases}
+    $$
+
+
+<!-- ## [LogSoftMax *vs.* SoftMax](https://www.youtube.com/watch?v=d9vdh3b787Y&t=3953s) -->
+## [লগ সফটম্যাক্স *বনাম* সফটম্যাক্স](https://www.youtube.com/watch?v=d9vdh3b787Y&t=3953s)
+
+<!-- *SoftMax*, which is also a PyTorch module, is a convenient way of transforming a group of numbers into a group of positive numbers between $0$ and $1$ that sum to one. These numbers can be interpreted as a probability distribution. As a result, it is commonly used in classification problems. $y_i$ in the equation below is a vector of probabilities for all the categories. -->
+*সফটম্যাক্স* একটি সহজ উপায় যার মাধ্যমে আমরা কিছু সংখ্যাকে $0$ থেকে $1$ এর মধ্যে  কিছু ধনাত্মক (পজিটিভ) সংখ্যায় রূপান্তর করতে পারি এবং এটি পাইটর্চ এর মডিউল হিসেবেও রয়েছে। এই রূপান্তরিত সংখ্যাগুলোকে প্রোবাবিলিটি ডিস্ট্রিবিউশন হিসেবে ধারণা করা যেতে পারে। যার ফলে, ক্লাসিফিকেশন প্রব্লেমগুলোতে অধিকাংশ সময় এটি ব্যবহৃত হয়। নিচের সমীকরণে $y_i$ ভেক্টরটি হচ্ছে সকল বিভাগের (ক্যাটাগরি) প্রোবাবিলিটি।
+
+$$
+y_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}
+$$
+
+<!-- However, the use of softmax leaves the network susceptible to vanishing gradients. Vanishing gradient is a problem, as it prevents weights downstream from being modified by the neural network, which may completely stop the neural network from further training. The logistic sigmoid function, which is the softmax function for one value, shows that when $s$ is large, $h(s)$ is $1$, and when s is small, $h(s)$ is $0$. Because the sigmoid function is flat at $h(s) = 0$ and $h(s) = 1$, the gradient is $0$, which results in a vanishing gradient. -->
+তবে, এই সফটম্যাক্স এর ব্যবহারের ফলে নেটওয়ার্কটি ভ্যানিশিং গ্রেডিয়েন্ট সমস্যার মুখোমুখি হতে পারে। ভ্যানিশিং গ্রেডিয়েন্ট সমস্যাটিতে দূরবর্তী ওয়েটগুলো নিউরাল নেটওয়ার্ক দ্বারা পরিবর্তীত হতে বাধাপ্রাপ্ত হয়, যার ফলে নিউরাল নেটওয়ার্ক এর ট্রেইনিং সম্পূর্ণরূপে বন্ধ হয়ে যেতে পারে। লজিস্টিক সিগমইড (logistic sigmoid) ফাংশন, যেটিকে একইভাবে একটি ভ্যালুর সফটম্যাক্স বলা যেতে পারে, সেটি যদি আমরা দেখি, তাহলে আমরা দেখতে পাই যখন $s$ বড়, তখন $h(s)$ এর মান $১$ এবং যখন $s$ ছোট, তখন $h(s)$ এর মান $০$। যেহেতু সিগমইড (sigmoid) ফাংশনটি $h(s) = ০$ এবং $h(s) = ১$ এ ফ্ল্যাট (সমতল), সেহেতু ঐ অংশে গ্রেডিয়েন্ট $০$ আর এই কারণেই ভ্যানিশিং গ্রেডিয়েন্ট সমস্যাটি দেখা দেয়।
+
+<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-2.png" alt="Sigmoid function to illustrate vanishing gradient" style="background-color:#DCDCDC;" /></center>
+
+$$
+h(s) = \frac{1}{1 + \exp(-s)}
+$$
+
+সফটম্যাক্স এর এই ভ্যানিশিং গ্রেডিয়েন্ট সমস্যাটি দূর করতে গণিতবিদেরা লগ সফটম্যাক্স (logsoftmax) এর ধারণা বের করলেন। *লগ সফটম্যাক্স (LogSoftMax)* হচ্ছে পাইটর্চের আরেকটি সাধারণ মডিউল। যেমনটি নিচের সমীকরণটিতে আমরা দেখতে পাচ্ছি, *লগ সফটম্যাক্স (LogSoftMax)* লগ (log) এবং সফটম্যাক্স (softmax) এর একটি মিশ্রণ।
+
+$$
+\log(y_i )= \log\left(\frac{\exp(x_i)}{\Sigma_j \exp(x_j)}\right) = x_i - \log(\Sigma_j \exp(x_j))
+$$
+
+<!-- The equation below demonstrates another way to look at the same equation. The figure below shows the $\log(1 + \exp(s))$ part of the function. When $s$ is very small, the value is $0$, and when $s$ is very large, the value is $s$. As a result it doesn’t saturate, and the vanishing gradient problem is avoided. -->
+একই সমীকরণকে আমরা অন্যভাবে দেখতে পারি নিচের সমীকরণটি দ্বারা। নিচের চিত্রটি ফাংশনের $\log(1 + \exp(s))$ অংশটি দেখায়। যখন $s$ অনেক ছোট, তখন সবুজ রেখাটির মান $০$ আর যখন $s$ অনেক বড় তখন মান $s$। যার ফলে এটি স্যাচুরেট হয় না এবং ভ্যানিশিং গ্রেডিয়েন্ট এর সমস্যাটি এড়ানো সম্ভব হয়।
+
+$$
+\log\left(\frac{\exp(s)}{\exp(s) + 1}\right)= s - \log(1 + \exp(s))
+$$
+
+<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-3.png" width='400px' alt="Plot of logarithmic part of the functions" /></center>
+
+
+<!-- ## [Practical tricks for backpropagation](https://www.youtube.com/watch?v=d9vdh3b787Y&t=4891s) -->
+## [ব্যাকপ্রোপাগেশনের কিছু কৌশল](https://www.youtube.com/watch?v=d9vdh3b787Y&t=4891s)
+
+
+<!-- ### Use ReLU as the non-linear activation function -->
+### নন-লিনিয়ার এক্টিভেশন ফাংশন হিসেবে রেলু (ReLU) ব্যবহার করবে
+
+<!-- ReLU works best for networks with many layers, which has caused alternatives like the sigmoid function and hyperbolic tangent $\tanh(\cdot)$ function to fall out of favour. The reason ReLU works best is likely due to its single kink which makes it scale equivariant. -->
+রেলু (ReLU) বহু লেয়ার বিশিষ্ট নেটওয়ার্ক এর জন্য সবচেয়ে ভালো কাজ করে, যার কারনে বিকল্প ফাংশনগুলো, যেমন, সিগময়েড ফাংশন এবং হাইপারবোলিক ট্যানজেন্ট $\tanh(\cdot)$ ফাংশন এর জনপ্রিয়তা হ্রাস পেয়েছে। রেলু (ReLU) এর সবচেয়ে ভালো কাজ করার কারনটি সম্ভবত এর সিঙ্গেল কিঙ্ক (একটি মাত্র ভাঁজ), যা এটাকে স্কেল ইকুইভ্যারিয়েন্ট করে তোলে।
+
+
+<!-- ### Use cross-entropy loss as the objective function for classification problems -->
+### ক্লাসিফিকেশন প্রব্লেমগুলোর অবজেক্টিভ ফাংশন হিসেবে ক্রস-এন্ট্রপি (cross-entropy) লস ব্যবহার করবে
+
+<!-- Log softmax, which we discussed earlier in the lecture, is a special case of cross-entropy loss. In PyTorch, be sure to provide the cross-entropy loss function with *log* softmax as input (as opposed to normal softmax). -->
+লগ সফটম্যাক্স হচ্ছে ক্রস-এন্ট্রপি (cross-entropy) এরই একটি বিশেষ রূপ, যেটি আমরা লেকচারে আগেই আলোচনা করেছি। পাইটর্চে, ক্রস-এন্ট্রপি (cross-entropy) লস ফাংশনের ইনপুট হিসেবে *লগ * সফটম্যাক্স (সাধারণ সফটম্যাক্স এর বদলে) ইনপুট দেয়ার বিষয়টি অবশ্যই খেয়াল রাখবে।
+
+<!-- ### Use stochastic gradient descent on minibatches during training -->
+### ট্রেইনিং এর সময় মিনি ব্যাচগুলোর উপর স্টকাস্টিক গ্রেডিয়েন্ট ডিসেন্ট (SGD) ব্যবহার করবে
+
+<!-- As discussed previously, minibatches let you train more efficiently because there is redundancy in the data; you shouldn't need to make a prediction and calculate the loss on every single observation at every single step to estimate the gradient. -->
+পূর্বের আলোচনা অনুযায়ী, মিনি ব্যাচগুলো তোমাকে বেশী ইফিশিয়েন্টলি ট্রেইনিং এ সহায়তা করে যেহেতু ডাটা এর মধ্যে রিডান্ডেন্সি (অপ্রয়োজনীয়তা) বিদ্যমান; গ্রেডিয়েন্ট এস্টিমেট (অনুমান) এর জন্য তোমার প্রত্যেক ধাপে একেকটি স্যাম্পলের উপর প্রেডিকশন এবং লস ক্যালকুলেশনের দরকার নেই।
+
+
+<!-- ### Shuffle the order of the training examples when using stochastic gradient descent -->
+### স্টকাস্টিক গ্রেডিয়েন্ট ডিসেন্ট (SGD) ব্যবহারের সময় ট্রেইনিং এক্সাম্পলগুলোর বিন্যাসকে শাফল (ওলোটপালোট) করে নিবে
+
+<!-- Order matters. If the model sees only examples from a single class during each training step, then it will learn to predict that class without learning why it ought to be predicting that class. For example, if you were trying to classify digits from the MNIST dataset and the data was unshuffled, the bias parameters in the last layer would simply always predict zero, then adapt to always predict one, then two, *etc*. Ideally, you should have samples from every class in every minibatch. -->
+ট্রেইনিং এক্সাম্পলগুলোর অর্ডারটি (বিন্যাস) গুরুত্বপূর্ণ বিষয়। ট্রেইনিং এর প্রতিটি ধাপে যদি মডেল শুধুমাত্র একটি ক্লাসেরই এক্সাম্পল দেখে, তাহলে কোনো প্রকৃত কারণ না জেনে/শিখেই মডেলটি ঐ ক্লাসটি প্রেডিক্ট করবে। উদাহরণস্বরূপ, যদি তুমি MNIST ডেটাসেটের ডিজিটগুলোকে ক্লাসিফাই করতে চাও এবং ডেটাগুলো আনশাফল্ড (ওলোটপালোট করা হয়নি এমন), শেষের লেয়ারের বায়াস প্যারামিটারগুলো সবসময় শুধু 0 প্রেডিক্ট করবে, এরপর সবসময় শুধু ওয়ান 1 প্রেডিক্ট করায় মানিয়ে নিবে, এবং একইভাবে এরপর 2 *ইত্যাদি*। সবচেয়ে ভালো হচ্ছে, প্রতিটি মিনি ব্যাচে সবগুলো ক্লাসেরই কিছু স্যাম্পল রাখা।
+
+<!-- However, there's ongoing debate over whether you need to change the order of the samples in every pass (epoch). -->
+তবে প্রতি ইপকে (epoch) তোমার স্যাম্পলগুলোর অর্ডার (বিন্যাস) পরিবর্তন করতে হবে কিনা তা নিয়ে একটি বিতর্ক চলমান রয়েছে।
+
+
+<!-- ### Normalize the inputs to have zero mean and unit variance -->
+### ইনপুটগুলোকে নর্মালাইজ করে নাও যাতে তা জিরো মিন এবং ইউনিট ভ্যারিয়েন্স হয়
+
+<!-- Before training, it's useful to normalize each input feature so that it has a mean of zero and a standard deviation of one. When using RGB image data, it is common to take mean and standard deviation of each channel individually and normalize the image channel-wise. For example, take the mean $m_b$ and standard deviation $\sigma_b$ of all the blue values in the dataset, then normalize the blue values for each individual image as -->
+ট্রেইনিং এর পূর্বে, প্রতিটি ইনপুট ফিচারকে নর্মালাইজ করা ভালো যাতে সেগুলোর মিন (mean) জিরো (0) এবং স্ট্যান্ডার্ড ডেভিয়েশন (STD) ওয়ান (1) হয়। আরজিবি (RGB) ছবির ডেটা ব্যবহারের সময়, প্রতিটি চ্যানেলের মিন (mean) এবং স্ট্যান্ডার্ড ডেভিয়েশন (STD) আলাদাভাবে নিয়ে, চ্যানেল এর সাপেক্ষে ছবিটিকে নর্মালাইজ করা খুবই প্রচলিত একটি ব্যাপার। যেমন, ডেটাসেটের সকল নীল মানের (blue values) জন্য মিন $m_b$ এবং স্ট্যান্ডার্ড ডেভিয়েশন $\sigma_b$ বের কর, এরপর প্রতিটি ছবির নীল মানগুলোকে এভাবে নর্মালাইজ কর
+
+$$
+b_{[i,j]}^{'} = \frac{b_{[i,j]} - m_b}{\max(\sigma_b, \epsilon)}
+$$
+
+<!-- where $\epsilon$ is an arbitrarily small number that we use to avoid division by zero. Repeat the same for green and red channels. This is necessary to get a meaningful signal out of images taken in different lighting; for example, day lit pictures have a lot of red while underwater pictures have almost none. -->
+যেখানে $\epsilon$ হচ্ছে একটি ছোট সংখ্যা যেটি আমাদেরকে শূন্য দ্বারা বিভাজনে বাঁধা দেয়। একই ধাপগুলোর পূনরাবৃত্তি কর সবুজ এবং নীল চ্যানেলের জন্য। ভিন্ন ভিন্ন আলোয় তোলা ছবিগুলো থেকে অর্থপূর্ণ সিগন্যাল পাওয়ার ক্ষেত্রে এই কাজ টি করা আবশ্যক। যেমন, দিনের বেলায় তোলা ছবিগুলো অনেক বেশি লাল হয় অথচ আন্ডারওয়াটার (পানির নিচে) তোলা ছবিগুলোতে লাল নেই বললেই চলে।
+
+
+<!-- ### Use a schedule to decrease the learning rate -->
+### লার্নিং রেট (lr) কমানোর জন্য শিডিউল ব্যবহার করবে
+
+<!-- The learning rate should fall as training goes on. In practice, most advanced models are trained by using algorithms like Adam which adapt the learning rate instead of simple SGD with a constant learning rate. -->
+ট্রেইনিং অগ্রসর হওয়ার সময় লার্নিং রেট (lr) কমা উচিৎ। প্রয়োগের ক্ষেত্রে, বেশীর ভাগ উন্নত মডেলগুলোই এডাম (Adam) এর মতো এলগরিদমের মাধ্যমে ট্রেইন করা হয়, যা SGD এর মতো ধ্রুবক (কনস্ট্যান্ট) লার্নিং রেট (lr) ব্যবহার না করে এটাকে বরং প্রয়োজন মতো মানিয়ে নেয়।
+
+
+<!-- ### Use L1 and/or L2 regularization for weight decay -->
+### ওয়েট ডিকে (weight decay) এর জন্য L1 এবং/অথবা L2 রেগুলারাইজেশন ব্যবহার করবে
+
+<!-- You can add a cost for large weights to the cost function. For example, using L2 regularization, we would define the loss $L$ and update the weights $w$ as follows: -->
+বড় ওয়েটগুলোর জন্য তুমি কস্ট ফাংশনের মান বাড়াতে পারো। উদাহনস্বরূপ, L2 রেগুলারাইজেশন ব্যবহার করলে, আমরা আমাদের লস কে $L$ দিয়ে প্রকাশ করবো এবং $w$ এর মান নিম্নোক্ত উপায়ে পরিবর্তন করবো:
+
+$$
+L(S, w) = C(S, w) + \alpha \Vert w \Vert^2\\
+\frac{\partial R}{\partial w_i} = 2w_i\\
+w_i = w_i - \eta\frac{\partial L}{\partial w_i} = w_i - \eta \left( \frac{\partial C}{\partial w_i} + 2 \alpha w_i \right)
+$$
+
+<!-- To understand why this is called weight decay, note that we can rewrite the above formula to show that we multiply $w_i$ by a constant less than one during the update. -->
+এটাকে কেন ওয়েট ডিকে (weight decay) বলা হয়, সেটি যদি বুঝতে চাও তবে লক্ষ্য করো, আমরা উপরের সমীকরণটিকে অন্যভাবে লিখতে পারি এবং আমরা দেখতে পাই প্রতিবার $w$ এর মান পরিবর্তনের সময়, $w_i$ কে একের থেকে ছোট মানের সাথে গুণ করা হয়।
+
+$$
+w_i = (1 - 2 \eta \alpha) w_i - \eta\frac{\partial C}{\partial w_i}
+$$
+
+<!-- L1 regularization (Lasso) is similar, except that we use $\sum_i \vert w_i\vert$ instead of $\Vert w \Vert^2$. -->
+L1 রেগুলারাইজেশন (lasso) অনেকটা একই, শুধুমাত্র পার্থক্য হচ্ছে $\Vert w \Vert^2$ এর বদলে আমরা $\sum_i \vert w_i\vert$ ব্যবহার করি।
+
+<!-- Essentially, regularization tries to tell the system to minimize the cost function with the shortest weight vector possible. With L1 regularization, weights that are not useful are shrunk to $0$. -->
+মূলত, রেগুলারাইজেশন চেষ্টা করে সবচেয়ে ছোট ওয়েট ভেক্টরটিকে খুঁজে বের করার যেটি কস্টকে হ্রাস করে। L1 রেগুলারাইজেশন (lasso) এ অপ্রয়োজনীয় ওয়েটগুলো $0$ তে হ্রস্বীভূত হয়ে যায়।
+
+
+<!-- ### Weight initialisation -->
+### ওয়েট ইনিশিয়ালাইজেশন 
+
+<!-- The weights need to be initialised at random, however, they shouldn't be too large or too small such that output is roughly of the same variance as that of input. There are various weight initialisation tricks built into PyTorch. One of the tricks that works well for deep models is Kaiming initialisation where the variance of the weights is inversely proportional to square root of number of inputs. -->
+ওয়েটগুলোকে আমাদের রেন্ডমভাবে ইনিশিয়ালাইজ করতে হবে, তবে সেগুলোর মান যাতে অনেক বেশি বড় অথবা অনেক বেশি ছোট না হয়ে যায় সেটি খেয়াল রাখতে হবে যাতে করে ইনপুট এবং আউটপুটের মধ্যকার পার্থক্য (ভ্যারিয়েন্স) মোটামোটি সমান থাকে। ওয়েট ইনিশিয়ালাইজেশনের অনেক ধরণের ট্রিক্স পাইটর্চে আগে থেকেই তৈরী করা আছে। তার মধ্যে একটি যেটি ডীপ মডেলগুলোর জন্য ভালো কাজ করে সেটি হচ্ছে Kaiming initialisation, যেখানে ওয়েটগুলোর ভ্যারিয়েন্স ইনপুট সংখ্যার বর্গমূলের (স্কয়ার রুট) ব্যস্তানুপাতিক (ইনভার্সলি প্রোপোর্শনাল)।
+
+
+<!-- ### Use dropout -->
+### ড্রপআউট ব্যবহার করবে
+
+<!-- Dropout is another form of regularization. It can be thought of as another layer of the neural net: it takes inputs, randomly sets $n/2$ of the inputs to zero, and returns the result as output. This forces the system to take information from all input units rather than becoming overly reliant on a small number of input units thus distributing the information across all of the units in a layer. This method was initially proposed by <a href="https://arxiv.org/abs/1207.0580">Hinton et al (2012)</a>. -->
+ড্রপআউট হচ্ছে এক ধরণের রেগুলারাইজেশন। এটাকে নিউরাল নেটেরই লেয়ার হিসেবে ভাবা যেতে পারে: এটি ইনপুট নেয়, এরপর রেন্ডমলি $n/2$ টি ইনপুটকে শূন্য বানিয়ে দেয় এবং বাকি ইনপুটগুলোকে আউটপুট হিসেবে পাঠিয়ে দেয়। এটি সিস্টেমকে কিছু সংখ্যক ইনপুটের উপর অনেক বেশি নির্ভরশীল হওয়াতে বাধা দেয় বরং বাধ্য করে যেন সকল ইনপুট ইউনিটগুলো থেকেই তথ্য সংগ্রহ করে, ফলে প্রতি লেয়ারের সবগুলো ইউনিটেই তথ্য সঠিকভাবে বিন্যস্ত (ডিস্ট্রিবিউট) হয়। এই পদ্ধতিটি প্রথম <a href="https://arxiv.org/abs/1207.0580">Hinton et al (2012)</a> প্রস্তাবিত করে।
+
+<!-- For more tricks, see  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">LeCun et al 1998</a>. -->
+আরও ট্রিক্স এর জন্য এটি দেখতে পারো <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">LeCun et al 1998</a>।
+
+<!-- Finally, note that backpropagation doesn't just work for stacked models; it can work for any directed acyclic graph (DAG) as long as there is a partial order on the modules. -->
+অবশেষে, মনে রেখো ব্যাকপ্রোপাগেশন শুধু স্ট্যাক মডেলগুলোর জন্যই কাজ করে এমনটি নয়, এটি যেকোনো ডিরেক্টেড এসাইক্লিক গ্রাফের (DAG) জন্য প্রযোজ্য যদি মডিউলগুলোর পার্শিয়াল অর্ডার থাকে।
+