The aim of this exercise is to implement the back-propagation algorithm. There are many ready-made tools that do this already but here we aim to implement the algorithm using only the linear algebra and other mathematics tools available in numpy and scipy. This way you get an insight into how the algorithm works before you move on to use it in larger toolboxes.
We will restrict ourselves to fully-connected feed forward neural networks with one hidden layer (plus an input and an output layer).
We will use the following nonlinear activation function:
We will also need the derivative of this function:
Create these two functions:
- The sigmoid function:
sigmoid(x)
- The derivative of the sigmoid function:
d_sigmoid(x)
Note: To avoid overflows, make sure that inside sigmoid(x)
you check if x<-100
and return 0.0
in that case.
Example inputs and outputs:
sigmoid(0.5)
->0.6224593312018546
d_sigmoid(0.2)
->0.24751657271185995
A perceptron takes in an array of inputs
Implement the function perceptron(x, w)
that returns the weighted sum and the output of the activation function.
Example inputs and outputs:
perceptron(np.array([1.0, 2.3, 1.9]),np.array([0.2,0.3,0.1]))
->(1.0799999999999998, 0.7464939833376621)
perceptron(np.array([0.2,0.4]),np.array([0.1,0.4]))
->(0.18000000000000005, 0.5448788923735801)
When we have the sigmoid and the perceptron function, we can start to implement the neural network.
Implement a function ffnn
which computes the output and hidden layer variables for a single hidden layer feed-forward neural network. If the number of inputs is
Write a function y, z0, z1, a1, a2 = ffnn(x, M, K, W1, W2)
where:
-
x
is the input pattern of$(1\times D)$ dimensions (a line vector) -
W1
is a$((D+1)\times M)$ matrix andW2
is a$(M+1)\times K$ matrix. (the+1
are for the bias weights) -
a1
is the input vector of the hidden layer of size$(1\times M)$ (needed for backprop). -
a2
is the input vector of the output layer of size$(1\times K)$ (needed for backprop). -
z0
is the input pattern of size$(1\times (D+1))$ , (this is justx
with1.0
inserted at the beginning to match the bias weight). -
z1
is the output vector of the hidden layer of size$(1\times (M+1))$ (needed for backprop). -
y
is the output of the neural network of size$(1\times K)$ .
Example inputs and outputs:
First load the iris data:
features, targets, classes = load_iris()
(train_features, train_targets), (test_features, test_targets) = \
split_train_test(features, targets)
Then call the function
# Take one point:
x = train_features[0, :]
K = 3 # number of classes
M = 10
# Initialize two random weight matrices
W1 = 2 * np.random.rand(D + 1, M) - 1
W2 = 2 * np.random.rand(M + 1, K) - 1
y, z0, z1, a1, a2 = ffnn(x, M, K, W1, W2)
Outputs:
y
:[0.47439547 0.73158513 0.95244346]
z0
:[1. 5. 3.4 1.6 0.4]
z1
:[1. 0.94226108 0.97838733 0.21444943 0.91094513 0.14620877 0.94492407 0.57909676 0.88187859 0.99826648 0.04362534]
a1
:[ 2.79235088 3.81262569 -1.29831091 2.32522998 -1.76465113 2.84239174 0.31906659 2.01034143 6.3558639 -3.08751164]
a2
:[-0.1025078 1.00267978 2.99711137]
We will now implement the back-propagation algorithm to evaluate the gradient of the error function
Create a function y, dE1, dE2 = backprop(x, target_y, M, K, W1, W2)
where:
-
x, M, K, W1
andW2
are the same as for theffnn
function -
target_y
is the target vector. In our case (i.e. for the classification of Iris) this will be a vector with 3 elements, with one element equal to 1.0 and the others equal to 0.0. (*). -
y
is the output of the output layer (vector with 3 elements) -
dE1
anddE2
are the gradient error matrices that contain$\frac{\partial E_n}{\partial w_{ji}}$ for the first and second layers.
Assume sigmoid hidden and output activation functions and assume cross-entropy error function (for classification). Notice that
The inner working of your backprop
function should follow this order of actions:
- run
ffnn
on the input. - calculate
$\delta_k = y_k - target_y_k$ - calculate
$\delta_j = (\frac{d}{da}\sigma(a1_j)) \sum_{k} w_{k,j+1}\delta_k$ (the+1
is because of the bias weights) - initialize
dE1
anddE1
as zero-matrices with the same shape asW1
andW2
- calculate
dE1_{i,j} = \delta_j z0_i
anddE2_{j,k} = \delta_k z1_j
Example inputs and outputs:
Call the function
K = 3 # number of classes
M = 6
D = train_features.shape[1]
x = features[0, :]
# create one-hot target for the feature
target_y = np.zeros(K)
target_y[targets[0]] = 1.0
np.random.seed(42)
# Initialize two random weight matrices
W1 = 2 * np.random.rand(D + 1, M) - 1
W2 = 2 * np.random.rand(M + 1, K) - 1
y, dE1, dE2 = backprop(x, target_y, M, K, W1, W2)
Output
y
:[0.42629045 0.1218163 0.56840796]
dE1
:
[[-3.17372897e-03 3.13040504e-02 -6.72419861e-03 7.39219402e-02
-1.16539047e-04 9.29566482e-03]
[-1.61860177e-02 1.59650657e-01 -3.42934129e-02 3.77001895e-01
-5.94349138e-04 4.74078906e-02]
[-1.11080514e-02 1.09564176e-01 -2.35346951e-02 2.58726791e-01
-4.07886663e-04 3.25348269e-02]
[-4.44322055e-03 4.38256706e-02 -9.41387805e-03 1.03490716e-01
-1.63154665e-04 1.30139307e-02]
[-6.34745793e-04 6.26081008e-03 -1.34483972e-03 1.47843880e-02
-2.33078093e-05 1.85913296e-03]]
dE2
:
[[-5.73709549e-01 1.21816299e-01 5.68407958e-01]
[-3.82317044e-02 8.11777445e-03 3.78784091e-02]
[-5.13977514e-01 1.09133338e-01 5.09227901e-01]
[-2.11392026e-01 4.48850716e-02 2.09438574e-01]
[-1.65803375e-01 3.52051896e-02 1.64271203e-01]
[-3.19254175e-04 6.77875452e-05 3.16303980e-04]
[-5.60171752e-01 1.18941805e-01 5.54995262e-01]]
(*): This is referred to as one-hot encoding. If we have e.g. 3 possible classes:
yellow
,green
andblue
, we could assign the following label mapping:yellow: 0
,green: 1
andblue: 2
. Then we could encode these labels as: $$ \text{yellow} = \begin{bmatrix}1\0\0\end{bmatrix}, \text{green} = \begin{bmatrix}0\1\0\end{bmatrix}, \text{blue} = \begin{bmatrix}0\0\1\end{bmatrix}, $$ But why would we choose to do this instead of just using $0, 1, 2$ ? The reason is simple, using ordinal categorical label injects assumptions into the network that we want to avoid. The network might assume thatyellow: 0
is more different fromblue: 2
thangreen: 1
because the difference in the labels is greater. We want our neural networks to output probability distributions over classes meaning that the output of the network might look something like: $$ \text{NN}(x) = \begin{bmatrix}0.32\0.03\0.65\end{bmatrix} $$ From this we can directly make a prediction,0.65
is highest so the model is most confident in that the input feature corresponds to theblue
label
We are now ready to train the network. Training consists of:
- forward propagating an input feature through the network
- Calculate the error between the prediction the network made and the actual target
- Back-propagating the error through the network to adjust the weights.
Write a function called W1tr, W2tr, E_total, misclassification_rate, guesses = train_nn(X_train, t_train, M, K, W1, W2, iterations, eta)
where
Inputs:
X_train
andt_train
are the training data and the target valuesM, K, W1, W2
are defined as aboveiterations
is the number of iterations the training should take, i.e. how often we should update the weightseta
is the learning rate.
Outputs:
W1tr
,W2tr
are the updated weight matricesE_total
is an array that contains the error after each iteration.misclassification_rate
is an array that contains the misclassification rate after each iterationguesses
is the result from the last iteration, i.e. what the network is guessing for the input datasetX_train
.
The inner working of your train_nn
function should follow this order of actions:
- Initialize necessary variables
- Run a loop for
iterations
iterations. - In each iteration we will collect the gradient error matrices for each data point. Start by initializing
dE1_total
anddE2_total
as zero matrices with the same shape asW1
andW2
respectively. - Run a loop over all the data points in
X_train
. In each iteration we call backprop to get the gradient error matrices and the output values. - Once we have collected the error gradient matrices for all the data points, we adjust the weights in
W1
andW2
, usingW1 = W1 + eta* dE1_total / N
whereN
is the number of data points inX_train
(and similarly forW2
). - For the error estimation we'll use the cross-entropy error function, (Eq. 4.90 in Bishop).
- When the outer loop finishes, we return from the function
Example inputs and outputs:
Call the function:
K = 3 # number of classes
M = 6
D = train_features.shape[1]
np.random.seed(42)
# Initialize two random weight matrices
W1 = 2 * np.random.rand(D + 1, M) - 1
W2 = 2 * np.random.rand(M + 1, K) - 1
W1tr, W2tr, Etotal, misclassification_rate, last_guesses = train_nn(
train_features[:20, :], train_targets[:20], M, K, W1, W2, 500, 0.3)
Output
W1tr
:
[[ 0.38483561 -0.76090605 0.65776134 -0.11972904 -0.68585273 -0.6963981 ]
...
[-1.93490158 5.40360714 -1.49729944 1.59882456 0.18374103 -0.70066333]]
W2tr
:
[[-0.27112102 -1.2313577 -1.94162633]
...
[-0.25543656 -1.25570954 -0.19962407]]
Etotal
:
[2.54201424 2.23668767 2.00342554 1.90063241 1.88144908 1.87174812
...
0.34689913 0.44249507 0.34517052 0.44083018 0.34346007 0.43918565
0.34176746 0.43756094]
misclassification_rate
:
[0.7 0.7 0.7 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55
...
0. 0.1 0. 0.1 0. 0.1 0. 0.1 0. 0.1 ]
last_guesses
:
[0. 2. 1. 1. 1. 2. 0. 2. 2. 1. 2. 2. 0. 1. 1. 2. 0. 2. 1. 0.]
Write a function guesses = test_nn(X_test, M, K, W1, W2)
where
-
X_test
: the dataset that we want to test. -
M
: size of the hidden layer. -
K
: size of the output layer. -
W1, W2
: fully trained weight matrices, i.e. the results from using thetrain_nn
function. -
guesses
: the classification for all the data points in the test setXtest
. This should be a$(1\times N)$ vector where$N$ is the number of data points inX_test
.
The function should run through all the data points in X_test
and use the ffnn
function to guess the classification of each point.
You should answer this question via Mimir Now train your network and test it on the Iris dataset. Use 80% (as usual) for training.
- Calculate the accuracy
- produce a confusion matrix for your test features and test predictions
- Plot the
E_total
as a function of iterations from yourtrain_nn
function. - Plot the
misclassification_rate
as a function of iterations from yourtrain_nn
function.
Present this and relevant discussion in a single PDF document.
In this assignment we have created a naive neural network. This network could be changed in small ways that might have meaningful impact on performance. To name a few:
- The size of the hidden layer
- The learning rate
- The initial values for the weights
Change the network in some way and compare the accuracy of your network for different configurations (i.e. if you decide to change the size of the hidden layer, a graph of test accuracy as a function of layer size would be ideal).