eng/Lecture 6 _ Training Neural Networks I.srt

﻿1
00:00:10,729 --> 00:00:12,896
- Okay, let's get started.

2
00:00:16,381 --> 00:00:21,529
Okay, so today we're going to get into some of
the details about how we train neural networks.

3
00:00:23,166 --> 00:00:28,785
So, some administrative details first.
Assignment 1 is due today, Thursday,

4
00:00:28,785 --> 00:00:36,521
so 11:59 p.m. tonight on Canvas. We're also
going to be releasing Assignment 2 today,

5
00:00:36,521 --> 00:00:40,082
and then your project proposals
are due Tuesday, April 25th.

6
00:00:40,082 --> 00:00:46,591
So you should be really starting to think about
your projects now if you haven't already.

7
00:00:46,591 --> 00:00:54,804
How many people have decided what they want to do for
their project so far? Okay, so some, some people,

8
00:00:54,804 --> 00:01:03,937
so yeah, everyone else, you can go to TA office hours
if you want suggestions and bounce ideas off of TAs.

9
00:01:05,657 --> 00:01:18,121
We also have a list of projects that other people have proposed. Some people usually
affiliated with Stanford, so on Piazza, so you can take a look at those for additional ideas.

10
00:01:19,604 --> 00:01:28,004
And we also have some notes on backprop for a linear layer and
a vector and tensor derivatives that Justin's written up,

11
00:01:28,004 --> 00:01:33,964
so that should help with understanding how exactly
backprop works and for vectors and matrices.

12
00:01:33,964 --> 00:01:40,484
So these are linked to lecture four on the
syllabus and you can go and take a look at those.

13
00:01:45,110 --> 00:01:57,124
Okay, so where we are now. We've talked about how to express a function in terms of a
computational graph, that we can represent any function in terms of a computational graph.

14
00:01:57,124 --> 00:02:03,751
And we've talked more explicitly about neural networks,
which is a type of graph where we have these linear layers

15
00:02:03,751 --> 00:02:08,360
that we stack on top of each other
with nonlinearities in between.

16
00:02:09,456 --> 00:02:13,360
And we've also talked last lecture
about convolutional neural networks,

17
00:02:13,360 --> 00:02:24,936
which are a particular type of network that uses convolutional layers to
preserve the spatial structure throughout all the the hierarchy of the network.

18
00:02:24,936 --> 00:02:38,056
And so we saw exactly how a convolution layer looked, where each activation map in the convolutional
layer output is produced by sliding a filter of weights over all of the spatial locations in the input.

19
00:02:38,056 --> 00:02:45,456
And we also saw that usually we can have many filters per
layer, each of which produces a separate activation map.

20
00:02:45,456 --> 00:02:50,655
And so what we can get is from an input right, with a
certain depth, we'll get an activation map output,

21
00:02:50,655 --> 00:02:58,771
which has some spatial dimension that's preserved, as well as the
depth is the total number of filters that we have in that layer.

22
00:02:59,695 --> 00:03:05,895
And so what we want to do is we want to learn the
values of all of these weights or parameters,

23
00:03:05,895 --> 00:03:12,507
and we saw that we can learn our network parameters through optimization,
which we talked about little bit earlier in the course, right?

24
00:03:12,507 --> 00:03:17,254
And so we want to get to a point in the
loss landscape that produces a low loss,

25
00:03:17,254 --> 00:03:23,053
and we can do this by taking steps
in the direction of the negative gradient.

26
00:03:23,053 --> 00:03:27,614
And so the whole process we actually call
a Mini-batch Stochastic Gradient Descent

27
00:03:27,614 --> 00:03:38,585
where the steps are that we continuously, we sample a batch of data. We forward prop
it through our computational graph or our neural network. We get the loss at the end.

28
00:03:38,585 --> 00:03:41,960
We backprop through our network
to calculate the gradients.

29
00:03:41,960 --> 00:03:47,986
And then we update the parameters or the
weights in our network using this gradient.

30
00:03:49,980 --> 00:03:58,321
Okay, so now for the next couple of lectures we're going to talk
about some of the details involved in training neural networks.

31
00:03:58,321 --> 00:04:02,441
And so this involves things like how do we
set up our neural network at the beginning,

32
00:04:02,441 --> 00:04:11,015
which activation functions that we choose, how do we preprocess the
data, weight initialization, regularization, gradient checking.

33
00:04:11,015 --> 00:04:16,118
We'll also talk about training dynamics. So,
how do we babysit the learning process?

34
00:04:16,118 --> 00:04:21,294
How do we choose how we do parameter
updates, specific perimeter update rules,

35
00:04:21,294 --> 00:04:26,241
and how do we do hyperparameter optimization
to choose the best hyperparameters?

36
00:04:26,241 --> 00:04:28,281
And then we'll also talk about evaluation

37
00:04:28,281 --> 00:04:29,948
and model ensembles.

38
00:04:33,000 --> 00:04:41,015
So today in the first part, I will talk about activation functions,
data preprocessing, weight initialization, batch normalization,

39
00:04:41,015 --> 00:04:45,412
babysitting the learning process,
and hyperparameter optimization.

40
00:04:47,348 --> 00:04:50,348
Okay, so first activation functions.

41
00:04:51,708 --> 00:04:55,095
So, we saw earlier how out
of any particular layer,

42
00:04:55,095 --> 00:05:01,481
we have the data coming in. We multiply by our weight
in you know, fully connected or a convolutional layer.

43
00:05:01,481 --> 00:05:06,388
And then we'll pass this through
an activation function or nonlinearity.

44
00:05:06,388 --> 00:05:08,027
And we saw some examples of this.

45
00:05:08,027 --> 00:05:13,295
We used sigmoid previously in some of our
examples. We also saw the ReLU nonlinearity.

46
00:05:13,295 --> 00:05:20,479
And so today we'll talk more about different choices for
these different nonlinearities and trade-offs between them.

47
00:05:22,228 --> 00:05:27,241
So first, the sigmoid, which we've seen before, and
probably the one we're most comfortable with, right?

48
00:05:27,241 --> 00:05:32,572
So the sigmoid function is as we have up
here, one over one plus e to the negative x.

49
00:05:32,572 --> 00:05:45,201
And what this does is it takes each number that's input into the sigmoid nonlinearity, so each
element, and the elementwise squashes these into this range [0,1] right, using this function here.

50
00:05:45,201 --> 00:05:50,427
And so, if you get very high values as input,
then output is going to be something near one.

51
00:05:50,427 --> 00:05:55,321
If you get very low values, or, I'm sorry, very
negative values, it's going to be near zero.

52
00:05:55,321 --> 00:06:02,481
And then we have this regime near zero that it's in a
linear regime. It looks a bit like a linear function.

53
00:06:02,481 --> 00:06:05,374
And so this is been historically popular,

54
00:06:05,374 --> 00:06:11,530
because sigmoids, in a sense, you can interpret them as
a kind of a saturating firing rate of a neuron, right?

55
00:06:11,530 --> 00:06:15,455
So if it's something between zero and one,
you could think of it as a firing rate.

56
00:06:15,455 --> 00:06:23,588
And we'll talk later about other nonlinearities, like ReLUs that,
in practice, actually turned out to be more biologically plausible,

57
00:06:23,588 --> 00:06:27,402
but this does have a kind of
interpretation that you could make.

58
00:06:30,015 --> 00:06:36,492
So if we look at this nonlinearity more carefully, there's
several problems that there actually are with this.

59
00:06:36,492 --> 00:06:44,065
So the first is that saturated neurons can kill off
the gradient. And so what exactly does this mean?

60
00:06:44,988 --> 00:06:48,801
So if we look at a sigmoid gate right,
a node in our computational graph,

61
00:06:48,801 --> 00:06:54,566
and we have our data X as input into it, and then we
have the output of the sigmoid gate coming out of it,

62
00:06:54,566 --> 00:06:59,236
what does the gradient flow look like
as we're coming back?

63
00:06:59,236 --> 00:07:08,441
We have dL over d sigma right? The upstream gradient coming
down, and then we're going to multiply this by dSigma over dX.

64
00:07:08,441 --> 00:07:11,081
This will be the gradient
of a local sigmoid function.

65
00:07:11,081 --> 00:07:16,495
And we're going to chain these together for
our downstream gradient that we pass back.

66
00:07:16,495 --> 00:07:24,708
So who can tell me what happens when X is equal to -10?
It's very negative. What does is gradient look like?

67
00:07:24,708 --> 00:07:28,868
Zero, yeah, so that's right.
So the gradient become zero

68
00:07:28,868 --> 00:07:37,348
and that's because in this negative, very negative region of
the sigmoid, it's essentially flat, so the gradient is zero,

69
00:07:37,348 --> 00:07:40,001
and we chain any upstream
gradient coming down.

70
00:07:40,001 --> 00:07:46,501
We multiply by basically something near zero, and we're going to
get a very small gradient that's flowing back downwards, right?

71
00:07:46,501 --> 00:07:55,381
So, in a sense, after the chain rule, this kills the gradient flow and
you're going to have a zero gradient passed down to downstream nodes.

72
00:07:58,869 --> 00:08:10,015
And so what happens when X is equal to zero? So there it's,
yeah, it's fine in this regime. So, in this regime near zero,

73
00:08:10,015 --> 00:08:15,135
you're going to get a reasonable gradient
here, and then it'll be fine for backprop.

74
00:08:15,135 --> 00:08:20,055
And then what about X equals 10?
Zero, right.

75
00:08:20,055 --> 00:08:31,108
So again, so when X is equal to a very negative or X is equal to large positive numbers, then
these are all regions where the sigmoid function is flat, and it's going to kill off the gradient

76
00:08:31,108 --> 00:08:35,275
and you're not going to get
a gradient flow coming back.

77
00:08:37,055 --> 00:08:42,454
Okay, so a second problem is that
the sigmoid outputs are not zero centered.

78
00:08:42,454 --> 00:08:46,415
And so let's take a look
at why this is a problem.

79
00:08:46,415 --> 00:08:51,892
So, consider what happens when
the input to a neuron is always positive.

80
00:08:51,892 --> 00:08:54,948
So in this case, all of our Xs
we're going to say is positive.

81
00:08:54,948 --> 00:09:04,348
It's going to be multiplied by some weight, W, and then
we're going to run it through our activation function.

82
00:09:04,348 --> 00:09:08,015
So what can we say about
the gradients on W?

83
00:09:12,375 --> 00:09:18,135
So think about what the local gradient is
going to be, right, for this linear layer.

84
00:09:18,135 --> 00:09:24,214
We have DL over whatever the activation
function, the loss coming down,

85
00:09:24,214 --> 00:09:29,834
and then we have our local gradient,
which is going to be basically X, right?

86
00:09:29,834 --> 00:09:34,001
And so what does this mean,
if all of X is positive?

87
00:09:36,253 --> 00:09:44,401
Okay, so I heard it's always going to be positive. So that's almost right. It's
always going to be either positive, or all positive or all negative, right?

88
00:09:44,401 --> 00:09:53,588
So, our upstream gradient coming down is DL over our loss. L is going
to be DL over DF. and this is going to be either positive or negative.

89
00:09:53,588 --> 00:09:55,815
It's some arbitrary gradient coming down.

90
00:09:55,815 --> 00:10:06,619
And then our local gradient that we multiply this by is, if we're going to
find the gradients on W, is going to be DF over DW, which is going to be X.

91
00:10:07,880 --> 00:10:20,800
And if X is always positive then the gradients on W, which is multiplying these two
together, are going to always be the sign of the upstream gradient coming down.

92
00:10:20,800 --> 00:10:28,520
And so what this means is that all the gradients of W, since they're always
either positive or negative, they're always going to move in the same direction.

93
00:10:28,520 --> 00:10:42,467
You're either going to increase all of the, when you do a parameter update, you're going to either increase
all of the values of W by a positive amount, or differing positive amounts, or you will decrease them all.

94
00:10:42,467 --> 00:10:48,867
And so the problem with this is that, this
gives very inefficient gradient updates.

95
00:10:48,867 --> 00:10:59,507
So, if you look at on the right here, we have an example of a case
where, let's say W is two-dimensional, so we have our two axes for W,

96
00:10:59,507 --> 00:11:04,796
and if we say that we can only have
all positive or all negative updates,

97
00:11:04,796 --> 00:11:12,400
then we have these two quadrants, and, are the two places
where the axis are either all positive or negative,

98
00:11:12,400 --> 00:11:17,213
and these are the only directions in which
we're allowed to make a gradient update.

99
00:11:17,213 --> 00:11:25,399
And so in the case where, let's say our hypothetical
optimal W is actually this blue vector here, right,

100
00:11:25,399 --> 00:11:30,773
and we're starting off at you know some point, or at
the top of the the the beginning of the red arrows,

101
00:11:30,773 --> 00:11:38,946
we can't just directly take a gradient update in this direction,
because this is not in one of those two allowed gradient directions.

102
00:11:38,946 --> 00:11:43,479
And so what we're going to have to do, is we'll
have to take a sequence of gradient updates.

103
00:11:43,479 --> 00:11:51,953
For example, in these red arrow directions that are each in
allowed directions, in order to finally get to this optimal W.

104
00:11:53,039 --> 00:11:58,479
And so this is why also, in general,
we want a zero mean data.

105
00:11:58,479 --> 00:12:11,893
So, we want our input X to be zero meaned, so that we actually have positive and negative values and
we don't get into this problem of the gradient updates. They'll be all moving in the same direction.

106
00:12:11,893 --> 00:12:17,819
So is this clear? Any questions
on this point? Okay.

107
00:12:21,453 --> 00:12:24,930
Okay, so we've talked about these two
main problems of the sigmoid.

108
00:12:24,930 --> 00:12:30,586
The saturated neurons can kill the gradients if
we're too positive or too negative of an input.

109
00:12:30,586 --> 00:12:36,586
They're also not zero-centered and so we get
these, this inefficient kind of gradient update.

110
00:12:36,586 --> 00:12:43,146
And then a third problem, we have an exponential function
in here, so this is a little bit computationally expensive.

111
00:12:43,146 --> 00:12:46,837
In the grand scheme of your network,
this is usually not the main problem,

112
00:12:46,837 --> 00:12:51,186
because we have all these convolutions and
dot products that are a lot more expensive,

113
00:12:51,186 --> 00:12:55,103
but this is just a minor
point also to observe.

114
00:12:58,986 --> 00:13:03,166
So now we can look at a second
activation function here at tanh.

115
00:13:03,166 --> 00:13:10,999
And so this looks very similar to the sigmoid, but the
difference is that now it's squashing to the range [-1, 1].

116
00:13:10,999 --> 00:13:15,573
So here, the main difference
is that it's now zero-centered,

117
00:13:15,573 --> 00:13:21,306
so we've gotten rid of the second problem that we had. It
still kills the gradients, however, when it's saturated.

118
00:13:21,306 --> 00:13:29,264
So, you still have these regimes where the gradient is
essentially flat and you're going to kill the gradient flow.

119
00:13:29,264 --> 00:13:34,009
So this is a bit better than the sigmoid,
but it still has some problems.

120
00:13:36,586 --> 00:13:40,104
Okay, so now let's look at
the ReLU activation function.

121
00:13:40,104 --> 00:13:47,573
And this is one that we saw in our examples last lecture
when we were talking about the convolutional neural network.

122
00:13:47,573 --> 00:13:53,279
And we saw that we interspersed ReLU nonlinearities
between many of the convolutional layers.

123
00:13:53,279 --> 00:13:58,253
And so, this function is f of
x equals max of zero and x.

124
00:13:58,253 --> 00:14:06,573
So it takes an elementwise operation on your input and basically
if your input is negative, it's going to put it to zero.

125
00:14:06,573 --> 00:14:13,264
And then if it's positive, it's going to
be just passed through. It's the identity.

126
00:14:13,264 --> 00:14:22,892
And so this is one that's pretty commonly used, and if we look at this one and look
at and think about the problems that we saw earlier with the sigmoid and the tanh,

127
00:14:22,892 --> 00:14:26,746
we can see that it doesn't saturate
in the positive region.

128
00:14:26,746 --> 00:14:34,465
So there's whole half of our input space where it's
not going to saturate, so this is a big advantage.

129
00:14:34,465 --> 00:14:36,959
So this is also
computationally very efficient.

130
00:14:36,959 --> 00:14:42,466
We saw earlier that the sigmoid
has this E exponential in it.

131
00:14:42,466 --> 00:14:48,968
And so the ReLU is just this simple max
and there's, it's extremely fast.

132
00:14:48,968 --> 00:14:57,063
And in practice, using this ReLU, it converges much faster
than the sigmoid and the tanh, so about six times faster.

133
00:14:57,063 --> 00:15:01,090
And it's also turned out to be more
biologically plausible than the sigmoid.

134
00:15:01,090 --> 00:15:11,450
So if you look at a neuron and you look at what the inputs look like, and you look at
what the outputs look like, and you try to measure this in neuroscience experiments,

135
00:15:11,450 --> 00:15:18,303
you'll see that this one is actually a closer
approximation to what's happening than sigmoids.

136
00:15:18,303 --> 00:15:33,798
And so ReLUs were starting to be used a lot around 2012 when we had AlexNet, the first major convolutional neural
network that was able to do well on ImageNet and large-scale data. They used the ReLU in their experiments.

137
00:15:36,775 --> 00:15:42,082
So a problem however, with the ReLU, is that
it's still, it's not not zero-centered anymore.

138
00:15:42,082 --> 00:15:49,228
So we saw that the sigmoid was not zero-centered.
Tanh fixed this and now ReLU has this problem again.

139
00:15:49,228 --> 00:15:52,122
And so that's one of
the issues of the ReLU.

140
00:15:52,122 --> 00:15:55,357
And then we also have
this further annoyance of,

141
00:15:55,357 --> 00:16:04,222
again we saw that in the positive half of the inputs, we don't
have saturation, but this is not the case of the negative half.

142
00:16:04,222 --> 00:16:06,882
Right, so just thinking about this
a little bit more precisely.

143
00:16:06,882 --> 00:16:11,255
So what's happening here
when X equals negative 10?

144
00:16:11,255 --> 00:16:12,855
So zero gradient, that's right.

145
00:16:12,855 --> 00:16:16,522
What happens when X is
equal to positive 10?

146
00:16:17,455 --> 00:16:20,175
It's good, right.
So, we're in the linear regime.

147
00:16:20,175 --> 00:16:30,442
And then what happens when X is equal to zero? Yes, it undefined
here, but in practice, we'll say, you know, zero, right.

148
00:16:30,442 --> 00:16:35,074
And so basically, it's killing the
gradient in half of the regime.

149
00:16:37,948 --> 00:16:45,708
And so we can get this phenomenon of basically dead
ReLUs, when we're in this bad part of the regime.

150
00:16:45,708 --> 00:16:51,212
And so there's, you can look at this in,
as coming from several potential reasons.

151
00:16:51,212 --> 00:16:57,192
And so if we look at our data cloud here,
this is all of our training data,

152
00:16:59,033 --> 00:17:09,092
then if we look at where the ReLUs can fall, so the ReLUs can be, each of
these is basically the half of the plane where it's going to activate.

153
00:17:11,948 --> 00:17:15,640
And so each of these is the plane
that defines each of these ReLUs,

154
00:17:15,640 --> 00:17:21,201
and we can see that you can have these dead
ReLUs that are basically off of the data cloud.

155
00:17:21,201 --> 00:17:26,588
And in this case, it will never activate and
never update, as compared to an active ReLU

156
00:17:26,588 --> 00:17:31,732
where some of the data is going to be positive
and passed through and some won't be.

157
00:17:31,732 --> 00:17:33,480
And so there's several reasons for this.

158
00:17:33,480 --> 00:17:37,201
The first is that it can happen
when you have bad initialization.

159
00:17:37,201 --> 00:17:45,015
So if you have weights that happen to be unlucky and they happen to be
off the data cloud, so they happen to specify this bad ReLU over here.

160
00:17:45,015 --> 00:17:55,069
Then they're never going to get a data input that causes it to activate,
and so they're never going to get good gradient flow coming back.

161
00:17:56,108 --> 00:17:59,321
And so it'll just never
update and never activate.

162
00:17:59,321 --> 00:18:03,880
What's the more common case is
when your learning rate is too high.

163
00:18:03,880 --> 00:18:11,561
And so this case you started off with an okay ReLU, but because
you're making these huge updates, the weights jump around

164
00:18:11,561 --> 00:18:18,028
and then your ReLU unit in a sense, gets knocked off of
the data manifold. And so this happens through training.

165
00:18:18,028 --> 00:18:22,975
So it was fine at the beginning and then
at some point, it became bad and it died.

166
00:18:22,975 --> 00:18:24,108
And so if in practice,

167
00:18:24,108 --> 00:18:33,361
if you freeze a network that you've trained and you pass the data through, you
can see it actually is much as 10 to 20% of the network is these dead ReLUs.

168
00:18:33,361 --> 00:18:40,001
And so you know that's a problem, but also most networks
do have this type of problem when you use ReLUs.

169
00:18:40,001 --> 00:18:49,467
Some of them will be dead, and in practice, people look into this, and
it's a research problem, but it's still doing okay for training networks.

170
00:18:49,467 --> 00:18:51,268
Yeah, is there a question?

171
00:18:51,268 --> 00:18:54,851
[student speaking off mic]

172
00:19:01,908 --> 00:19:05,335
Right. So the question is, yeah, so the
data cloud is just your training data.

173
00:19:05,335 --> 00:19:08,918
[student speaking off mic]

174
00:19:17,641 --> 00:19:25,708
Okay, so the question is when, how do you tell when the ReLU
is going to be dead or not, with respect to the data cloud?

175
00:19:25,708 --> 00:19:30,988
And so if you look at, this is an example
of like a simple two-dimensional case.

176
00:19:30,988 --> 00:19:42,278
And so our ReLU, we're going to get our input to the ReLU, which is going
to be a basically you know, W1 X1 plus W2 X2, and it we apply this,

177
00:19:42,278 --> 00:19:46,080
so that that defines this this
separating hyperplane here,

178
00:19:46,080 --> 00:19:51,453
and then we're going to take half of it that's going to
be positive, and half of it's going to be killed off,

179
00:19:51,453 --> 00:20:03,789
and so yes, so you, you know you just, it's whatever the weights happened to be, and
where the data happens to be is where these, where these hyperplanes fall, and so,

180
00:20:05,560 --> 00:20:14,329
so yeah so just throughout the course of training, some of your
ReLUs will be in different places, with respect to the data cloud.

181
00:20:16,480 --> 00:20:18,050
Oh, question.

182
00:20:18,050 --> 00:20:21,633
[student speaking off mic]

183
00:20:23,380 --> 00:20:33,478
Yeah. So okay, so the question is for the sigmoid we talked about two
drawbacks, and one of them was that the neurons can get saturated,

184
00:20:37,045 --> 00:20:40,500
so let's go back to the sigmoid here,

185
00:20:40,500 --> 00:20:45,820
and the question was this is not the case,
when all of your inputs are positive.

186
00:20:45,820 --> 00:20:51,971
So when all of your inputs are positive, they're all
going to be coming in in this zero plus region here,

187
00:20:51,971 --> 00:20:54,464
and so you can still
get a saturating neuron,

188
00:20:54,464 --> 00:21:00,544
because you see up in this positive
region, it also plateaus at one,

189
00:21:00,544 --> 00:21:08,846
and so when it's when you have large positive values as input you're also
going to get the zero gradient, because you have you have a flat slope here.

190
00:21:10,715 --> 00:21:11,548
Okay.

191
00:21:16,355 --> 00:21:24,528
Okay, so in practice people also like to
initialize ReLUs with slightly positive biases,

192
00:21:24,528 --> 00:21:30,721
in order to increase the likelihood of it being
active at initialization and to get some updates.

193
00:21:30,721 --> 00:21:40,430
Right and so this basically just biases towards more ReLUs firing at the
beginning, and in practice some say that it helps. Some say that it doesn't.

194
00:21:40,430 --> 00:21:48,072
Generally people don't always use this. It's yeah, a lot
of times people just initialize it with zero biases still.

195
00:21:49,483 --> 00:21:54,777
Okay, so now we can look at some modifications
on the ReLU that have come out since then,

196
00:21:54,777 --> 00:21:57,768
and so one example is this leaky ReLU.

197
00:21:57,768 --> 00:22:04,429
And so this looks very similar to the original ReLU, and the only
difference is that now instead of being flat in the negative regime,

198
00:22:04,429 --> 00:22:11,955
we're going to give a slight negative slope here And so this
solves a lot of the problems that we mentioned earlier.

199
00:22:11,955 --> 00:22:17,142
Right here we don't have any saturating
regime, even in the negative space.

200
00:22:17,142 --> 00:22:23,968
It's still very computationally efficient. It still converges
faster than sigmoid and tanh, very similar to a ReLU.

201
00:22:23,968 --> 00:22:27,218
And it doesn't have this dying problem.

202
00:22:28,923 --> 00:22:35,380
And there's also another example
is the parametric rectifier, so PReLU.

203
00:22:35,380 --> 00:22:42,195
And so in this case it's just like a leaky ReLU where
we again have this sloped region in the negative space,

204
00:22:42,195 --> 00:22:47,088
but now this slope in the negative regime
is determined through this alpha parameter,

205
00:22:47,088 --> 00:22:52,982
so we don't specify, we don't hard-code it. but we treat
it as now a parameter that we can backprop into and learn.

206
00:22:52,982 --> 00:22:57,555
And so this gives it a
little bit more flexibility.

207
00:22:57,555 --> 00:23:02,342
And we also have something called
an Exponential Linear Unit, an ELU,

208
00:23:02,342 --> 00:23:08,295
so we have all these different LUs,
basically. and this one again, you know,

209
00:23:08,295 --> 00:23:10,341
it has all the benefits of the ReLu,

210
00:23:10,341 --> 00:23:14,508
but now you're, it is also
closer to zero mean outputs.

211
00:23:16,181 --> 00:23:24,901
So, that's actually an advantage that the leaky ReLU, parametric ReLU,
a lot of these they allow you to have your mean closer to zero,

212
00:23:26,699 --> 00:23:36,538
but compared with the leaky ReLU, instead of it being sloped in the negative
regime, here you actually are building back in a negative saturation regime,

213
00:23:36,538 --> 00:23:43,029
and there's arguments that basically this allows
you to have some more robustness to noise,

214
00:23:43,029 --> 00:23:48,566
and you basically get these deactivation
states that can be more robust.

215
00:23:48,566 --> 00:23:55,885
And you can look at this paper for, there's a lot of
kind of more justification for why this is the case.

216
00:23:55,885 --> 00:24:01,111
And in a sense this is kind of something
in between the ReLUs and the leaky ReLUs,

217
00:24:01,111 --> 00:24:13,267
where has some of this shape, which the Leaky ReLU does, which gives it closer to zero mean
output, but then it also still has some of this more saturating behavior that ReLUs have.

218
00:24:13,267 --> 00:24:14,350
A question?

219
00:24:14,350 --> 00:24:17,933
[student speaking off mic]

220
00:24:19,952 --> 00:24:24,365
So, whether this parameter alpha
is going to be specific for each neuron.

221
00:24:24,365 --> 00:24:34,090
So, I believe it is often specified, but I actually can't remember exactly,
so you can look in the paper for exactly, yeah, how this is defined,

222
00:24:35,578 --> 00:24:45,050
but yeah, so I believe this function is basically very
carefully designed in order to have nice desirable properties.

223
00:24:45,050 --> 00:24:49,992
Okay, so there's basically all of these
kinds of variants on the ReLU.

224
00:24:49,992 --> 00:24:58,192
And so you can see that, all of these it's kind of, you can argue that
each one may have certain benefits, certain drawbacks in practice.

225
00:24:58,192 --> 00:25:04,950
People just want to run experiments all of them, and see empirically
what works better, try and justify it, and come up with new ones,

226
00:25:04,950 --> 00:25:08,612
but they're all different things
that are being experimented with.

227
00:25:10,135 --> 00:25:14,744
And so let's just mention one more.
This is Maxout Neuron.

228
00:25:14,744 --> 00:25:25,969
So, this one looks a little bit different in that it doesn't have the same form as the others did
of taking your basic dot product, and then putting this element-wise nonlinearity in front of it.

229
00:25:25,969 --> 00:25:34,670
Instead, it looks like this, this max of W dot product of X plus
B, and a second set of weights, W2 dot product with X plus B2.

230
00:25:38,230 --> 00:25:43,185
And so what does this, is this is taking
the max of these two functions in a sense.

231
00:25:44,870 --> 00:25:48,949
And so what it does is it generalizes
the ReLU and the leaky ReLu,

232
00:25:48,949 --> 00:25:54,112
because you're just you're taking the max
over these two, two linear functions.

233
00:25:55,023 --> 00:26:02,927
And so what this give us, it's again you're operating in
a linear regime. It doesn't saturate and it doesn't die.

234
00:26:02,927 --> 00:26:15,984
The problem is that here, you are doubling the number of parameters per neuron. So, each neuron
now has this original set of weights, W, but it now has W1 and W2, so you have twice these.

235
00:26:17,765 --> 00:26:24,560
So in practice, when we look at all of these activation
functions, kind of a good general rule of thumb is use ReLU.

236
00:26:24,560 --> 00:26:29,389
This is the most standard one
that generally just works well.

237
00:26:30,231 --> 00:26:36,497
And you know you do want to be careful in general with your
learning rates to adjust them based, see how things do.

238
00:26:36,497 --> 00:26:40,091
We'll talk more about adjusting
learning rates later in this lecture,

239
00:26:40,091 --> 00:26:52,318
but you can also try out some of these fancier activation functions, the leaky ReLU,
Maxout, ELU, but these are generally, they're still kind of more experimental.

240
00:26:53,828 --> 00:26:56,643
So, you can see how they
work for your problem.

241
00:26:56,643 --> 00:27:04,035
You can also try out tanh, but probably some of these
ReLU and ReLU variants are going to be better.

242
00:27:04,035 --> 00:27:15,243
And in general don't use sigmoid. This is one of the earliest original activation
functions, and ReLU and these other variants have generally worked better since then.

243
00:27:17,361 --> 00:27:21,517
Okay, so now let's talk a little bit
about data preprocessing.

244
00:27:21,517 --> 00:27:24,602
Right, so the activation function,
we design this is part of our network.

245
00:27:24,602 --> 00:27:30,361
Now we want to train the network, and we have our
input data that we want to start training from.

246
00:27:31,424 --> 00:27:39,495
So, generally we want to always preprocess the data, and this is something that
you've probably seen before in machine learning classes if you taken those.

247
00:27:39,495 --> 00:27:49,366
And some standard types of preprocessing are, you take your original data and
you want to zero mean them, and then you probably want to also normalize that,

248
00:27:49,366 --> 00:27:57,367
so normalized by the standard deviation,
And so why do we want to do this?

249
00:27:57,367 --> 00:28:04,979
For zero centering, you can remember earlier that we talked
about when all the inputs are positive, for example,

250
00:28:04,979 --> 00:28:12,772
then we get all of our gradients on the weights to be
positive, and we get this basically suboptimal optimization.

251
00:28:12,772 --> 00:28:21,710
And in general even if it's not all zero or all negative,
any sort of bias will still cause this type of problem.

252
00:28:23,770 --> 00:28:36,440
And so then in terms of normalizing the data, this is basically you want to normalize data typically in the
machine learning problems, so that all features are in the same range, and so that they contribute equally.

253
00:28:36,440 --> 00:28:45,866
In practice, since for images, which is what we're dealing with in
this course here for the most part, we do do the zero centering,

254
00:28:45,866 --> 00:28:56,616
but in practice we don't actually normalize the pixel value so much, because generally for
images right at each location you already have relatively comparable scale and distribution,

255
00:28:56,616 --> 00:29:09,339
and so we don't really need to normalize so much, compared to more general machine learning problems,
where you might have different features that are very different and of very different scales.

256
00:29:11,037 --> 00:29:19,983
And in machine learning, you might also see a more complicated
things, like PCA or whitening, but again with images,

257
00:29:19,983 --> 00:29:28,678
we typically just stick with the zero mean, and we don't do the normalization,
and we also don't do some of these more complicated pre-processing.

258
00:29:29,519 --> 00:29:40,876
And one reason for this is generally with images we don't really want to take all of our input, let's say pixel
values and project this onto a lower dimensional space of new kinds of features that we're dealing with.

259
00:29:40,876 --> 00:29:48,184
We typically just want to apply convolutional networks spatially
and have our spatial structure over the original image.

260
00:29:48,184 --> 00:29:49,595
Yeah, question.

261
00:29:49,595 --> 00:29:53,178
[student speaking off mic]

262
00:29:58,858 --> 00:30:06,968
So the question is we do this pre-processing in a training phase, do we
also do the same kind of thing in the test phase, and the answer is yes.

263
00:30:06,968 --> 00:30:24,839
So, let me just move to the next slide here. So, in general on the training phase is where we determine our let's say, mean, and
then we apply this exact same mean to the test data. So, we'll normalize by the same empirical mean from the training data.

264
00:30:24,839 --> 00:30:35,822
Okay, so to summarize basically for images, we typically just do the zero
mean pre-processing and we can subtract either the entire mean image.

265
00:30:38,151 --> 00:30:41,354
So, from the training data,
you compute the mean image,

266
00:30:41,354 --> 00:30:54,777
which will be the same size as your, as each image. So, for example 32 by 32 by three, you'll get this array
of numbers, and then you subtract that from each image that you're about to pass through the network,

267
00:30:54,777 --> 00:31:00,532
and you'll do the same thing at test time for
this array that you determined at training time.

268
00:31:00,532 --> 00:31:14,916
In practice, we can also for some networks, we also do this by just of subtracting a per-channel mean, and so
instead of having an entire mean image that were going to zero-center by, we just take the mean by channel,

269
00:31:14,916 --> 00:31:25,718
and this is just because it turns out that it was similar enough across the whole image, it
didn't make such a big difference to subtract the mean image versus just a per-channel value.

270
00:31:25,718 --> 00:31:36,936
And this is easier to just pass around and deal with. So, you'll see this as well for example,
in a VGG Network, which is a network that came after AlexNet, and we'll talk about that later.

271
00:31:36,936 --> 00:31:38,545
Question.

272
00:31:38,545 --> 00:31:42,128
[student speaking off mic]

273
00:31:45,215 --> 00:31:52,049
Okay, so there are two questions. The first is what's a channel,
in this case, when we are subtracting a per-channel mean?

274
00:31:52,049 --> 00:32:04,198
And this is RGB, so our array, our images are typically for example, 32 by 32 by
three. So, width, height, each are 32, and our depth, we have three channels RGB,

275
00:32:04,198 --> 00:32:09,786
and so we'll have one mean for the red
channel, one mean for a green, one for blue.

276
00:32:09,786 --> 00:32:14,529
And then the second, what
was your second question?

277
00:32:14,529 --> 00:32:18,112
[student speaking off mic]

278
00:32:21,349 --> 00:32:26,827
Oh. Okay, so the question is when we're subtracting
the mean image, what is the mean taken over?

279
00:32:27,882 --> 00:32:39,114
And the mean is taking over all of your training images. So, you'll take all of your
training images and just compute the mean of all of those. Does that make sense?

280
00:32:39,114 --> 00:32:42,697
[student speaking off mic]

281
00:32:48,432 --> 00:32:55,255
Yeah the question is, we do this for the entire training set,
once before we start training. We don't do this per batch,

282
00:32:55,255 --> 00:32:57,904
and yeah, that's exactly correct.

283
00:32:57,904 --> 00:33:03,984
So we just want to have a good sample,
an empirical mean that we have.

284
00:33:03,984 --> 00:33:13,983
And so if you take it per batch, if you're sampling reasonable batches, it
should be basically, you should be getting the same values anyways for the mean,

285
00:33:13,983 --> 00:33:19,126
and so it's more efficient and easier
just do this once at the beginning.

286
00:33:19,126 --> 00:33:28,296
You might not even have to really take it over the entire training data. You could
also just sample enough training images to get a good estimate of your mean.

287
00:33:30,734 --> 00:33:35,560
Okay, so any other questions
about data preprocessing? Yes.

288
00:33:35,560 --> 00:33:38,654
[student speaking off mic]

289
00:33:38,654 --> 00:33:42,187
So, the question is does the data
preprocessing solve the sigmoid problem?

290
00:33:42,187 --> 00:33:46,354
So the data preprocessing
is doing zero mean right?

291
00:33:47,540 --> 00:33:50,535
And we talked about how sigmoid,
we want to have zero mean.

292
00:33:50,535 --> 00:33:56,262
And so it does solve this for the
first layer that we pass it through.

293
00:33:56,262 --> 00:34:00,263
So, now our inputs to the first layer
of our network is going to be zero mean,

294
00:34:00,263 --> 00:34:08,472
but we'll see later on that we're actually going to have this problem
come up in much worse and greater form, as we have deep networks.

295
00:34:08,472 --> 00:34:12,437
You're going to get a lot
of nonzero mean problems later on.

296
00:34:12,438 --> 00:34:19,350
And so in this case, this is not going to be sufficient.
So this only helps at the first layer of your network.

297
00:34:21,784 --> 00:34:28,203
Okay, so now let's talk about how do we want
to initialize the weights of our network?

298
00:34:28,204 --> 00:34:34,471
So, we have let's say our standard two layer neural network
and we have all of these weights that we want to learn,

299
00:34:34,472 --> 00:34:43,509
but we have to start them with some value, right? And then we're
going to update them using our gradient updates from there.

300
00:34:43,510 --> 00:34:56,157
So first question. What happens when we use an initialization of W equals zero?
We just set all of the parameters to be zero. What's the problem with this?

301
00:34:56,157 --> 00:34:58,683
[student speaking off mic]

302
00:34:58,683 --> 00:35:00,766
So sorry, say that again.

303
00:35:02,039 --> 00:35:08,320
So I heard all the neurons are going to
be dead. No updates ever. So not exactly.

304
00:35:11,035 --> 00:35:16,995
So, part of that is correct in that all the neurons
will do the same thing. So, they might not all be dead.

305
00:35:16,995 --> 00:35:23,321
Depending on your input value, I mean, you could be in
any regime of your neurons, so they might not be dead,

306
00:35:23,321 --> 00:35:27,869
but the key thing is that they
will all do the same thing.

307
00:35:27,869 --> 00:35:36,577
So, since your weights are zero, given an input, every neuron is going
to be, have the same operation basically on top of your inputs.

308
00:35:36,577 --> 00:35:43,621
And so, since they're all going to output the same
thing, they're also all going to get the same gradient.

309
00:35:43,621 --> 00:35:47,571
And so, because of that, they're all
going to update in the same way.

310
00:35:47,571 --> 00:35:51,983
And now you're just going to get all neurons that
are exactly the same, which is not what you want.

311
00:35:51,983 --> 00:35:54,075
You want the neurons to
learn different things.

312
00:35:54,075 --> 00:35:58,514
And so, that's the problem
when you initialize everything equally

313
00:35:58,514 --> 00:36:02,730
and there's basically no
symmetry breaking here.

314
00:36:02,730 --> 00:36:05,961
So, what's the first, yeah question?

315
00:36:05,961 --> 00:36:09,544
[student speaking off mic]

316
00:36:19,699 --> 00:36:29,961
So the question is, because that, because the gradient also depends
on our loss, won't one backprop differently compared to the other?

317
00:36:29,961 --> 00:36:46,072
So in the last layer, like yes, you do have basically some of this, the gradients will get the same,
sorry, will get different loss for each specific neuron based on which class it was connected to,

318
00:36:46,072 --> 00:36:54,352
but if you look at all the neurons generally throughout your network, like you're going
to, you basically have a lot of these neurons that are connected in exactly the same way.

319
00:36:54,352 --> 00:36:59,885
They had the same updates and it's
basically going to be the problem.

320
00:36:59,885 --> 00:37:10,885
Okay, so the first idea that we can have to try and improve upon this is to set all
of the weights to be small random numbers that we can sample from a distribution.

321
00:37:10,885 --> 00:37:16,002
So, in this case, we're going to sample
from basically a standard gaussian,

322
00:37:16,002 --> 00:37:22,450
but we're going to scale it so that the standard
deviation is actually one E negative two, 0.01.

323
00:37:22,450 --> 00:37:25,640
And so, just give this
many small random weights.

324
00:37:25,640 --> 00:37:30,729
And so, this does work okay for small
networks, now we've broken the symmetry,

325
00:37:30,729 --> 00:37:34,896
but there's going to be
problems with deeper networks.

326
00:37:35,970 --> 00:37:43,070
And so, let's take a look at why this is the case. So,
here this is basically an experiment that we can do

327
00:37:43,070 --> 00:37:45,341
where let's take a deeper network.

328
00:37:45,341 --> 00:37:53,622
So in this case, let's initialize a 10 layer neural
network to have 500 neurons in each of these 10 layers.

329
00:37:53,622 --> 00:37:56,437
Okay, we'll use tanh
nonlinearities in this case

330
00:37:56,437 --> 00:38:06,116
and we'll initialize it with small random numbers as we described in the
last slide. So here, we're going to basically just initialize this network.

331
00:38:06,116 --> 00:38:12,356
We have random data that we're going to take, and
now let's just pass it through the entire network,

332
00:38:12,356 --> 00:38:18,725
and at each layer, look at the statistics of
the activations that come out of that layer.

333
00:38:22,476 --> 00:38:25,485
And so, what we'll see this is probably
a little bit hard to read up top,

334
00:38:25,485 --> 00:38:31,156
but if we compute the mean
and the standard deviations at each layer,

335
00:38:31,156 --> 00:38:39,410
well see that at the first layer this is,
the means are always around zero.

336
00:38:40,267 --> 00:38:48,219
There's a funny sound in here.
Interesting, okay well that was fixed.

337
00:38:49,613 --> 00:38:58,153
So, if we look at, if we look at the outputs from here, the
mean is always going to be around zero, which makes sense.

338
00:38:58,153 --> 00:39:01,175
So, if we look here, let's see,

339
00:39:01,175 --> 00:39:11,420
if we take this, we looked at the dot product of X with W, and then
we took the tanh on linearity, and then we store these values and so,

340
00:39:12,315 --> 00:39:16,780
because it tanh is centered around zero,
this will make sense,

341
00:39:16,780 --> 00:39:22,450
and then the standard deviation however
shrinks, and it quickly collapses to zero.

342
00:39:22,450 --> 00:39:32,019
So, if we're plotting this, here this second row of plots here is showing the
mean and standard deviations over time per layer and then in the bottom,

343
00:39:32,019 --> 00:39:38,592
the sequence of plots is showing for each of our layers.
What's the distribution of the activations that we have?

344
00:39:38,592 --> 00:39:45,206
And so, we can see that at the first layer, we still have a
reasonable gaussian looking thing. It's a nice distribution.

345
00:39:45,206 --> 00:39:58,591
But the problem is that as we multiply by this W, these small numbers at each layer, this
quickly shrinks and collapses all of these values, as we multiply this over and over again.

346
00:39:58,591 --> 00:40:02,191
And so, by the end, we
get all of these zeros,

347
00:40:02,191 --> 00:40:04,262
which is not what we want.

348
00:40:04,262 --> 00:40:07,457
So we get all the activations become zero.

349
00:40:07,457 --> 00:40:10,420
And so now let's think
about the backwards pass.

350
00:40:10,420 --> 00:40:16,144
So, if we do a backward pass, now assuming this was our
forward pass and now we want to compute our gradients.

351
00:40:16,144 --> 00:40:20,024
So first, what does the gradients
look like on the weights?

352
00:40:24,155 --> 00:40:26,238
Does anyone have a guess?

353
00:40:28,571 --> 00:40:36,531
So, if we think about this, we have our input
values are very small at each layer right,

354
00:40:36,531 --> 00:40:43,273
because they've all collapsed at this near zero, and then
now each layer, we have our upstream gradient flowing down,

355
00:40:43,273 --> 00:40:53,483
and then in order to get the gradient on the weights remember it's our upstream gradient
times our local gradient, which for this this dot product were doing W times X.

356
00:40:53,483 --> 00:40:56,985
It's just basically going to
be X, which is our inputs.

357
00:40:56,985 --> 00:41:00,571
So, it's again a similar kind of problem
that we saw earlier,

358
00:41:00,571 --> 00:41:07,058
where now since, so here because X is small, our weights are
getting a very small gradient, and they're basically not updating.

359
00:41:07,058 --> 00:41:13,488
So, this is a way that you can basically try and think
about the effect of gradient flows through your networks.

360
00:41:13,488 --> 00:41:20,329
You can always think about what the forward pass is doing, and then
think about what's happening as you have gradient flows coming down,

361
00:41:20,329 --> 00:41:28,562
and different types of inputs, what the effect of this
actually is on our weights and the gradients on them.

362
00:41:28,562 --> 00:41:38,025
And so also, if now if we think about what's the gradient that's going to
be flowing back from each layer as we're chaining all these gradients.

363
00:41:40,004 --> 00:41:50,291
Alright, so this is going to be the flip thing where we have now the gradient flowing
back is our upstream gradient times in this case the local gradient is W on our input X.

364
00:41:50,291 --> 00:41:53,085
And so again, because
this is the dot product,

365
00:41:53,085 --> 00:42:06,208
and so now, actually going backwards at each layer, we're basically doing a multiplication
of the upstream gradient by our weights in order to get the next gradient flowing downwards.

366
00:42:07,283 --> 00:42:18,198
And so because here, we're multiplying by W over and over again. You're getting basically the
same phenomenon as we had in the forward pass where everything is getting smaller and smaller.

367
00:42:18,198 --> 00:42:23,541
And now the gradient, upstream gradients
are collapsing to zero as well.

368
00:42:23,541 --> 00:42:24,869
Question?

369
00:42:24,869 --> 00:42:28,452
[student speaking off mic]

370
00:42:30,731 --> 00:42:37,945
Yes, I guess upstream and downstream is, can be interpreted
differently, depending on if you're going forward and backward,

371
00:42:37,945 --> 00:42:43,907
but in this case we're going, we're doing, we're going
backwards, right? We're doing back propagation.

372
00:42:43,907 --> 00:42:51,409
And so upstream is the gradient flowing, you can think of
a flow from your loss, all the way back to your input.

373
00:42:51,409 --> 00:42:58,684
And so upstream is what came from what you've already
done, flowing you know, down into your current node.

374
00:43:00,270 --> 00:43:07,521
Right, so we're for flowing downwards, and what we get coming
into the node through backprop is coming from upstream.

375
00:43:13,888 --> 00:43:21,102
Okay, so now let's think about what happens when, you know we saw
that this was a problem when our weights were pretty small, right?

376
00:43:21,102 --> 00:43:26,133
So, we can think about well, what if we just
try and solve this by making our weights big?

377
00:43:26,133 --> 00:43:38,273
So, let's sample from this standard gaussian, now with standard deviation
one instead of 0.01. So what's the problem here? Does anyone have a guess?

378
00:43:44,558 --> 00:43:54,750
If our weights are now all big, and we're passing them, and we're taking
these outputs of W times X, and passing them through tanh nonlinearities,

379
00:43:54,750 --> 00:44:01,883
remember we were talking about what happens at different
values of inputs to tanh, so what's the problem?

380
00:44:01,883 --> 00:44:06,289
Okay, so yeah I heard that it's going
to be saturated, so that's right.

381
00:44:06,289 --> 00:44:15,966
Basically now, because our weights are going to be big, we're going to always
be at saturated regimes of either very negative or very positive of the tanh.

382
00:44:15,966 --> 00:44:29,695
And so in practice, what you're going to get here is now if we look at the distribution of the activations
at each of the layers here on the bottom, they're going to be all basically negative one or plus one.

383
00:44:30,855 --> 00:44:40,447
Right, and so this will have the problem that we talked about with the tanh earlier, when
they're saturated, that all the gradients will be zero, and our weights are not updating.

384
00:44:41,397 --> 00:44:46,363
So basically, it's really hard to get
your weight initialization right.

385
00:44:46,363 --> 00:44:50,296
When it's too small they all collapse.
When it's too large they saturate.

386
00:44:50,296 --> 00:44:55,553
So, there's been some work in trying to figure out well,
what's the proper way to initialize these weights.

387
00:44:55,553 --> 00:45:02,507
And so, one kind of good rule of thumb that
you can use is the Xavier initialization.

388
00:45:02,507 --> 00:45:07,388
And so this is from this
paper by Glorot in 2010.

389
00:45:07,388 --> 00:45:15,962
And so what this formula is, is if we look at W up here,
we can see that we want to initialize them to these,

390
00:45:17,403 --> 00:45:22,653
we sample from our standard gaussian, and then we're
going to scale by the number of inputs that we have.

391
00:45:22,653 --> 00:45:28,599
And you can go through the math, and you can see in the lecture
notes as well as in this paper of exactly how this works out,

392
00:45:28,599 --> 00:45:35,789
but basically the way we do it is we specify that we want the
variance of the input to be the same as a variance of the output,

393
00:45:35,789 --> 00:45:42,789
and then if you derive what the weight should be you'll get
this formula, and intuitively with this kind of means is that

394
00:45:42,789 --> 00:45:52,654
if you have a small number of inputs right, then we're going to divide by the smaller
number and get larger weights, and we need larger weights, because with small inputs,

395
00:45:52,654 --> 00:45:58,993
and you're multiplying each of these by weight, you need a
larger weights to get the same larger variance at output,

396
00:45:58,993 --> 00:46:08,505
and kind of vice versa for if we have many inputs, then we want
smaller weights in order to get the same spread at the output.

397
00:46:08,505 --> 00:46:10,795
So, you can look at the notes
for more details about this.

398
00:46:10,795 --> 00:46:23,150
And so basically now, if we want to have a unit gaussian, right as input to each layer, we
can use this kind of initialization to at training time, to be able to initialize this,

399
00:46:23,150 --> 00:46:27,669
so that there is approximately
a unit gaussian at each layer.

400
00:46:29,057 --> 00:46:35,032
Okay, and so one thing is does assume though is
that it is assumed that there's linear activations.

401
00:46:35,032 --> 00:46:40,837
and so it assumes that we are in the activation,
in the active region of the tanh, for example.

402
00:46:40,837 --> 00:46:46,051
And so again, you can look at the notes to
really try and understand its derivation,

403
00:46:46,051 --> 00:46:51,255
but the problem is that this breaks
when now you use something like a ReLU.

404
00:46:51,255 --> 00:46:54,849
Right, and so with the
ReLU what happens is that,

405
00:46:54,849 --> 00:47:04,685
because it's killing half of your units, it's setting approximately half of them
to zero at each time, it's actually halving the variance that you get out of this.

406
00:47:04,685 --> 00:47:16,193
And so now, if you just make the same assumptions as your derivation earlier you
won't actually get the right variance coming out, it's going to be too small.

407
00:47:16,193 --> 00:47:23,323
And so what you see is again this kind of
phenomenon, as the distributions starts collapsing.

408
00:47:23,323 --> 00:47:28,019
In this case you get more and more peaked
toward zero, and more units deactivated.

409
00:47:29,541 --> 00:47:41,580
And the way to address this with something that has been pointed out in some papers,
which is that you can you can try to account for this with an extra, divided by two.

410
00:47:41,580 --> 00:47:47,023
So, now you're basically adjusting for the
fact that half the neurons get killed.

411
00:47:48,636 --> 00:47:58,122
And so you're kind of equivalent input has actually half this number of input,
and so you just add this divided by two factor in, this works much better,

412
00:47:59,332 --> 00:48:05,348
and you can see that the distributions are pretty
good throughout all layers of the network.

413
00:48:06,959 --> 00:48:16,161
And so in practice this is been really important actually, for training these types of
little things, to a really pay attention to how your weights are, make a big difference.

414
00:48:16,161 --> 00:48:28,309
And so for example, you'll see in some papers that this actually is the difference
between the network even training at all and performing well versus nothing happening.

415
00:48:32,548 --> 00:48:36,321
So, proper initialization is still
an active area of research.

416
00:48:36,321 --> 00:48:40,281
And so if you're interested in this, you can
look at a lot of these papers and resources.

417
00:48:40,281 --> 00:48:51,701
A good general rule of thumb is basically use the Xavier Initialization to start
with, and then you can also think about some of these other kinds of methods.

418
00:48:53,871 --> 00:49:01,405
And so now we're going to talk about a related idea to this, so this
idea of wanting to keep activations in a gaussian range that we want.

419
00:49:03,330 --> 00:49:09,672
Right, and so this idea behind what we're going to call batch
normalization is, okay we want unit gaussian activations.

420
00:49:09,672 --> 00:49:14,240
Let's just make them that way.
Let's just force them to be that way.

421
00:49:14,240 --> 00:49:15,834
And so how does this work?

422
00:49:15,834 --> 00:49:25,640
So, let's consider a batch of activations at some layer. And so now we have
all of our activations coming out. If we want to make this unit gaussian,

423
00:49:25,640 --> 00:49:29,368
we actually can just do
this empirically, right.

424
00:49:29,368 --> 00:49:39,392
We can take the mean of the batch that we have so far of the current batch,
and we can just and the variance, and we can just normalize by this.

425
00:49:39,392 --> 00:49:50,867
Right, and so basically, instead of with weight initialization, we're setting this at the start of
training so that we try and get it into a good spot that we can have unit gaussians at every layer,

426
00:49:50,867 --> 00:49:53,096
and hopefully during training
this will preserve this.

427
00:49:53,096 --> 00:49:58,336
Now we're going to explicitly make that happen
on every forward pass through the network.

428
00:49:58,336 --> 00:50:06,787
We're going to make this happen functionally, and basically
by normalizing by the mean and the variance of each neuron,

429
00:50:08,139 --> 00:50:15,754
we look at all of the inputs coming into it and calculate the
mean and variance for that batch and normalize it by it.

430
00:50:15,754 --> 00:50:19,928
And the thing is that this is a, this is
just a differentiable function right?

431
00:50:19,928 --> 00:50:31,098
If we have our mean and our variance as constants, this is just a sequence of
computational operations that we can differentiate and do back prop through this.

432
00:50:33,115 --> 00:50:47,065
Okay, so just as I was saying earlier right, if we look at our input data, and we think of
this as we have N training examples in our current batch, and then each batch has dimension D,

433
00:50:47,065 --> 00:50:56,063
we're going to the compute the empirical mean and variance
independently for each dimension, so each basically feature element,

434
00:50:56,063 --> 00:51:02,406
and we compute this across our batch, our current
mini-batch that we have and we normalize by this.

435
00:51:05,786 --> 00:51:09,988
And so this is usually inserted after
fully connected or convolutional layers.

436
00:51:09,988 --> 00:51:18,932
We saw that would we were multiplying by W in these layers, which we do over
and over again, then we can have this bad scaling effect with each one.

437
00:51:18,932 --> 00:51:22,731
And so this basically is
able to undo this effect.

438
00:51:22,731 --> 00:51:37,132
Right, and since we're basically just scaling by the inputs connected to each neuron, each activation,
we can apply this the same way to fully connected convolutional layers, and the only difference is that,

439
00:51:37,132 --> 00:51:45,895
with convolutional layers, we want to normalize not just across all the
training examples, and independently for each each feature dimension,

440
00:51:45,895 --> 00:51:58,895
but we actually want to normalize jointly across both all the feature dimensions, all the
spatial locations that we have in our activation map, as well as all of the training examples.

441
00:51:58,895 --> 00:52:05,903
And we do this, because we want to obey the convolutional property,
and we want nearby locations to be normalized the same way, right?

442
00:52:05,903 --> 00:52:13,489
And so with a convolutional layer, we're basically going to have a one
mean and one standard deviation, per activation map that that we have,

443
00:52:13,489 --> 00:52:18,094
and we're going to normalize by this
across all of the examples in the batch.

444
00:52:18,094 --> 00:52:23,098
And so this is something that you guys are
going to implement in your next homework.

445
00:52:23,098 --> 00:52:29,367
And so, all of these details are explained
very clearly in this paper from 2015.

446
00:52:29,367 --> 00:52:35,621
And so on this is a very useful, useful technique
that you want to use a lot in practice.

447
00:52:35,621 --> 00:52:46,129
You want to have these batch normalization layers. And so you should read this
paper. Go through all of the derivations, and then also go through the derivations

448
00:52:46,129 --> 00:52:53,718
of how to compute the gradients with given
these, this normalization operation.

449
00:52:56,626 --> 00:52:59,993
Okay, so one thing that I just
want to point out is that,

450
00:52:59,993 --> 00:53:05,930
it's not clear that, you know, we're doing this batch
normalization after every fully connected layer,

451
00:53:05,930 --> 00:53:12,031
and it's not clear that we necessarily want a
unit gaussian input to these tanh nonlinearities,

452
00:53:12,031 --> 00:53:17,107
because what this is doing is this is constraining
you to the linear regime of this nonlinearity,

453
00:53:17,107 --> 00:53:21,974
and we're not actually, you're trying to basically
say, let's not have any of this saturation,

454
00:53:21,974 --> 00:53:30,821
but maybe a little bit of this is good, right? You you want to be
able to control what's, how much saturation that you want to have.

455
00:53:31,845 --> 00:53:39,512
And so what, the way that we address this when we're doing batch
normalization is that we have our normalization operation,

456
00:53:39,512 --> 00:53:44,453
but then after that we have this additional
squashing and scaling operation.

457
00:53:44,453 --> 00:53:52,515
So, we do our normalization. Then we're going to scale by some
constant gamma, and then shift by another factor of beta.

458
00:53:53,349 --> 00:54:02,071
Right, and so what this actually does is that this allows you
to be able to recover the identity function if you wanted to.

459
00:54:02,071 --> 00:54:10,613
So, if the network wanted to, it could learn your scaling factor gamma
to be just your variance. It could learn your beta to be your mean,

460
00:54:10,613 --> 00:54:16,659
and in this case you can recover the identity
mapping, as if you didn't have batch normalization.

461
00:54:16,659 --> 00:54:32,225
And so now you have the flexibility of doing kind of everything in between and making your the network learning
how to make your tanh more or less saturated, and how much to do so in order to have, to have good training.

462
00:54:38,166 --> 00:54:42,285
Okay, so just to sort of summarize
the batch normalization idea.

463
00:54:42,285 --> 00:54:52,906
Right, so given our inputs, we're going to compute our mini-batch mean. So,
we do this for every mini-batch that's coming in. We compute our variance.

464
00:54:52,906 --> 00:54:58,342
We normalize by the mean and variance, and we
have this additional scaling and shifting factor.

465
00:54:58,342 --> 00:55:05,484
And so this improves gradient flow through the
network. it's also more robust as a result.

466
00:55:05,484 --> 00:55:10,562
It works for more range of learning rates,
and different kinds of initialization,

467
00:55:10,562 --> 00:55:16,955
so people have seen that once you put batch normalization in, and
it's just easier to train, and so that's why you should do this.

468
00:55:16,955 --> 00:55:27,162
And then also when one thing that I just want to point out is that you
can also think of this as in a way also doing some regularization.

469
00:55:27,162 --> 00:55:42,733
Right and so, because now at the output of each layer, each of these activations, each of these outputs, is an
output of both your input X, as well as the other examples in the batch that it happens to be sampled with, right,

470
00:55:42,733 --> 00:55:48,266
because you're going to normalize each input
data by the empirical mean over that batch.

471
00:55:48,266 --> 00:55:54,021
So because of that, it's no longer producing
deterministic values for a given training example,

472
00:55:54,021 --> 00:55:57,543
and it's tying all of these
inputs in a batch together.

473
00:55:57,543 --> 00:56:07,215
And so this basically, because it's no longer deterministic, kind of jitters your
representation of X a little bit, and in a sense, gives some sort of regularization effect.

474
00:56:08,941 --> 00:56:10,490
Yeah, question?

475
00:56:10,490 --> 00:56:13,401
[student speaking off camera]

476
00:56:13,401 --> 00:56:17,354
The question is gamma and beta are learned
parameters, and yes that's the case.

477
00:56:17,354 --> 00:56:20,937
[student speaking off mic]

478
00:56:27,754 --> 00:56:34,618
Yeah, so the question is why do we want to learn this gamma
and beta to be able to learn the identity function back,

479
00:56:34,618 --> 00:56:38,481
and the reason is because
you want to give it the flexibility.

480
00:56:38,481 --> 00:56:48,381
Right, so what batch normalization is doing, is it's forcing our
data to become this unit gaussian, our inputs to be unit gaussian,

481
00:56:48,381 --> 00:56:54,232
but even though in general this is a good idea, it's
not always that this is exactly the best thing to do.

482
00:56:54,232 --> 00:57:00,279
And we saw in particular for something like a tanh, you might
want to control some degree of saturation that you have.

483
00:57:00,279 --> 00:57:14,195
And so what this does is it gives you the flexibility of doing this exact like unit gaussian normalization, if it
wants to, but also learning that maybe in this particular part of the network, maybe that's not the best thing to do.

484
00:57:14,195 --> 00:57:19,838
Maybe we want something still in this general idea, but
slightly different right, slightly scaled or shifted.

485
00:57:19,838 --> 00:57:25,968
And so these parameters just give it that extra
flexibility to learn that if it wants to.

486
00:57:25,968 --> 00:57:35,665
And then yeah, if the the best thing to do is just batch
normalization then it'll learn the right parameters for that. Yeah?

487
00:57:35,665 --> 00:57:39,710
[student speaking off mic]

488
00:57:39,710 --> 00:57:47,079
Yeah, so basically each neuron output. So, we have
output of a fully connected layer. We have W times X.

489
00:57:48,366 --> 00:57:57,365
and so we have the values of each of these outputs, and then we're going
to apply batch normalization separately to each of these neurons.

490
00:57:57,365 --> 00:57:58,835
Question?

491
00:57:58,835 --> 00:58:02,418
[student speaking off mic]

492
00:58:10,031 --> 00:58:17,517
Yeah, so the question is that for things like reinforcement learning,
you might have a really small batch size. How do you deal with this?

493
00:58:17,517 --> 00:58:24,324
So in practice, I guess batch normalization has been used a
lot for like for standard convolutional neural networks,

494
00:58:24,324 --> 00:58:34,520
and there's actually papers on how do we want to do normalization for different kinds of recurrent
networks, or you know some of these networks that might also be in reinforcement learning.

495
00:58:34,520 --> 00:58:40,532
And so there's different considerations that you might want to
think of there. And this is still an active area of research.

496
00:58:40,532 --> 00:58:49,490
There's papers on this and we might also talk about some of this more later,
but for a typical convolutional neural network this generally works fine.

497
00:58:49,490 --> 00:58:57,741
And then if you have a smaller batch size, maybe this becomes a
little bit less accurate, but you still get kind of the same effect.

498
00:58:57,741 --> 00:59:06,088
And you know it's possible also that you could design your mean
and variance to be computed maybe over more examples, right,

499
00:59:06,088 --> 00:59:14,755
and I think in practice usually it's just okay, so you don't see this too
much, but this is something that maybe could help if that was a problem.

500
00:59:14,755 --> 00:59:16,128
Yeah, question?

501
00:59:16,128 --> 00:59:19,711
[student speaking off mic]

502
00:59:24,947 --> 00:59:32,979
So the question, so the question is, if we force the
inputs to be gaussian, do we lose the structure?

503
00:59:35,211 --> 00:59:45,221
So, no in a sense that you can think of like, if you had all your features distributed
as a gaussian for example, even if you were just doing data pre-processing,

504
00:59:45,221 --> 00:59:47,925
this gaussian is not
losing you any structure.

505
00:59:47,925 --> 00:59:57,913
All the, it's just shifting and scaling your data into a regime, that
works well for the operations that you're going to perform on it.

506
00:59:57,913 --> 01:00:03,169
In convolutional layers, you do have some structure,
that you want to preserve spatially, right.

507
01:00:03,169 --> 01:00:09,156
You want, like if you look at your activation maps, you
want them to relatively all make sense to each other.

508
01:00:09,156 --> 01:00:17,823
So, in this case you do want to take that into consideration. And so now,
we're going to normalize, find one mean for the entire activation map,

509
01:00:17,823 --> 01:00:22,815
so we only find the empirical mean
and variance over training examples.

510
01:00:22,815 --> 01:00:32,455
And so that's something that you'll be doing in your homework, and
also explained in the paper as well. So, you should refer to that.

511
01:00:32,455 --> 01:00:33,288
Yes.

512
01:00:34,287 --> 01:00:37,870
[student speaking off mic]

513
01:00:43,065 --> 01:00:47,849
So the question is, are we normalizing
the weight so that they become gaussian.

514
01:00:47,849 --> 01:00:49,665
So, if I understand
your question correctly,

515
01:00:49,665 --> 01:00:58,727
then the answer is, we're normalizing the inputs to each
layer, so we're not changing the weights in this process.

516
01:01:00,895 --> 01:01:04,562
[student speaking off mic]

517
01:01:15,208 --> 01:01:24,512
Yeah, so the question is, once we subtract by the mean and divide by the
standard deviation, does this become gaussian, and the answer is yes.

518
01:01:24,512 --> 01:01:33,843
So, if you think about the operations that are happening, basically you're
shifting by the mean, right. And so this shift up to be zero-centered,

519
01:01:33,843 --> 01:01:40,243
and then you're scaling by the standard deviation.
This now transforms this into a unit gaussian.

520
01:01:41,249 --> 01:01:48,630
And so if you want to look more into that, I think you can
look at, there's a lot of machine learning explanations

521
01:01:48,630 --> 01:01:52,942
that go into exactly what this,
visualizing with this operation is doing,

522
01:01:52,942 --> 01:01:58,563
but yeah this basically takes your data
and turns it into a gaussian distribution.

523
01:02:00,458 --> 01:02:02,375
Okay, so yeah question?

524
01:02:03,436 --> 01:02:07,019
[student speaking off mic]

525
01:02:08,262 --> 01:02:09,095
Uh-huh.

526
01:02:26,194 --> 01:02:35,634
So the question is, if we're going to be doing the shift and scale, and learning these
is the batch normalization redundant, because you could recover the identity mapping?

527
01:02:35,634 --> 01:02:44,523
So in the case that the network learns that identity mapping is always the best, and
it learns these parameters, the yeah, there would be no point for batch normalization,

528
01:02:44,523 --> 01:02:52,579
but in practice this doesn't happen. So in practice, we will learn
this gamma and beta. That's not the same as a identity mapping.

529
01:02:52,579 --> 01:02:58,858
So, it will shift and scale by some amount, but not the
amount that's going to give you an identity mapping.

530
01:02:58,858 --> 01:03:03,201
And so what you get is you still get
this batch normalization effect.

531
01:03:03,201 --> 01:03:14,266
Right, so having this identity mapping there, I'm only putting this here to say that
in the extreme, it could learn the identity mapping, but in practice it doesn't.

532
01:03:14,266 --> 01:03:15,970
Yeah, question.

533
01:03:15,970 --> 01:03:19,553
[student speaking off mic]

534
01:03:21,368 --> 01:03:22,561
Yeah.

535
01:03:22,561 --> 01:03:26,144
[student speaking off mic]

536
01:03:30,825 --> 01:03:37,505
Oh, right, right. Yeah, yeah sorry, I was not clear about this,
but yeah I think this is related to the other question earlier,

537
01:03:38,972 --> 01:03:49,814
that yeah when we're doing this we're actually getting zero mean and unit gaussian,
which put this into a nice shape, but it doesn't have to actually be a gaussian.

538
01:03:49,814 --> 01:03:57,830
So yeah, I mean ideally, if we're looking at like inputs
coming in, as you know, sort of approximately gaussian,

539
01:03:57,830 --> 01:04:03,592
we would like it to have this kind of effect,
but yeah, in practice it doesn't have to be.

540
01:04:06,658 --> 01:04:14,017
Okay, so ... Okay, so the last thing I just want to mention about
this is that, so at test time, the batch normalization layer,

541
01:04:17,064 --> 01:04:26,932
we now take the empirical mean and variance from the
training data. So, we don't re-compute this at test time.

542
01:04:26,932 --> 01:04:38,295
We just estimate this at training time, for example using running averages, and then
we're going to use this as at test time. So, we're just going to scale by that.

543
01:04:40,078 --> 01:04:43,725
Okay, so now I'm going to move on
to babysitting the learning process.

544
01:04:43,725 --> 01:04:54,264
Right, so now we've defined our network architecture, and we'll talk about
how do we monitor training, and how do we adjust hyperparameters as we go,

545
01:04:54,264 --> 01:04:56,681
to get good learning results?

546
01:04:58,091 --> 01:05:02,251
So as always, so the first step we want to
do, is we want to pre-process the data.

547
01:05:02,251 --> 01:05:05,773
Right, so we want to zero mean the data
as we talked about earlier.

548
01:05:05,773 --> 01:05:13,455
Then we want to choose the architecture, and so here we are
starting with one hidden layer of 50 neurons, for example,

549
01:05:13,455 --> 01:05:18,950
but we've basically we can pick any
architecture that we want to start with.

550
01:05:20,223 --> 01:05:23,934
And then the first thing that we want
to do is we initialize our network.

551
01:05:23,934 --> 01:05:28,600
We do a forward pass through it, and we want
to make sure that our loss is reasonable.

552
01:05:28,600 --> 01:05:35,697
So, we talked about this several lectures ago, where we have a
basically a, let's say we have a Softmax classifier that we have here.

553
01:05:37,493 --> 01:05:44,012
We know what our loss should be, when our weights
are small, and we have generally a distribution.

554
01:05:44,012 --> 01:05:50,293
Then we want it to be, the Softmax classifier
loss is going to be your negative log likelihood,

555
01:05:50,293 --> 01:05:54,826
which if we have 10 classes, it'll be
something like negative log of one over 10,

556
01:05:54,826 --> 01:06:03,213
which here is around 2.3, and so we want to make
sure that our loss is what we expect it to be.

557
01:06:03,213 --> 01:06:09,453
So, this is a good sanity check
that we want to always, always do.

558
01:06:09,453 --> 01:06:13,503
So, now once we've seen that our
original loss is good, now we want to,

559
01:06:14,853 --> 01:06:25,463
so first we want to do this having zero regularization, right. So, when we disable the
regularization, now our only loss term is this data loss, which is going to give 2.3 here.

560
01:06:25,463 --> 01:06:36,226
And so here, now we want to crank up the regularization, and when we do that, we want
to see that our loss goes up, because we've added this additional regularization term.

561
01:06:36,226 --> 01:06:40,879
So, this is a good next step
that you can do for your sanity check.

562
01:06:40,879 --> 01:06:46,309
And then, now we can start training.
So, now we start trying to train.

563
01:06:47,331 --> 01:06:53,026
What we do is, a good way to do this is to
start up with a very small amount of data,

564
01:06:53,026 --> 01:07:00,944
because if you have just a very small training set, you should be able
to over fit this very well and get very good training loss on here.

565
01:07:00,944 --> 01:07:10,697
And so in this case we want to turn off our regularization
again, and just see if we can make the loss go down to zero.

566
01:07:12,199 --> 01:07:21,961
And so we can see how our loss is changing, as we have all these epochs. We compute
our loss at each epoch, and we want to see this go all the way down to zero.

567
01:07:21,961 --> 01:07:27,124
Right, and here we can see that also our training accuracy
is going all the way up to one, and this makes sense right.

568
01:07:27,124 --> 01:07:32,813
If you have a very small number of data, you
should be able to over fit this perfectly.

569
01:07:34,726 --> 01:07:40,366
Okay, so now once you've done that, these are all sanity
checks. Now you can start really trying to train.

570
01:07:40,366 --> 01:07:49,480
So, now you can take your full training data, and now start with a small amount
of regularization, and let's first figure out what's a good learning rate.

571
01:07:49,480 --> 01:07:54,942
So, learning rate is one of the most important hyperparameters,
and it's something that you want to adjust first.

572
01:07:54,942 --> 01:08:00,954
So, you want to try some value of learning
rate. and here I've tried one E negative six,

573
01:08:00,954 --> 01:08:04,096
and you can see that the
loss is barely changing.

574
01:08:04,096 --> 01:08:10,244
Right, and so the reason this is barely changing is
usually because your learning rate is too small.

575
01:08:10,244 --> 01:08:16,362
So when it's too small, your gradient updates are not
big enough, and your cost is basically about the same.

576
01:08:17,423 --> 01:08:29,806
Okay, so, one thing that I want to point out here, is that we can notice that even though our loss
with barely changing, the training and the validation accuracy jumped up to 20% very quickly.

577
01:08:32,701 --> 01:08:38,152
And so does anyone have any idea
for why this might be the case?

578
01:08:40,089 --> 01:08:46,403
Why, so remember we have a Softmax function, and our loss
didn't really change, but our accuracy improved a lot.

579
01:08:50,263 --> 01:08:59,727
Okay, so the reason for this is that here the probabilities are
still pretty diffuse, so our loss term is still pretty similar,

580
01:08:59,727 --> 01:09:06,183
but when we shift all of these probabilities slightly
in the right direction, because we're learning right?

581
01:09:06,183 --> 01:09:11,954
Our weights are changing the right direction.
Now the accuracy all of a sudden can jump,

582
01:09:11,954 --> 01:09:21,985
because we're taking the maximum correct value, and so we're going to get
a big jump in accuracy, even though our loss is still relatively diffuse.

583
01:09:23,588 --> 01:09:31,325
Okay, so now if we try another learning rate, now here I'm jumping in the
other extreme, picking a very big learning rate, one E negative six.

584
01:09:31,326 --> 01:09:41,413
What's happening is that our cost is now giving us NaNs. And, when you
have NaNs, what this usually means is that basically your cost exploded.

585
01:09:41,413 --> 01:09:47,862
And so, the reason for that is typically
that your learning rate was too high.

586
01:09:49,350 --> 01:09:57,006
So, then you can adjust your learning rate down again. Here I can see that
we're trying three E to the negative three. The cost is still exploding.

587
01:09:57,006 --> 01:10:04,901
So, usually this, the rough range for learning rates that we want to
look at is between one E negative three, and one E negative five.

588
01:10:04,901 --> 01:10:09,628
And, this is the rough range that we
want to be cross-validating in between.

589
01:10:09,628 --> 01:10:19,011
So, you want to try out values in this range, and depending on whether your loss
is too slow, or too small, or whether it's too large, adjust it based on this.

590
01:10:21,228 --> 01:10:24,399
And so how do we exactly
pick these hyperparameters?

591
01:10:24,399 --> 01:10:31,139
Do hyperparameter optimization, and pick the
best values of all of these hyperparameters?

592
01:10:31,139 --> 01:10:37,575
So, the strategy that we're going to use is for any hyperparameter
for example learning rate, is to do cross-validation.

593
01:10:37,575 --> 01:10:43,472
So, cross-validation is training on your training
set, and then evaluating on a validation set.

594
01:10:43,472 --> 01:10:48,960
How well do this hyperparameter do? Something that
you guys have already done in your assignment.

595
01:10:48,960 --> 01:10:51,334
And so typically we want
to do this in stages.

596
01:10:51,334 --> 01:11:03,473
And so, we can do first of course stage, where we pick values pretty spread out apart, and then we
learn for only a few epochs. And with only a few epochs. you can already get a pretty good sense

597
01:11:03,473 --> 01:11:07,993
of which hyperparameters,
which values are good or not, right.

598
01:11:07,993 --> 01:11:13,712
You can quickly see that it's a NaN, or you can see that
nothing is happening, and you can adjust accordingly.

599
01:11:13,712 --> 01:11:22,540
So, typically once you do that, then you can see what's sort of a pretty good
range, and the range that you want to now do finer sampling of values in.

600
01:11:22,540 --> 01:11:30,779
And so, this is the second stage, where now you might want to run
this for a longer time, and do a finer search over that region.

601
01:11:30,779 --> 01:11:47,296
And one tip for detecting explosions like NaNs, you can have in your training loop, right sample
some hyperparameter, start training, and then look at your cost at every iteration or every epoch.

602
01:11:47,296 --> 01:11:57,902
And if you ever get a cost that's much larger than your original cost, so for example, something
like three times original cost, then you know that this is not heading in the right direction.

603
01:11:57,902 --> 01:12:06,335
Right, it's getting very big, very quickly, and you can just break out of
your loop, stop this this hyperparameter choice and pick something else.

604
01:12:06,335 --> 01:12:12,496
Alright, so an example of this, let's say here we
want to run now course search for five epochs.

605
01:12:13,866 --> 01:12:24,611
This is a similar network that we were talking about earlier, and what we
can do is we can see all of these validation accuracy that we're getting.

606
01:12:24,611 --> 01:12:29,291
And I've put in, highlighted in red
the ones that gives better values.

607
01:12:29,291 --> 01:12:33,092
And so these are going to be regions that
we're going to look into in more detail.

608
01:12:33,092 --> 01:12:37,067
And one thing to note is that it's
usually better to optimize in log space.

609
01:12:37,067 --> 01:12:49,040
And so here instead of sampling, I'd say uniformly between you know one E to the
negative 0.01 and 100, you're going to actually do 10 to the power of some range.

610
01:12:49,956 --> 01:12:55,427
Right, and this is because the learning
rate is multiplying your gradient update.

611
01:12:55,427 --> 01:13:07,524
And so it has these multiplicative effects, and so it makes more sense to consider a range of
learning rates that are multiplied or divided by some value, rather than uniformly sampled.

612
01:13:07,524 --> 01:13:10,894
So, you want to be dealing
with orders of some magnitude here.

613
01:13:10,894 --> 01:13:14,379
Okay, so once you find that,
you can then adjust your range.

614
01:13:14,379 --> 01:13:26,176
Right get in this case, we have a range of you know, maybe of 10 to the negative four,
right, to 10 to the zero power. This is a good range that we want to narrow down into.

615
01:13:26,176 --> 01:13:37,962
And so we can do this again, and here we can see that we're getting a relatively
good accuracy of 53%. And so this means we're headed in the right direction.

616
01:13:37,962 --> 01:13:42,377
The one thing that I want to point out
is that here we actually have a problem.

617
01:13:42,377 --> 01:13:50,396
And so the problem is that we can see that our best
accuracy here has a learning rate that's about,

618
01:13:52,373 --> 01:13:57,816
you know, all of our good learning rates
are in this E to the negative four range.

619
01:13:57,816 --> 01:14:10,273
Right, and since the learning rate that we specified was going from 10 to the negative four to 10 to the
zero, that means that all the good learning rates, were at the edge of the range that we were sampling.

620
01:14:10,273 --> 01:14:11,856
And so this is bad,

621
01:14:12,693 --> 01:14:17,113
because this means that we might not have
explored our space sufficiently, right.

622
01:14:17,113 --> 01:14:20,485
We might actually want to go to 10 to the
negative five, or 10 to the negative six.

623
01:14:20,485 --> 01:14:23,494
There might be still better ranges
if we continue shifting down.

624
01:14:23,494 --> 01:14:32,839
So, you want to make sure that your range kind of has the good values somewhere in the middle,
or somewhere where you get a sense that you've hit, you've explored your range fully.

625
01:14:36,224 --> 01:14:43,741
Okay, and so another thing is that we can sample all of our
different hyperparameters, using a kind of grid search, right.

626
01:14:43,741 --> 01:14:49,731
We can sample for a fixed set of combinations,
a fixed set of values for each hyperparameter.

627
01:14:49,731 --> 01:15:02,334
Sample in a grid manner over all of these values, but in practice it's actually better to
sample from a random layout, so sampling random value of each hyperparameter in a range.

628
01:15:02,334 --> 01:15:10,876
And so what you'll get instead is we'll have these two hyper parameters here that
we want to sample from. You'll get samples that look like this right side instead.

629
01:15:10,876 --> 01:15:19,816
And the reason for this is that if a function is really sort of more
of a function of one variable than another, which is usually true.

630
01:15:19,816 --> 01:15:24,669
Usually we have little bit more, a lower
effective dimensionality than we actually have.

631
01:15:24,669 --> 01:15:30,342
Then you're going to get many more samples
of the important variable that you have.

632
01:15:30,342 --> 01:15:38,326
You're going to be able to see this shape in this green function
that I've drawn on top, showing where the good values are,

633
01:15:38,326 --> 01:15:46,459
compared to if you just did a grid layout where we were only able to
sample three values here, and you've missed where were the good regions.

634
01:15:46,459 --> 01:15:55,685
Right, and so basically we'll get much more useful signal overall since
we have more samples of different values of the important variable.

635
01:15:55,685 --> 01:16:00,427
And so, hyperparameters to play with,
we've talked about learning rate,

636
01:16:00,427 --> 01:16:07,697
things like different types of decay schedules, update
types, regularization, also your network architecture,

637
01:16:07,697 --> 01:16:12,405
so the number of hidden units, the depth, all of these
are hyperparameters that you can optimize over.

638
01:16:12,405 --> 01:16:16,928
And we've talked about some of these, but we'll keep
talking about more of these in the next lecture.

639
01:16:16,928 --> 01:16:24,781
And so you can think of this as kind of, you know, if you're basically
tuning all the knobs right, of some turntable where you're,

640
01:16:26,667 --> 01:16:32,260
you're a neural networks practitioner. You can think of the
music that's output is the loss function that you want,

641
01:16:32,260 --> 01:16:36,313
and you want to adjust everything appropriately
to get the kind of output that you want.

642
01:16:36,313 --> 01:16:40,480
Alright, so it's really kind
of an art that you're doing.

643
01:16:42,194 --> 01:16:50,277
And in practice, you're going to do a lot of
hyperparameter optimization, a lot of cross validation.

644
01:16:50,277 --> 01:17:00,368
And so you know, in order to get numbers, people will run cross validation over tons of
hyperparameters, monitor all of them, see which ones are doing better, which ones are doing worse.

645
01:17:00,368 --> 01:17:07,895
Here we have all these loss curves. Pick the right
ones, readjust, and keep going through this process.

646
01:17:07,895 --> 01:17:14,380
And so as I mentioned earlier, as you're monitoring each
of these loss curves, learning rate is an important one,

647
01:17:15,311 --> 01:17:20,654
but you'll get a sense for how different learning
rates, which learning rates are good and bad.

648
01:17:20,654 --> 01:17:34,060
So you'll see that if you have a very high exploding one, right, this is your loss explodes, then your learning
rate is too high. If it's too kind of linear and too flat, you'll see that it's too low, it's not changing enough.

649
01:17:34,060 --> 01:17:41,660
And if you get something that looks like there's a steep change, but
then a plateau, this is also an indicator of it being maybe too high,

650
01:17:41,660 --> 01:17:48,460
because in this case, you're taking too large jumps, and
you're not able to settle well into your local optimum.

651
01:17:48,460 --> 01:17:53,572
And so a good learning rate usually ends up looking something
like this, where you have a relatively steep curve,

652
01:17:53,572 --> 01:17:57,993
but then it's continuing to go down, and then you
might keep adjusting your learning rate from there.

653
01:17:57,993 --> 01:18:02,160
And so this is something that
you'll see through practice.

654
01:18:03,522 --> 01:18:12,637
Okay and just, I think we're very close to the end, so just one last thing that
I want to point out is than in case you ever see learning rate loss curves,

655
01:18:12,637 --> 01:18:23,567
where it's ... So if you ever see loss curves where it's flat for a while, and then
starts training all of a sudden, a potential reason could be bad initialization.

656
01:18:23,567 --> 01:18:36,383
So in this case, your gradients are not really flowing too well the beginning, so nothing's really learning, and then
at some point, it just happens to adjust in the right way, such that it tips over and things just start training right?

657
01:18:36,383 --> 01:18:47,901
And so there's a lot of experience at looking at these and see what's wrong that you'll
get over time. And so you'll usually want to monitor and visualize your accuracy.

658
01:18:48,826 --> 01:18:54,860
If you have a big gap between your training
accuracy and your validation accuracy,

659
01:18:54,860 --> 01:18:59,652
it usually means that you might have overfitting and you
might want to increase your regularization strength.

660
01:18:59,652 --> 01:19:08,137
If you have no gap, you might want to increase your model capacity,
because you haven't overfit yet. You could potentially increase it more.

661
01:19:08,137 --> 01:19:13,998
And in general, we also want to track the updates, the
ratio of our weight updates to our weight magnitudes.

662
01:19:13,998 --> 01:19:21,428
We can just take the norm of our parameters that
we have to get a sense for how large they are,

663
01:19:21,428 --> 01:19:26,353
and when we have our update size, we can also take
the norm of that, get a sense for how large that is,

664
01:19:26,353 --> 01:19:30,025
and we want this ratio to
be somewhere around 0.001.

665
01:19:30,025 --> 01:19:35,598
There's a lot of variance in this range,
so you don't have to be exactly on this,

666
01:19:35,598 --> 01:19:41,477
but it's just this sense of you don't want your updates to
be too large compared to your value or too small, right?

667
01:19:41,477 --> 01:19:43,637
You don't want to dominate
or to have no effect.

668
01:19:43,637 --> 01:19:47,811
And so this is just something that can
help debug what might be a problem.

669
01:19:49,843 --> 01:19:59,016
Okay, so in summary, today we've looked at activation functions, data
preprocessing, weight initialization, batch norm, babysitting the learning process,

670
01:19:59,016 --> 01:20:01,694
and hyperparameter optimization.

671
01:20:01,694 --> 01:20:05,338
These are the kind of the takeaways for
each that you guys should keep in mind.

672
01:20:05,338 --> 01:20:08,491
Use ReLUs, subtract the mean,
use Xavier Initialization,

673
01:20:08,491 --> 01:20:12,499
use batch norm, and sample
hyperparameters randomly.

674
01:20:12,499 --> 01:20:19,355
And next time we'll continue to talk about the training
neural networks with all these different topics.