eng/Lecture 9 _ CNN Architectures.srt

﻿1
00:00:14,752 --> 00:00:21,696
- All right welcome to lecture nine. So today
we will be talking about CNN Architectures.

2
00:00:21,696 --> 00:00:27,706
And just a few administrative points before we
get started, assignment two is due Thursday.

3
00:00:27,706 --> 00:00:36,855
The mid term will be in class on Tuesday May ninth, so next week and it will
cover material through Tuesday through this coming Thursday May fourth.

4
00:00:36,855 --> 00:00:41,350
So everything up to recurrent neural
networks are going to be fair game.

5
00:00:41,350 --> 00:00:49,121
The poster session we've decided on a time, it's going to be Tuesday June
sixth from twelve to three p.m. So this is the last week of classes.

6
00:00:49,121 --> 00:00:53,828
So we have our our poster session a little bit
early during the last week so that after that,

7
00:00:53,828 --> 00:01:00,132
once you guys get feedback you still have some time to
work for your final report which will be due finals week.

8
00:01:03,325 --> 00:01:05,812
Okay, so just a quick review of last time.

9
00:01:05,812 --> 00:01:09,324
Last time we talked about different
kinds of deep learning frameworks.

10
00:01:09,324 --> 00:01:12,690
We talked about you know
PyTorch, TensorFlow, Caffe2

11
00:01:14,514 --> 00:01:18,762
and we saw that using these kinds of frameworks we
were able to easily build big computational graphs,

12
00:01:18,762 --> 00:01:25,784
for example very large neural networks and comm nets, and
be able to really easily compute gradients in these graphs.

13
00:01:25,784 --> 00:01:32,415
So to compute all of the gradients for all the intermediate
variables weights inputs and use that to train our models

14
00:01:32,415 --> 00:01:35,665
and to run all this efficiently on GPUs

15
00:01:37,658 --> 00:01:44,978
And we saw that for a lot of these frameworks the way this works is by working
with these modularized layers that you guys have been working writing with,

16
00:01:44,978 --> 00:01:49,928
in your home works as well where we have
a forward pass, we have a backward pass,

17
00:01:49,928 --> 00:01:58,404
and then in our final model architecture, all we need to do then
is to just define all of these sequence of layers together.

18
00:01:58,404 --> 00:02:04,937
So using that we're able to very easily be able
to build up very complex network architectures.

19
00:02:06,626 --> 00:02:14,520
So today we're going to talk about some specific kinds of CNN Architectures
that are used today in cutting edge applications and research.

20
00:02:14,520 --> 00:02:19,631
And so we'll go into depth in some of the most
commonly used architectures for these that are winners

21
00:02:19,631 --> 00:02:22,125
of ImageNet classification benchmarks.

22
00:02:22,125 --> 00:02:28,085
So in chronological order AlexNet,
VGG net, GoogLeNet, and ResNet.

23
00:02:28,085 --> 00:02:43,771
And so these will go into a lot of depth. And then I'll also after that, briefly go through some other architectures that are
not as prominently used these days, but are interesting either from a historical perspective, or as recent areas of research.

24
00:02:46,822 --> 00:02:50,839
Okay, so just a quick review.
We talked a long time ago about LeNet,

25
00:02:50,839 --> 00:02:55,603
which was one of the first instantiations of a
comNet that was successfully used in practice.

26
00:02:55,603 --> 00:03:05,778
And so this was the comNet that took an input image, used com filters five
by five filters applied at stride one and had a couple of conv layers,

27
00:03:05,778 --> 00:03:09,335
a few pooling layers and then some
fully connected layers at the end.

28
00:03:09,335 --> 00:03:14,320
And this fairly simple comNet was very
successfully applied to digit recognition.

29
00:03:17,030 --> 00:03:22,875
So AlexNet from 2012 which you guys have also
heard already before in previous classes,

30
00:03:22,875 --> 00:03:31,179
was the first large scale convolutional neural network
that was able to do well on the ImageNet classification

31
00:03:31,179 --> 00:03:40,611
task so in 2012 AlexNet was entered in the competition, and was able to
outperform all previous non deep learning based models by a significant margin,

32
00:03:40,611 --> 00:03:48,012
and so this was the comNet that started the
spree of comNet research and usage afterwards.

33
00:03:48,012 --> 00:03:56,427
And so the basic comNet AlexNet architecture is a conv layer
followed by pooling layer, normalization, com pool norm,

34
00:03:58,421 --> 00:04:01,006
and then a few more conv
layers, a pooling layer,

35
00:04:01,006 --> 00:04:03,422
and then several fully
connected layers afterwards.

36
00:04:03,422 --> 00:04:09,766
So this actually looks very similar to the LeNet network
that we just saw. There's just more layers in total.

37
00:04:09,766 --> 00:04:18,387
There is five of these conv layers, and two fully connected layers
before the final fully connected layer going to the output classes.

38
00:04:21,889 --> 00:04:25,930
So let's first get a sense of the
sizes involved in the AlexNet.

39
00:04:25,930 --> 00:04:33,128
So if we look at the input to the AlexNet this was trained
on ImageNet, with inputs at a size 227 by 227 by 3 images.

40
00:04:33,128 --> 00:04:43,193
And if we look at this first layer which is a conv layer for the
AlexNet, it's 11 by 11 filters, 96 of these applied at stride 4.

41
00:04:43,193 --> 00:04:49,323
So let's just think about this for a moment.
What's the output volume size of this first layer?

42
00:04:51,788 --> 00:04:53,371
And there's a hint.

43
00:04:57,769 --> 00:05:11,441
So remember we have our input size, we have our convolutional filters, ray. And we have this formula,
which is the hint over here that gives you the size of the output dimensions after applying com right?

44
00:05:11,441 --> 00:05:17,632
So remember it was the full image, minus the
filter size, divided by the stride, plus one.

45
00:05:17,632 --> 00:05:26,919
So given that that's written up here for you 55, does anyone have
a guess at what's the final output size after this conv layer?

46
00:05:26,919 --> 00:05:29,823
[student speaks off mic]

47
00:05:29,823 --> 00:05:32,966
- So I had 55 by 55 by 96, yep.
That's correct.

48
00:05:32,966 --> 00:05:38,113
Right so our spatial dimensions at the output are
going to be 55 in each dimension and then we have

49
00:05:38,113 --> 00:05:45,391
96 total filters so the depth after our conv layer
is going to be 96. So that's the output volume.

50
00:05:45,391 --> 00:05:49,486
And what's the total number
of parameters in this layer?

51
00:05:49,486 --> 00:05:52,819
So remember we have 96 11 by 11 filters.

52
00:05:54,851 --> 00:05:57,753
[student speaks off mic]

53
00:05:57,753 --> 00:06:00,753
- [Lecturer] 96 by 11 by 11, almost.

54
00:06:01,945 --> 00:06:05,297
So yes, so I had another by three,
yes that's correct.

55
00:06:05,297 --> 00:06:13,632
So each of the filters is going to see through a local region
of 11 by 11 by three, right because the input depth was three.

56
00:06:13,632 --> 00:06:18,983
And so, that's each filter size,
times we have 96 of these total.

57
00:06:18,983 --> 00:06:23,150
And so there's 35K parameters
in this first layer.

58
00:06:26,018 --> 00:06:30,233
Okay, so now if we look at the second layer
this is a pooling layer right and in this case

59
00:06:30,233 --> 00:06:34,004
we have three three by three
filters applied at stride two.

60
00:06:34,004 --> 00:06:38,171
So what's the output volume
of this layer after pooling?

61
00:06:40,701 --> 00:06:44,868
And again we have a hint, very
similar to the last question.

62
00:06:51,251 --> 00:06:56,267
Okay, 27 by 27 by 96.
Yes that's correct.

63
00:06:57,716 --> 00:07:01,528
Right so the pooling layer is basically
going to use this formula that we had here.

64
00:07:01,528 --> 00:07:16,655
Again because these are pooling applied at a stride of two so we're going to use the same formula to determine
the spatial dimensions and so the spatial dimensions are going to be 27 by 27, and pooling preserves the depth.

65
00:07:16,655 --> 00:07:21,527
So we had 96 as depth as input, and it's
still going to be 96 depth at output.

66
00:07:22,825 --> 00:07:28,127
And next question. What's the
number of parameters in this layer?

67
00:07:31,446 --> 00:07:34,354
I hear some muttering.
[student answers off mic]

68
00:07:34,354 --> 00:07:36,905
- Nothing.
Okay.

69
00:07:36,905 --> 00:07:40,801
Yes, so pooling layer has no parameters,
so, kind of a trick question.

70
00:07:42,739 --> 00:07:45,272
Okay, so we can basically, yes, question?

71
00:07:45,272 --> 00:07:47,192
[student speaks off mic]

72
00:07:47,192 --> 00:07:52,180
- The question is, why are there no
parameters in the pooling layer?

73
00:07:52,180 --> 00:07:54,551
The parameters are the weights right,
that we're trying to learn.

74
00:07:54,551 --> 00:07:56,511
And so convolutional layers
have weights that we learn

75
00:07:56,511 --> 00:08:02,236
but pooling all we do is have a rule, we look
at the pooling region, and we take the max.

76
00:08:02,236 --> 00:08:05,710
So there's no parameters that are learned.

77
00:08:05,710 --> 00:08:14,250
So we can keep on doing this and you can just repeat the process and it's kind of a good
exercise to go through this and figure out the sizes, the parameters, at every layer.

78
00:08:16,473 --> 00:08:22,688
And so if you do this all the way, you can look at
this is the final architecture that you can work with.

79
00:08:22,688 --> 00:08:31,920
There's 11 by 11 filters at the beginning, then five by five and some three
by three filters. And so these are generally pretty familiar looking sizes

80
00:08:31,920 --> 00:08:39,122
that you've seen before and then at the end we have a couple of
fully connected layers of size 4096 and finally the last layer,

81
00:08:39,123 --> 00:08:41,540
is FC8 going to the soft max,

82
00:08:42,689 --> 00:08:46,356
which is going to the
1000 ImageNet classes.

83
00:08:48,039 --> 00:08:56,352
And just a couple of details about this, it was the first use of the ReLu
non-linearity that we've talked about that's the most commonly used non-linearity.

84
00:08:56,352 --> 00:09:07,391
They used local response normalization layers basically trying to normalize the response
across neighboring channels but this is something that's not really used anymore.

85
00:09:07,391 --> 00:09:11,937
It turned out not to, other people showed
that it didn't have so much of an effect.

86
00:09:11,937 --> 00:09:21,769
There's a lot of heavy data augmentation, and so you can look in the paper for more details,
but things like flipping, jittering, jittering, color normalization all of these things

87
00:09:21,769 --> 00:09:28,727
which you'll probably find useful for you when you're working on
your projects for example, so a lot of data augmentation here.

88
00:09:28,727 --> 00:09:32,419
They also use dropout batch size of 128,

89
00:09:32,419 --> 00:09:37,183
and learned with SGD with
momentum which we talked about

90
00:09:37,183 --> 00:09:42,295
in an earlier lecture, and basically just started
with a base learning rate of 1e negative 2.

91
00:09:42,295 --> 00:09:50,145
Every time it plateaus, reduce by a factor of 10 and
then just keep going. Until they finish training

92
00:09:50,145 --> 00:09:59,012
and a little bit of weight decay and in the end, in order to get the best numbers
they also did an ensembling of models and so training multiple of these,

93
00:09:59,012 --> 00:10:03,162
averaging them together and this also
gives an improvement in performance.

94
00:10:04,405 --> 00:10:08,781
And so one other thing I want to point out is
that if you look at this AlexNet diagram up here,

95
00:10:08,781 --> 00:10:15,235
it looks kind of like the normal comNet diagrams
that we've been seeing, except for one difference,

96
00:10:15,235 --> 00:10:21,937
which is that it's, you can see it's kind of split
in these two different rows or columns going across.

97
00:10:23,177 --> 00:10:32,905
And so the reason for this is mostly historical note, so AlexNet was
trained on GTX580 GPUs older GPUs that only had three gigs of memory.

98
00:10:34,106 --> 00:10:37,255
So it couldn't actually fit
this entire network on here,

99
00:10:37,255 --> 00:10:41,773
and so what they ended up doing, was
they spread the network across two GPUs.

100
00:10:41,773 --> 00:10:46,455
So on each GPU you would have half of the
neurons, or half of the feature maps.

101
00:10:46,455 --> 00:10:51,730
And so for example if you look at this first
conv layer, we have 55 by 55 by 96 output,

102
00:10:54,389 --> 00:11:04,155
but if you look at this diagram carefully, you can zoom in later in the actual
paper, you can see that, it's actually only 48 depth-wise, on each GPU,

103
00:11:05,049 --> 00:11:08,593
and so they just spread it, the
feature maps, directly in half.

104
00:11:10,288 --> 00:11:17,367
And so what happens is that for most of these layers, for example com
one, two, four and five, the connections are only with feature maps

105
00:11:17,367 --> 00:11:29,683
on the same GPU, so you would take as input, half of the feature maps that were on the
the same GPU as before and you don't look at the full 96 feature maps for example.

106
00:11:29,683 --> 00:11:33,850
You just take as input the
48 in that first layer.

107
00:11:34,767 --> 00:11:47,696
And then there's a few layers so com three, as well as FC six, seven and eight, where here are the
GPUs do talk to each other and so there's connections with all feature maps in the preceding layer.

108
00:11:47,696 --> 00:11:54,191
so there's communication across the GPUs, and each of these neurons
are then connected to the full depth of the previous input layer.

109
00:11:54,191 --> 00:11:55,627
Question.

110
00:11:55,627 --> 00:12:01,442
- [Student] It says the full simplified
AlexNetwork architecture. [mumbles]

111
00:12:05,583 --> 00:12:10,033
- Oh okay, so the question is why does it say
full simplified AlexNet architecture here?

112
00:12:10,033 --> 00:12:19,036
It just says that because I didn't put all the details on here, so for example
this is the full set of layers in the architecture, and the strides and so on,

113
00:12:19,036 --> 00:12:25,268
but for example the normalization layer, there's
other, these details are not written on here.

114
00:12:30,637 --> 00:12:37,849
And then just one little note, if you look at the paper and
try and write out the math and architectures and so on,

115
00:12:38,858 --> 00:12:52,721
there's a little bit of an issue on the very first layer they'll say if you'll look in the figure they'll say 224 by 224 ,
but there's actually some kind of funny pattern going on and so the numbers actually work out if you look at it as 227.

116
00:12:54,982 --> 00:13:04,261
AlexNet was the winner of the ImageNet classification benchmark in
2012, you can see that it cut the error rate by quite a large margin.

117
00:13:05,246 --> 00:13:14,193
It was the first CNN base winner, and it was widely used as a base to our
architecture almost ubiquitously from then until a couple years ago.

118
00:13:15,720 --> 00:13:17,980
It's still used quite a bit.

119
00:13:17,980 --> 00:13:24,071
It's used in transfer learning for lots of different
tasks and so it was used for basically a long time,

120
00:13:24,071 --> 00:13:33,202
and it was very famous and now though there's been some more recent architectures
that have generally just had better performance and so we'll talk about these

121
00:13:33,202 --> 00:13:39,282
next and these are going to be the more common
architectures that you'll be wanting to use in practice.

122
00:13:40,853 --> 00:13:47,813
So just quickly first in 2013 the ImageNet
challenge was won by something called a ZFNet.

123
00:13:47,813 --> 00:13:48,718
Yes, question.

124
00:13:48,718 --> 00:13:52,729
[student speaks off mic]

125
00:13:52,729 --> 00:13:56,612
- So the question is intuition why AlexNet was
so much better than the ones that came before,

126
00:13:56,612 --> 00:14:04,786
DefLearning comNets [mumbles] this is just a
very different kind of approach in architecture.

127
00:14:04,786 --> 00:14:09,004
So this was the first deep learning based
approach first comNet that was used.

128
00:14:12,445 --> 00:14:18,298
So in 2013 the challenge was won by something called a
ZFNet [Zeller Fergus Net] named after the creators.

129
00:14:18,298 --> 00:14:23,749
And so this mostly was improving
hyper parameters over the AlexNet.

130
00:14:23,749 --> 00:14:35,735
It had the same number of layers, the same general structure and they made a few changes things like changing
the stride size, different numbers of filters and after playing around with these hyper parameters more,

131
00:14:35,735 --> 00:14:41,369
they were able to improve the error rate.
But it's still basically the same idea.

132
00:14:41,369 --> 00:14:49,843
So in 2014 there are a couple of architectures that were now more
significantly different and made another jump in performance,

133
00:14:49,843 --> 00:14:58,178
and the main difference with these networks
first of all was much deeper networks.

134
00:14:58,178 --> 00:15:12,321
So from the eight layer network that was in 2012 and 2013, now in 2014 we had two
very close winners that were around 19 layers and 22 layers. So significantly deeper.

135
00:15:12,321 --> 00:15:16,502
And the winner of this
was GoogleNet, from Google

136
00:15:16,502 --> 00:15:20,176
but very close behind was
something called VGGNet

137
00:15:20,176 --> 00:15:27,421
from Oxford, and on actually the localization challenge
VGG got first place in some of the other tracks.

138
00:15:27,421 --> 00:15:31,958
So these were both very,
very strong networks.

139
00:15:31,958 --> 00:15:34,663
So let's first look at VGG
in a little bit more detail.

140
00:15:34,663 --> 00:15:40,818
And so the VGG network is the idea of much
deeper networks and with much smaller filters.

141
00:15:40,818 --> 00:15:50,374
So they increased the number of layers from eight layers in AlexNet
right to now they had models with 16 to 19 layers in VGGNet.

142
00:15:52,290 --> 00:16:03,916
And one key thing that they did was they kept very small filter so only three by three conv all the way,
which is basically the smallest com filter size that is looking at a little bit of the neighboring pixels.

143
00:16:03,916 --> 00:16:11,485
And they just kept this very simple structure of three by three
convs with the periodic pooling all the way through the network.

144
00:16:11,485 --> 00:16:19,948
And it's very simple elegant network architecture, was
able to get 7.3% top five error on the ImageNet challenge.

145
00:16:22,651 --> 00:16:27,442
So first the question of
why use smaller filters.

146
00:16:27,442 --> 00:16:33,371
So when we take these small filters now we have
fewer parameters and we try and stack more of them

147
00:16:33,371 --> 00:16:39,344
instead of having larger filters, have smaller filters with
more depth instead, have more of these filters instead,

148
00:16:39,344 --> 00:16:47,202
what happens is that you end up having the same effective receptive
field as if you only have one seven by seven convolutional layer.

149
00:16:47,202 --> 00:16:55,466
So here's a question, what is the effective receptive field
of three of these three by three conv layers with stride one?

150
00:16:55,466 --> 00:17:01,189
So if you were to stack three three by three conv layers
with Stride one what's the effective receptive field,

151
00:17:01,189 --> 00:17:09,754
the total area of the input, spatial area of the input that
enure at the top layer of the three layers is looking at.

152
00:17:12,313 --> 00:17:15,987
So I heard fifteen pixels,
why fifteen pixels?

153
00:17:15,987 --> 00:17:20,609
- [Student] Okay, so the
reason given was because

154
00:17:20,609 --> 00:17:27,369
they overlap-- - Okay, so the reason given was
because they overlap. So it's on the right track.

155
00:17:27,369 --> 00:17:35,668
What actually is happening though is you have to see, at the first
layer, the receptive field is going to be three by three right?

156
00:17:35,668 --> 00:17:43,193
And then at the second layer, each of these neurons in the second
layer is going to look at three by three other first layer

157
00:17:43,193 --> 00:17:51,676
filters, but the corners of these three by three have an additional
pixel on each side, that is looking at in the original input layer.

158
00:17:51,676 --> 00:17:56,423
So the second layer is actually looking at five by
five receptive field and then if you do this again,

159
00:17:56,423 --> 00:18:04,040
the third layer is looking at three by three
in the second layer but this is going to,

160
00:18:04,040 --> 00:18:06,907
if you just draw out this pyramid is looking
at seven by seven in the input layer.

161
00:18:06,907 --> 00:18:16,026
So the effective receptive field here is going to be seven by
seven. Which is the same as one seven by seven conv layer.

162
00:18:16,026 --> 00:18:21,546
So what happens is that this has the same effective receptive
field as a seven by seven conv layer but it's deeper.

163
00:18:21,546 --> 00:18:26,201
It's able to have more non-linearities in
there, and it's also fewer parameters.

164
00:18:26,201 --> 00:18:36,536
So if you look at the total number of parameters, each of these conv filters for
the three by threes is going to have nine parameters in each conv [mumbles]

165
00:18:38,165 --> 00:18:44,648
three times three, and then times the input depth, so
three times three times C, times this total number

166
00:18:44,648 --> 00:18:51,034
of output feature maps, which is again C is we're
going to preserve the total number of channels.

167
00:18:51,034 --> 00:19:00,165
So you get three times three, times C times C for each of these layers,
and we have three layers so it's going to be three times this number,

168
00:19:00,165 --> 00:19:07,409
compared to if you had a single seven by seven layer then you
get, by the same reasoning, seven squared times C squared.

169
00:19:07,409 --> 00:19:11,032
So you're going to have fewer
parameters total, which is nice.

170
00:19:15,570 --> 00:19:24,161
So now if we look at this full network here there's a lot of numbers up
here that you can go back and look at more carefully but if we look at all

171
00:19:24,161 --> 00:19:30,716
of the sizes and number of parameters the same
way that we calculated the example for AlexNet,

172
00:19:30,716 --> 00:19:32,517
this is a good exercise to go through,

173
00:19:32,517 --> 00:19:45,834
we can see that you know going the same way we have a couple of these conv layers and a pooling layer a
couple more conv layers, pooling layer, several more conv layers and so on. And so this just keeps going up.

174
00:19:45,834 --> 00:19:52,431
And if you counted the total number of convolutional and fully
connected layers, we're going to have 16 in this case for VGG 16,

175
00:19:52,431 --> 00:20:00,478
and then VGG 19, it's just a very similar architecture,
but with a few more conv layers in there.

176
00:20:03,021 --> 00:20:05,605
And so the total memory
usage of this network,

177
00:20:05,605 --> 00:20:17,196
so just making a forward pass through counting up all of these numbers so in the
memory numbers here written in terms of the total numbers, like we calculated earlier,

178
00:20:17,196 --> 00:20:23,125
and if you look at four bytes per number,
this is going to be about 100 megs per image,

179
00:20:23,125 --> 00:20:28,727
and so this is the scale of the memory usage that's
happening and this is only for a forward pass right,

180
00:20:28,727 --> 00:20:35,470
when you do a backward pass you're going to have to
store more and so this is pretty heavy memory wise.

181
00:20:35,470 --> 00:20:44,410
100 megs per image, if you have on five gigs of total memory,
then you're only going to be able to store about 50 of these.

182
00:20:47,300 --> 00:20:56,131
And so also the total number of parameters here we have is 138 million
parameters in this network, and this compares with 60 million for AlexNet.

183
00:20:56,131 --> 00:20:57,481
Question?

184
00:20:57,481 --> 00:21:00,898
[student speaks off mic]

185
00:21:06,204 --> 00:21:09,920
- So the question is what do we mean by deeper,
is it the number of filters, number of layers?

186
00:21:09,920 --> 00:21:14,087
So deeper in this case is
always referring to layers.

187
00:21:15,605 --> 00:21:25,216
So there are two usages of the word depth which is confusing one is the depth
rate per channel, width by height by depth, you can use the word depth here,

188
00:21:26,942 --> 00:21:34,298
but in general we talk about the depth of a network, this is going to
be the total number of layers in the network, and usually in particular

189
00:21:34,298 --> 00:21:43,368
we're counting the total number of weight layers. So the total number of layers
with trainable weight, so convolutional layers and fully connected layers.

190
00:21:43,368 --> 00:21:46,868
[student mumbles off mic]

191
00:22:00,810 --> 00:22:06,174
- Okay, so the question is, within each
layer what do different filters need?

192
00:22:06,174 --> 00:22:13,043
And so we talked about this back in the comNet
lecture, so you can also go back and refer to that,

193
00:22:13,043 --> 00:22:27,616
but each filter is a set of let's say three by three convs, so each filter is looking at a, is a set
of weight looking at a three by three value input input depth, and this produces one feature map,

194
00:22:27,616 --> 00:22:31,954
one activation map of all the responses
of the different spatial locations.

195
00:22:31,954 --> 00:22:39,646
And then we have we can have as many filters as we want right so for
example 96 and each of these is going to produce a feature map.

196
00:22:39,646 --> 00:22:48,368
And so it's just like each filter corresponds to a different pattern that we're looking for
in the input that we convolve around and we see the responses everywhere in the input,

197
00:22:48,368 --> 00:22:56,181
we create a map of these and then another filter will
we convolve over the image and create another map.

198
00:22:58,761 --> 00:23:00,226
Question.

199
00:23:00,226 --> 00:23:03,643
[student speaks off mic]

200
00:23:07,465 --> 00:23:16,733
- So question is, is there intuition behind, as you go deeper into the network
we have more channel depth so more number of filters right and so you can have

201
00:23:17,676 --> 00:23:21,766
any design that you want so
you don't have to do this.

202
00:23:21,766 --> 00:23:24,341
In practice you will see this
happen a lot of the times

203
00:23:24,341 --> 00:23:30,598
and one of the reasons is people try and maintain
kind of a relatively constant level of compute,

204
00:23:30,598 --> 00:23:37,991
so as you go higher up or deeper into your network,
you're usually also using basically down sampling

205
00:23:39,606 --> 00:23:45,759
and having smaller total spatial area and then so then
they also increase now you increase by depth a little bit,

206
00:23:45,759 --> 00:23:53,367
it's not as expensive now to increase by depth because
it's spatially smaller and so, yeah that's just a reason.

207
00:23:53,367 --> 00:23:54,716
Question.

208
00:23:54,716 --> 00:23:58,133
[student speaks off mic]

209
00:23:59,872 --> 00:24:04,653
- So performance-wise is there any reason to use
SBN [mumbles] instead of SouthMax [mumbles],

210
00:24:04,653 --> 00:24:09,761
so no, for a classifier you can use either one,
and you did that earlier in the class as well,

211
00:24:09,761 --> 00:24:17,242
but in general SouthMax losses, have generally worked
well and been standard use for classification here.

212
00:24:18,509 --> 00:24:20,023
Okay yeah one more question.

213
00:24:20,023 --> 00:24:23,523
[student mumbles off mic]

214
00:24:37,902 --> 00:24:45,398
- Yes, so the question is, we don't have to store all of the memory
like we can throw away the parts that we don't need and so on?

215
00:24:45,398 --> 00:24:49,221
And yes this is true.
Some of this you don't need to keep,

216
00:24:49,221 --> 00:25:02,571
but you're also going to be doing a backwards pass through ware for the most part, when you were doing the chain rule
and so on you needed a lot of these activations as part of it and so in large part a lot of this does need to be kept.

217
00:25:04,006 --> 00:25:14,440
So if we look at the distribution of where memory is used and where parameters are,
you can see that a lot of memories in these early layers right where you still have

218
00:25:14,440 --> 00:25:24,054
spatial dimensions you're going to have more memory usage and then a lot of
the parameters are actually in the last layers, the fully connected layers

219
00:25:24,054 --> 00:25:28,837
have a huge number of parameters right, because
we have all of these dense connections.

220
00:25:28,837 --> 00:25:36,999
And so that's something just to know and then keep
in mind so later on we'll see some networks actually

221
00:25:36,999 --> 00:25:42,345
get rid of these fully connected layers and be
able to save a lot on the number of parameters.

222
00:25:42,345 --> 00:25:48,059
And then just one last thing to point out, you'll also
see different ways of calling all of these layers right.

223
00:25:48,059 --> 00:25:56,190
So here I've written out exactly what the layers are.
conv3-64 means three by three convs with 64 total filters.

224
00:25:56,190 --> 00:26:05,190
But for VGGNet on this diagram on the right here there's also
common ways that people will look at each group of filters,

225
00:26:05,190 --> 00:26:11,822
so each orange block here, as in conv1
part one, so conv1-1, conv1-2, and so on.

226
00:26:11,822 --> 00:26:14,655
So just something to keep in mind.

227
00:26:16,594 --> 00:26:22,120
So VGGNet ended up getting second place in
the ImageNet 2014 classification challenge,

228
00:26:22,120 --> 00:26:24,783
first in localization.

229
00:26:24,783 --> 00:26:29,037
They followed a very similar training
procedure as Alex Krizhevsky for the AlexNet.

230
00:26:29,037 --> 00:26:38,764
They didn't use local response normalization, so as I mentioned earlier,
they found out this didn't really help them, and so they took it out.

231
00:26:38,764 --> 00:26:49,615
You'll see VGG 16 and VGG 19 are common variants of the cycle here,
and this is just the number of layers, 19 is slightly deeper than 16.

232
00:26:49,615 --> 00:27:00,366
In practice VGG 19 works very little bit better, and there's a little bit
more memory usage, so you can use either but 16 is very commonly used.

233
00:27:01,470 --> 00:27:10,110
For best results, like AlexNet, they did ensembling in order
to average several models, and you get better results.

234
00:27:10,110 --> 00:27:20,158
And they also showed in their work that the FC7 features of the last
fully connected layer before going to the 1000 ImageNet classes.

235
00:27:20,158 --> 00:27:26,463
The 4096 size layer just before that,
is a good feature representation,

236
00:27:26,463 --> 00:27:35,055
that can even just be used as is, to extract these features
from other data, and generalized these other tasks as well.

237
00:27:35,055 --> 00:27:37,792
And so FC7 is a good
feature representation.

238
00:27:37,792 --> 00:27:39,142
Yeah question.

239
00:27:39,142 --> 00:27:44,432
[student speaks off mic]
- Sorry what was the question?

240
00:27:45,939 --> 00:27:50,036
Okay, so the question is
what is localization here?

241
00:27:50,036 --> 00:27:57,163
And so this is a task, and we'll talk about it a little bit more in
a later lecture on detection and localization so I don't want to

242
00:27:57,163 --> 00:28:03,205
go into detail here but it's basically an image, not
just classifying What's the class of the image,

243
00:28:03,205 --> 00:28:09,433
but also drawing a bounding box around
where that object is in the image.

244
00:28:09,433 --> 00:28:16,153
And the difference with detection, which is a very related task is that
detection there can be multiple instances of this object in the image

245
00:28:16,153 --> 00:28:22,671
localization we're assuming there's just one, this
classification but we just how this additional bounding box.

246
00:28:25,343 --> 00:28:32,382
So we looked at VGG which was one of the deep networks
from 2014 and then now we'll talk about GoogleNet

247
00:28:32,382 --> 00:28:36,603
which was the other one that won
the classification challenge.

248
00:28:37,612 --> 00:28:47,776
So GoogleNet again was a much deeper network with 22 layers but one of
the main insights and special things about GoogleNet is that it really

249
00:28:47,776 --> 00:28:57,866
looked at this problem of computational efficiency and it tried to design
a network architecture that was very efficient in the amount of compute.

250
00:28:57,866 --> 00:29:05,023
And so they did this using this inception module which
we'll go into more detail and basically stacking

251
00:29:05,023 --> 00:29:08,336
a lot of these inception
modules on top of each other.

252
00:29:08,336 --> 00:29:19,841
There's also no fully connected layers in this network, so they got rid of that were able to save a lot of
parameters and so in total there's only five million parameters which is twelve times less than AlexNet,

253
00:29:19,841 --> 00:29:24,308
which had 60 million even
though it's much deeper now.

254
00:29:24,308 --> 00:29:26,975
It got 6.7% top five error.

255
00:29:31,392 --> 00:29:35,363
So what's the inception module?
So the idea behind the inception module

256
00:29:35,363 --> 00:29:40,023
is that they wanted to design
a good local network typology

257
00:29:40,023 --> 00:29:52,341
and it has this idea of this local topology that's you know you can think of it as a network
within a network and then stack a lot of these local typologies one on top of each other.

258
00:29:52,341 --> 00:29:58,387
And so in this local network that they're calling an
inception module what they're doing is they're basically

259
00:29:58,387 --> 00:30:07,138
applying several different kinds of filter operations in
parallel on top of the same input coming into this same layer.

260
00:30:07,138 --> 00:30:11,896
So we have our input coming in from the previous layer and
then we're going to do different kinds of convolutions.

261
00:30:11,896 --> 00:30:25,647
So a one by one conv, right a three by three conv, five by five conv, and then they also have a pooling operation
in this case three by three pooling, and so you get all of these different outputs from these different layers,

262
00:30:25,647 --> 00:30:31,499
and then what they do is they concatenate all
these filter outputs together depth wise, and so

263
00:30:31,499 --> 00:30:38,893
then this creates one tenser output at the end
that is going tom pass on to the next layer.

264
00:30:41,020 --> 00:30:50,015
So if we look at just a naive way of doing this we just do exactly that we have all
of these different operations we get the outputs we concatenate them together.

265
00:30:50,015 --> 00:30:52,386
So what's the problem with this?

266
00:30:52,386 --> 00:30:57,717
And it turns out that computational
complexity is going to be a problem here.

267
00:30:58,982 --> 00:31:11,156
So if we look more carefully at an example, so here just for as an example I've put one by
one conv, 128 filter so three by three conv 192 filters, five by five convs and 96 filters.

268
00:31:11,156 --> 00:31:19,398
Assume everything has basically the stride that's going to maintain
the spatial dimensions, and that we have this input coming in.

269
00:31:21,341 --> 00:31:29,231
So what is the output size of the one by one filter with
128 , one by one conv with 128 filters? Who has a guess?

270
00:31:35,910 --> 00:31:39,910
OK so I heard 28 by 28,
by 128 which is correct.

271
00:31:40,988 --> 00:31:53,159
So right by one by one conv we're going to maintain spatial dimensions and then on top
of that, each conv filter is going to look through the entire 256 depth of the input,

272
00:31:53,159 --> 00:32:00,194
but then the output is going to be, we have a 28 by 28 feature
map for each of the 128 filters that we have in this conv layer.

273
00:32:00,194 --> 00:32:02,361
So we get 28 by 28 by 128.

274
00:32:05,469 --> 00:32:14,939
OK and then now if we do the same thing and we look at the filter sizes
of the output sizes sorry of all of the different filters here, after the

275
00:32:14,939 --> 00:32:20,379
three by three conv we're going to have this volume
of 28 by 28 by 192 right after five by five conv

276
00:32:20,379 --> 00:32:24,559
we have 96 filters here.
So 28 by 28 by 96,

277
00:32:24,559 --> 00:32:34,712
and then out pooling layer is just going to keep the same spatial
dimension here, so pooling layer will preserve it in depth,

278
00:32:34,712 --> 00:32:40,192
and here because of our stride, we're also
going to preserve our spatial dimensions.

279
00:32:41,225 --> 00:32:51,498
And so now if we look at the output size after filter concatenation what we're
going to get is 28 by 28, these are all 28 by 28, and we concatenating depth wise.

280
00:32:51,498 --> 00:32:59,330
So we get 28 by 28 times all of these added together, and
the total output size is going to be 28 by 28 by 672.

281
00:33:01,113 --> 00:33:10,208
So the input to our inception module was 28 by 28 by 256,
then the output from this module is 28 by 28 by 672.

282
00:33:11,466 --> 00:33:17,254
So we kept the same spatial dimensions,
and we blew up the depth.

283
00:33:17,254 --> 00:33:18,188
Question.

284
00:33:18,188 --> 00:33:21,905
[student speaks off mic]

285
00:33:21,905 --> 00:33:25,546
OK So in this case, yeah, the question is,
how are we getting 28 by 28 for everything?

286
00:33:25,546 --> 00:33:29,307
So here we're doing all the zero padding in
order to maintain the spatial dimensions,

287
00:33:29,307 --> 00:33:33,403
and that way we can do this filter
concatenation depth-wise.

288
00:33:34,395 --> 00:33:36,233
Question in the back.

289
00:33:36,233 --> 00:33:39,650
[student speaks off mic]

290
00:33:44,824 --> 00:33:47,805
- OK The question is what's
the 256 deep at the input,

291
00:33:47,805 --> 00:33:53,814
and so this is not the input to the network, this is the
input just to this local module that I'm looking at.

292
00:33:53,814 --> 00:34:00,506
So in this case 256 is the depth of the previous
inception module that came just before this.

293
00:34:00,506 --> 00:34:08,438
And so now coming out we have 28 by 28 by 672, and that's
going to be the input to the next inception module.

294
00:34:08,438 --> 00:34:09,915
Question.

295
00:34:09,916 --> 00:34:13,333
[student speaks off mic]

296
00:34:17,039 --> 00:34:23,181
- Okay the question is, how did we get 28 by
28 by 128 for the first one, the first conv,

297
00:34:23,181 --> 00:34:34,058
and this is basically it's a one by one convolution right, so we're going to take
this one by one convolution slide it across our 28 by 28 by 256 input spatially

298
00:34:35,485 --> 00:34:41,956
where it's at each location, it's going to multiply, it's going
to do a [mumbles] through the entire 256 depth, and so we do this

299
00:34:41,956 --> 00:34:46,983
one by one conv slide it over spatially and we
get a feature map out that's 28 by 28 by one.

300
00:34:46,983 --> 00:34:58,311
There's one number at each spatial location coming out, and each filter produces
one of these 28 by 28 by one maps, and we have here a total 128 filters,

301
00:35:01,050 --> 00:35:04,800
and that's going to
produce 28 by 28, by 128.

302
00:35:05,809 --> 00:35:10,403
OK so if you look at the number of operations
that are happening in the convolutional layer,

303
00:35:10,403 --> 00:35:22,553
let's look at the first one for example this one by one conv as I was just
saying at each each location we're doing a one by one by 256 dot product.

304
00:35:24,545 --> 00:35:28,358
So there's 256 multiply
operations happening here

305
00:35:28,358 --> 00:35:37,865
and then for each filter map we have 28 by 28 spatial locations, so
that's the first 28 times 28 first two numbers that are multiplied here.

306
00:35:37,865 --> 00:35:53,859
These are the spatial locations for each filter map, and so we have to do this to 25 60 multiplication
each one of these then we have 128 total filters at this layer, or we're producing 128 total feature maps.

307
00:35:53,859 --> 00:36:01,221
And so the total number of these operations here
is going to be 28 times 28 times 128 times 256.

308
00:36:02,129 --> 00:36:10,349
And so this is going to be the same for, you can think about this for the three
by three conv, and the five by five conv, that's exactly the same principle.

309
00:36:10,349 --> 00:36:16,690
And in total we're going to get 854 million
operations that are happening here.

310
00:36:17,968 --> 00:36:21,191
- [Student] And the 128,
192, and 96 are just values

311
00:36:22,131 --> 00:36:29,044
- Question the 128, 192 and 256 are values that I picked.
Yes, these are not values that I just came up with.

312
00:36:29,044 --> 00:36:35,594
They are similar to the ones that you will see
in like a particular layer of inception net,

313
00:36:35,594 --> 00:36:43,103
so in GoogleNet basically, each module has a different set of these
kinds of parameters, and I picked one that was similar to one of these.

314
00:36:45,089 --> 00:36:49,046
And so this is very expensive computationally
right, these these operations.

315
00:36:49,046 --> 00:36:55,507
And then the other thing that I also want to note is that the pooling layer
also adds to this problem because it preserves the whole feature depth.

316
00:36:57,062 --> 00:37:03,519
So at every layer your total depth can only grow
right, you're going to take the full featured depth

317
00:37:03,519 --> 00:37:10,513
from your pooling layer, as well as all the additional
feature maps from the conv layers and add these up together.

318
00:37:10,513 --> 00:37:18,960
So here our input was 256 depth and our output is 672 depth
and you're just going to keep increasing this as you go up.

319
00:37:21,920 --> 00:37:25,441
So how do we deal with this and how
do we keep this more manageable?

320
00:37:25,441 --> 00:37:36,181
And so one of the key insights that GoogleNet used was that well we can we can
address this by using bottleneck layers and try and project these feature maps

321
00:37:36,181 --> 00:37:43,174
to lower dimension before our our convolutional
operations, so before our expensive layers.

322
00:37:45,007 --> 00:37:46,642
And so what exactly does that mean?

323
00:37:46,642 --> 00:37:58,080
So reminder one by one convolution, I guess we were just going through this but it's taking your input volume,
it's performing a dot product at each spatial location and what it does is it preserves spatial dimension

324
00:38:00,141 --> 00:38:06,139
but it reduces the depth and it reduces that by
projecting your input depth to a lower dimension.

325
00:38:06,139 --> 00:38:10,515
It just takes it's basically like a linear
combination of your input feature maps.

326
00:38:12,880 --> 00:38:18,199
And so this main idea is that it's projecting
your depth down and so the inception module

327
00:38:18,199 --> 00:38:29,085
takes these one by one convs and adds these at a bunch of places in these modules
where there's going to be, in order to alleviate this expensive compute.

328
00:38:29,085 --> 00:38:36,162
So before the three by three and five by five conv
layers, it puts in one of these one by one convolutions.

329
00:38:36,162 --> 00:38:42,315
And then after the pooling layer it also
puts an additional one by one convolution.

330
00:38:43,284 --> 00:38:47,609
Right so these are the one by one
bottleneck layers that are added in.

331
00:38:48,562 --> 00:38:52,736
And so how does this change the math
that we were looking at earlier?

332
00:38:52,736 --> 00:38:58,589
So now basically what's happening is that we
still have the same input here 28 by 28 by 256,

333
00:38:58,589 --> 00:39:12,856
but these one by one convs are going to reduce the depth dimension and so you can see before the three by
three convs, if I put a one by one conv with 64 filters, my output from that is going to be, 28 by 28 by 64.

334
00:39:14,184 --> 00:39:25,154
So instead of now going into the three by three convs afterwards instead of
28 by 28 by 256 coming in, we only have a 28 by 28, by 64 block coming in.

335
00:39:25,154 --> 00:39:31,454
And so this is now reducing the smaller input
going into these conv layers, the same thing for

336
00:39:31,454 --> 00:39:40,499
the five by five conv, and then for the pooling layer, after the
pooling comes out, we're going to reduce the depth after this.

337
00:39:41,562 --> 00:39:51,214
And so, if you work out the math the same way for all of the convolutional ops here,
adding in now all these one by one convs on top of the three by threes and five by fives,

338
00:39:51,214 --> 00:40:02,499
the total number of operations is 358 million operations, so it's much less than
the 854 million that we had in the naive version, and so you can see how you

339
00:40:02,499 --> 00:40:10,438
can use this one by one conv, and the filter
size for that to control your computation.

340
00:40:10,438 --> 00:40:12,118
Yes, question in the back.

341
00:40:12,118 --> 00:40:15,535
[student speaks off mic]

342
00:40:23,525 --> 00:40:30,979
- Yes, so the question is, have you looked into what information
might be lost by doing this one by one conv at the beginning.

343
00:40:30,979 --> 00:40:35,112
And so there might be
some information loss,

344
00:40:35,112 --> 00:40:46,013
but at the same time if you're doing these projections you're taking a linear combination of
these input feature maps which has redundancy in them, you're taking combinations of them,

345
00:40:47,623 --> 00:40:59,422
and you're also introducing an additional non-linearity after the one by one conv, so it also actually
helps in that way with adding a little bit more depth and so, I don't think there's a rigorous analysis

346
00:40:59,422 --> 00:41:07,314
of this, but basically in general this works
better and there's reasons why it helps as well.

347
00:41:07,314 --> 00:41:15,627
OK so here we have, we're basically using these one by
one convs to help manage our computational complexity,

348
00:41:15,627 --> 00:41:20,450
and then what GooleNet does is it takes these inception
modules and it's going to stack all these together.

349
00:41:20,450 --> 00:41:22,827
So this is a full inception architecture.

350
00:41:22,827 --> 00:41:32,773
And if we look at this a little bit more detail, so here I've flipped it,
because it's so big, it's not going to fit vertically any more on the slide.

351
00:41:32,773 --> 00:41:41,867
So what we start with is we first have this stem network, so this is more the kind
of vanilla plain conv net that we've seen earlier [mumbles] six sequence of layers.

352
00:41:43,256 --> 00:41:48,570
So conv pool a couple of convs in another
pool just to get started and then after that

353
00:41:48,570 --> 00:41:54,911
we have all of our different our multiple inception
modules all stacked on top of each other,

354
00:41:54,911 --> 00:41:58,433
and then on top we have
our classifier output.

355
00:41:58,433 --> 00:42:08,982
And notice here that they've really removed the expensive fully connected layers it turns
out that the model works great without them, even and you reduce a lot of parameters.

356
00:42:08,982 --> 00:42:17,098
And then what they also have here is, you can see these couple of
extra stems coming out and these are auxiliary classification outputs

357
00:42:18,866 --> 00:42:23,273
and so these are also you know
just a little mini networks

358
00:42:23,273 --> 00:42:29,217
with an average pooling, a one by one conv, a
couple of fully connected layers here going to

359
00:42:29,217 --> 00:42:35,702
the soft Max and also a 1000 way SoftMax
with the ImageNet classes.

360
00:42:35,702 --> 00:42:41,350
And so you're actually using your ImageNet training
classification loss in three separate places here.

361
00:42:41,350 --> 00:42:51,752
The standard end of the network, as well as in these two places earlier on
in the network, and the reason they do that is just this is a deep network

362
00:42:51,752 --> 00:43:02,140
and they found that having these additional auxiliary classification
outputs, you get more gradient training injected at the earlier layers,

363
00:43:02,140 --> 00:43:13,484
and so more just helpful signal flowing in because these intermediate layers should also
be helpful. You should be able to do classification based off some of these as well.

364
00:43:13,484 --> 00:43:20,711
And so this is the full architecture,
there's 22 total layers with weights and so

365
00:43:20,711 --> 00:43:29,474
within each of these modules each of those one by one, three by three, five
by five is a weight layer, just including all of these parallel layers,

366
00:43:29,474 --> 00:43:44,128
and in general it's a relatively more carefully designed architecture and part of
this is based on some of these intuitions that we're talking about and part of them

367
00:43:44,128 --> 00:43:55,511
also is just you know Google the authors they had huge clusters and they're cross
validating across all kinds of design choices and this is what ended up working well.

368
00:43:55,511 --> 00:43:57,105
Question?

369
00:43:57,105 --> 00:44:00,522
[student speaks off mic]

370
00:44:24,442 --> 00:44:32,457
- Yeah so the question is, are the auxiliary outputs actually
useful for the final classification, to use these as well?

371
00:44:32,457 --> 00:44:39,164
I think when they're training them they do average all
these for the losses coming out. I think they are helpful.

372
00:44:39,164 --> 00:44:49,272
I can't remember if in the final architecture, whether they average all of these or just take
one, it seems very possible that they would use all of them, but you'll need to check on that.

373
00:44:49,272 --> 00:44:52,689
[student speaks off mic]

374
00:44:58,352 --> 00:45:10,219
- So the question is for the bottleneck layers, is it possible to use some other types
of dimensionality reduction and yes you can use other kinds of dimensionality reduction.

375
00:45:10,219 --> 00:45:17,138
The benefits here of this one by one conv is, you're getting this
effect, but it's all, you know it's a conv layer just like any other.

376
00:45:17,138 --> 00:45:26,180
You have the soul network of these, you just train it this full network back [mumbles]
through everything, and it's learning how to combine the previous feature maps.

377
00:45:28,601 --> 00:45:30,730
Okay yeah, question in the back.

378
00:45:30,730 --> 00:45:34,147
[student speaks off mic]

379
00:45:35,807 --> 00:45:42,549
- Yes so, question is are any weights
shared or all they all separate and yeah,

380
00:45:42,549 --> 00:45:45,542
all of these layers have separate weights.

381
00:45:45,542 --> 00:45:46,690
Question.

382
00:45:46,690 --> 00:45:50,107
[student speaks off mic]

383
00:45:56,784 --> 00:46:00,143
- Yes so the question is why do we have
to inject gradients at earlier layers?

384
00:46:00,143 --> 00:46:07,785
So our classification output at the very end, where we get a gradient
on this, it's passed all the way back through the chain roll

385
00:46:09,599 --> 00:46:21,178
but the problem is when you have very deep networks and you're going all the way back through these, some
of this gradient signal can become minimized and lost closer to the beginning, and so that's why having

386
00:46:21,178 --> 00:46:28,377
these additional ones in earlier parts
can help provide some additional signal.

387
00:46:28,377 --> 00:46:32,667
[student mumbles off mic]

388
00:46:32,667 --> 00:46:35,853
- So the question is are you doing back
prop all the times for each output.

389
00:46:35,853 --> 00:46:41,446
No it's just one back prop all the way
through, and you can think of these three,

390
00:46:41,446 --> 00:46:48,075
you can think of there being kind of like an addition at the end of these
if you were to draw up your computational graph, and so you get your

391
00:46:48,075 --> 00:46:54,004
final signal and you can just take all of these
gradients and just back plot them all the way through.

392
00:46:54,004 --> 00:46:58,970
So it's as if they were added together
at the end in a computational graph.

393
00:46:58,970 --> 00:47:05,423
OK so in the interest of time because we still have a
lot to get through, can take other questions offline.

394
00:47:07,353 --> 00:47:10,520
Okay so GoogleNet basically 22 layers.

395
00:47:11,441 --> 00:47:15,983
It has an efficient inception module,
there's no fully connected layers.

396
00:47:15,983 --> 00:47:22,026
12 times fewer parameters than AlexNet, and
it's the ILSVRC 2014 classification winner.

397
00:47:25,228 --> 00:47:30,869
And so now let's look at the 2015 winner,
which is the ResNet network and so here

398
00:47:30,869 --> 00:47:38,339
this idea is really, this revolution of depth net right.
We were starting to increase depth in 2014, and here we've

399
00:47:38,339 --> 00:47:45,616
just had this hugely deeper model at 152
layers was the ResNet architecture.

400
00:47:45,616 --> 00:47:48,846
And so now let's look at that
in a little bit more detail.

401
00:47:48,846 --> 00:47:54,286
So the ResNet architecture, is getting extremely
deep networks, much deeper than any other networks

402
00:47:54,286 --> 00:48:00,479
before and it's doing this using this idea of
residual connections which we'll talk about.

403
00:48:00,479 --> 00:48:04,158
And so, they had 152
layer model for ImageNet.

404
00:48:04,158 --> 00:48:07,969
They were able to get 3.5
of 7% top 5 error with this

405
00:48:07,969 --> 00:48:18,114
and the really special thing is that they swept all classification and detection
contests in the ImageNet mart benchmark and this other benchmark called COCO.

406
00:48:18,114 --> 00:48:23,546
It just basically won everything. So it was
just clearly better than everything else.

407
00:48:25,055 --> 00:48:32,538
And so now let's go into a little bit of the motivation
behind ResNet and residual connections that we'll talk about.

408
00:48:32,538 --> 00:48:41,939
And the question that they started off by trying to answer is what happens when we
try and stack deeper and deeper layers on a plain convolutional neural network?

409
00:48:41,939 --> 00:48:53,874
So if we take something like VGG or some normal network that's just stacks of conv and pool layers
on top of each other can we just continuously extend these, get deeper layers and just do better?

410
00:48:55,601 --> 00:48:58,421
And and the answer is no.

411
00:48:58,421 --> 00:49:06,599
So if you so if you look at what happens when you get deeper, so here I'm
comparing a 20 layer network and a 56 layer network and so this is just a plain

412
00:49:09,498 --> 00:49:16,817
kind of network you'll see that in the test error here on the right
the 56 layer network is doing worse than the 28 layer network.

413
00:49:16,817 --> 00:49:19,771
So the deeper network was
not able to do better.

414
00:49:19,771 --> 00:49:29,680
But then the really weird thing is now if you look at the training error
right we here have again the 20 layer network and a 56 layer network.

415
00:49:29,680 --> 00:49:40,271
The 56 layer network, one of the obvious problems you think, I have a really deep network,
I have tons of parameters maybe it's probably starting to over fit at some point.

416
00:49:41,294 --> 00:49:48,985
But what actually happens is that when you're over fitting you would expect
to have very good, very low training error rate, and just bad test error,

417
00:49:48,985 --> 00:49:55,511
but what's happening here is that in the training error the 56
layer network is also doing worse than the 20 layer network.

418
00:49:56,833 --> 00:50:01,545
And so even though the deeper model performs
worse, this is not caused by over-fitting.

419
00:50:03,462 --> 00:50:10,253
And so the hypothesis of the ResNet creators is that
the problem is actually an optimization problem.

420
00:50:10,253 --> 00:50:15,611
Deeper models are just harder to optimize,
than more shallow networks.

421
00:50:16,835 --> 00:50:23,263
And the reasoning was that well, a deeper model should be
able to perform at least as well as a shallower model.

422
00:50:23,263 --> 00:50:32,330
You can have actually a solution by construction where you just take the learned layers
from your shallower model, you just copy these over and then for the remaining additional

423
00:50:32,330 --> 00:50:35,192
deeper layers you just
add identity mappings.

424
00:50:35,192 --> 00:50:39,533
So by construction this should be working
just as well as the shallower layer.

425
00:50:39,533 --> 00:50:46,295
And your model that weren't able to learn properly,
it should be able to learn at least this.

426
00:50:46,295 --> 00:51:00,594
And so motivated by this their solution was well how can we make it easier for our
architecture, our model to learn these kinds of solutions, or at least something like this?

427
00:51:00,594 --> 00:51:11,794
And so their idea is well instead of just stacking all these layers on top
of each other and having every layer try and learn some underlying mapping

428
00:51:11,794 --> 00:51:21,708
of a desired function, lets instead have these blocks, where we
try and fit a residual mapping, instead of a direct mapping.

429
00:51:21,708 --> 00:51:28,220
And so what this looks like is here on this right where
the input to these block is just the input coming in

430
00:51:29,818 --> 00:51:48,499
and here we are going to use our, here on the side, we're going to use our layers to try and fit
some residual of our desire to H of X, minus X instead of the desired function H of X directly.

431
00:51:49,450 --> 00:51:55,827
And so basically at the end of this block we take
the step connection on this right here, this loop,

432
00:51:55,827 --> 00:52:07,241
where we just take our input, we just use pass it through as an identity, and so if we had no weight layers
in between it was just going to be the identity it would be the same thing as the output, but now we use

433
00:52:07,241 --> 00:52:12,562
our additional weight layers to learn
some delta, for some residual from our X.

434
00:52:14,067 --> 00:52:24,502
And so now the output of this is going to be just our original R X plus some
residual that we're going to call it. It's basically a delta and so the idea is that

435
00:52:24,502 --> 00:52:31,428
now the output it should be easy for example,
in the case where identity is ideal,

436
00:52:32,510 --> 00:52:39,249
to just squash all of these weights of F of X
from our weight layers just set it to all zero

437
00:52:39,249 --> 00:52:48,578
for example, then we're just going to get identity as the output, and we can get
something, for example, close to this solution by construction that we had earlier.

438
00:52:48,578 --> 00:53:00,962
Right, so this is just a network architecture that says okay, let's try and fit this, learn how our
weight layers residual, and be something close, that way it'll more likely be something close to X,

439
00:53:00,962 --> 00:53:05,388
it's just modifying X, than to learn exactly
this full mapping of what it should be.

440
00:53:05,388 --> 00:53:08,249
Okay, any questions about this?

441
00:53:08,249 --> 00:53:09,189
[student speaks off mic]

442
00:53:09,189 --> 00:53:12,689
- Question is is there the same dimension?

443
00:53:13,770 --> 00:53:17,603
So yes these two paths
are the same dimension.

444
00:53:18,752 --> 00:53:32,288
In general either it's the same dimension, or what they actually do is they have these projections and shortcuts
and they have different ways of padding to make things work out to be the same dimension. Depth wise.

445
00:53:32,288 --> 00:53:33,395
Yes

446
00:53:33,395 --> 00:53:39,120
- [Student] When you use the word residual
you were talking about [mumbles off mic]

447
00:53:45,857 --> 00:53:53,638
- So the question is what exactly do we mean by residual
this output of this transformation is a residual?

448
00:53:53,638 --> 00:54:01,899
So we can think of our output here right as this F of X
plus X, where F of X is the output of our transformation

449
00:54:01,899 --> 00:54:06,650
and then X is our input, just
passed through by the identity.

450
00:54:06,650 --> 00:54:17,198
So we'd like to using a plain layer, what we're trying to do is learn something
like H of X, but what we saw earlier is that it's hard to learn H of X.

451
00:54:17,198 --> 00:54:20,671
It's a good H of X as we
get very deep networks.

452
00:54:20,671 --> 00:54:29,438
And so here the idea is let's try and break it down instead of as H
of X is equal to F of X plus, and let's just try and learn F of X.

453
00:54:29,438 --> 00:54:39,741
And so instead of learning directly this H of X we just want to learn what is it
that we need to add or subtract to our input as we move on to the next layer.

454
00:54:39,741 --> 00:54:45,889
So you can think of it as kind of modifying
this input, in place in a sense. We have--

455
00:54:45,889 --> 00:54:49,121
[interrupted by student mumbling off mic]

456
00:54:49,121 --> 00:54:58,129
- The question is, when we're saying the word residual are we talking about F of X?
Yeah. So F of X is what we're calling the residual. And it just has that meaning.

457
00:55:01,477 --> 00:55:03,941
Yes another question.

458
00:55:03,941 --> 00:55:07,441
[student mumbles off mic]

459
00:55:11,319 --> 00:55:20,145
- So the question is in practice do we just sum F of X and X together,
or do we learn some weighted combination and you just do a direct sum.

460
00:55:20,145 --> 00:55:28,809
Because when you do a direct sum, this is the idea of let
me just learn what is it I have to add or subtract onto X.

461
00:55:30,652 --> 00:55:34,463
Is this clear to everybody,
the main intuition?

462
00:55:34,463 --> 00:55:35,361
Question.

463
00:55:35,361 --> 00:55:38,778
[student speaks off mic]

464
00:55:40,721 --> 00:55:47,099
- Yeah, so the question is not clear why is it that learning the
residual should be easier than learning the direct mapping?

465
00:55:47,099 --> 00:55:58,747
And so this is just their hypotheses, and a hypotheses is that if we're
learning the residual you just have to learn what's the delta to X right?

466
00:55:58,747 --> 00:56:16,101
And if our hypotheses is that generally even something like our solution by construction, where we had some number of these
shallow layers that were learned and we had all these identity mappings at the top this was a solution that should have been

467
00:56:16,101 --> 00:56:23,985
good, and so that implies that maybe a lot of these layers,
actually something just close to identity, would be a good layer

468
00:56:23,985 --> 00:56:30,954
And so because of that, now we formulate this as being
able to learn the identity plus just a little delta.

469
00:56:30,954 --> 00:56:34,315
And if really the identity
is best we just make

470
00:56:34,315 --> 00:56:40,363
F of X squashes transformation to just be zero, which is
something that's relatively, might seem easier to learn,

471
00:56:40,363 --> 00:56:44,784
also we're able to get things that
are close to identity mappings.

472
00:56:44,784 --> 00:56:50,966
And so again this is not something that's necessarily
proven or anything it's just the intuition and hypothesis,

473
00:56:50,966 --> 00:56:58,708
and then we'll also see later some works where people are actually trying to challenge
this and say oh maybe it's not actually the residuals that are so necessary,

474
00:56:58,708 --> 00:57:07,507
but at least this is the hypothesis for this paper, and in
practice using this model, it was able to do very well.

475
00:57:07,507 --> 00:57:08,810
Question.

476
00:57:08,810 --> 00:57:12,227
[student speaks off mic]

477
00:57:41,813 --> 00:57:49,128
- Yes so the question is have people tried other ways
of combining the inputs from previous layers and yes

478
00:57:49,128 --> 00:57:56,747
so this is basically a very active area of research on and how we formulate all
these connections, and what's connected to what in all of these structures.

479
00:57:56,747 --> 00:58:04,695
So we'll see a few more examples of different network architectures
briefly later but this is an active area of research.

480
00:58:05,658 --> 00:58:12,093
OK so we basically have all of these residual
blocks that are stacked on top of each other.

481
00:58:12,093 --> 00:58:14,788
We can see the full resident architecture.

482
00:58:14,788 --> 00:58:27,299
Each of these residual blocks has two three by three conv layers as part of this block and
there's also been work just saying that this happens to be a good configuration that works well.

483
00:58:27,299 --> 00:58:29,828
We stack all these blocks
together very deeply.

484
00:58:29,828 --> 00:58:40,851
Another thing like with this very deep architecture it's basically also
enabling up to 150 layers deep of this, and then what we do is we stack

485
00:58:46,582 --> 00:58:53,982
all these and periodically we also double the number of filters
and down sample spatially using stride two when we do that.

486
00:58:55,856 --> 00:59:03,867
And then we have this additional [mumbles] at the very beginning of our
network and at the end we also hear, don't have any fully connected layers

487
00:59:03,867 --> 00:59:08,641
and we just have a global average pooling layer
that's going to average over everything spatially,

488
00:59:08,641 --> 00:59:12,808
and then be input into the
last 1000 way classification.

489
00:59:14,694 --> 00:59:16,991
So this is the full ResNet architecture

490
00:59:16,991 --> 00:59:21,935
and it's very simple and elegant just stacking up
all of these ResNet blocks on top of each other,

491
00:59:21,935 --> 00:59:29,389
and they have total depths of up to 34, 50,
100, and they tried up to 152 for ImageNet.

492
00:59:34,230 --> 00:59:43,964
OK so one additional thing just to know is that for a very deep network, so
the ones that are more than 50 layers deep, they also use bottleneck layers

493
00:59:43,964 --> 00:59:46,663
similar to what GoogleNet did
in order to improve efficiency

494
00:59:46,663 --> 00:59:57,195
and so within each block now you're going to, what they did is, have this
one by one conv filter, that first projects it down to a smaller depth.

495
00:59:57,195 --> 01:00:07,949
So again if we are looking at let's say 28 by 28 by 256 implant, we do this one
by one conv, it's taking it's projecting the depth down. We get 28 by 28 by 64.

496
01:00:09,107 --> 01:00:18,486
Now your convolution your three by three conv, in here they only have one,
is operating over this reduced step so it's going to be less expensive,

497
01:00:18,486 --> 01:00:29,870
and then afterwards they have another one by one conv that projects the depth back
up to 256, and so, this is the actual block that you'll see in deeper networks.

498
01:00:33,021 --> 01:00:41,282
So in practice the ResNet also uses batch normalization
after every conv layer, they use Xavier initialization

499
01:00:41,282 --> 01:00:50,578
with an extra scaling factor that they helped introduce to
improve the initialization trained with SGD + momentum.

500
01:00:51,604 --> 01:00:59,470
Their learning rate they use a similar learning rate type of schedule
where you decay your learning rate when your validation error plateaus.

501
01:01:01,751 --> 01:01:05,874
Mini batch size 256, a little bit
of weight decay and no drop out.

502
01:01:07,645 --> 01:01:13,581
And so experimentally they were able to show that they were
able to train these very deep networks, without degrading.

503
01:01:13,581 --> 01:01:19,060
They were able to have basically good gradient flow
coming all the way back down through the network.

504
01:01:19,060 --> 01:01:22,625
They tried up to 152 layers on ImageNet,

505
01:01:22,625 --> 01:01:26,632
1200 on Cifar, which is a,
you have played with it,

506
01:01:26,632 --> 01:01:35,024
but a smaller data set and they also saw that now you're deeper
networks are able to achieve lower training errors as expected.

507
01:01:36,303 --> 01:01:44,543
So you don't have the same strange plots that we saw
earlier where the behavior was in the wrong direction.

508
01:01:44,543 --> 01:01:54,843
And so from here they were able to sweep first place at all of the ILSVRC
competitions, and all of the COCO competitions in 2015 by a significant margins.

509
01:01:56,152 --> 01:02:06,649
Their total top five error was 3.6 % for a classification and this
is actually better than human performance in the ImageNet paper.

510
01:02:08,902 --> 01:02:22,551
There was also a human metric that came from actually [mumbles] our lab Andre Kapathy
spent like a week training himself and then basically did all of, did this task himself

511
01:02:24,730 --> 01:02:34,191
and was I think somewhere around 5-ish %, and so I was
basically able to do better than the then that human at least.

512
01:02:36,175 --> 01:02:42,069
Okay, so these are kind of the main
networks that have been used recently.

513
01:02:42,069 --> 01:02:48,004
We had AlexNet starting off with first,
VGG and GoogleNet are still very popular,

514
01:02:48,004 --> 01:02:58,218
but ResNet is the most recent best performing model that if you're looking for
something training a new network ResNet is available, you should try working with it.

515
01:03:00,154 --> 01:03:06,403
So just quickly looking at some of this getting
a better sense of the complexity involved.

516
01:03:06,403 --> 01:03:14,120
So here we have some plots that are sorted by performance
so this is top one accuracy here, and higher is better.

517
01:03:15,275 --> 01:03:21,540
And so you'll see a lot of these models that we talked about, as well
as some different versions of them so, this GoogleNet inception thing,

518
01:03:21,540 --> 01:03:31,389
I think there's like V2, V3 and the best one here is V4, which is
actually a ResNet plus inception combination, so these are just kind of

519
01:03:31,389 --> 01:03:39,159
more incremental, smaller changes that they've built on top
of them, and so that's the best performing model here.

520
01:03:39,159 --> 01:03:45,446
And if we look on the right, these plots of their
computational complexity here it's sorted.

521
01:03:47,686 --> 01:03:52,313
The Y axis is your top one accuracy
so higher is better.

522
01:03:52,313 --> 01:04:03,074
The X axis is your operations and so the more to the right, the more ops you're doing, the more
computationally expensive and then the bigger the circle, your circle is your memory usage,

523
01:04:03,074 --> 01:04:07,251
so the gray circles are referenced here, but
the bigger the circle the more memory usage

524
01:04:07,251 --> 01:04:16,206
and so here we can see that VGG these green ones are kind of the
least efficient. They have the biggest memory, the most operations,

525
01:04:16,206 --> 01:04:18,623
but they they do pretty well.

526
01:04:19,838 --> 01:04:29,275
GoogleNet is the most efficient here. It's way down on the
operation side, as well as a small little circle for memory usage.

527
01:04:29,275 --> 01:04:39,411
AlexNet, our earlier model, has lowest accuracy. It's relatively smaller compute,
because it's a smaller network, but it's also not particularly memory efficient.

528
01:04:41,309 --> 01:04:46,216
And then ResNet here, we
have moderate efficiency.

529
01:04:46,216 --> 01:04:52,500
It's kind of in the middle, both in terms of memory
and operations, and it has the highest accuracy.

530
01:04:56,029 --> 01:04:58,028
And so here also are
some additional plots.

531
01:04:58,028 --> 01:05:14,868
You can look at these more on your own time, but this plot on the left is showing the forward pass time and so this is in milliseconds and you
can up at the top VGG forward passes about 200 milliseconds you can get about five frames per second with this, and this is sorted in order.

532
01:05:14,868 --> 01:05:25,883
There's also this plot on the right looking at power consumption and if you look more at
this paper here, there's further analysis of these kinds of computational comparisons.

533
01:05:30,604 --> 01:05:38,750
So these were the main architectures that you should really know
in-depth and be familiar with, and be thinking about actively using.

534
01:05:38,750 --> 01:05:48,263
But now I'm going just to go briefly through some other architectures that are
just good to know either historical inspirations or more recent areas of research.

535
01:05:50,716 --> 01:05:56,342
So the first one Network in Network, this
is from 2014, and the idea behind this

536
01:06:00,529 --> 01:06:16,118
is that we have these vanilla convolutional layers but we also have these, this introduces the idea of MLP conv
layers they call it, which are micro networks or basically network within networth, the name of the paper.

537
01:06:16,118 --> 01:06:23,152
Where within each conv layer trying to stack an MLP
with a couple of fully connected layers on top of

538
01:06:23,152 --> 01:06:29,167
just the standard conv and be able to compute more
abstract features for these local patches right.

539
01:06:29,167 --> 01:06:41,975
So instead of sliding just a conv filter around, it's sliding a slightly more complex
hierarchical set of filters around and using that to get the activation maps.

540
01:06:41,975 --> 01:06:47,941
And so, it uses these fully connected, or
basically one by one conv kind of layers.

541
01:06:47,941 --> 01:06:57,196
It's going to stack them all up like the bottom diagram here where we
just have these networks within networks stacked in each of the layers.

542
01:06:57,196 --> 01:07:10,102
And the main reason to know this is just it was kind of a precursor to GoogleNet and ResNet
in 2014 with this idea of bottleneck layers that you saw used very heavily in there.

543
01:07:10,102 --> 01:07:22,070
And it also had a little bit of philosophical inspiration for GoogleNet for this idea of a local
network typology network in network that they also used, with a different kind of structure.

544
01:07:24,238 --> 01:07:36,759
Now I'm going to talk about a series of works, on, or works since ResNet that are mostly
geared towards improving resNet and so this is more recent research has been done since then.

545
01:07:36,759 --> 01:07:39,911
I'm going to go over these pretty fast,
and so just at a very high level.

546
01:07:39,911 --> 01:07:44,754
If you're interested in any of these you should
look at the papers, to have more details.

547
01:07:45,755 --> 01:07:55,719
So the authors of ResNet a little bit later on in 2016 also
had this paper where they improved the ResNet block design.

548
01:07:56,742 --> 01:08:03,015
And so they basically adjusted what were the
layers that were in the ResNet block path,

549
01:08:03,015 --> 01:08:18,861
and showed this new structure was able to have a more direct path in order for propagating information throughout the
network, and you want to have a good path to propagate information all the way up, and then back up all the way down again.

550
01:08:18,861 --> 01:08:25,319
And so they showed that this new block was better
for that and was able to give better performance.

551
01:08:25,319 --> 01:08:28,959
There's also a Wide  
networks which this paper

552
01:08:28,959 --> 01:08:40,228
argued that while ResNets made networks much deeper as well as added these residual
connections and their argument was that residuals are really the important factor.

553
01:08:40,228 --> 01:08:45,290
Having this residual construction, and not
necessarily having extremely deep networks.

554
01:08:45,290 --> 01:08:52,794
And so what they did was they used wider residual blocks, and
so what this means is just more filters in every conv layer.

555
01:08:52,794 --> 01:09:02,661
So before we might have F filters per layer and they use these factors of
K and said well, every layer it's going to be F times K filters instead.

556
01:09:02,663 --> 01:09:11,502
And so, using these wider layers they showed that their 50 layer
wide ResNet was able to out-perform the 152 layer original ResNet,

557
01:09:13,754 --> 01:09:23,035
and it also had the additional advantages of increasing with this, even
with the same amount of parameters, tit's more computationally efficient

558
01:09:23,035 --> 01:09:26,922
because you can parallelize these
with operations more easily.

559
01:09:26,923 --> 01:09:39,546
Right just convolutions with more neurons just spread across more kernels as opposed to depth
that's more sequential, so it's more computationally efficient to increase your width.

560
01:09:39,546 --> 01:09:49,817
- So here you can see this work is starting to trying to understand the contributions of width and depth and residual connections,
- and making some arguments for one way versus the other.

561
01:09:49,817 --> 01:09:58,125
And this other paper around the same time, I
think maybe a little bit later, is ResNeXt,

562
01:09:58,125 --> 01:10:04,383
and so this is again, the creators of ResNet
continuing to work on pushing the architecture.

563
01:10:04,383 --> 01:10:18,576
And here they also had this idea of okay, let's indeed tackle this width thing more but instead
of just increasing the width of this residual block through more filters they have structure.

564
01:10:18,576 --> 01:10:26,415
And so within each residual block, multiple parallel pathways and they're
going to call the total number of these pathways the cardinality.

565
01:10:26,415 --> 01:10:36,317
And so it's basically taking the one ResNet block with the bottlenecks and
having it be relatively thinner, but having multiple of these done in parallel.

566
01:10:38,395 --> 01:10:44,452
And so here you can also see that this both have
some relation to this idea of wide networks,

567
01:10:44,452 --> 01:10:54,023
as well as to has some connection to the inception module as well right
where we have these parallel, these layers operating in parallel.

568
01:10:54,023 --> 01:10:58,190
And so now this ResNeXt has
some flavor of that as well.

569
01:11:00,838 --> 01:11:13,878
So another approach towards improving ResNets was this idea called Stochastic Depth
and in this work the motivation is well let's look more at this depth problem.

570
01:11:13,878 --> 01:11:21,537
Once you get deeper and deeper the typical problems
that you're going to have vanishing gradients right.

571
01:11:21,537 --> 01:11:32,071
You're not able to, your gradients will get smaller and eventually vanish as you're
trying to back propagate them over very long layers, or a large number of layers.

572
01:11:32,071 --> 01:11:43,045
And so what their motivation is well let's try to have short networks during training
and they use this idea of dropping out a subset of the layers during training.

573
01:11:43,045 --> 01:11:48,436
And so for a subset of the layers they just drop out the
weights and they just set it to identity connection,

574
01:11:48,436 --> 01:11:56,126
and now what you get is you have these shorter networks
during training, you can pass back your gradients better.

575
01:11:56,126 --> 01:12:04,074
It's also a little more efficient, and then it's kind of like the
drop out right. It has this sort of flavor that you've seen before.

576
01:12:04,074 --> 01:12:08,108
And then at test time you want to use the
full deep network that you've trained.

577
01:12:10,446 --> 01:12:19,038
So these are some of the works that looking at the resident architecture, trying
to understand different aspects of it and trying to improve ResNet training.

578
01:12:19,038 --> 01:12:32,253
And so there's also some works now that are going beyond ResNet that are saying well what are some
non ResNet architectures that maybe can also work better, or comparable or better to ResNets.

579
01:12:32,253 --> 01:12:45,273
And so one idea is FractalNet, which came out pretty recently, and the argument in FractalNet is that while
residual representations maybe are not actually necessary, so this goes back to what we were talking about earlier.

580
01:12:45,273 --> 01:12:52,645
What's the motivation of residual networks and it seems to make sense and
there's, you know, good reasons for why this should help but in this paper

581
01:12:52,645 --> 01:12:58,407
they're saying that well here is a different architecture
that we're introducing, there's no residual representations.

582
01:12:58,407 --> 01:13:03,898
We think that the key is more about transitioning
effectively from shallow to deep networks,

583
01:13:03,898 --> 01:13:13,258
and so they have this fractal architecture which has if you look on the
right here, these layers where they compose it in this fractal fashion.

584
01:13:14,769 --> 01:13:18,639
And so there's both shallow and
deep pathways to your output.

585
01:13:20,045 --> 01:13:29,568
And so they have these different length pathways, they train them with
dropping out sub paths, and so again it has this dropout kind of flavor,

586
01:13:29,568 --> 01:13:37,203
and then at test time they'll use the entire fractal network
and they show that this was able to get very good performance.

587
01:13:39,047 --> 01:13:44,886
There's another idea called Densely Connected
convolutional Networks, DenseNet, and this idea

588
01:13:44,886 --> 01:13:48,567
is now we have these blocks
that are called dense blocks.

589
01:13:48,567 --> 01:13:55,940
And within each block each layer is going to be connected to
every other layer after it, in this feed forward fashion.

590
01:13:55,940 --> 01:14:00,362
So within this block, your input to the block
is also the input to every other conv layer,

591
01:14:00,362 --> 01:14:08,779
and as you compute each conv output, those outputs are now connected
to every layer after and then, these are all concatenated

592
01:14:08,779 --> 01:14:18,643
as input to the conv layer, and they do some they have some other
processes for reducing the dimensions and keeping efficient.

593
01:14:18,643 --> 01:14:30,863
And so their main takeaway from this, is that they argue that this is alleviating a
vanishing gradient problem because you have all of these very dense connections.

594
01:14:30,863 --> 01:14:37,324
It strengthens feature propagation and then also encourages
future use right because there are so many of these

595
01:14:37,324 --> 01:14:45,487
connections each feature map that you're learning is input
in multiple later layers and being used multiple times.

596
01:14:47,906 --> 01:15:03,006
So these are just a couple of ideas that are you know alternatives or what can we do that's not ResNets and yet is
still performing either comparably or better to ResNets and so this is another very active area of current research.

597
01:15:03,006 --> 01:15:11,830
You can see that a lot of this is looking at the way how different layers
are connected to each other and how depth is managed in these networks.

598
01:15:13,528 --> 01:15:17,991
And so one last thing that I wanted to
mention quickly, is just efficient networks.

599
01:15:17,991 --> 01:15:33,994
So this idea of efficiency and you saw that GoogleNet was a work that was looking into this direction of how can we have efficient
networks which are important for you know a lot of practical usage both training as well as especially deployment and so this is

600
01:15:33,994 --> 01:15:37,927
another recent network
that's called SqueezeNet

601
01:15:37,927 --> 01:15:41,618
which is looking at very efficient networks.
They have these things called fire modules,

602
01:15:41,618 --> 01:15:49,645
which consists of a squeeze layer with a lot of one by one filters and then this
feeds then into an expand layer with one by one and three by three filters,

603
01:15:49,645 --> 01:15:59,220
and they're showing that with this kind of architecture they're able to get
AlexNet level accuracy on ImageNet, but with 50 times fewer parameters,

604
01:15:59,220 --> 01:16:06,093
and then you can further do network compression on
this to get up to 500 times smaller than AlexNet

605
01:16:06,093 --> 01:16:10,095
and just have the whole
network just be 0.5 megs.

606
01:16:10,095 --> 01:16:20,062
And so this is a direction of how do we have efficient networks model compression
that we'll cover more in a lecture later, but just giving you a hint of that.

607
01:16:21,856 --> 01:16:26,809
OK so today in summary we've talked about
different kinds of CNN Architectures.

608
01:16:26,809 --> 01:16:31,555
We looked in-depth at four of the main
architectures that you'll see in wide usage.

609
01:16:31,555 --> 01:16:35,553
AlexNet, one of the early,
very popular networks.

610
01:16:35,553 --> 01:16:38,832
VGG and GoogleNet which
are still widely used.

611
01:16:38,832 --> 01:16:45,906
But ResNet is kind of taking over as the thing
that you should be looking most when you can.

612
01:16:45,906 --> 01:16:50,587
We also looked at these other networks in a
little bit more depth at a brief level overview.

613
01:16:51,921 --> 01:16:58,228
And so the takeaway that these models that are available they're
in a lot of [mumbles] so you can use them when you need them.

614
01:16:58,228 --> 01:17:06,827
There's a trend toward extremely deep networks, but there's also
significant research now around the design of how do we connect layers,

615
01:17:06,827 --> 01:17:15,419
skip connections, what is connected to what, and also using
these to design your architecture to improve gradient flow.

616
01:17:15,419 --> 01:17:22,748
There's an even more recent trend towards examining what's
the necessity of depth versus width, residual connections.

617
01:17:22,748 --> 01:17:31,380
Trade offs, what's actually helping matters, and so there's a lot of these recent works in
this direction that you can look into some of the ones I pointed out if you are interested.

618
01:17:31,380 --> 01:17:33,597
And next time we'll talk about
Recurrent neural networks.