eng/Lecture 11 _ Detection and Segmentation.srt

﻿1
00:00:08,691 --> 00:00:15,429
- Hello, hi. So I want to get started.
Welcome to CS 231N Lecture 11.

2
00:00:15,430 --> 00:00:23,258
We're going to talk about today detection segmentation and a whole bunch
of other really exciting topics around core computer vision tasks.

3
00:00:23,259 --> 00:00:25,590
But as usual, a couple
administrative notes.

4
00:00:25,590 --> 00:00:31,358
So last time you obviously took the midterm, we
didn't have lecture, hopefully that went okay

5
00:00:31,358 --> 00:00:42,269
for all of you but so we're going to work on grading the midterm this week, but as a reminder
please don't make any public discussions about the midterm questions or answers or whatever

6
00:00:42,270 --> 00:00:48,517
until at least tomorrow because there are still some people
taking makeup midterms today and throughout the rest of the week

7
00:00:48,518 --> 00:00:53,668
so we just ask you that you refrain from
talking publicly about midterm questions.

8
00:00:56,329 --> 00:01:02,920
Why don't you wait until Monday?
[laughing] Okay, great.

9
00:01:02,921 --> 00:01:07,760
So we're also starting to work on midterm grading. We'll get
those back to you as soon as you can, as soon as we can.

10
00:01:07,761 --> 00:01:14,078
We're also starting to work on grading assignment two so there's
a lot of grading being done this week. The TA's are pretty busy.

11
00:01:14,079 --> 00:01:18,479
Also a reminder for you guys, hopefully you've been
working hard on your projects now that most of you

12
00:01:18,479 --> 00:01:26,969
are done with the midterm so your project milestones will be due on
Tuesday so any sort of last minute changes that you had in your projects,

13
00:01:26,970 --> 00:01:31,650
I know some people decided to switch projects after
the proposal, some teams reshuffled a little bit,

14
00:01:31,650 --> 00:01:39,676
that's fine but your milestone should reflect the project that you're actually
doing for the rest of the quarter. So hopefully that's going out well.

15
00:01:39,677 --> 00:01:43,900
I know there's been a lot of worry and stress
on Piazza, wondering about assignment three.

16
00:01:43,900 --> 00:01:50,188
So we're working on that as hard as we can but that's actually
a bit of a new assignment, it's changing a bit from last year

17
00:01:50,189 --> 00:01:53,951
so it will be out as soon as possible,
hopefully today or tomorrow.

18
00:01:53,951 --> 00:02:01,550
Although we promise that whenever it comes out you'll have two
weeks to finish it so try not to stress out about that too much.

19
00:02:01,551 --> 00:02:05,318
But I'm pretty excited, I think assignment
three will be really cool, has a lot of cool,

20
00:02:05,318 --> 00:02:09,079
it'll cover a lot of really cool material.

21
00:02:09,079 --> 00:02:13,340
So another thing, last time in lecture we
mentioned this thing called the Train Game

22
00:02:13,340 --> 00:02:17,780
which is this really cool thing we've been working
on sort of as a side project a little bit.

23
00:02:17,780 --> 00:02:24,391
So this is an interactive tool that you guys can
go on and use to explore a little bit the process

24
00:02:24,391 --> 00:02:27,340
of tuning hyperparameters
in practice so we hope that,

25
00:02:27,340 --> 00:02:33,119
so this is again totally not required for the course.
Totally optional, but if you do we will offer

26
00:02:33,119 --> 00:02:35,072
a small amount of extra
credit for those of you

27
00:02:35,072 --> 00:02:37,963
who want to do well and
participate on this.

28
00:02:37,963 --> 00:02:42,224
And we'll send out exactly some more
details later this afternoon on Piazza.

29
00:02:42,224 --> 00:02:48,362
But just a bit of a demo for what exactly is this thing.
So you'll get to go in and we've changed the name

30
00:02:48,362 --> 00:02:51,752
from Train Game to HyperQuest
because you're questing

31
00:02:51,752 --> 00:02:54,464
to solve, to find the best
hyperparameters for your model

32
00:02:54,464 --> 00:02:59,344
so this is really cool, it'll be an interactive tool that
you can use to explore the training of hyperparameters

33
00:02:59,344 --> 00:03:01,254
interactively in your browser.

34
00:03:01,254 --> 00:03:04,871
So you'll login with
your student ID and name.

35
00:03:04,871 --> 00:03:08,830
You'll fill out a little survey with some
of your experience on deep learning

36
00:03:08,830 --> 00:03:14,934
then you'll read some instructions. So in this
game you'll be shown some random data set

37
00:03:14,934 --> 00:03:16,152
on every trial.

38
00:03:16,152 --> 00:03:21,494
This data set might be images or it might be vectors
and your goal is to train a model by picking

39
00:03:21,494 --> 00:03:25,632
the right hyperparameters interactively to
perform as well as you can on the validation set

40
00:03:25,632 --> 00:03:28,077
of this random data set.

41
00:03:28,077 --> 00:03:31,382
And it'll sort of keep track of your performance
over time and there'll be a leaderboard,

42
00:03:31,382 --> 00:03:33,423
it'll be really cool.

43
00:03:33,423 --> 00:03:38,723
So every time you play the game, you'll
get some statistics about your data set.

44
00:03:38,723 --> 00:03:42,397
In this case we're doing a
classification problem with 10 classes.

45
00:03:43,424 --> 00:03:47,774
You can see down at the bottom you have these
statistics about random data set, we have 10 classes.

46
00:03:47,774 --> 00:03:52,987
The input data size is three by 32 by 32 so
this is some image data set and we can see

47
00:03:52,987 --> 00:03:58,832
that in this case we have 8500 examples in the
training set and 1500 examples in the validation set.

48
00:03:58,832 --> 00:04:01,518
These are all random, they'll change
a little bit every time.

49
00:04:01,518 --> 00:04:06,912
Based on these data set statistics you'll make some choices
on your initial learning rate, your initial network size,

50
00:04:06,912 --> 00:04:08,931
and your initial dropout rate.

51
00:04:08,931 --> 00:04:13,811
Then you'll see a screen like this where it'll
run one epoch with those chosen hyperparameters,

52
00:04:13,811 --> 00:04:19,712
show you on the right here you'll see two
plots. One is your training and validation loss

53
00:04:19,712 --> 00:04:21,040
for that first epoch.

54
00:04:21,040 --> 00:04:23,409
Then you'll see your training
and validation accuracy

55
00:04:23,409 --> 00:04:30,759
for that first epoch and based on the gaps that you see in these two graphs you
can make choices interactively to change the learning rates and hyperparameters

56
00:04:30,759 --> 00:04:32,290
for the next epoch.

57
00:04:32,290 --> 00:04:37,803
So then you can either choose to continue training
with the current or changed hyperparameters,

58
00:04:37,803 --> 00:04:41,523
you can also stop training, or you can
revert to go back to the previous checkpoint

59
00:04:41,523 --> 00:04:43,872
in case things got really messed up.

60
00:04:43,872 --> 00:04:48,691
So then you'll get to make some choice,
so here we'll decide to continue training

61
00:04:48,691 --> 00:04:51,347
and in this case you could
go and set new learning rates

62
00:04:51,347 --> 00:04:54,971
and new hyperparameters for
the next epoch of training.

63
00:04:54,971 --> 00:04:59,808
You can also, kind of interesting here, you
can actually grow the network interactively

64
00:04:59,808 --> 00:05:01,899
during training in this demo.

65
00:05:01,899 --> 00:05:07,562
There's this cool trick from a couple recent
papers where you can either take existing layers

66
00:05:07,562 --> 00:05:12,083
and make them wider or add new layers to the network
in the middle of training while still maintaining

67
00:05:12,083 --> 00:05:15,762
the same function in the
network so you can do that

68
00:05:15,762 --> 00:05:20,131
to increase the size of your network in the
middle of training here which is kind of cool.

69
00:05:20,131 --> 00:05:24,430
So then you'll make choices over several epochs
and eventually your final validation accuracy

70
00:05:24,430 --> 00:05:26,811
will be recorded and we'll
have some leaderboard

71
00:05:26,811 --> 00:05:29,912
that compares your score on that data set

72
00:05:29,912 --> 00:05:33,072
to some simple baseline models.

73
00:05:33,072 --> 00:05:37,534
And depending on how well you do on this leaderboard
we'll again offer some small amounts of extra credit

74
00:05:37,534 --> 00:05:39,774
for those of you who
choose to participate.

75
00:05:39,774 --> 00:05:42,322
So this is again, totally
optional, but I think

76
00:05:42,322 --> 00:05:46,936
it can be a really cool learning experience for you guys
to play around with and explore how hyperparameters

77
00:05:46,936 --> 00:05:49,243
affect the learning process.

78
00:05:49,243 --> 00:05:54,872
Also, it's really useful for us. You'll help
science out by participating in this experiment.

79
00:05:54,872 --> 00:06:02,101
We're pretty interested in seeing how people behave when
they train neural networks so you'll be helping us out

80
00:06:02,101 --> 00:06:04,422
as well if you decide to play this.

81
00:06:04,422 --> 00:06:08,462
But again, totally optional, up to you.

82
00:06:08,462 --> 00:06:10,295
Any questions on that?

83
00:06:15,080 --> 00:06:18,680
Hopefully at some point but it's.
So the question was will this be a paper

84
00:06:18,680 --> 00:06:20,272
or whatever eventually?

85
00:06:20,272 --> 00:06:26,760
Hopefully but it's really early stages of this
project so I can't make any promises but I hope so.

86
00:06:26,760 --> 00:06:29,510
But I think it'll be really cool.

87
00:06:33,240 --> 00:06:35,000
[laughing]

88
00:06:35,000 --> 00:06:37,971
Yeah, so the question is how can
you add layers during training?

89
00:06:37,971 --> 00:06:43,552
I don't really want to get into that right now but
the paper to read is Net2Net by Ian Goodfellow's

90
00:06:43,552 --> 00:06:45,291
one of the authors and
there's another paper

91
00:06:45,291 --> 00:06:48,240
from Microsoft called Network Morphism.

92
00:06:48,240 --> 00:06:52,407
So if you read those two papers
you can see how this works.

93
00:06:53,680 --> 00:06:58,152
Okay, so last time, a bit of a reminder
before we had the midterm last time we talked

94
00:06:58,152 --> 00:06:59,792
about recurrent neural networks.

95
00:06:59,792 --> 00:07:03,032
We saw that recurrent neural networks can
be used for different types of problems.

96
00:07:03,032 --> 00:07:07,192
In addition to one to one we can do one
to many, many to one, many to many.

97
00:07:07,192 --> 00:07:10,679
We saw how this can apply
to language modeling

98
00:07:10,679 --> 00:07:15,460
and we saw some cool examples of applying neural networks to
model different sorts of languages at the character level

99
00:07:15,460 --> 00:07:20,571
and we sampled these artificial math
and Shakespeare and C source code.

100
00:07:20,571 --> 00:07:26,560
We also saw how similar things could be applied to
image captioning by connecting a CNN feature extractor

101
00:07:26,560 --> 00:07:28,491
together with an RNN language model.

102
00:07:28,491 --> 00:07:31,011
And we saw some really
cool examples of that.

103
00:07:31,011 --> 00:07:36,040
We also talked about the different types of
RNN's. We talked about this Vanilla RNN.

104
00:07:36,040 --> 00:07:40,158
I also want to mention that this is sometimes
called a Simple RNN or an Elman RNN so you'll see

105
00:07:40,158 --> 00:07:42,331
all of these different
terms in literature.

106
00:07:42,331 --> 00:07:44,997
We also talked about the Long
Short Term Memory or LSTM.

107
00:07:44,997 --> 00:07:50,102
And we talked about how the gradient,
the LSTM has this crazy set of equations

108
00:07:50,102 --> 00:07:53,021
but it makes sense because it
helps improve gradient flow

109
00:07:53,021 --> 00:07:56,022
during back propagation
and helps this thing model

110
00:07:56,022 --> 00:07:59,443
more longer term dependencies
in our sequences.

111
00:07:59,443 --> 00:08:03,982
So today we're going to switch gears and talk
about a whole bunch of different exciting tasks.

112
00:08:03,982 --> 00:08:08,992
We're going to talk about, so so far we've been talking
about mostly the image classification problem.

113
00:08:08,992 --> 00:08:13,262
Today we're going to talk about various types of other
computer vision tasks where you actually want to go in

114
00:08:13,262 --> 00:08:19,542
and say things about the spatial pixels inside your images
so we'll see segmentation, localization, detection,

115
00:08:19,542 --> 00:08:21,942
a couple other different
computer vision tasks

116
00:08:21,942 --> 00:08:25,494
and how you can approach these
with convolutional neural networks.

117
00:08:25,494 --> 00:08:29,552
So as a bit of refresher, so far the main
thing we've been talking about in this class

118
00:08:29,552 --> 00:08:32,163
is image classification so
here we're going to have

119
00:08:32,163 --> 00:08:34,842
some input image come in.
That input image will go through

120
00:08:34,842 --> 00:08:36,583
some deep convolutional network,

121
00:08:36,583 --> 00:08:42,991
that network will give us some feature vector of
maybe 4096 dimensions in the case of AlexNet RGB

122
00:08:42,991 --> 00:08:46,222
and then from that final feature vector
we'll have some fully-connected,

123
00:08:46,222 --> 00:08:47,750
some final fully-connected layer

124
00:08:47,750 --> 00:08:50,568
that gives us 1000 numbers
for the different class scores

125
00:08:50,568 --> 00:08:55,660
that we care about where 1000 is maybe the
number of classes in ImageNet in this example.

126
00:08:55,660 --> 00:08:59,080
And then at the end of the day
what the network does is we input an image

127
00:08:59,080 --> 00:09:01,437
and then we output a single category label

128
00:09:01,437 --> 00:09:05,083
saying what is the content of
this entire image as a whole.

129
00:09:05,083 --> 00:09:09,879
But this is maybe the most basic possible task
in computer vision and there's a whole bunch

130
00:09:09,879 --> 00:09:11,686
of other interesting types of tasks

131
00:09:11,686 --> 00:09:14,314
that we might want to
solve using deep learning.

132
00:09:14,314 --> 00:09:18,609
So today we're going to talk about several of these
different tasks and step through each of these

133
00:09:18,609 --> 00:09:21,515
and see how they all
work with deep learning.

134
00:09:21,515 --> 00:09:26,944
So we'll talk about these more in detail
about what each problem is as we get to it

135
00:09:26,944 --> 00:09:28,852
but this is kind of a summary slide

136
00:09:28,852 --> 00:09:31,480
that we'll talk first about
semantic segmentation.

137
00:09:31,480 --> 00:09:35,153
We'll talk about classification and localization,
then we'll talk about object detection,

138
00:09:35,153 --> 00:09:39,086
and finally a couple brief words
about instance segmentation.

139
00:09:39,967 --> 00:09:44,035
So first is the problem
of semantic segmentation.

140
00:09:44,035 --> 00:09:49,847
In the problem of semantic segmentation, we want
to input an image and then output a decision

141
00:09:49,847 --> 00:09:52,567
of a category for every
pixel in that image

142
00:09:52,567 --> 00:09:58,327
so for every pixel in this, so this input image for example
is this cat walking through the field, he's very cute.

143
00:09:58,327 --> 00:10:04,517
And in the output we want to say for every pixel
is that pixel a cat or grass or sky or trees

144
00:10:04,517 --> 00:10:07,701
or background or some
other set of categories.

145
00:10:07,701 --> 00:10:11,922
So we're going to have some set of categories
just like we did in the image classification case

146
00:10:11,922 --> 00:10:15,820
but now rather than assigning a single category
labeled to the entire image, we want to produce

147
00:10:15,820 --> 00:10:19,569
a category label for each
pixel of the input image.

148
00:10:19,569 --> 00:10:22,674
And this is called semantic segmentation.

149
00:10:22,674 --> 00:10:27,340
So one interesting thing about semantic segmentation
is that it does not differentiate instances

150
00:10:27,340 --> 00:10:31,523
so in this example on the right we have this image
with two cows where they're standing right next

151
00:10:31,523 --> 00:10:36,859
to each other and when we're talking about semantic
segmentation we're just labeling all the pixels

152
00:10:36,859 --> 00:10:39,741
independently for what is
the category of that pixel.

153
00:10:39,741 --> 00:10:44,510
So in the case like this where we have two cows
right next to each other the output does not make

154
00:10:44,510 --> 00:10:46,840
any distinguishing, does not distinguish

155
00:10:46,840 --> 00:10:48,309
between these two cows.

156
00:10:48,309 --> 00:10:51,782
Instead we just get a whole mass of pixels
that are all labeled as cow.

157
00:10:51,782 --> 00:10:56,625
So this is a bit of a shortcoming of semantic
segmentation and we'll see how we can fix this later

158
00:10:56,625 --> 00:10:58,910
when we move to instance segmentation.

159
00:10:58,910 --> 00:11:02,882
But at least for now we'll just talk about
semantic segmentation first.

160
00:11:04,437 --> 00:11:09,340
So you can imagine maybe using a class,
so one potential approach for attacking

161
00:11:09,340 --> 00:11:12,544
semantic segmentation might
be through classification.

162
00:11:12,544 --> 00:11:17,755
So there's this, you could use this idea of a
sliding window approach to semantic segmentation.

163
00:11:17,755 --> 00:11:24,315
So you might imagine that we take our input image and
we break it up into many many small, tiny local crops

164
00:11:24,315 --> 00:11:27,763
of the image so in this
example we've taken

165
00:11:27,763 --> 00:11:31,310
maybe three crops from
around the head of this cow

166
00:11:31,310 --> 00:11:36,564
and then you could imagine taking each of those crops
and now treating this as a classification problem.

167
00:11:36,564 --> 00:11:41,246
Saying for this crop, what is the category
of the central pixel of the crop?

168
00:11:41,246 --> 00:11:46,752
And then we could use all the same machinery that
we've developed for classifying entire images

169
00:11:46,752 --> 00:11:48,760
but now just apply it on crops rather than

170
00:11:48,760 --> 00:11:51,083
on the entire image.

171
00:11:51,083 --> 00:11:56,601
And this would probably work to some extent
but it's probably not a very good idea.

172
00:11:56,601 --> 00:12:02,498
So this would end up being super super
computationally expensive because we want to label

173
00:12:02,498 --> 00:12:07,319
every pixel in the image, we would need a separate
crop for every pixel in that image and this would be

174
00:12:07,319 --> 00:12:09,407
super super expensive to
run forward and backward

175
00:12:09,407 --> 00:12:10,910
passes through.

176
00:12:10,910 --> 00:12:17,085
And moreover, we're actually, if you think about this
we can actually share computation between different

177
00:12:17,085 --> 00:12:20,476
patches so if you're trying
to classify two patches

178
00:12:20,476 --> 00:12:22,950
that are right next to each
other and actually overlap

179
00:12:22,950 --> 00:12:25,509
then the convolutional
features of those patches

180
00:12:25,509 --> 00:12:30,611
will end up going through the same convolutional layers
and we can actually share a lot of the computation

181
00:12:30,611 --> 00:12:32,644
when applying this to separate passes

182
00:12:32,644 --> 00:12:34,742
or when applying this type of approach

183
00:12:34,742 --> 00:12:37,194
to separate patches in the image.

184
00:12:37,194 --> 00:12:41,896
So this is actually a terrible idea and nobody
does this and you should probably not do this

185
00:12:41,896 --> 00:12:48,683
but it's at least the first thing you might think of if
you were trying to think about semantic segmentation.

186
00:12:48,683 --> 00:12:53,372
Then the next idea that works a bit better is
this idea of a fully convolutional network right.

187
00:12:53,372 --> 00:12:58,305
So rather than extracting individual patches from the
image and classifying these patches independently,

188
00:12:58,305 --> 00:13:03,604
we can imagine just having our network be a whole giant
stack of convolutional layers with no fully connected

189
00:13:03,604 --> 00:13:06,501
layers or anything so in this
case we just have a bunch

190
00:13:06,501 --> 00:13:12,633
of convolutional layers that are all maybe three
by three with zero padding or something like that

191
00:13:12,633 --> 00:13:15,422
so that each convolutional
layer preserves the spatial size

192
00:13:15,422 --> 00:13:17,843
of the input and now if we pass our image

193
00:13:17,843 --> 00:13:20,605
through a whole stack of
these convolutional layers,

194
00:13:20,605 --> 00:13:27,184
then the final convolutional layer could just
output a tensor of something by C by H by W

195
00:13:27,184 --> 00:13:29,622
where C is the number of
categories that we care about

196
00:13:29,622 --> 00:13:34,734
and you could see this tensor as just giving
our classification scores for every pixel

197
00:13:34,734 --> 00:13:38,127
in the input image at every
location in the input image.

198
00:13:38,127 --> 00:13:43,014
And we could compute this all at once with
just some giant stack of convolutional layers.

199
00:13:43,014 --> 00:13:47,216
And then you could imagine training this thing
by putting a classification loss at every pixel

200
00:13:47,216 --> 00:13:50,558
of this output, taking an
average over those pixels

201
00:13:50,558 --> 00:13:55,137
in space, and just training this kind of network
through normal, regular back propagation.

202
00:13:55,137 --> 00:13:55,970
Question?

203
00:13:58,430 --> 00:14:01,179
Oh, the question is how do you develop
training data for this?

204
00:14:01,179 --> 00:14:04,366
It's very expensive right.
So the training data for this would be

205
00:14:04,366 --> 00:14:06,899
we need to label every
pixel in those input images

206
00:14:06,899 --> 00:14:11,831
so there's tools that people sometimes have online
where you can go in and sort of draw contours

207
00:14:11,831 --> 00:14:14,613
around the objects and
then fill in regions

208
00:14:14,613 --> 00:14:17,604
but in general getting this kind of
training data is very expensive.

209
00:14:29,243 --> 00:14:31,357
Yeah, the question is
what is the loss function?

210
00:14:31,357 --> 00:14:37,009
So here since we're making a classification
decision per pixel then we put a cross entropy loss

211
00:14:37,009 --> 00:14:39,025
on every pixel of the output.

212
00:14:39,025 --> 00:14:42,212
So we have the ground truth category label
for every pixel in the output,

213
00:14:42,212 --> 00:14:45,793
then we compute across entropy loss
between every pixel in the output

214
00:14:45,793 --> 00:14:48,143
and the ground truth pixels and then

215
00:14:48,143 --> 00:14:52,739
take either a sum or an average over space
and then sum or average over the mini-batch.

216
00:14:52,739 --> 00:14:53,572
Question?

217
00:15:18,548 --> 00:15:26,505
Yeah, yeah.
Yeah, the question is do we assume

218
00:15:26,505 --> 00:15:28,008
that we know the categories?

219
00:15:28,008 --> 00:15:31,258
So yes, we do assume that we
know the categories up front

220
00:15:31,258 --> 00:15:33,716
so this is just like the
image classification case.

221
00:15:33,716 --> 00:15:39,466
So an image classification we know at the start of
training based on our data set that maybe there's 10 or 20

222
00:15:39,466 --> 00:15:41,357
or 100 or 1000 classes that we care about

223
00:15:41,357 --> 00:15:50,077
for this data set and then here we are fixed to that
set of classes that are fixed for the data set.

224
00:15:51,012 --> 00:15:56,206
So this model is relatively simple and you
can imagine this working reasonably well

225
00:15:56,206 --> 00:15:58,853
assuming that you tuned all
the hyperparameters right

226
00:15:58,853 --> 00:16:00,562
but it's kind of a problem right.

227
00:16:00,562 --> 00:16:05,120
So in this setup, since we're applying a bunch
of convolutions that are all keeping the same

228
00:16:05,120 --> 00:16:07,479
spatial size of the input image,

229
00:16:07,479 --> 00:16:09,574
this would be super super expensive right.

230
00:16:09,574 --> 00:16:16,435
If you wanted to do convolutions that maybe have 64 or
128 or 256 channels for those convolutional filters

231
00:16:16,435 --> 00:16:18,982
which is pretty common in
a lot of these networks,

232
00:16:18,982 --> 00:16:24,111
then running those convolutions on this high resolution
input image over a sequence of layers would be

233
00:16:24,111 --> 00:16:25,849
extremely computationally expensive

234
00:16:25,849 --> 00:16:27,361
and would take a ton of memory.

235
00:16:27,361 --> 00:16:31,304
So in practice, you don't usually see
networks with this architecture.

236
00:16:31,304 --> 00:16:37,512
Instead you tend to see networks that look something
like this where we have some downsampling

237
00:16:37,512 --> 00:16:39,277
and then some upsampling
of the feature map

238
00:16:39,277 --> 00:16:40,592
inside the image.

239
00:16:40,592 --> 00:16:44,614
So rather than doing all the convolutions of
the full spatial resolution of the image,

240
00:16:44,614 --> 00:16:48,997
we'll maybe go through a small number of
convolutional layers at the original resolution

241
00:16:48,997 --> 00:16:53,991
then downsample that feature map using something
like max pooling or strided convolutions

242
00:16:53,991 --> 00:16:55,719
and sort of downsample, downsample,

243
00:16:55,719 --> 00:16:59,338
so we have convolutions in downsampling
and convolutions in downsampling

244
00:16:59,338 --> 00:17:04,640
that look much like a lot of the classification
networks that you see but now the difference is that

245
00:17:04,640 --> 00:17:09,346
rather than transitioning to a fully connected layer
like you might do in an image classification setup,

246
00:17:09,346 --> 00:17:12,071
instead we want to increase
the spatial resolution

247
00:17:12,071 --> 00:17:15,213
of our predictions in the
second half of the network

248
00:17:15,214 --> 00:17:20,614
so that our output image can now be the same
size as our input image and this ends up being

249
00:17:20,614 --> 00:17:22,136
much more computationally efficient

250
00:17:22,136 --> 00:17:26,417
because you can make the network very deep
and work at a lower spatial resolution

251
00:17:26,417 --> 00:17:29,749
for many of the layers at
the inside of the network.

252
00:17:29,749 --> 00:17:36,418
So we've already seen examples of downsampling
when it comes to convolutional networks.

253
00:17:36,418 --> 00:17:41,180
We've seen that you can do strided convolutions or
various types of pooling to reduce the spatial size

254
00:17:41,180 --> 00:17:44,050
of the image inside a
network but we haven't

255
00:17:44,050 --> 00:17:46,040
really talked about
upsampling and the question

256
00:17:46,040 --> 00:17:51,476
you might be wondering is what are these upsampling
layers actually look like inside the network?

257
00:17:51,476 --> 00:17:55,875
And what are our strategies for increasing the
size of a feature map inside the network?

258
00:17:55,875 --> 00:17:59,208
Sorry, was there a question in the back?

259
00:18:07,316 --> 00:18:09,061
Yeah, so the question
is how do we upsample?

260
00:18:09,061 --> 00:18:11,758
And the answer is that's the topic
of the next couple slides.

261
00:18:11,758 --> 00:18:13,263
[laughing]

262
00:18:13,263 --> 00:18:21,075
So one strategy for upsampling is something like
unpooling so we have this notion of pooling

263
00:18:21,075 --> 00:18:23,379
to downsample so we talked
about average pooling

264
00:18:23,379 --> 00:18:26,187
or max pooling so when we
talked about average pooling

265
00:18:26,187 --> 00:18:30,389
we're kind of taking a spatial average within
a receptive field of each pooling region.

266
00:18:30,389 --> 00:18:34,853
One kind of analog for upsampling is
this idea of nearest neighbor unpooling.

267
00:18:34,853 --> 00:18:39,090
So here on the left we see this example of
nearest neighbor unpooling where our input

268
00:18:39,090 --> 00:18:41,379
is maybe some two by
two grid and our output

269
00:18:41,379 --> 00:18:43,853
is a four by four grid
and now in our output

270
00:18:43,853 --> 00:18:50,461
we've done a two by two stride two nearest neighbor
unpooling or upsampling where we've just duplicated

271
00:18:50,461 --> 00:18:53,177
that element for every
point in our two by two

272
00:18:53,177 --> 00:18:56,149
receptive field of the unpooling region.

273
00:18:56,149 --> 00:19:03,472
Another thing you might see is this bed of nails unpooling
or bed of nails upsampling where you'll just take,

274
00:19:03,472 --> 00:19:09,116
again we have a two by two receptive field for
our unpooling regions and then you'll take the,

275
00:19:09,116 --> 00:19:23,462
in this case you make it all zeros except for one element of the unpooling region so in this case we've taken all of
our inputs and always put them in the upper left hand corner of this unpooling region and everything else is zeros.

276
00:19:23,463 --> 00:19:24,867
And this is kind of like a bed of nails

277
00:19:24,867 --> 00:19:33,559
because the zeros are very flat, then you've got these things
poking up for the values at these various non-zero regions.

278
00:19:33,560 --> 00:19:39,591
Another thing that you see sometimes which was alluded to
by the question a minute ago is this idea of max unpooling

279
00:19:39,591 --> 00:19:52,046
so in a lot of these networks they tend to be symmetrical where we have a downsampling portion of the network
and then an upsampling portion of the network with a symmetry between those two portions of the network.

280
00:19:52,047 --> 00:20:06,139
So sometimes what you'll see is this idea of max unpooling where for each unpooling, for each upsampling layer,
it is associated with one of the pooling layers in the first half of the network and now in the first half,

281
00:20:06,140 --> 00:20:16,464
in the downsampling when we do max pooling we'll actually remember which element
of the receptive field during max pooling was used to do the max pooling

282
00:20:16,465 --> 00:20:26,390
and now when we go through the rest of the network then we'll do something that looks like this bed of nails
upsampling except rather than always putting the elements in the same position, instead we'll stick it

283
00:20:26,391 --> 00:20:33,697
into the position that was used in the corresponding
max pooling step earlier in the network.

284
00:20:33,697 --> 00:20:38,321
I'm not sure if that explanation was clear
but hopefully the picture makes sense.

285
00:20:39,248 --> 00:20:42,388
Yeah, so then you just end up
filling the rest with zeros.

286
00:20:42,388 --> 00:20:48,256
So then you fill the rest with zeros and then you stick the elements
from the low resolution patch up into the high resolution patch

287
00:20:48,256 --> 00:20:54,964
at the points where the max pooling took place
at the corresponding max pooling there.

288
00:20:56,871 --> 00:21:00,723
Okay, so that's kind
of an interesting idea.

289
00:21:00,723 --> 00:21:02,056
Sorry, question?

290
00:21:08,696 --> 00:21:11,801
Oh yeah, so the question is why is this
a good idea? Why might this matter?

291
00:21:11,801 --> 00:21:16,806
So the idea is that when we're doing semantic segmentation
we want our predictions to be pixel perfect right.

292
00:21:16,806 --> 00:21:23,708
We kind of want to get those sharp boundaries and
those tiny details in our predictive segmentation

293
00:21:23,708 --> 00:21:31,782
so now if you're doing this max pooling, there's this sort of heterogeneity
that's happening inside the feature map due to the max pooling

294
00:21:31,782 --> 00:21:44,363
where from the low resolution image you don't know, you're sort of losing spatial information in some sense
by you don't know where that feature vector came from in the local receptive field after max pooling.

295
00:21:45,253 --> 00:21:53,759
So if you actually unpool by putting the vector in the same slot you might
think that that might help us handle these fine details a little bit better

296
00:21:53,759 --> 00:21:59,051
and help us preserve some of that spatial
information that was lost during max pooling.

297
00:21:59,051 --> 00:21:59,884
Question?

298
00:22:10,883 --> 00:22:13,809
The question is does this make
things easier for back prop?

299
00:22:13,809 --> 00:22:21,009
Yeah, I guess, I don't think it changes the back prop dynamics too much
because storing these indices is not a huge computational overhead.

300
00:22:21,009 --> 00:22:24,851
They're pretty small in
comparison to everything else.

301
00:22:24,851 --> 00:22:29,566
So another thing that you'll see sometimes
is this idea of transpose convolution.

302
00:22:29,566 --> 00:22:34,724
So transpose convolution, so for these various
types of unpooling that we just talked about,

303
00:22:34,724 --> 00:22:38,945
these bed of nails, this nearest neighbor,
this max unpooling, all of these are kind of

304
00:22:38,945 --> 00:22:44,964
a fixed function, they're not really learning exactly how
to do the upsampling so if you think about something

305
00:22:44,964 --> 00:22:47,404
like strided convolution,
strided convolution

306
00:22:47,404 --> 00:22:54,423
is kind of like a learnable layer that learns the way that
the network wants to perform downsampling at that layer.

307
00:22:54,423 --> 00:23:02,534
And by analogy with that there's this type of layer called a
transpose convolution that lets us do kind of learnable upsampling.

308
00:23:02,534 --> 00:23:08,068
So it will both upsample the feature map and learn
some weights about how it wants to do that upsampling.

309
00:23:08,068 --> 00:23:13,262
And this is really just another type of convolution
so to see how this works remember how a normal

310
00:23:13,262 --> 00:23:16,663
three by three stride one pad
one convolution would work.

311
00:23:16,663 --> 00:23:20,488
That for this kind of normal convolution that
we've seen many times now in this class,

312
00:23:20,488 --> 00:23:24,316
our input might by four by four,
our output might be four by four,

313
00:23:24,316 --> 00:23:29,721
and now we'll have this three by three kernel and we'll take an inner
product between, we'll plop down that kernel at the corner of the image,

314
00:23:29,721 --> 00:23:35,409
take an inner product, and that inner product will give us the value
and the activation in the upper left hand corner of our output.

315
00:23:35,409 --> 00:23:39,388
And we'll repeat this for every
receptive field in the image.

316
00:23:39,388 --> 00:23:44,688
Now if we talk about strided convolution then
strided convolution ends up looking pretty similar.

317
00:23:44,688 --> 00:23:49,648
However, our input is maybe a four by four
region and our output is a two by two region.

318
00:23:49,648 --> 00:24:00,808
But we still have this idea of taking, of there being some three by three filter or kernel that we plop down in
the corner of the image, take an inner product and use that to compute a value of the activation and the output.

319
00:24:00,808 --> 00:24:08,879
But now with strided convolution the idea is that we're moving that,
rather than popping down that filter at every possible point in the input,

320
00:24:08,879 --> 00:24:16,961
instead we're going to move the filter by two pixels in the input every time we
move the filter by one pixel, every time we move by one pixel in the output.

321
00:24:16,961 --> 00:24:23,361
Right so this stride of two gives us a ratio between how much do
we move in the input versus how much do we move in the output.

322
00:24:23,361 --> 00:24:32,495
So when you do a strided convolution with stride two this ends up downsampling
the image or the feature map by a factor of two in kind of a learnable way.

323
00:24:32,495 --> 00:24:42,638
And now a transpose convolution is sort of the opposite in a way so here our
input will be a two by two region and our output will be a four by four region.

324
00:24:42,638 --> 00:24:46,904
But now the operation that we perform with
transpose convolution is a little bit different.

325
00:24:46,904 --> 00:24:56,074
Now so rather than taking an inner product instead what we're going
to do is we're going to take the value of our input feature map

326
00:24:56,074 --> 00:25:00,856
at that upper left hand corner and that'll be
some scalar value in the upper left hand corner.

327
00:25:00,856 --> 00:25:06,767
We're going to multiply the filter by that scalar value
and then copy those values over to this three by three

328
00:25:06,767 --> 00:25:14,428
region in the output so rather than taking an inner
product with our filter and the input, instead our input

329
00:25:14,428 --> 00:25:24,911
gives weights that we will use to weight the filter and then our output will
be weighted copies of the filter that are weighted by the values in the input.

330
00:25:24,911 --> 00:25:36,703
And now we can do this sort of same ratio trick in order to upsample so now when we move one pixel
in the input now we can plop our filter down two pixels away in the output and it's the same trick

331
00:25:36,703 --> 00:25:43,713
that now the blue pixel in the input is some scalar value and we'll
take that scalar value, multiply it by the values in the filter,

332
00:25:43,713 --> 00:25:49,048
and copy those weighted filter values
into this new region in the output.

333
00:25:49,048 --> 00:25:54,765
The tricky part is that sometimes these receptive
fields in the output can overlap now and now when these

334
00:25:54,765 --> 00:26:00,143
receptive fields in the output overlap
we just sum the results in the output.

335
00:26:00,143 --> 00:26:07,931
So then you can imagine repeating this everywhere and repeating this
process everywhere and this ends up doing sort of a learnable upsampling

336
00:26:07,931 --> 00:26:14,466
where we use these learned convolutional filter weights
to upsample the image and increase the spatial size.

337
00:26:15,609 --> 00:26:19,975
By the way, you'll see this operation go
by a lot of different names in literature.

338
00:26:19,975 --> 00:26:24,153
Sometimes this gets called
things like deconvolution

339
00:26:24,153 --> 00:26:27,024
which I think is kind of a
bad name but you'll see it

340
00:26:27,024 --> 00:26:34,066
out there in papers so from a signal processing perspective
deconvolution means the inverse operation to convolution

341
00:26:34,066 --> 00:26:39,945
which this is not however you'll frequently see
this type of layer called a deconvolution layer

342
00:26:39,945 --> 00:26:44,121
in some deep learning papers so be aware
of that, watch out for that terminology.

343
00:26:44,121 --> 00:26:48,280
You'll also sometimes see this called
upconvolution which is kind of a cute name.

344
00:26:48,280 --> 00:26:51,490
Sometimes it gets called
fractionally strided convolution

345
00:26:51,490 --> 00:27:01,437
because if we think of the stride as the ratio in step between the input and the output
then now this is something like a stride one half convolution because of this ratio

346
00:27:01,437 --> 00:27:04,869
of one to two between steps in the input
and steps in the output.

347
00:27:04,869 --> 00:27:09,311
This also sometimes gets called a backwards
strided convolution because if you think about it

348
00:27:09,311 --> 00:27:15,287
and work through the math this ends up being the
same, the forward pass of a transpose convolution

349
00:27:15,287 --> 00:27:20,030
ends up being the same mathematical operation
as the backwards pass in a normal convolution

350
00:27:20,030 --> 00:27:28,698
so you might have to take my word for it, that might not be super obvious when you first
look at this but that's kind of a neat fact so you'll sometimes see that name as well.

351
00:27:28,698 --> 00:27:36,923
And as maybe a bit of a more concrete example of what this looks like I
think it's maybe a little easier to see in one dimension so if we imagine,

352
00:27:36,923 --> 00:27:41,272
so here we're doing a three by three
transpose convolution in one dimension.

353
00:27:41,272 --> 00:27:46,091
Sorry, not three by three, a three by one
transpose convolution in one dimension.

354
00:27:46,091 --> 00:27:50,211
So our filter here is just three numbers. Our
input is two numbers and now you can see

355
00:27:50,211 --> 00:27:58,060
that in our output we've taken the values in the input, used them to weight the
values of the filter and plopped down those weighted filters in the output

356
00:27:58,060 --> 00:28:03,597
with a stride of two and now where these receptive
fields overlap in the output then we sum.

357
00:28:03,597 --> 00:28:12,253
So you might be wondering, this is kind of a funny name. Where does the name transpose
convolution come from and why is that actually my preferred name for this operation?

358
00:28:12,253 --> 00:28:15,530
So that comes from this kind of
neat interpretation of convolution.

359
00:28:15,530 --> 00:28:21,902
So it turns out that any time you do convolution you can
always write convolution as a matrix multiplication.

360
00:28:21,902 --> 00:28:25,737
So again, this is kind of easier to see
with a one-dimensional example

361
00:28:25,737 --> 00:28:33,470
but here we've got some weight. So we're doing a one-dimensional
convolution of a weight vector x which has three elements,

362
00:28:34,497 --> 00:28:38,706
and an input vector, a vector, which
has four elements, A, B, C, D.

363
00:28:38,706 --> 00:28:47,869
So here we're doing a three by one convolution with stride one and you can
see that we can frame this whole operation as a matrix multiplication

364
00:28:47,869 --> 00:28:54,781
where we take our convolutional kernel x
and turn it into some matrix capital X

365
00:28:54,781 --> 00:28:59,360
which contains copies of that convolutional
kernel that are offset by different regions.

366
00:28:59,360 --> 00:29:08,157
And now we can take this giant weight matrix X and do a matrix vector multiplication
between x and our input a and this just produces the same result as convolution.

367
00:29:09,274 --> 00:29:17,770
And now with transpose convolution means that we're going to take this same weight
matrix but now we're going to multiply by the transpose of that same weight matrix.

368
00:29:17,770 --> 00:29:26,491
So here you can see the same example for this stride one convolution on the
left and the corresponding stride one transpose convolution on the right.

369
00:29:26,491 --> 00:29:31,018
And if you work through the details you'll
see that when it comes to stride one,

370
00:29:31,018 --> 00:29:37,570
a stride one transpose convolution also ends up being a
stride one normal convolution so there's a little bit

371
00:29:37,570 --> 00:29:42,334
of details in the way that the border and the padding
are handled but it's fundamentally the same operation.

372
00:29:42,334 --> 00:29:45,879
But now things look different
when you talk about a stride of two.

373
00:29:45,879 --> 00:29:54,240
So again, here on the left we can take a stride two convolution and
write out this stride two convolution as a matrix multiplication.

374
00:29:54,240 --> 00:29:59,837
And now the corresponding transpose convolution
is no longer a convolution so if you look

375
00:29:59,837 --> 00:30:04,985
through this weight matrix and think about how
convolutions end up getting represented in this way

376
00:30:04,985 --> 00:30:13,913
then now this transposed matrix for the stride two convolution is something
fundamentally different from the original normal convolution operation

377
00:30:13,913 --> 00:30:20,647
so that's kind of the reasoning behind the name and that's why I
think that's kind of the nicest name to call this operation by.

378
00:30:20,647 --> 00:30:22,980
Sorry, was there a question?

379
00:30:27,991 --> 00:30:29,646
Sorry?

380
00:30:29,646 --> 00:30:36,523
It's very possible there's a typo in the slide so please point
out on Piazza and I'll fix it but I hope the idea was clear.

381
00:30:36,523 --> 00:30:43,000
Is there another question? Okay, thank you
[laughing]. Yeah, so, oh no lots of questions.

382
00:30:53,576 --> 00:30:56,360
Yeah, so the issue is why
do we sum and not average?

383
00:30:56,360 --> 00:31:03,404
So the reason we sum is due to this transpose convolution
formula zone so that's the reason why we sum

384
00:31:03,404 --> 00:31:11,325
but you're right that you actually, this is kind of a problem that the magnitudes will
actually vary in the output depending on how many receptive fields were in the output.

385
00:31:11,325 --> 00:31:15,322
So actually in practice this is something that people
started to point out very recently and somewhat

386
00:31:15,322 --> 00:31:26,250
switched away from this stride, so using three by three stride two transpose convolution upsampling
can sometimes produce some checkerboard artifacts in the output exactly due to that problem.

387
00:31:26,250 --> 00:31:37,127
So what I've seen in a couple more recent papers is maybe to use four by four stride two or two by two
stride two transpose convolution for upsampling and that helps alleviate that problem a little bit.

388
00:31:46,834 --> 00:31:52,515
- Yeah, so the question is what is a stride half convolution
- and where does that terminology come from?

389
00:31:52,515 --> 00:31:56,790
I think that was from my paper. So that was
actually, yes that was definitely this.

390
00:31:56,790 --> 00:32:01,181
So at the time I was writing that paper I was kind
of into the name fractionally strided convolution

391
00:32:01,181 --> 00:32:07,282
but after thinking about it a bit more I think
transpose convolution is probably the right name.

392
00:32:07,282 --> 00:32:13,746
So then this idea of semantic segmentation
actually ends up being pretty natural.

393
00:32:13,746 --> 00:32:19,540
You just have this giant convolutional network with
downsampling and upsampling inside the network

394
00:32:19,540 --> 00:32:22,053
and now our downsampling will
be by strided convolution

395
00:32:22,053 --> 00:32:28,035
or pooling, our upsampling will be by transpose
convolution or various types of unpooling or upsampling

396
00:32:28,035 --> 00:32:33,634
and we can train this whole thing end to end with back
propagation using this cross entropy loss over every pixel.

397
00:32:33,634 --> 00:32:41,514
So this is actually pretty cool that we can take a lot of the same machinery
that we already learned for image classification and now just apply it

398
00:32:41,514 --> 00:32:45,414
very easily to extend to new types
of problems so that's super cool.

399
00:32:46,333 --> 00:32:52,024
So the next task that I want to talk about is
this idea of classification plus localization.

400
00:32:52,024 --> 00:32:54,953
So we've talked about
image classification a lot

401
00:32:54,953 --> 00:33:01,234
where we want to just assign a category label to the input image but
sometimes you might want to know a little bit more about the image.

402
00:33:01,234 --> 00:33:09,077
In addition to predicting what the category is, in this case the
cat, you might also want to know where is that object in the image?

403
00:33:09,077 --> 00:33:17,874
So in addition to predicting the category label cat, you might also
want to draw a bounding box around the region of the cat in that image.

404
00:33:17,874 --> 00:33:22,713
And classification plus localization, the
distinction here between this and object detection

405
00:33:22,713 --> 00:33:31,242
is that in the localization scenario you assume ahead of time that you know there's
exactly one object in the image that you're looking for or maybe more than one

406
00:33:31,242 --> 00:33:41,001
but you know ahead of time that we're going to make some classification decision about this image and
we're going to produce exactly one bounding box that's going to tell us where that object is located

407
00:33:41,001 --> 00:33:47,584
in the image so we sometimes call that
task classification plus localization.

408
00:33:47,584 --> 00:33:53,680
And again, we can reuse a lot of the same machinery that we've already
learned from image classification in order to tackle this problem.

409
00:33:53,680 --> 00:33:58,220
So kind of a basic architecture for
this problem looks something like this.

410
00:33:58,220 --> 00:34:09,301
So again, we have our input image, we feed our input image through some giant convolutional network, this is
Alex, this is AlexNet for example, which will give us some final vector summarizing the content of the image.

411
00:34:09,301 --> 00:34:15,730
Then just like before we'll have some fully connected layer
that goes from that final vector to our class scores.

412
00:34:15,730 --> 00:34:21,109
But now we'll also have another fully connected
layer that goes from that vector to four numbers.

413
00:34:21,109 --> 00:34:28,478
Where the four numbers are something like the height, the
width, and the x and y positions of that bounding box.

414
00:34:28,478 --> 00:34:34,228
And now our network will produce these two different
outputs, one is this set of class scores,

415
00:34:34,228 --> 00:34:39,094
and the other are these four numbers giving the
coordinates of the bounding box in the input image.

416
00:34:39,094 --> 00:34:44,489
And now during training time, when we train this network
we'll actually have two losses so in this scenario

417
00:34:44,489 --> 00:34:47,210
we're sort of assuming a
fully supervised setting

418
00:34:47,210 --> 00:34:55,330
so we assume that each of our training images is annotated with both a category
label and also a ground truth bounding box for that category in the image.

419
00:34:55,331 --> 00:34:57,118
So now we have two loss functions.

420
00:34:57,118 --> 00:35:03,360
We have our favorite softmax loss that we compute using the
ground truth category label and the predicted class scores,

421
00:35:03,360 --> 00:35:13,669
and we also have some kind of loss that gives us some measure of dissimilarity between our
predicted coordinates for the bounding box and our actual coordinates for the bounding box.

422
00:35:13,669 --> 00:35:20,509
So one very simple thing is to just take an L2 loss between those two and that's
kind of the simplest thing that you'll see in practice although sometimes

423
00:35:20,509 --> 00:35:27,728
people play around with this and maybe use L1 or smooth L1 or they parametrize
the bounding box a little bit differently but the idea is always the same,

424
00:35:27,728 --> 00:35:35,509
that you have some regression loss between your predicted bounding
box coordinates and the ground truth bounding box coordinates.

425
00:35:35,509 --> 00:35:39,510
Question?
Sorry, go ahead.

426
00:35:49,410 --> 00:35:52,193
So the question is, is this a good idea
to do all at the same time?

427
00:35:52,193 --> 00:35:55,600
Like what happens if you misclassify, should
you even look at the box coordinates?

428
00:35:55,600 --> 00:35:59,901
So sometimes people get fancy with it,
so in general it works okay.

429
00:35:59,901 --> 00:36:03,652
It's not a big problem, you can actually train a
network to do both of these things at the same time

430
00:36:03,652 --> 00:36:09,592
and it'll figure it out but sometimes things can get tricky
in terms of misclassification so sometimes what you'll see

431
00:36:09,592 --> 00:36:19,232
for example is that rather than predicting a single box you might make predictions
like a separate prediction of the box for each category and then only apply loss

432
00:36:19,232 --> 00:36:24,091
to the predicted box corresponding
to the ground truth category.

433
00:36:24,091 --> 00:36:28,318
So people do get a little bit fancy with these
things that sometimes helps a bit in practice.

434
00:36:28,318 --> 00:36:34,611
But at least this basic setup, it might not be perfect or it
might not be optimal but it will work and it will do something.

435
00:36:34,611 --> 00:36:37,361
Was there a question in the back?

436
00:36:41,226 --> 00:36:46,746
Yeah, so that's the question is do these losses have
different units, do they dominate the gradient?

437
00:36:46,746 --> 00:36:49,306
So this is what we call a multi-task loss

438
00:36:49,306 --> 00:36:58,554
so whenever we're taking derivatives we always want to take derivative of a scalar
with respect to our network parameters and use that derivative to take gradient steps.

439
00:36:58,554 --> 00:37:01,331
But now we've got two scalars
that we want to both minimize

440
00:37:01,331 --> 00:37:11,833
so what you tend to do in practice is have some additional hyperparameter that gives you some weighting between
these two losses so you'll take a weighted sum of these two different loss functions to give our final scalar loss.

441
00:37:11,833 --> 00:37:15,642
And then you'll take your gradients with respect
to this weighted sum of the two losses.

442
00:37:15,642 --> 00:37:23,691
And this ends up being really really tricky because this weighting parameter
is a hyperparameter that you need to set but it's kind of different

443
00:37:23,691 --> 00:37:27,851
from some of the other hyperparameters
that we've seen so far in the past right

444
00:37:27,851 --> 00:37:32,390
because this weighting hyperparameter actually
changes the value of the loss function

445
00:37:32,390 --> 00:37:43,091
so one thing that you might often look at when you're trying to set hyperparameters is you might make
different hyperparameter choices and see what happens to the loss under different choices of hyperparameters.

446
00:37:43,091 --> 00:37:51,089
But in this case because the loss actually, because the hyperparameter affects
the absolute value of the loss making those comparisons becomes kind of tricky.

447
00:37:51,089 --> 00:37:54,473
So setting that hyperparameter
is somewhat difficult.

448
00:37:54,473 --> 00:38:00,393
And in practice, you kind of need to take it on a case by case basis for
exactly the problem you're solving but my general strategy for this

449
00:38:00,393 --> 00:38:08,163
is to have some other metric of performance that
you care about other than the actual loss value

450
00:38:08,163 --> 00:38:17,763
which then you actually use that final performance metric to make your cross
validation choices rather than looking at the value of the loss to make those choices.

451
00:38:17,763 --> 00:38:18,596
Question?

452
00:38:27,529 --> 00:38:32,682
So the question is why do we do this all
at once? Why not do this separately?

453
00:38:38,131 --> 00:38:45,413
Yeah, so the question is why don't we fix the big network and then
just only learn separate fully connected layers for these two tasks?

454
00:38:45,413 --> 00:38:52,702
People do do that sometimes and in fact that's probably the first thing
you should try if you're faced with a situation like this but in general

455
00:38:52,702 --> 00:39:00,574
whenever you're doing transfer learning you always get better performance if you fine tune
the whole system jointly because there's probably some mismatch between the features,

456
00:39:00,574 --> 00:39:09,280
if you train on ImageNet and then you use that network for your data set you're going
to get better performance on your data set if you can also change the network.

457
00:39:09,280 --> 00:39:16,870
But one trick you might see in practice sometimes is that you might freeze
that network then train those two things separately until convergence

458
00:39:16,870 --> 00:39:20,398
and then after they converge then you go
back and jointly fine tune the whole system.

459
00:39:20,398 --> 00:39:24,558
So that's a trick that sometimes people do
in practice in that situation.

460
00:39:24,558 --> 00:39:30,978
And as I've kind of alluded to this big network is often a
pre-trained network that is taken from ImageNet for example.

461
00:39:31,979 --> 00:39:37,339
So a bit of an aside, this idea of predicting
some fixed number of positions in the image

462
00:39:37,339 --> 00:39:41,881
can be applied to a lot of different problems
beyond just classification plus localization.

463
00:39:41,881 --> 00:39:44,710
One kind of cool example
is human pose estimation.

464
00:39:44,710 --> 00:39:49,440
So here we want to take an input image
is a picture of a person.

465
00:39:49,440 --> 00:39:56,462
We want to output the positions of the joints for that person and this
actually allows the network to predict what is the pose of the human.

466
00:39:56,462 --> 00:39:59,030
Where are his arms, where are
his legs, stuff like that,

467
00:39:59,030 --> 00:40:04,060
and generally most people have the same number of
joints. That's a bit of a simplifying assumption,

468
00:40:04,060 --> 00:40:06,862
it might not always be true
but it works for the network.

469
00:40:06,862 --> 00:40:10,251
So for example one
parameterization that you might see

470
00:40:10,251 --> 00:40:13,451
in some data sets is
define a person's pose

471
00:40:13,451 --> 00:40:15,430
by 14 joint positions.

472
00:40:15,430 --> 00:40:16,932
Their feet and their knees and their hips

473
00:40:16,932 --> 00:40:19,652
and something like that and
now when we train the network

474
00:40:19,652 --> 00:40:23,150
then we're going to input
this image of a person

475
00:40:23,150 --> 00:40:27,132
and now we're going to output
14 numbers in this case

476
00:40:27,132 --> 00:40:30,521
giving the x and y coordinates
for each of those 14 joints.

477
00:40:30,521 --> 00:40:33,120
And then you apply some
kind of regression loss

478
00:40:33,120 --> 00:40:35,961
on each of those 14
different predicted points

479
00:40:35,961 --> 00:40:40,619
and just train this network
with back propagation again.

480
00:40:40,619 --> 00:40:43,579
Yeah, so you might see an L2
loss but people play around

481
00:40:43,579 --> 00:40:46,571
with other regression losses here as well.

482
00:40:46,571 --> 00:40:47,404
Question?

483
00:40:50,934 --> 00:40:52,432
So the question is what do I mean

484
00:40:52,432 --> 00:40:53,992
when I say regression loss?

485
00:40:53,992 --> 00:40:56,099
So I mean something
other than cross entropy

486
00:40:56,099 --> 00:40:57,294
or softmax right.

487
00:40:57,294 --> 00:40:59,094
When I say regression loss I usually mean

488
00:40:59,094 --> 00:41:02,382
like an L2 Euclidean loss or an L1 loss

489
00:41:02,382 --> 00:41:04,494
or sometimes a smooth L1 loss.

490
00:41:04,494 --> 00:41:07,512
But in general classification
versus regression

491
00:41:07,512 --> 00:41:10,502
is whether your output is
categorical or continuous

492
00:41:10,502 --> 00:41:12,643
so if you're expecting
a categorical output

493
00:41:12,643 --> 00:41:15,272
like you ultimately want to
make a classification decision

494
00:41:15,272 --> 00:41:17,243
over some fixed number of categories

495
00:41:17,243 --> 00:41:19,942
then you'll think about
a cross entropy loss,

496
00:41:19,942 --> 00:41:23,094
softmax loss or these
SVM margin type losses

497
00:41:23,094 --> 00:41:25,022
that we talked about already in the class.

498
00:41:25,022 --> 00:41:28,272
But if your expected output is
to be some continuous value,

499
00:41:28,272 --> 00:41:30,222
in this case the position of these points,

500
00:41:30,222 --> 00:41:32,174
then your output is
continuous so you tend to use

501
00:41:32,174 --> 00:41:34,734
different types of losses
in those situations.

502
00:41:34,734 --> 00:41:37,883
Typically an L2, L1, different
kinds of things there.

503
00:41:37,883 --> 00:41:41,482
So sorry for not clarifying that earlier.

504
00:41:41,482 --> 00:41:44,471
But the bigger point
here is that for any time

505
00:41:44,471 --> 00:41:46,832
you know that you want
to make some fixed number

506
00:41:46,832 --> 00:41:51,003
of outputs from your network,
if you know for example.

507
00:41:51,003 --> 00:41:54,344
Maybe you knew that you wanted to,

508
00:41:54,344 --> 00:41:56,395
you knew that you always
are going to have pictures

509
00:41:56,395 --> 00:41:58,763
of a cat and a dog and
you want to predict both

510
00:41:58,763 --> 00:42:01,392
the bounding box of the cat
and the bounding box of the dog

511
00:42:01,392 --> 00:42:03,062
in that case you'd know
that you have a fixed number

512
00:42:03,062 --> 00:42:05,304
of outputs for each input
so you might imagine

513
00:42:05,304 --> 00:42:07,093
hooking up this type of regression

514
00:42:07,093 --> 00:42:09,264
classification plus localization framework

515
00:42:09,264 --> 00:42:10,743
for that problem as well.

516
00:42:10,743 --> 00:42:13,094
So this idea of some fixed
number of regression outputs

517
00:42:13,094 --> 00:42:14,872
can be applied to a lot
of different problems

518
00:42:14,872 --> 00:42:17,039
including pose estimation.

519
00:42:19,062 --> 00:42:23,531
So the next task that I want to
talk about is object detection

520
00:42:23,531 --> 00:42:25,342
and this is a really meaty topic.

521
00:42:25,342 --> 00:42:27,422
This is kind of a core
problem in computer vision

522
00:42:27,422 --> 00:42:29,910
and you could probably
teach a whole seminar class

523
00:42:29,910 --> 00:42:31,868
on just the history of object detection

524
00:42:31,868 --> 00:42:33,902
and various techniques applied there.

525
00:42:33,902 --> 00:42:35,931
So I'll be relatively
brief and try to go over

526
00:42:35,931 --> 00:42:39,691
the main big ideas of object
detection plus deep learning

527
00:42:39,691 --> 00:42:42,582
that have been used in
the last couple of years.

528
00:42:42,582 --> 00:42:44,731
But the idea in object detection is that

529
00:42:44,731 --> 00:42:47,942
we again start with some
fixed set of categories

530
00:42:47,942 --> 00:42:52,182
that we care about, maybe cats
and dogs and fish or whatever

531
00:42:52,182 --> 00:42:55,321
but some fixed set of categories
that we're interested in.

532
00:42:55,321 --> 00:42:59,030
And now our task is that
given our input image,

533
00:42:59,030 --> 00:43:02,470
every time one of those
categories appears in the image,

534
00:43:02,470 --> 00:43:05,641
we want to draw a box around
it and we want to predict

535
00:43:05,641 --> 00:43:08,710
the category of that
box so this is different

536
00:43:08,710 --> 00:43:10,902
from classification plus localization

537
00:43:10,902 --> 00:43:13,620
because there might be a
varying number of outputs

538
00:43:13,620 --> 00:43:15,302
for every input image.

539
00:43:15,302 --> 00:43:17,910
You don't know ahead of time
how many objects you expect

540
00:43:17,910 --> 00:43:20,081
to find in each image so that's,

541
00:43:20,081 --> 00:43:22,870
this ends up being a
pretty challenging problem.

542
00:43:22,870 --> 00:43:25,630
So we've seen graphs, so
this is kind of interesting.

543
00:43:25,630 --> 00:43:28,988
We've seen this graph
many times of the ImageNet

544
00:43:28,988 --> 00:43:31,870
classification performance
as a function of years

545
00:43:31,870 --> 00:43:34,761
and we saw that it just got
better and better every year

546
00:43:34,761 --> 00:43:37,342
and there's been a similar
trend with object detection

547
00:43:37,342 --> 00:43:39,131
because object detection
has again been one

548
00:43:39,131 --> 00:43:41,291
of these core problems in computer vision

549
00:43:41,291 --> 00:43:44,110
that people have cared
about for a very long time.

550
00:43:44,110 --> 00:43:46,390
So this slide is due to Ross Girshick

551
00:43:46,390 --> 00:43:48,742
who's worked on this
problem a lot and it shows

552
00:43:48,742 --> 00:43:51,070
the progression of object
detection performance

553
00:43:51,070 --> 00:43:54,441
on this one particular
data set called PASCAL VOC

554
00:43:54,441 --> 00:43:57,230
which has been relatively
used for a long time

555
00:43:57,230 --> 00:43:59,462
in the object detection community.

556
00:43:59,462 --> 00:44:02,428
And you can see that up until about 2012

557
00:44:02,428 --> 00:44:04,761
performance on object
detection started to stagnate

558
00:44:04,761 --> 00:44:08,161
and slow down a little
bit and then in 2013

559
00:44:08,161 --> 00:44:10,039
was when some of the first
deep learning approaches

560
00:44:10,039 --> 00:44:12,141
to object detection came
around and you could see

561
00:44:12,141 --> 00:44:13,982
that performance just shot up very quickly

562
00:44:13,982 --> 00:44:16,171
getting better and better year over year.

563
00:44:16,171 --> 00:44:21,422
One thing you might notice is that this plot ends in
2015 and it's actually continued to go up since then

564
00:44:21,422 --> 00:44:29,928
so the current state of the art in this data set is well over 80 and in fact a lot of recent
papers don't even report results on this data set anymore because it's considered too easy.

565
00:44:29,929 --> 00:44:37,421
So it's a little bit hard to know, I'm not actually sure what is the state
of the art number on this data set but it's off the top of this plot.

566
00:44:37,422 --> 00:44:40,924
Sorry, did you have a question?
Nevermind.

567
00:44:42,051 --> 00:44:50,960
Okay, so as I already said this is different from localization
because there might be differing numbers of objects for each image.

568
00:44:50,961 --> 00:44:57,770
So for example in this cat on the upper left there's only one object so
we only need to predict four numbers but now for this image in the middle

569
00:44:57,771 --> 00:45:05,551
there's three animals there so we need our network to
predict 12 numbers, four coordinates for each bounding box.

570
00:45:05,552 --> 00:45:13,210
Or in this example of many many ducks then you want your network to
predict a whole bunch of numbers. Again, four numbers for each duck.

571
00:45:13,211 --> 00:45:20,683
So this is quite different from object detection. Sorry
object detection is quite different from localization

572
00:45:20,683 --> 00:45:28,870
because in object detection you might have varying numbers of objects in
the image and you don't know ahead of time how many you expect to find.

573
00:45:28,870 --> 00:45:34,568
So as a result, it's kind of tricky if you want to
think of object detection as a regression problem.

574
00:45:34,568 --> 00:45:40,768
So instead, people tend to work, use kind of a different
paradigm when thinking about object detection.

575
00:45:40,768 --> 00:45:49,958
So one approach that's very common and has been used for a long time in
computer vision is this idea of sliding window approaches to object detection.

576
00:45:49,958 --> 00:45:59,360
So this is kind of similar to this idea of taking small patches and applying that
for semantic segmentation and we can apply a similar idea for object detection.

577
00:45:59,360 --> 00:46:05,118
So the ideas is that we'll take different crops from
the input image, in this case we've got this crop

578
00:46:05,118 --> 00:46:10,359
in the lower left hand corner of our image and now we
take that crop, feed it through our convolutional network

579
00:46:10,359 --> 00:46:14,829
and our convolutional network does a
classification decision on that input crop.

580
00:46:14,829 --> 00:46:18,160
It'll say that there's no dog
here, there's no cat here,

581
00:46:18,160 --> 00:46:23,899
and then in addition to the categories that we care about
we'll add an additional category called background

582
00:46:23,899 --> 00:46:32,288
and now our network can predict background in case it doesn't see any
of the categories that we care about, so then when we take this crop

583
00:46:32,288 --> 00:46:39,008
from the lower left hand corner here then our network would hopefully
predict background and say that no, there's no object here.

584
00:46:39,008 --> 00:46:44,128
Now if we take a different crop then our network
would predict dog yes, cat no, background no.

585
00:46:44,128 --> 00:46:47,680
We take a different crop we get dog yes,
cat no, background no.

586
00:46:47,680 --> 00:46:54,372
Or a different crop, dog no, cat yes,
background no. Does anyone see a problem here?

587
00:47:00,324 --> 00:47:04,764
Yeah, the question is how do you choose the
crops? So this is a huge problem right.

588
00:47:04,764 --> 00:47:10,543
Because there could be any number of objects in this image,
these objects could appear at any location in the image,

589
00:47:10,543 --> 00:47:15,583
these objects could appear at any size in the image,
these objects could also appear at any aspect ratio

590
00:47:15,583 --> 00:47:29,523
in the image, so if you want to do kind of a brute force sliding window approach you'd end up having to test thousands, tens
of thousands, many many many many different crops in order to tackle this problem with a brute force sliding window approach.

591
00:47:29,523 --> 00:47:37,532
And in the case where every one of those crops is going to be fed through a giant
convolutional network, this would be completely computationally intractable.

592
00:47:37,532 --> 00:47:45,920
So in practice people don't ever do this sort of brute force sliding
window approach for object detection using convolutional networks.

593
00:47:47,044 --> 00:47:54,492
Instead there's this cool line of work called region proposals
that comes from, this is not using deep learning typically.

594
00:47:54,492 --> 00:47:56,332
These are slightly more
traditional computer vision

595
00:47:56,332 --> 00:48:05,401
techniques but the idea is that a region proposal network kind of uses more
traditional signal processing, image processing type things to make some list

596
00:48:05,401 --> 00:48:14,341
of proposals for where, so given an input image, a region proposal network will
then give you something like a thousand boxes where an object might be present.

597
00:48:14,341 --> 00:48:22,382
So you can imagine that maybe we do some local, we look for edges in the image
and try to draw boxes that contain closed edges or something like that.

598
00:48:22,382 --> 00:48:30,132
These various types of image processing approaches, but these region proposal
networks will basically look for blobby regions in our input image and then give us

599
00:48:30,132 --> 00:48:38,962
some set of candidate proposal regions where objects might be
potentially found. And these are relatively fast-ish to run

600
00:48:38,962 --> 00:48:44,703
so one common example of a region proposal method that
you might see is something called Selective Search

601
00:48:44,703 --> 00:48:49,284
which I think actually gives you 2000 region
proposals, not the 1000 that it says on the slide.

602
00:48:49,284 --> 00:48:59,404
So you kind of run this thing and then after about two seconds of turning on your CPU it'll
spit out 2000 region proposals in the input image where objects are likely to be found

603
00:48:59,404 --> 00:49:05,052
so there'll be a lot of noise in those. Most of them will
not be true objects but there's a pretty high recall.

604
00:49:05,052 --> 00:49:11,204
If there is an object in the image then it does tend to get
covered by these region proposals from Selective Search.

605
00:49:11,204 --> 00:49:17,103
So now rather than applying our classification network
to every possible location and scale in the image

606
00:49:17,103 --> 00:49:25,164
instead what we can do is first apply one of these region proposal networks
to get some set of proposal regions where objects are likely located

607
00:49:25,164 --> 00:49:33,135
and now apply a convolutional network for classification to each of these
proposal regions and this will end up being much more computationally tractable

608
00:49:33,135 --> 00:49:36,903
than trying to do all
possible locations and scales.

609
00:49:36,903 --> 00:49:45,583
And this idea all came together in this paper called
R-CNN from a few years ago that does exactly that.

610
00:49:45,583 --> 00:49:53,263
So given our input image in this case we'll run some region proposal
network to get our proposals, these are also sometimes called

611
00:49:53,263 --> 00:49:56,724
regions of interest or ROI's
so again Selective Search

612
00:49:56,724 --> 00:49:59,692
gives you something like
2000 regions of interest.

613
00:49:59,692 --> 00:50:07,043
Now one of the problems here is that these input, these
regions in the input image could have different sizes

614
00:50:07,043 --> 00:50:13,143
but if we're going to run them all through a convolutional network
our classification, our convolutional networks for classification

615
00:50:13,143 --> 00:50:18,149
all want images of the same input size typically
due to the fully connected net layers and whatnot

616
00:50:18,149 --> 00:50:26,855
so we need to take each of these region proposals and warp them to that
fixed square size that is expected as input to our downstream network.

617
00:50:26,855 --> 00:50:34,090
So we'll crop out those region proposal, those regions corresponding
to the region proposals, we'll warp them to that fixed size,

618
00:50:34,090 --> 00:50:37,418
and then we'll run each of them
through a convolutional network

619
00:50:37,418 --> 00:50:48,479
which will then use in this case an SVM to make a classification decision
for each of those, to predict categories for each of those crops.

620
00:50:48,479 --> 00:50:52,506
And then I lost a slide.

621
00:50:52,506 --> 00:51:05,650
But it'll also, not shown in the slide right now but in addition R-CNN also predicts a regression,
like a correction to the bounding box in addition for each of these input region proposals

622
00:51:05,650 --> 00:51:13,549
because the problem is that your input region proposals are kind of generally in the
right position for an object but they might not be perfect so in addition R-CNN will,

623
00:51:13,549 --> 00:51:24,658
in addition to category labels for each of these proposals, it'll also predict four numbers that
are kind of an offset or a correction to the box that was predicted at the region proposal stage.

624
00:51:24,658 --> 00:51:27,919
So then again, this is a multi-task loss
and you would train this whole thing.

625
00:51:27,919 --> 00:51:30,169
Sorry was there a question?

626
00:51:35,511 --> 00:51:39,359
The question is how much does the change
in aspect ratio impact accuracy?

627
00:51:40,698 --> 00:51:41,772
It's a little bit hard to say.

628
00:51:41,772 --> 00:51:46,551
I think there's some controlled experiments
in some of these papers but I'm not sure

629
00:51:46,551 --> 00:51:48,738
I can give a generic answer to that.

630
00:51:48,738 --> 00:51:49,571
Question?

631
00:51:53,602 --> 00:51:56,772
The question is is it necessary
for regions of interest to be rectangles?

632
00:51:56,772 --> 00:52:03,731
So they typically are because it's tough to
warp these non-region things but once you move

633
00:52:03,731 --> 00:52:08,911
to something like instant segmentation then you
sometimes get proposals that are not rectangles.

634
00:52:08,911 --> 00:52:12,071
If you actually do care about predicting
things that are not rectangles.

635
00:52:12,071 --> 00:52:14,238
Is there another question?

636
00:52:18,704 --> 00:52:24,375
Yeah, so the question is are the region proposals
learned so in R-CNN it's a traditional thing.

637
00:52:24,375 --> 00:52:29,203
These are not learned, this is kind of some fixed algorithm
that someone wrote down but we'll see in a couple minutes

638
00:52:29,203 --> 00:52:33,466
that we can actually, we've changed that a
little bit in the last couple of years.

639
00:52:33,466 --> 00:52:35,633
Is there another question?

640
00:52:37,767 --> 00:52:40,735
The question is is the offset always
inside the region of interest?

641
00:52:40,735 --> 00:52:42,665
The answer is no, it doesn't have to be.

642
00:52:42,665 --> 00:52:50,786
You might imagine that suppose the region of interest put a box around a
person but missed the head then you could imagine the network inferring

643
00:52:50,786 --> 00:52:55,906
that oh this is a person but people usually have heads so
the network showed the box should be a little bit higher.

644
00:52:55,906 --> 00:52:59,666
So sometimes the final predicted boxes
will be outside the region of interest.

645
00:52:59,666 --> 00:53:00,499
Question?

646
00:53:08,110 --> 00:53:12,801
Yeah. Yeah the question is you have a lot of
ROI's that don't correspond to true objects?

647
00:53:15,877 --> 00:53:22,550
And like we said, in addition to the classes that you actually care about
you add an additional background class so your class scores can also

648
00:53:22,550 --> 00:53:26,289
predict background to say
that there was no object here.

649
00:53:26,289 --> 00:53:27,122
Question?

650
00:53:37,716 --> 00:53:40,894
Yeah, so the question is
what kind of data do we need

651
00:53:40,894 --> 00:53:53,383
and yeah, this is fully supervised in the sense that our training data has each image, consists of images.
Each image has all the object categories marked with bounding boxes for each instance of that category.

652
00:53:53,383 --> 00:54:02,945
There are definitely papers that try to approach this like oh what if you don't have the data.
What if you only have that data for some images? Or what if that data is noisy but at least

653
00:54:02,945 --> 00:54:08,568
in the generic case you assume full supervision
of all objects in the images at training time.

654
00:54:09,835 --> 00:54:16,535
Okay, so I think we've kind of alluded to this but there's
kind of a lot of problems with this R-CNN framework.

655
00:54:16,535 --> 00:54:21,644
And actually if you look at the figure here on the right you
can see that additional bounding box head so I'll put it back.

656
00:54:21,644 --> 00:54:25,811
But this is kind of still
computationally pretty expensive

657
00:54:27,436 --> 00:54:34,415
because if we've got 2000 region proposals, we're running each
of those proposals independently, that can be pretty expensive.

658
00:54:34,415 --> 00:54:42,895
There's also this question of relying on this fixed region proposal network, this
fixed region proposals, we're not learning them so that's kind of a problem.

659
00:54:42,895 --> 00:54:46,015
And just in practice it
ends up being pretty slow

660
00:54:46,015 --> 00:54:54,721
so in the original implementation R-CNN would actually dump all the features to
disk so it'd take hundreds of gigabytes of disk space to store all these features.

661
00:54:54,721 --> 00:54:58,472
Then training would be super slow since you have to
make all these different forward and backward passes

662
00:54:58,472 --> 00:55:06,134
through the image and it took something like 84 hours is one number
they've recorded for training time so this is super super slow.

663
00:55:06,134 --> 00:55:11,076
And now at test time it's also super slow,
something like roughly 30 seconds minute per image

664
00:55:11,076 --> 00:55:18,316
because you need to run thousands of forward passes through the convolutional
network for each of these region proposals so this ends up being pretty slow.

665
00:55:18,316 --> 00:55:27,404
Thankfully we have fast R-CNN that fixed a lot of these problems
so when we do fast R-CNN then it's going to look kind of the same.

666
00:55:27,404 --> 00:55:34,116
We're going to start with our input image but now rather than processing each
region of interest separately instead we're going to run the entire image

667
00:55:34,116 --> 00:55:41,924
through some convolutional layers all at once to give this high
resolution convolutional feature map corresponding to the entire image.

668
00:55:41,924 --> 00:55:46,652
And now we still are using some region proposals
from some fixed thing like Selective Search

669
00:55:46,652 --> 00:55:52,334
but rather than cropping out the pixels of the
image corresponding to the region proposals,

670
00:55:52,334 --> 00:56:04,745
instead we imagine projecting those region proposals onto this convolutional feature map and then taking crops from
the convolutional feature map corresponding to each proposal rather than taking crops directly from the image.

671
00:56:04,745 --> 00:56:13,425
And this allows us to reuse a lot of this expensive convolutional
computation across the entire image when we have many many crops per image.

672
00:56:13,425 --> 00:56:20,052
But again, if we have some fully connected layers downstream
those fully connected layers are expecting some fixed-size input

673
00:56:20,052 --> 00:56:26,131
so now we need to do some reshaping of those
crops from the convolutional feature map

674
00:56:26,131 --> 00:56:31,673
and they do that in a differentiable way using
something they call an ROI pooling layer.

675
00:56:31,673 --> 00:56:38,622
Once you have these warped crops from the convolutional
feature map then you can run these things through some

676
00:56:38,622 --> 00:56:45,673
fully connected layers and predict your classification scores
and your linear regression offsets to the bounding boxes.

677
00:56:45,673 --> 00:56:51,654
And now when we train this thing then we again have a multi-task loss
that trades off between these two constraints and during back propagation

678
00:56:51,654 --> 00:56:56,124
we can back prop through this entire thing
and learn it all jointly.

679
00:56:56,124 --> 00:57:03,575
This ROI pooling, it looks kind of like max pooling. I don't
really want to get into the details of that right now.

680
00:57:03,575 --> 00:57:12,014
And in terms of speed if we look at R-CNN versus fast R-CNN versus
this other model called SPP net which is kind of in between the two,

681
00:57:12,014 --> 00:57:16,924
then you can see that at training time fast R-CNN
is something like 10 times faster to train

682
00:57:16,924 --> 00:57:20,134
because we're sharing all this computation
between different feature maps.

683
00:57:20,134 --> 00:57:23,272
And now at test time
fast R-CNN is super fast

684
00:57:23,272 --> 00:57:33,764
and in fact fast R-CNN is so fast at test time that its computation
time is actually dominated by computing region proposals.

685
00:57:33,764 --> 00:57:39,334
So we said that computing these 2000 region proposals
using Selective Search takes something like two seconds

686
00:57:39,334 --> 00:57:53,273
and now once we've got all these region proposals then because we're processing them all sort of in a shared way by sharing these
expensive convolutions across the entire image that we can process all of these region proposals in less than a second altogether.

687
00:57:53,273 --> 00:57:59,142
So fast R-CNN ends up being bottlenecked by
just the computing of these region proposals.

688
00:57:59,142 --> 00:58:03,804
Thankfully we've solved this
problem with faster R-CNN.

689
00:58:03,804 --> 00:58:13,734
So the idea in faster R-CNN is to just make, so the problem was the
computing the region proposals using this fixed function was a bottleneck.

690
00:58:13,734 --> 00:58:18,054
So instead we'll just make the network
itself predict its own region proposals.

691
00:58:18,054 --> 00:58:30,572
And so the way that this sort of works is that again, we take our input image, run the entire input image altogether
through some convolutional layers to get some convolutional feature map representing the entire high resolution image

692
00:58:30,572 --> 00:58:33,204
and now there's a separate
region proposal network

693
00:58:33,204 --> 00:58:39,204
which works on top of those convolutional features and
predicts its own region proposals inside the network.

694
00:58:39,204 --> 00:58:44,542
Now once we have those predicted region
proposals then it looks just like fast R-CNN

695
00:58:44,542 --> 00:58:50,662
where now we take crops from those region proposals from the
convolutional features, pass them up to the rest of the network.

696
00:58:50,662 --> 00:58:57,094
And now we talked about multi-task losses and multi-task
training networks to do multiple things at once.

697
00:58:57,094 --> 00:59:05,019
Well now we're telling the network to do four things all at once
so balancing out this four-way multi-task loss is kind of tricky.

698
00:59:05,019 --> 00:59:14,848
But because the region proposal network needs to do two things: it needs to say for
each potential proposal is it an object or not an object, it needs to actually regress

699
00:59:14,848 --> 00:59:18,186
the bounding box coordinates
for each of those proposals,

700
00:59:18,186 --> 00:59:21,787
and now the final network at the end
needs to do these two things again.

701
00:59:21,787 --> 00:59:26,288
Make final classification decisions for what are
the class scores for each of these proposals,

702
00:59:26,288 --> 00:59:34,086
and also have a second round of bounding box regression to again
correct any errors that may have come from the region proposal stage.

703
00:59:34,086 --> 00:59:34,919
Question?

704
00:59:45,231 --> 00:59:50,703
So the question is that sometimes multi-task learning might be
seen as regularization and are we getting that affect here?

705
00:59:50,703 --> 00:59:52,602
I'm not sure if there's been
super controlled studies

706
00:59:52,602 --> 01:00:01,162
on that but actually in the original version of the faster R-CNN
paper they did a little bit of experimentation like what if we share

707
01:00:01,162 --> 01:00:03,951
the region proposal network,
what if we don't share?

708
01:00:03,951 --> 01:00:08,522
What if we learn separate convolutional networks for the
region proposal network versus the classification network?

709
01:00:08,522 --> 01:00:12,970
And I think there were minor differences but
it wasn't a dramatic difference either way.

710
01:00:12,970 --> 01:00:18,380
So in practice it's kind of nicer to only learn
one because it's computationally cheaper.

711
01:00:18,380 --> 01:00:19,713
Sorry, question?

712
01:00:33,583 --> 01:00:41,903
Yeah the question is how do you train this region proposal network because you don't
know, you don't have ground truth region proposals for the region proposal network.

713
01:00:41,903 --> 01:00:45,172
So that's a little bit hairy. I don't
want to get too much into those details

714
01:00:45,172 --> 01:00:53,452
but the idea is that at any time you have a region proposal which has
more than some threshold of overlap with any of the ground truth objects

715
01:00:53,452 --> 01:00:57,771
then you say that that is the positive region proposal
and you should predict that as the region proposal

716
01:00:57,771 --> 01:01:04,471
and any potential proposal which has very low overlap with
any ground truth objects should be predicted as a negative.

717
01:01:04,471 --> 01:01:09,550
But there's a lot of dark magic hyperparameters
in that process and that's a little bit hairy.

718
01:01:09,550 --> 01:01:10,383
Question?

719
01:01:15,394 --> 01:01:19,793
Yeah, so the question is what is the classification
loss on the region proposal network and the answer is

720
01:01:19,793 --> 01:01:26,648
that it's making a binary, so I didn't want to get into too much of the details of
that architecture 'cause it's a little bit hairy but it's making binary decisions.

721
01:01:26,648 --> 01:01:32,269
So it has some set of potential regions that it's
considering and it's making a binary decision for each one.

722
01:01:32,269 --> 01:01:34,078
Is this an object or not an object?

723
01:01:34,078 --> 01:01:37,578
So it's like a binary classification loss.

724
01:01:38,520 --> 01:01:43,658
So once you train this thing then faster
R-CNN ends up being pretty darn fast.

725
01:01:43,658 --> 01:01:48,706
So now because we've eliminated this overhead from
computing region proposals outside the network,

726
01:01:48,706 --> 01:01:53,588
now faster R-CNN ends up being very very
fast compared to these other alternatives.

727
01:01:53,588 --> 01:01:59,388
Also, one interesting thing is that because we're
learning the region proposals here you might imagine

728
01:01:59,388 --> 01:02:05,086
maybe what if there was some mismatch between
this fixed region proposal algorithm and my data?

729
01:02:05,086 --> 01:02:16,320
So in this case once you're learning your own region proposals then you can overcome that
mismatch if your region proposals are somewhat weird or different than other data sets.

730
01:02:16,320 --> 01:02:22,914
So this whole family of R-CNN methods, R stands
for region, so these are all region-based methods

731
01:02:22,914 --> 01:02:30,716
because there's some kind of region proposal and then we're doing some
processing, some independent processing for each of those potential regions.

732
01:02:30,716 --> 01:02:36,708
So this whole family of methods are called these
region-based methods for object detection.

733
01:02:36,708 --> 01:02:40,676
But there's another family of methods that
you sometimes see for object detection

734
01:02:40,676 --> 01:02:43,818
which is sort of all feed
forward in a single pass.

735
01:02:43,818 --> 01:02:48,076
So one of these is YOLO
for You Only Look Once.

736
01:02:48,076 --> 01:02:50,796
And another is SSD for
Single Shot Detection

737
01:02:50,796 --> 01:02:54,067
and these two came out
somewhat around the same time.

738
01:02:54,067 --> 01:03:02,348
But the idea is that rather than doing independent processing for each of these potential
regions instead we want to try to treat this like a regression problem and just make

739
01:03:02,348 --> 01:03:06,156
all these predictions all at once
with some big convolutional network.

740
01:03:06,156 --> 01:03:13,468
So now given our input image you imagine dividing that input image
into some coarse grid, in this case it's a seven by seven grid

741
01:03:13,468 --> 01:03:18,556
and now within each of those grid cells you
imagine some set of base bounding boxes.

742
01:03:18,556 --> 01:03:25,748
Here I've drawn three base bounding boxes like a tall one, a wide
one, and a square one but in practice you would use more than three.

743
01:03:25,748 --> 01:03:32,858
So now for each of these grid cells and for each of these
base bounding boxes you want to predict several things.

744
01:03:32,858 --> 01:03:41,868
One, you want to predict an offset off the base bounding box to predict
what is the true location of the object off this base bounding box.

745
01:03:43,020 --> 01:03:51,460
And you also want to predict classification scores so maybe a
classification score for each of these base bounding boxes.

746
01:03:51,460 --> 01:03:55,503
How likely is it that an object of this
category appears in this bounding box.

747
01:03:55,503 --> 01:04:03,929
So then at the end we end up predicting from our input image, we end
up predicting this giant tensor of seven by seven grid by 5B + C.

748
01:04:04,951 --> 01:04:12,700
So that's just where we have B base bounding boxes, we have five numbers
for each giving our offset and our confidence for that base bounding box

749
01:04:12,700 --> 01:04:16,340
and C classification scores
for our C categories.

750
01:04:16,340 --> 01:04:23,522
So then we kind of see object detection as this input
of an image, output of this three dimensional tensor

751
01:04:23,522 --> 01:04:27,722
and you can imagine just training this whole
thing with a giant convolutional network.

752
01:04:27,722 --> 01:04:30,682
And that's kind of what
these single shot methods do

753
01:04:30,682 --> 01:04:41,180
where they just, and again matching the ground truth objects into these potential
base boxes becomes a little bit hairy but that's what these methods do.

754
01:04:41,180 --> 01:04:48,539
And by the way, the region proposal network that gets used in faster
R-CNN ends up looking quite similar to these where they have some set

755
01:04:48,539 --> 01:04:55,279
of base bounding boxes over some gridded image, another region
proposal network does some regression plus some classification.

756
01:04:55,279 --> 01:04:59,196
So there's kind of some
overlapping ideas here.

757
01:05:00,388 --> 01:05:13,892
So in faster R-CNN we're kind of treating the object, the region proposal step as kind of this fixed end-to-end
regression problem and then we do the separate per region processing but now with these single shot methods

758
01:05:13,892 --> 01:05:19,761
we only do that first step and just do all of our
object detection with a single forward pass.

759
01:05:19,761 --> 01:05:21,740
So object detection has a
ton of different variables.

760
01:05:21,740 --> 01:05:23,950
There could be different
base networks like VGG,

761
01:05:23,950 --> 01:05:29,601
ResNet, we've seen different metastrategies for
object detection including this faster R-CNN

762
01:05:29,601 --> 01:05:31,820
type region based family of methods,

763
01:05:31,820 --> 01:05:34,060
this single shot detection
family of methods.

764
01:05:34,060 --> 01:05:38,153
There's kind of a hybrid that I didn't talk about
called R-FCN which is somewhat in between.

765
01:05:38,153 --> 01:05:39,580
There's a lot of different hyperparameters

766
01:05:39,580 --> 01:05:43,590
like what is the image size,
how many region proposals do you use.

767
01:05:43,590 --> 01:05:48,022
And there's actually this really cool paper that
will appear at CVPR this summer that does a really

768
01:05:48,022 --> 01:05:56,353
controlled experimentation around a lot of these different variables and tries
to tell you how do these methods all perform under these different variables.

769
01:05:56,353 --> 01:05:58,676
So if you're interested I'd
encourage you to check it out

770
01:05:58,676 --> 01:06:06,702
but kind of one of the key takeaways is that the faster R-CNN style of
region based methods tends to give higher accuracies but ends up being

771
01:06:06,702 --> 01:06:08,972
much slower than the single shot methods

772
01:06:08,972 --> 01:06:12,486
because the single shot methods don't
require this per region processing.

773
01:06:12,486 --> 01:06:17,204
But I encourage you to check out
this paper if you want more details.

774
01:06:17,204 --> 01:06:24,621
Also as a bit of aside, I had this fun paper with Andre a couple years
ago that kind of combined object detection with image captioning

775
01:06:24,621 --> 01:06:27,273
and did this problem
called dense captioning

776
01:06:27,273 --> 01:06:32,472
so now the idea is that rather than predicting
a fixed category label for each region,

777
01:06:32,472 --> 01:06:35,084
instead we want to write
a caption for each region.

778
01:06:35,084 --> 01:06:41,033
And again, we had some data set that had this sort of data
where we had a data set of regions together with captions

779
01:06:41,033 --> 01:06:46,153
and then we sort of trained this giant end-to-end
model that just predicted these captions all jointly.

780
01:06:46,153 --> 01:06:50,962
And this ends up looking somewhat like faster
R-CNN where you have some region proposal stage

781
01:06:50,962 --> 01:06:53,764
then a bounding box, then
some per region processing.

782
01:06:53,764 --> 01:07:03,454
But rather than a SVM or a softmax loss instead those per region processing
has a whole RNN language model that predicts a caption for each region.

783
01:07:03,454 --> 01:07:06,814
So that ends up looking quite
a bit like faster R-CNN.

784
01:07:06,814 --> 01:07:11,524
There's a video here but I think
we're running out of time so I'll skip it.

785
01:07:11,524 --> 01:07:17,897
But the idea here is that once you have this, you
can kind of tie together a lot of these ideas

786
01:07:17,897 --> 01:07:21,508
and if you have some new problem that you're
interested in tackling like dense captioning,

787
01:07:21,508 --> 01:07:26,860
you can recycle a lot of the components that you've learned
from other problems like object detection and image captioning

788
01:07:26,860 --> 01:07:32,565
and kind of stitch together one end-to-end network that
produces the outputs that you care about for your problem.

789
01:07:32,565 --> 01:07:36,567
So the last task that I want to talk about
is this idea of instance segmentation.

790
01:07:36,567 --> 01:07:40,636
So here instance segmentation is
in some ways like the full problem

791
01:07:40,636 --> 01:07:50,594
We're given an input image and we want to predict one, the locations and identities
of objects in that image similar to object detection, but rather than just

792
01:07:50,594 --> 01:07:55,385
predicting a bounding box for each of those objects,
instead we want to predict a whole segmentation mask

793
01:07:55,385 --> 01:08:02,785
for each of those objects and predict which pixels in
the input image corresponds to each object instance.

794
01:08:02,785 --> 01:08:07,484
So this is kind of like a hybrid between
semantic segmentation and object detection

795
01:08:07,484 --> 01:08:15,196
because like object detection we can handle multiple objects and we
differentiate the identities of different instances so in this example

796
01:08:15,196 --> 01:08:21,924
since there are two dogs in the image and instance segmentation
method actually distinguishes between the two dog instances

797
01:08:21,924 --> 01:08:32,765
and the output and kind of like semantic segmentation we have this pixel wise accuracy
where for each of these objects we want to say which pixels belong to that object.

798
01:08:32,765 --> 01:08:38,247
So there's been a lot of different methods that people
have tackled, for instance segmentation as well,

799
01:08:38,247 --> 01:08:49,868
but the current state of the art is this new paper called Mask R-CNN that actually just came
out on archive about a month ago so this is not yet published, this is like super fresh stuff.

800
01:08:49,868 --> 01:08:52,675
And this ends up looking
a lot like faster R-CNN.

801
01:08:52,676 --> 01:08:55,296
So it has this multi-stage
processing approach

802
01:08:55,296 --> 01:09:05,622
where we take our whole input image, that whole input image goes into some convolutional
network and some learned region proposal network that's exactly the same as faster R-CNN

803
01:09:05,622 --> 01:09:14,795
and now once we have our learned region proposals then we project those proposals
onto our convolutional feature map just like we did in fast and faster R-CNN.

804
01:09:14,796 --> 01:09:21,228
But now rather than just making a classification and a bounding box
for regression decision for each of those boxes we in addition

805
01:09:21,229 --> 01:09:27,478
want to predict a segmentation mask for each of those
bounding box, for each of those region proposals.

806
01:09:27,478 --> 01:09:36,888
So now it kind of looks like a mini, like a semantic segmentation problem inside
each of the region proposals that we're getting from our region proposal network.

807
01:09:36,889 --> 01:09:45,947
So now after we do this ROI aligning to warp our features corresponding to the
region of proposal into the right shape, then we have two different branches.

808
01:09:45,948 --> 01:09:53,750
One branch will come up that looks exact, and this first branch at the top
looks just like faster R-CNN and it will predict classification scores

809
01:09:53,750 --> 01:09:59,318
telling us what is the category corresponding to that region
of proposal or alternatively whether or not it's background.

810
01:09:59,318 --> 01:10:04,596
And we'll also predict some bounding box coordinates
that regressed off the region proposal coordinates.

811
01:10:04,596 --> 01:10:13,550
And now in addition we'll have this branch at the bottom which looks basically
like a semantic segmentation mini network which will classify for each pixel

812
01:10:13,550 --> 01:10:17,780
in that input region proposal
whether or not it's an object

813
01:10:17,780 --> 01:10:29,230
so this mask R-CNN problem, this mask R-CNN architecture just kind of unifies all of these different
problems that we've been talking about today into one nice jointly end-to-end trainable model.

814
01:10:29,230 --> 01:10:36,710
And it's really cool and it actually works really really well so
when you look at the examples in the paper they're kind of amazing.

815
01:10:36,710 --> 01:10:39,078
They look kind of indistinguishable
from ground truth.

816
01:10:39,078 --> 01:10:49,497
So in this example on the left you can see that there are these two people standing in front of motorcycles, it's drawn
the boxes around these people, it's also gone in and labeled all the pixels of those people and it's really small

817
01:10:49,497 --> 01:10:54,961
but actually in the background on that image on the left there's
also a whole crowd of people standing very small in the background.

818
01:10:54,961 --> 01:10:58,628
It's also drawn boxes around each of those and
grabbed the pixels of each of those images.

819
01:10:58,628 --> 01:11:08,028
And you can see that this is just, it ends up working really really well and
it's a relatively simple addition on top of the existing faster R-CNN framework.

820
01:11:08,028 --> 01:11:15,108
So I told you that mask R-CNN unifies everything we talked
about today and it also does pose estimation by the way.

821
01:11:15,108 --> 01:11:22,257
So we talked about, you can do pose estimation by predicting
these joint coordinates for each of the joints of the person

822
01:11:22,257 --> 01:11:29,388
so you can do mask R-CNN to do joint object detection,
pose estimation, and instance segmentation.

823
01:11:29,388 --> 01:11:35,246
And the only addition we need to make is that for each of
these region proposals we add an additional little branch

824
01:11:35,246 --> 01:11:42,628
that predicts these coordinates of the joints for
the instance of the current region proposal.

825
01:11:42,628 --> 01:11:51,715
So now this is just another loss, like another layer that we add, another
head coming out of the network and an additional term in our multi-task loss.

826
01:11:51,715 --> 01:11:59,406
But once we add this one little branch then you can do all of these
different problems jointly and you get results looking something like this.

827
01:11:59,406 --> 01:12:02,705
Where now this network, like
a single feed forward network

828
01:12:02,705 --> 01:12:09,792
is deciding how many people are in the image, detecting where
those people are, figuring out the pixels corresponding to each

829
01:12:09,792 --> 01:12:22,742
of those people and also drawing a skeleton estimating the pose of those people and this works really well even in crowded scenes
like this classroom where there's a ton of people sitting and they all overlap each other and it just seems to work incredibly well.

830
01:12:22,742 --> 01:12:28,291
And because it's built on the faster R-CNN framework
it also runs relatively close to real time

831
01:12:28,291 --> 01:12:36,061
so this is running something like five frames per second on a GPU because
this is all sort of done in the single forward pass of the network.

832
01:12:36,061 --> 01:12:42,833
So this is again, a super new paper but I think that this
will probably get a lot of attention in the coming months.

833
01:12:42,833 --> 01:12:45,430
So just to recap, we've talked.

834
01:12:45,430 --> 01:12:46,680
Sorry question?

835
01:12:53,800 --> 01:12:55,781
The question is how much
training data do you need?

836
01:12:55,781 --> 01:13:00,948
So all of these instant segmentation results
were trained on the Microsoft Coco data set

837
01:13:00,948 --> 01:13:08,320
so Microsoft Coco is roughly 200,000 training
images. It has 80 categories that it cares about

838
01:13:08,320 --> 01:13:14,010
so in each of those 200,000 training images it has
all the instances of those 80 categories labeled.

839
01:13:14,010 --> 01:13:23,285
So there's something like 200,000 images for training and there's something like I think
an average of fivee or six instances per image. So it actually is quite a lot of data.

840
01:13:23,285 --> 01:13:32,000
And for Microsoft Coco for all the people in Microsoft Coco they also have
all the joints annotated as well so this actually does have quite a lot

841
01:13:32,000 --> 01:13:36,669
of supervision at training time you're right, and
actually is trained with quite a lot of data.

842
01:13:36,669 --> 01:13:42,050
So I think one really interesting topic to
study moving forward is that we kind of know

843
01:13:42,050 --> 01:13:50,701
that if you have a lot of data to solve some problem, at this point we're relatively confident that
you can stitch up some convolutional network that can probably do a reasonable job at that problem

844
01:13:50,701 --> 01:13:59,069
but figuring out ways to get performance like this with less training data is a super
interesting and active area of research and I think that's something people will be spending

845
01:13:59,069 --> 01:14:03,301
a lot of their efforts working
on in the next few years.

846
01:14:03,301 --> 01:14:08,068
So just to recap, today we had kind of a whirlwind tour
of a whole bunch of different computer vision topics

847
01:14:08,068 --> 01:14:15,925
and we saw how a lot of the machinery that we built up from image classification
can be applied relatively easily to tackle these different computer vision topics.

848
01:14:15,925 --> 01:14:22,835
And next time we'll talk about, we'll have a
really fun lecture on visualizing CNN features.