-
Notifications
You must be signed in to change notification settings - Fork 210
/
Lecture 11 _ Detection and Segmentation.srt
4123 lines (3276 loc) · 103 KB
/
Lecture 11 _ Detection and Segmentation.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1
00:00:08,691 --> 00:00:15,429
- Hello, hi. So I want to get started.
Welcome to CS 231N Lecture 11.
2
00:00:15,430 --> 00:00:23,258
We're going to talk about today detection segmentation and a whole bunch
of other really exciting topics around core computer vision tasks.
3
00:00:23,259 --> 00:00:25,590
But as usual, a couple
administrative notes.
4
00:00:25,590 --> 00:00:31,358
So last time you obviously took the midterm, we
didn't have lecture, hopefully that went okay
5
00:00:31,358 --> 00:00:42,269
for all of you but so we're going to work on grading the midterm this week, but as a reminder
please don't make any public discussions about the midterm questions or answers or whatever
6
00:00:42,270 --> 00:00:48,517
until at least tomorrow because there are still some people
taking makeup midterms today and throughout the rest of the week
7
00:00:48,518 --> 00:00:53,668
so we just ask you that you refrain from
talking publicly about midterm questions.
8
00:00:56,329 --> 00:01:02,920
Why don't you wait until Monday?
[laughing] Okay, great.
9
00:01:02,921 --> 00:01:07,760
So we're also starting to work on midterm grading. We'll get
those back to you as soon as you can, as soon as we can.
10
00:01:07,761 --> 00:01:14,078
We're also starting to work on grading assignment two so there's
a lot of grading being done this week. The TA's are pretty busy.
11
00:01:14,079 --> 00:01:18,479
Also a reminder for you guys, hopefully you've been
working hard on your projects now that most of you
12
00:01:18,479 --> 00:01:26,969
are done with the midterm so your project milestones will be due on
Tuesday so any sort of last minute changes that you had in your projects,
13
00:01:26,970 --> 00:01:31,650
I know some people decided to switch projects after
the proposal, some teams reshuffled a little bit,
14
00:01:31,650 --> 00:01:39,676
that's fine but your milestone should reflect the project that you're actually
doing for the rest of the quarter. So hopefully that's going out well.
15
00:01:39,677 --> 00:01:43,900
I know there's been a lot of worry and stress
on Piazza, wondering about assignment three.
16
00:01:43,900 --> 00:01:50,188
So we're working on that as hard as we can but that's actually
a bit of a new assignment, it's changing a bit from last year
17
00:01:50,189 --> 00:01:53,951
so it will be out as soon as possible,
hopefully today or tomorrow.
18
00:01:53,951 --> 00:02:01,550
Although we promise that whenever it comes out you'll have two
weeks to finish it so try not to stress out about that too much.
19
00:02:01,551 --> 00:02:05,318
But I'm pretty excited, I think assignment
three will be really cool, has a lot of cool,
20
00:02:05,318 --> 00:02:09,079
it'll cover a lot of really cool material.
21
00:02:09,079 --> 00:02:13,340
So another thing, last time in lecture we
mentioned this thing called the Train Game
22
00:02:13,340 --> 00:02:17,780
which is this really cool thing we've been working
on sort of as a side project a little bit.
23
00:02:17,780 --> 00:02:24,391
So this is an interactive tool that you guys can
go on and use to explore a little bit the process
24
00:02:24,391 --> 00:02:27,340
of tuning hyperparameters
in practice so we hope that,
25
00:02:27,340 --> 00:02:33,119
so this is again totally not required for the course.
Totally optional, but if you do we will offer
26
00:02:33,119 --> 00:02:35,072
a small amount of extra
credit for those of you
27
00:02:35,072 --> 00:02:37,963
who want to do well and
participate on this.
28
00:02:37,963 --> 00:02:42,224
And we'll send out exactly some more
details later this afternoon on Piazza.
29
00:02:42,224 --> 00:02:48,362
But just a bit of a demo for what exactly is this thing.
So you'll get to go in and we've changed the name
30
00:02:48,362 --> 00:02:51,752
from Train Game to HyperQuest
because you're questing
31
00:02:51,752 --> 00:02:54,464
to solve, to find the best
hyperparameters for your model
32
00:02:54,464 --> 00:02:59,344
so this is really cool, it'll be an interactive tool that
you can use to explore the training of hyperparameters
33
00:02:59,344 --> 00:03:01,254
interactively in your browser.
34
00:03:01,254 --> 00:03:04,871
So you'll login with
your student ID and name.
35
00:03:04,871 --> 00:03:08,830
You'll fill out a little survey with some
of your experience on deep learning
36
00:03:08,830 --> 00:03:14,934
then you'll read some instructions. So in this
game you'll be shown some random data set
37
00:03:14,934 --> 00:03:16,152
on every trial.
38
00:03:16,152 --> 00:03:21,494
This data set might be images or it might be vectors
and your goal is to train a model by picking
39
00:03:21,494 --> 00:03:25,632
the right hyperparameters interactively to
perform as well as you can on the validation set
40
00:03:25,632 --> 00:03:28,077
of this random data set.
41
00:03:28,077 --> 00:03:31,382
And it'll sort of keep track of your performance
over time and there'll be a leaderboard,
42
00:03:31,382 --> 00:03:33,423
it'll be really cool.
43
00:03:33,423 --> 00:03:38,723
So every time you play the game, you'll
get some statistics about your data set.
44
00:03:38,723 --> 00:03:42,397
In this case we're doing a
classification problem with 10 classes.
45
00:03:43,424 --> 00:03:47,774
You can see down at the bottom you have these
statistics about random data set, we have 10 classes.
46
00:03:47,774 --> 00:03:52,987
The input data size is three by 32 by 32 so
this is some image data set and we can see
47
00:03:52,987 --> 00:03:58,832
that in this case we have 8500 examples in the
training set and 1500 examples in the validation set.
48
00:03:58,832 --> 00:04:01,518
These are all random, they'll change
a little bit every time.
49
00:04:01,518 --> 00:04:06,912
Based on these data set statistics you'll make some choices
on your initial learning rate, your initial network size,
50
00:04:06,912 --> 00:04:08,931
and your initial dropout rate.
51
00:04:08,931 --> 00:04:13,811
Then you'll see a screen like this where it'll
run one epoch with those chosen hyperparameters,
52
00:04:13,811 --> 00:04:19,712
show you on the right here you'll see two
plots. One is your training and validation loss
53
00:04:19,712 --> 00:04:21,040
for that first epoch.
54
00:04:21,040 --> 00:04:23,409
Then you'll see your training
and validation accuracy
55
00:04:23,409 --> 00:04:30,759
for that first epoch and based on the gaps that you see in these two graphs you
can make choices interactively to change the learning rates and hyperparameters
56
00:04:30,759 --> 00:04:32,290
for the next epoch.
57
00:04:32,290 --> 00:04:37,803
So then you can either choose to continue training
with the current or changed hyperparameters,
58
00:04:37,803 --> 00:04:41,523
you can also stop training, or you can
revert to go back to the previous checkpoint
59
00:04:41,523 --> 00:04:43,872
in case things got really messed up.
60
00:04:43,872 --> 00:04:48,691
So then you'll get to make some choice,
so here we'll decide to continue training
61
00:04:48,691 --> 00:04:51,347
and in this case you could
go and set new learning rates
62
00:04:51,347 --> 00:04:54,971
and new hyperparameters for
the next epoch of training.
63
00:04:54,971 --> 00:04:59,808
You can also, kind of interesting here, you
can actually grow the network interactively
64
00:04:59,808 --> 00:05:01,899
during training in this demo.
65
00:05:01,899 --> 00:05:07,562
There's this cool trick from a couple recent
papers where you can either take existing layers
66
00:05:07,562 --> 00:05:12,083
and make them wider or add new layers to the network
in the middle of training while still maintaining
67
00:05:12,083 --> 00:05:15,762
the same function in the
network so you can do that
68
00:05:15,762 --> 00:05:20,131
to increase the size of your network in the
middle of training here which is kind of cool.
69
00:05:20,131 --> 00:05:24,430
So then you'll make choices over several epochs
and eventually your final validation accuracy
70
00:05:24,430 --> 00:05:26,811
will be recorded and we'll
have some leaderboard
71
00:05:26,811 --> 00:05:29,912
that compares your score on that data set
72
00:05:29,912 --> 00:05:33,072
to some simple baseline models.
73
00:05:33,072 --> 00:05:37,534
And depending on how well you do on this leaderboard
we'll again offer some small amounts of extra credit
74
00:05:37,534 --> 00:05:39,774
for those of you who
choose to participate.
75
00:05:39,774 --> 00:05:42,322
So this is again, totally
optional, but I think
76
00:05:42,322 --> 00:05:46,936
it can be a really cool learning experience for you guys
to play around with and explore how hyperparameters
77
00:05:46,936 --> 00:05:49,243
affect the learning process.
78
00:05:49,243 --> 00:05:54,872
Also, it's really useful for us. You'll help
science out by participating in this experiment.
79
00:05:54,872 --> 00:06:02,101
We're pretty interested in seeing how people behave when
they train neural networks so you'll be helping us out
80
00:06:02,101 --> 00:06:04,422
as well if you decide to play this.
81
00:06:04,422 --> 00:06:08,462
But again, totally optional, up to you.
82
00:06:08,462 --> 00:06:10,295
Any questions on that?
83
00:06:15,080 --> 00:06:18,680
Hopefully at some point but it's.
So the question was will this be a paper
84
00:06:18,680 --> 00:06:20,272
or whatever eventually?
85
00:06:20,272 --> 00:06:26,760
Hopefully but it's really early stages of this
project so I can't make any promises but I hope so.
86
00:06:26,760 --> 00:06:29,510
But I think it'll be really cool.
87
00:06:33,240 --> 00:06:35,000
[laughing]
88
00:06:35,000 --> 00:06:37,971
Yeah, so the question is how can
you add layers during training?
89
00:06:37,971 --> 00:06:43,552
I don't really want to get into that right now but
the paper to read is Net2Net by Ian Goodfellow's
90
00:06:43,552 --> 00:06:45,291
one of the authors and
there's another paper
91
00:06:45,291 --> 00:06:48,240
from Microsoft called Network Morphism.
92
00:06:48,240 --> 00:06:52,407
So if you read those two papers
you can see how this works.
93
00:06:53,680 --> 00:06:58,152
Okay, so last time, a bit of a reminder
before we had the midterm last time we talked
94
00:06:58,152 --> 00:06:59,792
about recurrent neural networks.
95
00:06:59,792 --> 00:07:03,032
We saw that recurrent neural networks can
be used for different types of problems.
96
00:07:03,032 --> 00:07:07,192
In addition to one to one we can do one
to many, many to one, many to many.
97
00:07:07,192 --> 00:07:10,679
We saw how this can apply
to language modeling
98
00:07:10,679 --> 00:07:15,460
and we saw some cool examples of applying neural networks to
model different sorts of languages at the character level
99
00:07:15,460 --> 00:07:20,571
and we sampled these artificial math
and Shakespeare and C source code.
100
00:07:20,571 --> 00:07:26,560
We also saw how similar things could be applied to
image captioning by connecting a CNN feature extractor
101
00:07:26,560 --> 00:07:28,491
together with an RNN language model.
102
00:07:28,491 --> 00:07:31,011
And we saw some really
cool examples of that.
103
00:07:31,011 --> 00:07:36,040
We also talked about the different types of
RNN's. We talked about this Vanilla RNN.
104
00:07:36,040 --> 00:07:40,158
I also want to mention that this is sometimes
called a Simple RNN or an Elman RNN so you'll see
105
00:07:40,158 --> 00:07:42,331
all of these different
terms in literature.
106
00:07:42,331 --> 00:07:44,997
We also talked about the Long
Short Term Memory or LSTM.
107
00:07:44,997 --> 00:07:50,102
And we talked about how the gradient,
the LSTM has this crazy set of equations
108
00:07:50,102 --> 00:07:53,021
but it makes sense because it
helps improve gradient flow
109
00:07:53,021 --> 00:07:56,022
during back propagation
and helps this thing model
110
00:07:56,022 --> 00:07:59,443
more longer term dependencies
in our sequences.
111
00:07:59,443 --> 00:08:03,982
So today we're going to switch gears and talk
about a whole bunch of different exciting tasks.
112
00:08:03,982 --> 00:08:08,992
We're going to talk about, so so far we've been talking
about mostly the image classification problem.
113
00:08:08,992 --> 00:08:13,262
Today we're going to talk about various types of other
computer vision tasks where you actually want to go in
114
00:08:13,262 --> 00:08:19,542
and say things about the spatial pixels inside your images
so we'll see segmentation, localization, detection,
115
00:08:19,542 --> 00:08:21,942
a couple other different
computer vision tasks
116
00:08:21,942 --> 00:08:25,494
and how you can approach these
with convolutional neural networks.
117
00:08:25,494 --> 00:08:29,552
So as a bit of refresher, so far the main
thing we've been talking about in this class
118
00:08:29,552 --> 00:08:32,163
is image classification so
here we're going to have
119
00:08:32,163 --> 00:08:34,842
some input image come in.
That input image will go through
120
00:08:34,842 --> 00:08:36,583
some deep convolutional network,
121
00:08:36,583 --> 00:08:42,991
that network will give us some feature vector of
maybe 4096 dimensions in the case of AlexNet RGB
122
00:08:42,991 --> 00:08:46,222
and then from that final feature vector
we'll have some fully-connected,
123
00:08:46,222 --> 00:08:47,750
some final fully-connected layer
124
00:08:47,750 --> 00:08:50,568
that gives us 1000 numbers
for the different class scores
125
00:08:50,568 --> 00:08:55,660
that we care about where 1000 is maybe the
number of classes in ImageNet in this example.
126
00:08:55,660 --> 00:08:59,080
And then at the end of the day
what the network does is we input an image
127
00:08:59,080 --> 00:09:01,437
and then we output a single category label
128
00:09:01,437 --> 00:09:05,083
saying what is the content of
this entire image as a whole.
129
00:09:05,083 --> 00:09:09,879
But this is maybe the most basic possible task
in computer vision and there's a whole bunch
130
00:09:09,879 --> 00:09:11,686
of other interesting types of tasks
131
00:09:11,686 --> 00:09:14,314
that we might want to
solve using deep learning.
132
00:09:14,314 --> 00:09:18,609
So today we're going to talk about several of these
different tasks and step through each of these
133
00:09:18,609 --> 00:09:21,515
and see how they all
work with deep learning.
134
00:09:21,515 --> 00:09:26,944
So we'll talk about these more in detail
about what each problem is as we get to it
135
00:09:26,944 --> 00:09:28,852
but this is kind of a summary slide
136
00:09:28,852 --> 00:09:31,480
that we'll talk first about
semantic segmentation.
137
00:09:31,480 --> 00:09:35,153
We'll talk about classification and localization,
then we'll talk about object detection,
138
00:09:35,153 --> 00:09:39,086
and finally a couple brief words
about instance segmentation.
139
00:09:39,967 --> 00:09:44,035
So first is the problem
of semantic segmentation.
140
00:09:44,035 --> 00:09:49,847
In the problem of semantic segmentation, we want
to input an image and then output a decision
141
00:09:49,847 --> 00:09:52,567
of a category for every
pixel in that image
142
00:09:52,567 --> 00:09:58,327
so for every pixel in this, so this input image for example
is this cat walking through the field, he's very cute.
143
00:09:58,327 --> 00:10:04,517
And in the output we want to say for every pixel
is that pixel a cat or grass or sky or trees
144
00:10:04,517 --> 00:10:07,701
or background or some
other set of categories.
145
00:10:07,701 --> 00:10:11,922
So we're going to have some set of categories
just like we did in the image classification case
146
00:10:11,922 --> 00:10:15,820
but now rather than assigning a single category
labeled to the entire image, we want to produce
147
00:10:15,820 --> 00:10:19,569
a category label for each
pixel of the input image.
148
00:10:19,569 --> 00:10:22,674
And this is called semantic segmentation.
149
00:10:22,674 --> 00:10:27,340
So one interesting thing about semantic segmentation
is that it does not differentiate instances
150
00:10:27,340 --> 00:10:31,523
so in this example on the right we have this image
with two cows where they're standing right next
151
00:10:31,523 --> 00:10:36,859
to each other and when we're talking about semantic
segmentation we're just labeling all the pixels
152
00:10:36,859 --> 00:10:39,741
independently for what is
the category of that pixel.
153
00:10:39,741 --> 00:10:44,510
So in the case like this where we have two cows
right next to each other the output does not make
154
00:10:44,510 --> 00:10:46,840
any distinguishing, does not distinguish
155
00:10:46,840 --> 00:10:48,309
between these two cows.
156
00:10:48,309 --> 00:10:51,782
Instead we just get a whole mass of pixels
that are all labeled as cow.
157
00:10:51,782 --> 00:10:56,625
So this is a bit of a shortcoming of semantic
segmentation and we'll see how we can fix this later
158
00:10:56,625 --> 00:10:58,910
when we move to instance segmentation.
159
00:10:58,910 --> 00:11:02,882
But at least for now we'll just talk about
semantic segmentation first.
160
00:11:04,437 --> 00:11:09,340
So you can imagine maybe using a class,
so one potential approach for attacking
161
00:11:09,340 --> 00:11:12,544
semantic segmentation might
be through classification.
162
00:11:12,544 --> 00:11:17,755
So there's this, you could use this idea of a
sliding window approach to semantic segmentation.
163
00:11:17,755 --> 00:11:24,315
So you might imagine that we take our input image and
we break it up into many many small, tiny local crops
164
00:11:24,315 --> 00:11:27,763
of the image so in this
example we've taken
165
00:11:27,763 --> 00:11:31,310
maybe three crops from
around the head of this cow
166
00:11:31,310 --> 00:11:36,564
and then you could imagine taking each of those crops
and now treating this as a classification problem.
167
00:11:36,564 --> 00:11:41,246
Saying for this crop, what is the category
of the central pixel of the crop?
168
00:11:41,246 --> 00:11:46,752
And then we could use all the same machinery that
we've developed for classifying entire images
169
00:11:46,752 --> 00:11:48,760
but now just apply it on crops rather than
170
00:11:48,760 --> 00:11:51,083
on the entire image.
171
00:11:51,083 --> 00:11:56,601
And this would probably work to some extent
but it's probably not a very good idea.
172
00:11:56,601 --> 00:12:02,498
So this would end up being super super
computationally expensive because we want to label
173
00:12:02,498 --> 00:12:07,319
every pixel in the image, we would need a separate
crop for every pixel in that image and this would be
174
00:12:07,319 --> 00:12:09,407
super super expensive to
run forward and backward
175
00:12:09,407 --> 00:12:10,910
passes through.
176
00:12:10,910 --> 00:12:17,085
And moreover, we're actually, if you think about this
we can actually share computation between different
177
00:12:17,085 --> 00:12:20,476
patches so if you're trying
to classify two patches
178
00:12:20,476 --> 00:12:22,950
that are right next to each
other and actually overlap
179
00:12:22,950 --> 00:12:25,509
then the convolutional
features of those patches
180
00:12:25,509 --> 00:12:30,611
will end up going through the same convolutional layers
and we can actually share a lot of the computation
181
00:12:30,611 --> 00:12:32,644
when applying this to separate passes
182
00:12:32,644 --> 00:12:34,742
or when applying this type of approach
183
00:12:34,742 --> 00:12:37,194
to separate patches in the image.
184
00:12:37,194 --> 00:12:41,896
So this is actually a terrible idea and nobody
does this and you should probably not do this
185
00:12:41,896 --> 00:12:48,683
but it's at least the first thing you might think of if
you were trying to think about semantic segmentation.
186
00:12:48,683 --> 00:12:53,372
Then the next idea that works a bit better is
this idea of a fully convolutional network right.
187
00:12:53,372 --> 00:12:58,305
So rather than extracting individual patches from the
image and classifying these patches independently,
188
00:12:58,305 --> 00:13:03,604
we can imagine just having our network be a whole giant
stack of convolutional layers with no fully connected
189
00:13:03,604 --> 00:13:06,501
layers or anything so in this
case we just have a bunch
190
00:13:06,501 --> 00:13:12,633
of convolutional layers that are all maybe three
by three with zero padding or something like that
191
00:13:12,633 --> 00:13:15,422
so that each convolutional
layer preserves the spatial size
192
00:13:15,422 --> 00:13:17,843
of the input and now if we pass our image
193
00:13:17,843 --> 00:13:20,605
through a whole stack of
these convolutional layers,
194
00:13:20,605 --> 00:13:27,184
then the final convolutional layer could just
output a tensor of something by C by H by W
195
00:13:27,184 --> 00:13:29,622
where C is the number of
categories that we care about
196
00:13:29,622 --> 00:13:34,734
and you could see this tensor as just giving
our classification scores for every pixel
197
00:13:34,734 --> 00:13:38,127
in the input image at every
location in the input image.
198
00:13:38,127 --> 00:13:43,014
And we could compute this all at once with
just some giant stack of convolutional layers.
199
00:13:43,014 --> 00:13:47,216
And then you could imagine training this thing
by putting a classification loss at every pixel
200
00:13:47,216 --> 00:13:50,558
of this output, taking an
average over those pixels
201
00:13:50,558 --> 00:13:55,137
in space, and just training this kind of network
through normal, regular back propagation.
202
00:13:55,137 --> 00:13:55,970
Question?
203
00:13:58,430 --> 00:14:01,179
Oh, the question is how do you develop
training data for this?
204
00:14:01,179 --> 00:14:04,366
It's very expensive right.
So the training data for this would be
205
00:14:04,366 --> 00:14:06,899
we need to label every
pixel in those input images
206
00:14:06,899 --> 00:14:11,831
so there's tools that people sometimes have online
where you can go in and sort of draw contours
207
00:14:11,831 --> 00:14:14,613
around the objects and
then fill in regions
208
00:14:14,613 --> 00:14:17,604
but in general getting this kind of
training data is very expensive.