-
Notifications
You must be signed in to change notification settings - Fork 210
/
Lecture 9 _ CNN Architectures.srt
3020 lines (2403 loc) · 84.4 KB
/
Lecture 9 _ CNN Architectures.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1
00:00:14,752 --> 00:00:21,696
- All right welcome to lecture nine. So today
we will be talking about CNN Architectures.
2
00:00:21,696 --> 00:00:27,706
And just a few administrative points before we
get started, assignment two is due Thursday.
3
00:00:27,706 --> 00:00:36,855
The mid term will be in class on Tuesday May ninth, so next week and it will
cover material through Tuesday through this coming Thursday May fourth.
4
00:00:36,855 --> 00:00:41,350
So everything up to recurrent neural
networks are going to be fair game.
5
00:00:41,350 --> 00:00:49,121
The poster session we've decided on a time, it's going to be Tuesday June
sixth from twelve to three p.m. So this is the last week of classes.
6
00:00:49,121 --> 00:00:53,828
So we have our our poster session a little bit
early during the last week so that after that,
7
00:00:53,828 --> 00:01:00,132
once you guys get feedback you still have some time to
work for your final report which will be due finals week.
8
00:01:03,325 --> 00:01:05,812
Okay, so just a quick review of last time.
9
00:01:05,812 --> 00:01:09,324
Last time we talked about different
kinds of deep learning frameworks.
10
00:01:09,324 --> 00:01:12,690
We talked about you know
PyTorch, TensorFlow, Caffe2
11
00:01:14,514 --> 00:01:18,762
and we saw that using these kinds of frameworks we
were able to easily build big computational graphs,
12
00:01:18,762 --> 00:01:25,784
for example very large neural networks and comm nets, and
be able to really easily compute gradients in these graphs.
13
00:01:25,784 --> 00:01:32,415
So to compute all of the gradients for all the intermediate
variables weights inputs and use that to train our models
14
00:01:32,415 --> 00:01:35,665
and to run all this efficiently on GPUs
15
00:01:37,658 --> 00:01:44,978
And we saw that for a lot of these frameworks the way this works is by working
with these modularized layers that you guys have been working writing with,
16
00:01:44,978 --> 00:01:49,928
in your home works as well where we have
a forward pass, we have a backward pass,
17
00:01:49,928 --> 00:01:58,404
and then in our final model architecture, all we need to do then
is to just define all of these sequence of layers together.
18
00:01:58,404 --> 00:02:04,937
So using that we're able to very easily be able
to build up very complex network architectures.
19
00:02:06,626 --> 00:02:14,520
So today we're going to talk about some specific kinds of CNN Architectures
that are used today in cutting edge applications and research.
20
00:02:14,520 --> 00:02:19,631
And so we'll go into depth in some of the most
commonly used architectures for these that are winners
21
00:02:19,631 --> 00:02:22,125
of ImageNet classification benchmarks.
22
00:02:22,125 --> 00:02:28,085
So in chronological order AlexNet,
VGG net, GoogLeNet, and ResNet.
23
00:02:28,085 --> 00:02:43,771
And so these will go into a lot of depth. And then I'll also after that, briefly go through some other architectures that are
not as prominently used these days, but are interesting either from a historical perspective, or as recent areas of research.
24
00:02:46,822 --> 00:02:50,839
Okay, so just a quick review.
We talked a long time ago about LeNet,
25
00:02:50,839 --> 00:02:55,603
which was one of the first instantiations of a
comNet that was successfully used in practice.
26
00:02:55,603 --> 00:03:05,778
And so this was the comNet that took an input image, used com filters five
by five filters applied at stride one and had a couple of conv layers,
27
00:03:05,778 --> 00:03:09,335
a few pooling layers and then some
fully connected layers at the end.
28
00:03:09,335 --> 00:03:14,320
And this fairly simple comNet was very
successfully applied to digit recognition.
29
00:03:17,030 --> 00:03:22,875
So AlexNet from 2012 which you guys have also
heard already before in previous classes,
30
00:03:22,875 --> 00:03:31,179
was the first large scale convolutional neural network
that was able to do well on the ImageNet classification
31
00:03:31,179 --> 00:03:40,611
task so in 2012 AlexNet was entered in the competition, and was able to
outperform all previous non deep learning based models by a significant margin,
32
00:03:40,611 --> 00:03:48,012
and so this was the comNet that started the
spree of comNet research and usage afterwards.
33
00:03:48,012 --> 00:03:56,427
And so the basic comNet AlexNet architecture is a conv layer
followed by pooling layer, normalization, com pool norm,
34
00:03:58,421 --> 00:04:01,006
and then a few more conv
layers, a pooling layer,
35
00:04:01,006 --> 00:04:03,422
and then several fully
connected layers afterwards.
36
00:04:03,422 --> 00:04:09,766
So this actually looks very similar to the LeNet network
that we just saw. There's just more layers in total.
37
00:04:09,766 --> 00:04:18,387
There is five of these conv layers, and two fully connected layers
before the final fully connected layer going to the output classes.
38
00:04:21,889 --> 00:04:25,930
So let's first get a sense of the
sizes involved in the AlexNet.
39
00:04:25,930 --> 00:04:33,128
So if we look at the input to the AlexNet this was trained
on ImageNet, with inputs at a size 227 by 227 by 3 images.
40
00:04:33,128 --> 00:04:43,193
And if we look at this first layer which is a conv layer for the
AlexNet, it's 11 by 11 filters, 96 of these applied at stride 4.
41
00:04:43,193 --> 00:04:49,323
So let's just think about this for a moment.
What's the output volume size of this first layer?
42
00:04:51,788 --> 00:04:53,371
And there's a hint.
43
00:04:57,769 --> 00:05:11,441
So remember we have our input size, we have our convolutional filters, ray. And we have this formula,
which is the hint over here that gives you the size of the output dimensions after applying com right?
44
00:05:11,441 --> 00:05:17,632
So remember it was the full image, minus the
filter size, divided by the stride, plus one.
45
00:05:17,632 --> 00:05:26,919
So given that that's written up here for you 55, does anyone have
a guess at what's the final output size after this conv layer?
46
00:05:26,919 --> 00:05:29,823
[student speaks off mic]
47
00:05:29,823 --> 00:05:32,966
- So I had 55 by 55 by 96, yep.
That's correct.
48
00:05:32,966 --> 00:05:38,113
Right so our spatial dimensions at the output are
going to be 55 in each dimension and then we have
49
00:05:38,113 --> 00:05:45,391
96 total filters so the depth after our conv layer
is going to be 96. So that's the output volume.
50
00:05:45,391 --> 00:05:49,486
And what's the total number
of parameters in this layer?
51
00:05:49,486 --> 00:05:52,819
So remember we have 96 11 by 11 filters.
52
00:05:54,851 --> 00:05:57,753
[student speaks off mic]
53
00:05:57,753 --> 00:06:00,753
- [Lecturer] 96 by 11 by 11, almost.
54
00:06:01,945 --> 00:06:05,297
So yes, so I had another by three,
yes that's correct.
55
00:06:05,297 --> 00:06:13,632
So each of the filters is going to see through a local region
of 11 by 11 by three, right because the input depth was three.
56
00:06:13,632 --> 00:06:18,983
And so, that's each filter size,
times we have 96 of these total.
57
00:06:18,983 --> 00:06:23,150
And so there's 35K parameters
in this first layer.
58
00:06:26,018 --> 00:06:30,233
Okay, so now if we look at the second layer
this is a pooling layer right and in this case
59
00:06:30,233 --> 00:06:34,004
we have three three by three
filters applied at stride two.
60
00:06:34,004 --> 00:06:38,171
So what's the output volume
of this layer after pooling?
61
00:06:40,701 --> 00:06:44,868
And again we have a hint, very
similar to the last question.
62
00:06:51,251 --> 00:06:56,267
Okay, 27 by 27 by 96.
Yes that's correct.
63
00:06:57,716 --> 00:07:01,528
Right so the pooling layer is basically
going to use this formula that we had here.
64
00:07:01,528 --> 00:07:16,655
Again because these are pooling applied at a stride of two so we're going to use the same formula to determine
the spatial dimensions and so the spatial dimensions are going to be 27 by 27, and pooling preserves the depth.
65
00:07:16,655 --> 00:07:21,527
So we had 96 as depth as input, and it's
still going to be 96 depth at output.
66
00:07:22,825 --> 00:07:28,127
And next question. What's the
number of parameters in this layer?
67
00:07:31,446 --> 00:07:34,354
I hear some muttering.
[student answers off mic]
68
00:07:34,354 --> 00:07:36,905
- Nothing.
Okay.
69
00:07:36,905 --> 00:07:40,801
Yes, so pooling layer has no parameters,
so, kind of a trick question.
70
00:07:42,739 --> 00:07:45,272
Okay, so we can basically, yes, question?
71
00:07:45,272 --> 00:07:47,192
[student speaks off mic]
72
00:07:47,192 --> 00:07:52,180
- The question is, why are there no
parameters in the pooling layer?
73
00:07:52,180 --> 00:07:54,551
The parameters are the weights right,
that we're trying to learn.
74
00:07:54,551 --> 00:07:56,511
And so convolutional layers
have weights that we learn
75
00:07:56,511 --> 00:08:02,236
but pooling all we do is have a rule, we look
at the pooling region, and we take the max.
76
00:08:02,236 --> 00:08:05,710
So there's no parameters that are learned.
77
00:08:05,710 --> 00:08:14,250
So we can keep on doing this and you can just repeat the process and it's kind of a good
exercise to go through this and figure out the sizes, the parameters, at every layer.
78
00:08:16,473 --> 00:08:22,688
And so if you do this all the way, you can look at
this is the final architecture that you can work with.
79
00:08:22,688 --> 00:08:31,920
There's 11 by 11 filters at the beginning, then five by five and some three
by three filters. And so these are generally pretty familiar looking sizes
80
00:08:31,920 --> 00:08:39,122
that you've seen before and then at the end we have a couple of
fully connected layers of size 4096 and finally the last layer,
81
00:08:39,123 --> 00:08:41,540
is FC8 going to the soft max,
82
00:08:42,689 --> 00:08:46,356
which is going to the
1000 ImageNet classes.
83
00:08:48,039 --> 00:08:56,352
And just a couple of details about this, it was the first use of the ReLu
non-linearity that we've talked about that's the most commonly used non-linearity.
84
00:08:56,352 --> 00:09:07,391
They used local response normalization layers basically trying to normalize the response
across neighboring channels but this is something that's not really used anymore.
85
00:09:07,391 --> 00:09:11,937
It turned out not to, other people showed
that it didn't have so much of an effect.
86
00:09:11,937 --> 00:09:21,769
There's a lot of heavy data augmentation, and so you can look in the paper for more details,
but things like flipping, jittering, jittering, color normalization all of these things
87
00:09:21,769 --> 00:09:28,727
which you'll probably find useful for you when you're working on
your projects for example, so a lot of data augmentation here.
88
00:09:28,727 --> 00:09:32,419
They also use dropout batch size of 128,
89
00:09:32,419 --> 00:09:37,183
and learned with SGD with
momentum which we talked about
90
00:09:37,183 --> 00:09:42,295
in an earlier lecture, and basically just started
with a base learning rate of 1e negative 2.
91
00:09:42,295 --> 00:09:50,145
Every time it plateaus, reduce by a factor of 10 and
then just keep going. Until they finish training
92
00:09:50,145 --> 00:09:59,012
and a little bit of weight decay and in the end, in order to get the best numbers
they also did an ensembling of models and so training multiple of these,
93
00:09:59,012 --> 00:10:03,162
averaging them together and this also
gives an improvement in performance.
94
00:10:04,405 --> 00:10:08,781
And so one other thing I want to point out is
that if you look at this AlexNet diagram up here,
95
00:10:08,781 --> 00:10:15,235
it looks kind of like the normal comNet diagrams
that we've been seeing, except for one difference,
96
00:10:15,235 --> 00:10:21,937
which is that it's, you can see it's kind of split
in these two different rows or columns going across.
97
00:10:23,177 --> 00:10:32,905
And so the reason for this is mostly historical note, so AlexNet was
trained on GTX580 GPUs older GPUs that only had three gigs of memory.
98
00:10:34,106 --> 00:10:37,255
So it couldn't actually fit
this entire network on here,
99
00:10:37,255 --> 00:10:41,773
and so what they ended up doing, was
they spread the network across two GPUs.
100
00:10:41,773 --> 00:10:46,455
So on each GPU you would have half of the
neurons, or half of the feature maps.
101
00:10:46,455 --> 00:10:51,730
And so for example if you look at this first
conv layer, we have 55 by 55 by 96 output,
102
00:10:54,389 --> 00:11:04,155
but if you look at this diagram carefully, you can zoom in later in the actual
paper, you can see that, it's actually only 48 depth-wise, on each GPU,
103
00:11:05,049 --> 00:11:08,593
and so they just spread it, the
feature maps, directly in half.
104
00:11:10,288 --> 00:11:17,367
And so what happens is that for most of these layers, for example com
one, two, four and five, the connections are only with feature maps
105
00:11:17,367 --> 00:11:29,683
on the same GPU, so you would take as input, half of the feature maps that were on the
the same GPU as before and you don't look at the full 96 feature maps for example.
106
00:11:29,683 --> 00:11:33,850
You just take as input the
48 in that first layer.
107
00:11:34,767 --> 00:11:47,696
And then there's a few layers so com three, as well as FC six, seven and eight, where here are the
GPUs do talk to each other and so there's connections with all feature maps in the preceding layer.
108
00:11:47,696 --> 00:11:54,191
so there's communication across the GPUs, and each of these neurons
are then connected to the full depth of the previous input layer.
109
00:11:54,191 --> 00:11:55,627
Question.
110
00:11:55,627 --> 00:12:01,442
- [Student] It says the full simplified
AlexNetwork architecture. [mumbles]
111
00:12:05,583 --> 00:12:10,033
- Oh okay, so the question is why does it say
full simplified AlexNet architecture here?
112
00:12:10,033 --> 00:12:19,036
It just says that because I didn't put all the details on here, so for example
this is the full set of layers in the architecture, and the strides and so on,
113
00:12:19,036 --> 00:12:25,268
but for example the normalization layer, there's
other, these details are not written on here.
114
00:12:30,637 --> 00:12:37,849
And then just one little note, if you look at the paper and
try and write out the math and architectures and so on,
115
00:12:38,858 --> 00:12:52,721
there's a little bit of an issue on the very first layer they'll say if you'll look in the figure they'll say 224 by 224 ,
but there's actually some kind of funny pattern going on and so the numbers actually work out if you look at it as 227.
116
00:12:54,982 --> 00:13:04,261
AlexNet was the winner of the ImageNet classification benchmark in
2012, you can see that it cut the error rate by quite a large margin.
117
00:13:05,246 --> 00:13:14,193
It was the first CNN base winner, and it was widely used as a base to our
architecture almost ubiquitously from then until a couple years ago.
118
00:13:15,720 --> 00:13:17,980
It's still used quite a bit.
119
00:13:17,980 --> 00:13:24,071
It's used in transfer learning for lots of different
tasks and so it was used for basically a long time,
120
00:13:24,071 --> 00:13:33,202
and it was very famous and now though there's been some more recent architectures
that have generally just had better performance and so we'll talk about these
121
00:13:33,202 --> 00:13:39,282
next and these are going to be the more common
architectures that you'll be wanting to use in practice.
122
00:13:40,853 --> 00:13:47,813
So just quickly first in 2013 the ImageNet
challenge was won by something called a ZFNet.
123
00:13:47,813 --> 00:13:48,718
Yes, question.
124
00:13:48,718 --> 00:13:52,729
[student speaks off mic]
125
00:13:52,729 --> 00:13:56,612
- So the question is intuition why AlexNet was
so much better than the ones that came before,
126
00:13:56,612 --> 00:14:04,786
DefLearning comNets [mumbles] this is just a
very different kind of approach in architecture.
127
00:14:04,786 --> 00:14:09,004
So this was the first deep learning based
approach first comNet that was used.
128
00:14:12,445 --> 00:14:18,298
So in 2013 the challenge was won by something called a
ZFNet [Zeller Fergus Net] named after the creators.
129
00:14:18,298 --> 00:14:23,749
And so this mostly was improving
hyper parameters over the AlexNet.
130
00:14:23,749 --> 00:14:35,735
It had the same number of layers, the same general structure and they made a few changes things like changing
the stride size, different numbers of filters and after playing around with these hyper parameters more,
131
00:14:35,735 --> 00:14:41,369
they were able to improve the error rate.
But it's still basically the same idea.
132
00:14:41,369 --> 00:14:49,843
So in 2014 there are a couple of architectures that were now more
significantly different and made another jump in performance,
133
00:14:49,843 --> 00:14:58,178
and the main difference with these networks
first of all was much deeper networks.
134
00:14:58,178 --> 00:15:12,321
So from the eight layer network that was in 2012 and 2013, now in 2014 we had two
very close winners that were around 19 layers and 22 layers. So significantly deeper.
135
00:15:12,321 --> 00:15:16,502
And the winner of this
was GoogleNet, from Google
136
00:15:16,502 --> 00:15:20,176
but very close behind was
something called VGGNet
137
00:15:20,176 --> 00:15:27,421
from Oxford, and on actually the localization challenge
VGG got first place in some of the other tracks.
138
00:15:27,421 --> 00:15:31,958
So these were both very,
very strong networks.
139
00:15:31,958 --> 00:15:34,663
So let's first look at VGG
in a little bit more detail.
140
00:15:34,663 --> 00:15:40,818
And so the VGG network is the idea of much
deeper networks and with much smaller filters.
141
00:15:40,818 --> 00:15:50,374
So they increased the number of layers from eight layers in AlexNet
right to now they had models with 16 to 19 layers in VGGNet.
142
00:15:52,290 --> 00:16:03,916
And one key thing that they did was they kept very small filter so only three by three conv all the way,
which is basically the smallest com filter size that is looking at a little bit of the neighboring pixels.
143
00:16:03,916 --> 00:16:11,485
And they just kept this very simple structure of three by three
convs with the periodic pooling all the way through the network.
144
00:16:11,485 --> 00:16:19,948
And it's very simple elegant network architecture, was
able to get 7.3% top five error on the ImageNet challenge.
145
00:16:22,651 --> 00:16:27,442
So first the question of
why use smaller filters.
146
00:16:27,442 --> 00:16:33,371
So when we take these small filters now we have
fewer parameters and we try and stack more of them
147
00:16:33,371 --> 00:16:39,344
instead of having larger filters, have smaller filters with
more depth instead, have more of these filters instead,
148
00:16:39,344 --> 00:16:47,202
what happens is that you end up having the same effective receptive
field as if you only have one seven by seven convolutional layer.
149
00:16:47,202 --> 00:16:55,466
So here's a question, what is the effective receptive field
of three of these three by three conv layers with stride one?
150
00:16:55,466 --> 00:17:01,189
So if you were to stack three three by three conv layers
with Stride one what's the effective receptive field,
151
00:17:01,189 --> 00:17:09,754
the total area of the input, spatial area of the input that
enure at the top layer of the three layers is looking at.
152
00:17:12,313 --> 00:17:15,987
So I heard fifteen pixels,
why fifteen pixels?
153
00:17:15,987 --> 00:17:20,609
- [Student] Okay, so the
reason given was because
154
00:17:20,609 --> 00:17:27,369
they overlap-- - Okay, so the reason given was
because they overlap. So it's on the right track.
155
00:17:27,369 --> 00:17:35,668
What actually is happening though is you have to see, at the first
layer, the receptive field is going to be three by three right?
156
00:17:35,668 --> 00:17:43,193
And then at the second layer, each of these neurons in the second
layer is going to look at three by three other first layer
157
00:17:43,193 --> 00:17:51,676
filters, but the corners of these three by three have an additional
pixel on each side, that is looking at in the original input layer.
158
00:17:51,676 --> 00:17:56,423
So the second layer is actually looking at five by
five receptive field and then if you do this again,
159
00:17:56,423 --> 00:18:04,040
the third layer is looking at three by three
in the second layer but this is going to,
160
00:18:04,040 --> 00:18:06,907
if you just draw out this pyramid is looking
at seven by seven in the input layer.
161
00:18:06,907 --> 00:18:16,026
So the effective receptive field here is going to be seven by
seven. Which is the same as one seven by seven conv layer.
162
00:18:16,026 --> 00:18:21,546
So what happens is that this has the same effective receptive
field as a seven by seven conv layer but it's deeper.
163
00:18:21,546 --> 00:18:26,201
It's able to have more non-linearities in
there, and it's also fewer parameters.
164
00:18:26,201 --> 00:18:36,536
So if you look at the total number of parameters, each of these conv filters for
the three by threes is going to have nine parameters in each conv [mumbles]
165
00:18:38,165 --> 00:18:44,648
three times three, and then times the input depth, so
three times three times C, times this total number
166
00:18:44,648 --> 00:18:51,034
of output feature maps, which is again C is we're
going to preserve the total number of channels.
167
00:18:51,034 --> 00:19:00,165
So you get three times three, times C times C for each of these layers,
and we have three layers so it's going to be three times this number,
168
00:19:00,165 --> 00:19:07,409
compared to if you had a single seven by seven layer then you
get, by the same reasoning, seven squared times C squared.
169
00:19:07,409 --> 00:19:11,032
So you're going to have fewer
parameters total, which is nice.
170
00:19:15,570 --> 00:19:24,161
So now if we look at this full network here there's a lot of numbers up
here that you can go back and look at more carefully but if we look at all
171
00:19:24,161 --> 00:19:30,716
of the sizes and number of parameters the same
way that we calculated the example for AlexNet,
172
00:19:30,716 --> 00:19:32,517
this is a good exercise to go through,
173
00:19:32,517 --> 00:19:45,834
we can see that you know going the same way we have a couple of these conv layers and a pooling layer a
couple more conv layers, pooling layer, several more conv layers and so on. And so this just keeps going up.
174
00:19:45,834 --> 00:19:52,431
And if you counted the total number of convolutional and fully
connected layers, we're going to have 16 in this case for VGG 16,
175
00:19:52,431 --> 00:20:00,478
and then VGG 19, it's just a very similar architecture,
but with a few more conv layers in there.
176
00:20:03,021 --> 00:20:05,605
And so the total memory
usage of this network,
177
00:20:05,605 --> 00:20:17,196
so just making a forward pass through counting up all of these numbers so in the
memory numbers here written in terms of the total numbers, like we calculated earlier,
178
00:20:17,196 --> 00:20:23,125
and if you look at four bytes per number,
this is going to be about 100 megs per image,
179
00:20:23,125 --> 00:20:28,727
and so this is the scale of the memory usage that's
happening and this is only for a forward pass right,
180
00:20:28,727 --> 00:20:35,470
when you do a backward pass you're going to have to
store more and so this is pretty heavy memory wise.
181
00:20:35,470 --> 00:20:44,410
100 megs per image, if you have on five gigs of total memory,
then you're only going to be able to store about 50 of these.
182
00:20:47,300 --> 00:20:56,131
And so also the total number of parameters here we have is 138 million
parameters in this network, and this compares with 60 million for AlexNet.
183
00:20:56,131 --> 00:20:57,481
Question?
184
00:20:57,481 --> 00:21:00,898
[student speaks off mic]
185
00:21:06,204 --> 00:21:09,920
- So the question is what do we mean by deeper,
is it the number of filters, number of layers?
186
00:21:09,920 --> 00:21:14,087
So deeper in this case is
always referring to layers.
187
00:21:15,605 --> 00:21:25,216
So there are two usages of the word depth which is confusing one is the depth
rate per channel, width by height by depth, you can use the word depth here,
188
00:21:26,942 --> 00:21:34,298
but in general we talk about the depth of a network, this is going to
be the total number of layers in the network, and usually in particular
189
00:21:34,298 --> 00:21:43,368
we're counting the total number of weight layers. So the total number of layers
with trainable weight, so convolutional layers and fully connected layers.
190
00:21:43,368 --> 00:21:46,868
[student mumbles off mic]
191
00:22:00,810 --> 00:22:06,174
- Okay, so the question is, within each
layer what do different filters need?
192
00:22:06,174 --> 00:22:13,043
And so we talked about this back in the comNet
lecture, so you can also go back and refer to that,
193
00:22:13,043 --> 00:22:27,616
but each filter is a set of let's say three by three convs, so each filter is looking at a, is a set
of weight looking at a three by three value input input depth, and this produces one feature map,
194
00:22:27,616 --> 00:22:31,954
one activation map of all the responses
of the different spatial locations.
195
00:22:31,954 --> 00:22:39,646
And then we have we can have as many filters as we want right so for
example 96 and each of these is going to produce a feature map.
196
00:22:39,646 --> 00:22:48,368
And so it's just like each filter corresponds to a different pattern that we're looking for
in the input that we convolve around and we see the responses everywhere in the input,
197
00:22:48,368 --> 00:22:56,181
we create a map of these and then another filter will
we convolve over the image and create another map.
198
00:22:58,761 --> 00:23:00,226
Question.
199
00:23:00,226 --> 00:23:03,643
[student speaks off mic]
200
00:23:07,465 --> 00:23:16,733
- So question is, is there intuition behind, as you go deeper into the network
we have more channel depth so more number of filters right and so you can have
201
00:23:17,676 --> 00:23:21,766
any design that you want so
you don't have to do this.
202
00:23:21,766 --> 00:23:24,341
In practice you will see this
happen a lot of the times
203
00:23:24,341 --> 00:23:30,598
and one of the reasons is people try and maintain
kind of a relatively constant level of compute,
204
00:23:30,598 --> 00:23:37,991
so as you go higher up or deeper into your network,
you're usually also using basically down sampling
205
00:23:39,606 --> 00:23:45,759
and having smaller total spatial area and then so then