Skip to content

Commit

Permalink
Update 2016-8-24-The-9-Deep-Learning-Papers-You-Need-To-Know-About-&b…
Browse files Browse the repository at this point in the history
…arryclark#40&#85nderstanding-C&barryclark#78&#78s-Part-3&barryclark#41.html
  • Loading branch information
adeshpande3 authored Jun 30, 2017
1 parent 0e84f85 commit 7648aa2
Showing 1 changed file with 1 addition and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ <h2><span style="text-decoration: underline;"><a href="https://arxiv.org/pdf/141
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; The interesting idea for me was that of using these seemingly different RNN and CNN models to create a very useful application that in a way combines the fields of Computer Vision and Natural Language Processing. It opens the door for new ideas in terms of how to make computers and models smarter when dealing with tasks that cross different fields.</p>
<h2><span style="text-decoration: underline;"><a href="https://arxiv.org/pdf/1506.02025.pdf" target="_blank"><strong>Spatial Transformer Networks</strong></a></span><strong> (2015)</strong></h2>
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Last, but not least, let&rsquo;s get into one of the more recent papers in the field. This paper was written by a group at Google Deepmind a little over a year ago. The main contribution is the introduction of a Spatial Transformer module. The basic idea is that this module transforms the input image in a way so that the subsequent layers have an easier time making a classification. Instead of making changes to the main CNN architecture itself, the authors worry about making changes to the image <em>before </em>it is fed into the specific conv layer. The 2 things that this module hopes to correct are pose normalization (scenarios where the object is tilted or scaled) and spatial attention (bringing attention to the correct object in a crowded image). For traditional CNNs, if you wanted to make your model invariant to images with different scales and rotations, you&rsquo;d need a lot of training examples for the model to learn properly. Let&rsquo;s get into the specifics of how this transformer module helps combat that problem.</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; The entity in traditional CNN models that dealt with spatial invariance was the maxpooling layer. The intuitive reasoning behind this later was that once we know that a specific feature is in the original input volume (wherever there are high activation values), it&rsquo;s exact location is not as important as its relative location to other features. This new spatial transformer is dynamic in a way that it will produce different behavior (different distortions/transformations) for each input image. It&rsquo;s not just as simple and pre-defined as a traditional maxpool. Let&rsquo;s take look at how this transformer module works. The module consists of:</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; The entity in traditional CNN models that dealt with spatial invariance was the maxpooling layer. The intuitive reasoning behind this layer was that once we know that a specific feature is in the original input volume (wherever there are high activation values), it&rsquo;s exact location is not as important as its relative location to other features. This new spatial transformer is dynamic in a way that it will produce different behavior (different distortions/transformations) for each input image. It&rsquo;s not just as simple and pre-defined as a traditional maxpool. Let&rsquo;s take look at how this transformer module works. The module consists of:</p>
<ul>
<li>A localization network which takes in the input volume and outputs parameters of the spatial transformation that should be applied. The parameters, or theta, can be 6 dimensional for an affine transformation.</li>
<li>The creation of a sampling grid that is the result of warping the regular grid with the affine transformation (theta) created in the localization network.</li>
Expand Down

0 comments on commit 7648aa2

Please sign in to comment.