Sik-Ho Tang | Review -- Unsupervised Feature Learning via Non-Parametric Instance Discrimination. #134

NorbertZheng · 2023-09-01T06:02:26Z

Sik-Ho Tang. Review — Unsupervised Feature Learning via Non-Parametric Instance Discrimination.

NorbertZheng · 2023-09-01T06:11:21Z

Overview

Each image is treated as a class and projected to hypersphere.

Unsupervised Feature Learning via Non-Parametric Instance Discrimination. Instance Discrimination, by UC Berkeley / ICSI, Chinese University of Hong Kong, and Amazon Rekognition. 2018 CVPR, Over 1100 Citations.
Unsupervised Learning, Deep Metric Learning, Self-Supervised Learning, Semi-Supervised Learning, Image Classification, Object Detection.

Authors start by asking a question: “Can we learn a good feature representation that captures apparent similarity among instances, instead of classes, by merely asking the feature to be discriminative of individual instances?”
This intuition is formulated as a non-parametric classification problem at the instance-level, and
- use noise-contrastive estimation (NCE) to tackle the computational challenges imposed by the large number of instance classes.

NorbertZheng · 2023-09-01T06:18:10Z

Unsupervised Feature Learning via Non-Parametric Instance Discrimination

The pipeline of unsupervised feature learning approach.

Goal

A backbone CNN is used to encode each image as a feature vector, which is projected to a 128-dimensional space and L2 normalized.
The optimal feature embedding is learned via instance-level discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere.

The goal is to learn an embedding function $v=f_{\theta}(x)$ without supervision. $f_{\theta}$ is a deep neural network with parameters $\theta$, mapping image $x$ to feature $v$.

A metric is induced over the image space for instances $x$ and $y$:

A good embedding should map visually similar images closer to each other.

Each image instance is treated as a distinct class of its own and a classifier is trained to distinguish between individual instance classes.

NorbertZheng · 2023-09-01T06:21:45Z

Parametric Classifier: Conventional Softmax

If we got $n$ images/instances, we got $n$ classes.

Under the conventional parametric softmax formulation, for image $x$ with feature $v=f_{\theta}(x)$, the probability of it being recognized as $i$-th example is:

where $w_{j}$ is a weight vector for class $j$, and $w_{j}^{T}v$ measures how well $v$ matches the $j$-th class, i.e. instance.

NorbertZheng · 2023-09-01T06:33:05Z

Proposed Non-Parametric Softmax Classifier

A non-parametric variant of the above softmax equation is to replace $w_{j}^{T}v$ with $v_{j}^{T}v$, and $||v||=1$ is enforced via a L2-normalization layer.

Then the probability $P(i|v)$ becomes:

where $\tau$ is a temperature parameter that controls the concentration level of the distribution (Please feel free to read Distillation for more details about temperature $\tau$). $\tau$ is important for supervised feature learning [43], and also necessary for tuning the concentration of $v$ on the unit sphere.

The learning objective is then to maximize the joint probability:

or equivalently to minimize the negative log-likelihood over the training set:

NorbertZheng · 2023-09-01T06:33:38Z

Getting rid of these weight vectors is important, because the learning objective focuses entirely on the feature representation and its induced metric, which can be applied everywhere in the space and to any new instances at the test time.

Also, it eliminates the need for computing and storing the gradients for $\{w_{j}\}$, making it more scalable for big data applications.

NorbertZheng · 2023-09-01T06:34:55Z

Suitable for scenarios where the number of classes is large. Non-parametric classifier is a kind of contrastive learning???

NorbertZheng · 2023-09-01T06:46:15Z

Learning with A Memory Bank and NCE

Memory Bank

To compute the probability $P(i|v)$, $\{v_{j}\}$ for all the images are needed. Instead of exhaustively computing these representations every time, a feature memory bank $V$ is maintained for storing them.

Separate notations are introduced for the memory bank and features forwarded from the network. Let $V=\{v_{j}\}$ be the memory bank and $f_{i}=f_{\theta}(x_{i})$ be the feature of $x_{i}$.

During each learning iteration, the representation $f_{i}$ as well as the network parameters $\theta$ are optimized via stochastic gradient descend.

Then $f_{i}$ is updated to $V$ at the corresponding instance entry $f_{i}\to v_{i}$.

All the representations in the memory bank $V$ are initialized as unit random vectors.

NorbertZheng · 2023-09-01T07:26:17Z

Noise-Contrastive Estimation (NCE)

Noise-Contrastive Estimation (NCE) is used to approximate full Softmax.

The basic idea is to cast the multi-class classification problem into a set of binary classification problems, where the binary classification tasks is to discriminate between data samples and noise samples.

(NCE is originally used in NLP. Please feel free to read NCE if interested.)

Specifically, the probability that feature representation $v$ in the memory bank corresponds to the $i$-th example under the model is:

where $Z_{i}$ is the normalizing constant. The noise distribution is formalized as a uniform distribution: $P_{n}=\frac{1}{n}$.

Noise samples are assumed to be $m$ times more frequent than data samples. The posterior probability of sample $i$ with feature $v$ being from the data distribution (denoted by $D=1$) is:

The approximated training objective is to minimize the negative log-posterior distribution of data and noise samples:

Here, $P_{d}$ denotes the actual data distribution. For $P_{d}$, $v$ is the feature corresponding to $x_{i}$; whereas for $P_{n}$, $v'$ is the feature from another image, randomly sampled according to noise distribution $P_{n}$.

Both $v$ and $v’$ are sampled from the non-parametric memory bank $V$.

Computing normalizing constant $Z_{i}$ is expensive, Morte Carlo approximation is used:

where $\{j_{k}\}$ is a random subset of indices. Empirically, the approximation derived from initial batches is sufficient to work well in practice.

NCE reduces the computational complexity from O(n) to O(1) per sample.

NorbertZheng · 2023-09-01T08:01:39Z

Proximal Regularization

The effect of proximal regularization.

During each training epoch, each class is only visited once. Therefore, the learning process oscillates a lot from random sampling fluctuation.

An additional term is added to encourage the smoothness.

At current iteration $t$, the feature representation for data $x_{i}$ is computed from the network $v(t){i}=f{\theta}(x_{i})$. The memory bank of all the representation are stored at previous iteration $V=\{v(t-1)\}$.
The loss function for a positive sample from $P_{d}$ is:

As learning converges, the difference between iterations, i.e. $v(t){i}-v(t-1){i}$, gradually vanishes, and the augmented loss is reduced to the original one.

With proximal regularization, the final objective becomes:

The above figure shows that, empirically, proximal regularization helps stabilize training, speed up convergence, and improve the learned representation, with negligible extra cost.

NorbertZheng · 2023-09-01T08:01:53Z

Change representation smoothly!!!

NorbertZheng · 2023-09-01T08:10:13Z

Weighted K-Nearest Neighbor Classifier

To classify test image $\hat{x}$, we first compute its feature $\hat{f}=f_{\theta}(\hat{x})$, and then compare it against the embeddings of all the images in the memory bank, using the cosine similarity $s_{i}=cos(v_{i},\hat{f})$.

The top k nearest neighbors, denoted by $N_{k}$, would then be used to make the prediction via weighted voting.

$\tau=0.07$ and $k=200$.

NorbertZheng added deep-learning constrastive-learning cnn representation-learning unsupervised-learning labels Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sik-Ho Tang | Review -- Unsupervised Feature Learning via Non-Parametric Instance Discrimination. #134

Sik-Ho Tang | Review -- Unsupervised Feature Learning via Non-Parametric Instance Discrimination. #134

NorbertZheng commented Sep 1, 2023

NorbertZheng commented Sep 1, 2023

NorbertZheng commented Sep 1, 2023

NorbertZheng commented Sep 1, 2023

NorbertZheng commented Sep 1, 2023

NorbertZheng commented Sep 1, 2023

NorbertZheng commented Sep 1, 2023 •

edited

Loading

NorbertZheng commented Sep 1, 2023

NorbertZheng commented Sep 1, 2023

NorbertZheng commented Sep 1, 2023

NorbertZheng commented Sep 1, 2023

NorbertZheng commented Sep 1, 2023

Sik-Ho Tang | Review -- Unsupervised Feature Learning via Non-Parametric Instance Discrimination. #134

Sik-Ho Tang | Review -- Unsupervised Feature Learning via Non-Parametric Instance Discrimination. #134

Comments

NorbertZheng commented Sep 1, 2023

NorbertZheng commented Sep 1, 2023

Overview

NorbertZheng commented Sep 1, 2023

Unsupervised Feature Learning via Non-Parametric Instance Discrimination

Goal

NorbertZheng commented Sep 1, 2023

Parametric Classifier: Conventional Softmax

NorbertZheng commented Sep 1, 2023

Proposed Non-Parametric Softmax Classifier

NorbertZheng commented Sep 1, 2023

NorbertZheng commented Sep 1, 2023 • edited Loading

NorbertZheng commented Sep 1, 2023

Learning with A Memory Bank and NCE

Memory Bank

NorbertZheng commented Sep 1, 2023

Noise-Contrastive Estimation (NCE)

NorbertZheng commented Sep 1, 2023

Proximal Regularization

NorbertZheng commented Sep 1, 2023

NorbertZheng commented Sep 1, 2023

Weighted K-Nearest Neighbor Classifier

NorbertZheng commented Sep 1, 2023 •

edited

Loading