Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sik-Ho Tang | Review -- Unsupervised Feature Learning via Non-Parametric Instance Discrimination. #134

Open
NorbertZheng opened this issue Sep 1, 2023 · 11 comments

Comments

@NorbertZheng
Copy link
Owner

Sik-Ho Tang. Review — Unsupervised Feature Learning via Non-Parametric Instance Discrimination.

@NorbertZheng
Copy link
Owner Author

Overview

image
Each image is treated as a class and projected to hypersphere.

Unsupervised Feature Learning via Non-Parametric Instance Discrimination. Instance Discrimination, by UC Berkeley / ICSI, Chinese University of Hong Kong, and Amazon Rekognition. 2018 CVPR, Over 1100 Citations.
Unsupervised Learning, Deep Metric Learning, Self-Supervised Learning, Semi-Supervised Learning, Image Classification, Object Detection.

  • Authors start by asking a question: “Can we learn a good feature representation that captures apparent similarity among instances, instead of classes, by merely asking the feature to be discriminative of individual instances?”
  • This intuition is formulated as a non-parametric classification problem at the instance-level, and
    • use noise-contrastive estimation (NCE) to tackle the computational challenges imposed by the large number of instance classes.

@NorbertZheng
Copy link
Owner Author

Unsupervised Feature Learning via Non-Parametric Instance Discrimination

image
The pipeline of unsupervised feature learning approach.

Goal

  • A backbone CNN is used to encode each image as a feature vector, which is projected to a 128-dimensional space and L2 normalized.
  • The optimal feature embedding is learned via instance-level discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere.

The goal is to learn an embedding function $v=f_{\theta}(x)$ without supervision. $f_{\theta}$ is a deep neural network with parameters $\theta$, mapping image $x$ to feature $v$.

A metric is induced over the image space for instances $x$ and $y$:
image

A good embedding should map visually similar images closer to each other.

Each image instance is treated as a distinct class of its own and a classifier is trained to distinguish between individual instance classes.

@NorbertZheng
Copy link
Owner Author

Parametric Classifier: Conventional Softmax

If we got $n$ images/instances, we got $n$ classes.

Under the conventional parametric softmax formulation, for image $x$ with feature $v=f_{\theta}(x)$, the probability of it being recognized as $i$-th example is:
image

where $w_{j}$ is a weight vector for class $j$, and $w_{j}^{T}v$ measures how well $v$ matches the $j$-th class, i.e. instance.

@NorbertZheng
Copy link
Owner Author

Proposed Non-Parametric Softmax Classifier

A non-parametric variant of the above softmax equation is to replace $w_{j}^{T}v$ with $v_{j}^{T}v$, and $||v||=1$ is enforced via a L2-normalization layer.

Then the probability $P(i|v)$ becomes:
image

where $\tau$ is a temperature parameter that controls the concentration level of the distribution (Please feel free to read Distillation for more details about temperature $\tau$). $\tau$ is important for supervised feature learning [43], and also necessary for tuning the concentration of $v$ on the unit sphere.

The learning objective is then to maximize the joint probability:
image

or equivalently to minimize the negative log-likelihood over the training set:
image

@NorbertZheng
Copy link
Owner Author

Getting rid of these weight vectors is important, because the learning objective focuses entirely on the feature representation and its induced metric, which can be applied everywhere in the space and to any new instances at the test time.

Also, it eliminates the need for computing and storing the gradients for $\{w_{j}\}$, making it more scalable for big data applications.

@NorbertZheng
Copy link
Owner Author

NorbertZheng commented Sep 1, 2023

Suitable for scenarios where the number of classes is large. Non-parametric classifier is a kind of contrastive learning???

@NorbertZheng
Copy link
Owner Author

Learning with A Memory Bank and NCE

Memory Bank

To compute the probability $P(i|v)$, $\{v_{j}\}$ for all the images are needed. Instead of exhaustively computing these representations every time, a feature memory bank $V$ is maintained for storing them.

Separate notations are introduced for the memory bank and features forwarded from the network. Let $V=\{v_{j}\}$ be the memory bank and $f_{i}=f_{\theta}(x_{i})$ be the feature of $x_{i}$.

During each learning iteration, the representation $f_{i}$ as well as the network parameters $\theta$ are optimized via stochastic gradient descend.

Then $f_{i}$ is updated to $V$ at the corresponding instance entry $f_{i}\to v_{i}$.

All the representations in the memory bank $V$ are initialized as unit random vectors.

@NorbertZheng
Copy link
Owner Author

Noise-Contrastive Estimation (NCE)

Noise-Contrastive Estimation (NCE) is used to approximate full Softmax.

The basic idea is to cast the multi-class classification problem into a set of binary classification problems, where the binary classification tasks is to discriminate between data samples and noise samples.

(NCE is originally used in NLP. Please feel free to read NCE if interested.)

Specifically, the probability that feature representation $v$ in the memory bank corresponds to the $i$-th example under the model is:
image

where $Z_{i}$ is the normalizing constant. The noise distribution is formalized as a uniform distribution: $P_{n}=\frac{1}{n}$.

Noise samples are assumed to be $m$ times more frequent than data samples. The posterior probability of sample $i$ with feature $v$ being from the data distribution (denoted by $D=1$) is:
image

The approximated training objective is to minimize the negative log-posterior distribution of data and noise samples:
image

Here, $P_{d}$ denotes the actual data distribution. For $P_{d}$, $v$ is the feature corresponding to $x_{i}$; whereas for $P_{n}$, $v'$ is the feature from another image, randomly sampled according to noise distribution $P_{n}$.

Both $v$ and $v’$ are sampled from the non-parametric memory bank $V$.

Computing normalizing constant $Z_{i}$ is expensive, Morte Carlo approximation is used:
image

where $\{j_{k}\}$ is a random subset of indices. Empirically, the approximation derived from initial batches is sufficient to work well in practice.

NCE reduces the computational complexity from O(n) to O(1) per sample.

@NorbertZheng
Copy link
Owner Author

Proximal Regularization

image
The effect of proximal regularization.

  • During each training epoch, each class is only visited once. Therefore, the learning process oscillates a lot from random sampling fluctuation.

An additional term is added to encourage the smoothness.

  • At current iteration $t$, the feature representation for data $x_{i}$ is computed from the network $v(t){i}=f{\theta}(x_{i})$. The memory bank of all the representation are stored at previous iteration $V=\{v(t-1)\}$.
  • The loss function for a positive sample from $P_{d}$ is:
    image

As learning converges, the difference between iterations, i.e. $v(t){i}-v(t-1){i}$, gradually vanishes, and the augmented loss is reduced to the original one.

With proximal regularization, the final objective becomes:
image

The above figure shows that, empirically, proximal regularization helps stabilize training, speed up convergence, and improve the learned representation, with negligible extra cost.

@NorbertZheng
Copy link
Owner Author

Change representation smoothly!!!

@NorbertZheng
Copy link
Owner Author

Weighted K-Nearest Neighbor Classifier

To classify test image $\hat{x}$, we first compute its feature $\hat{f}=f_{\theta}(\hat{x})$, and then compare it against the embeddings of all the images in the memory bank, using the cosine similarity $s_{i}=cos(v_{i},\hat{f})$.

The top k nearest neighbors, denoted by $N_{k}$, would then be used to make the prediction via weighted voting.

image

image

$\tau=0.07$ and $k=200$.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant