Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernels (Arthur Gretton) #1

Open
YugeTen opened this issue Aug 30, 2019 · 0 comments
Open

Kernels (Arthur Gretton) #1

YugeTen opened this issue Aug 30, 2019 · 0 comments

Comments

@YugeTen
Copy link
Owner

YugeTen commented Aug 30, 2019

Kernels

slides

Kernels and feature space

Why

  1. Kernel methods make it possible to separate the XOR cases that cannot be deal with by any linear classifier;
  2. Kernel methods can control smoothness and avoid overfitting/underfitting.

What

Hilbert space: inner product space containing Cauchy sequence limits.
Kernel: a dot product in a Hilbert space (between features, you can use NN to produce a feature, we don't care)

New kernels from old

  1. if a, b are kernels, c is constant, then ca and a+b are both kernels. (proof via positive definiteness)
  2. The difference of two kernels might not be a kernel (because the length of a kernel cannot be negative, but calculating the difference between two kernels permits that)
  3. Products of kernels are kernels
    1,3 --> polynomial kernels! (expands to sums and products)

Exponential quadratic kernel

As long as the algorithm only uses dot product, and the dot product can be written in closed form, the features (the things you take inner product with) can be infinite (but obviously if you use an NN to construct the feature space you will have finite feature space)

Positive definite functions

Kernel validness: positive definite functions! Proof by the positivity of norm in Hilbert space.

This condition is sufficient and necessary: positive definite function is inner product in unique Hilbert space.

The REPRODUCING Hilbert Kernel Space

You can do a linear combination of infinitely many features, but then you need infinitely many coefficients. But that's super inconvenient! Remember how we could avoid infinity by taking dot product.
Solution: instead of using infinitely many coefficient, you can approximate it with a finite number of features (let's say m). Now your infinite number of linear combination between coefficients and features becomes a finite sum of kernels. The choice of m will depend on the problem you're solving -- typically in SVM, m would be 10% of total training points.
So in actual fact kernels is a very simple function:
image

2 defining features of RKHS

  1. The reproducing property (aka kernel trick):

drawing

2. The feature map of every point is a function. (if it's a kernel between x and y, you can consider it as a function of x with the function defined by y)

Understanding smoothness of RKHS

Consider a Fourier Series:

drawing

We want to use Fourier series to approximate the "top hat" function:

drawing

drawing

(in here in the basis function f(x), all the sin disappear because sin is asymmetrical.)

The more cosine you add, the closer the approximation is. The approximation is a sum of features -- Fourier features. This can be written in kernel form as:

drawing

Comparing the dot product of Fourier series vs. in Hilbert space:

drawing

We can see that the Hilbert space one has a squared norm k term, which enforces smoothness. This is because if we write it as dot product between one kernel and itself, we have:

drawing

Using this principle we can determine that the top hat function is not smooth enough (numerator decay at polynomial, denominator at exponential, the norm is going to blow up) so it can never be a kernel.

In short, the core property of RKHS kernel is its smoothness enforced by small norm.

Reproducing property:

drawing

drawing

A unique property of RKHS

drawing

The L2 Hilbert space does not have this property. This is equivalent to the positive definite condition when defining RKHS.

MMD and GANs

slides

Maximum mean discrepancy

drawing

drawing

Illustration

drawing

MMD as an integral probability metric

drawing

A "well behaved" (smooth) function to maximise the mean discrepency. But when two distributions are not too different, we'd have:

drawing

Which is still not too bad because P and Q are very similar.

drawing

drawing

How does MMD "maximises the mean discrepency"?

drawing

f is optimal (maximised) when it's in the same direction as mean discrepancy.

Divergences

Question: do we look at the difference or the ratio?

drawing

drawing

Two sample testing with MMD

Statistical testing language

drawing

Example: idependent P and Q

Draw n=200 i.i.d. samples from P and Q

drawing

Repeat this process 300 times, get 150 MMDs, this is the histogram you get for MMD:

drawing

Looks like a Gaussian! This can be proven, when P and Q are independent, MMD looks like a Gaussian. (P and Q can have the same mean and different variance, it'll still look like a Gaussian. The variance of Gaussian will depend on the kernel of choice)

drawing

Example: P and Q are the same

drawing

Side note: notice from above, MMD can be negative! Because you're calculating the discrepancy between two samples with the same mean, MMD is an unbiased estimation of a distribution with zero mean, therefore MMD can be negative sometimes

drawing

It's going to be an infinite weighted sum of chi-squared (weighted with eigenvalues) with zero mean.

Summary

drawing

How to get test threshold c?

drawing

Now we permute the dataset:

drawing

And this is how we get the threshold (cut off at a point where MMD is "small enough"). It's a quantile of null distribution.

Choosing kernels

maximising test power == minimising false negatives

drawing

we want the blue area to be as small as possible.

drawing

To compute the second term is extremely hard, its a function of our kernel in a way that is extremely difficult (eigenvalues, chi squred, blah)
But, luckily:

drawing

So, to maximise test power, we just need to maximise the first term. (check out code)

GAN: real vs. fake

Distinguishing from real to fake digits:

drawing

Our kernel has a bandwidth for each pixel, so that it ignores the unimportant pixels and focus on the important ones.

drawing

So while humans are not able to distinguish the differences, statistical tests are very confident.

Tutorial take-away

See tutorial and solution

In traditional t-testing we have to take multiple sample to say with confidence that two statistics are different, but with mmd we can tell with only two samples.

drawing

drawing

GAN: training

Recap: previously we talked about 2 different types of distance metrics, integral prob (Wasserstein, MMD) and F-divergences (KL).

F-divergence as critic

Unhelpful, the following two casesboth have D_js = log2

drawing

drawing

In practice, we:

  1. Use variational approximation to the critic, alternate generator and critic training
  2. Add instance noise to the reference and generator observations (or a gradient penalty for the variational critic)

Wasserstein as critic

Helpful, the distance decreases when we get closer to the real distribution

MMD as critic

Wide kernels are helpful, narrow kernels are unhelpful:

drawing

drawing

drawing

drawing

To use MMD as a GAN critic, we need features that are "image specific", because MMD itself doesn't really know what an image looks like -- so we use NN!

drawing

How to train?
Reminder: witness function is acquired by subtracting the red points from the blue points

drawing

So now when we push input through NN first, look at the witness function:

drawing

There are lots of distortions!

MMD: helpful vs. unhelpful example

If the MMD gives a powerful test (can accurately separate blue and red) it will not be a good critic. To see this, let's look at a 2D example.

drawing

The gradient when using kernels k(x, y):

drawing

kernel:

drawing

This is when MMD is helpful: there are gradients everywhere so we always know which way to optimise. Now we can have a look at the gradient when using CNN features, kernels k(h(x), h(y)):

drawing

Kernels:

drawing

This is not helpful! The red points generally don't know where to move the gradients, test power is too strong.

What to do with MMD? Regularisation!

We can add a gradient regulariser (an approximation of it because the exact form is too computationally expensive O(n^3)):

drawing

This is now a bit more helpful:

Early stages: diffused gradient, tells you where to move your distribution

drawing

drawing

Late stages: concentrated gradient

drawing

drawing

The regulariser enforce the kernels to take the shape of the target distribution, so that the kernels can move around and capture the target distribution.

Observations on GAN

To have a successful GAN, you don't use an off the shelf critic function, you need a data-specific gradient estimator. This is true for the MMD critic we just developed, it's also true for the traditional GAN with KL divergence (it's not the KL or the MMD that is working, it's the data dependant regularisation)

DO NOT TRAIN CRITIC TO CONVERGENCE

WGAN-GP

drawing

drawing

Evaluation of GAN: inception score

Based on the classification output p(y|x) of the inception model:

drawing

High when predictive label distribution p(y|x) has low entropy and when label entropy p(y) is high, but it basically can't be used if you don't have the specific labels of your generation. It also relies on a trained classifier, can't be used on new categories.

Evaluation of GAN: Frechet inception distance (FID)

drawing

It's not gonna go to zero even if you calculate the FID between two sets of real images. Experiments found that you need 100000 samples for the ordering of FID to be truthful of the image quality.

Evaluation of GAN: kernel inception distance (KID)

drawing

Dependence testing

slides

To find the correlation between text and image, we can use paired image-caption as P, randomised image-caption as Q, calculate MMD(P, Q) and use as a dependence measure.

drawing

Next question -- what kernel should we use? We can use kernel k on image feature space, kernel l on sentence feature space, then the kernel we use for correlation analysis can just be the product between the two and take the trace. If the images and captions are correlated the trace is going to be large.

drawing

drawing

BUT -- this is not very interpretable. Also taking the product between the kernels is quite random.
Dependence is not correlation: they can be dependent on each other, but when the dependence is not linear, the correlation can be very low.

drawing

drawing

But, you can always find a smooth transformation of the data and find a high correlation between the two.

drawing

To this end, we can define Hilbert Schmidt independence criterion (HSIC):
We can find a range of smooth functions (all orthogonal to each other) to find correlations and add the squared values of the correlations together.

drawing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant