Kernels (Arthur Gretton) #1

YugeTen · 2019-08-30T09:12:58Z

Kernels

Kernels and feature space

Why

Kernel methods make it possible to separate the XOR cases that cannot be deal with by any linear classifier;
Kernel methods can control smoothness and avoid overfitting/underfitting.

What

Hilbert space: inner product space containing Cauchy sequence limits.
Kernel: a dot product in a Hilbert space (between features, you can use NN to produce a feature, we don't care)

New kernels from old

if a, b are kernels, c is constant, then ca and a+b are both kernels. (proof via positive definiteness)
The difference of two kernels might not be a kernel (because the length of a kernel cannot be negative, but calculating the difference between two kernels permits that)
Products of kernels are kernels
1,3 --> polynomial kernels! (expands to sums and products)

Exponential quadratic kernel

As long as the algorithm only uses dot product, and the dot product can be written in closed form, the features (the things you take inner product with) can be infinite (but obviously if you use an NN to construct the feature space you will have finite feature space)

Positive definite functions

Kernel validness: positive definite functions! Proof by the positivity of norm in Hilbert space.

This condition is sufficient and necessary: positive definite function is inner product in unique Hilbert space.

The REPRODUCING Hilbert Kernel Space

You can do a linear combination of infinitely many features, but then you need infinitely many coefficients. But that's super inconvenient! Remember how we could avoid infinity by taking dot product.
Solution: instead of using infinitely many coefficient, you can approximate it with a finite number of features (let's say m). Now your infinite number of linear combination between coefficients and features becomes a finite sum of kernels. The choice of m will depend on the problem you're solving -- typically in SVM, m would be 10% of total training points.
So in actual fact kernels is a very simple function:

2 defining features of RKHS

The reproducing property (aka kernel trick):

2. The feature map of every point is a function. (if it's a kernel between x and y, you can consider it as a function of x with the function defined by y)

Understanding smoothness of RKHS

Consider a Fourier Series:

We want to use Fourier series to approximate the "top hat" function:

(in here in the basis function f(x), all the sin disappear because sin is asymmetrical.)

The more cosine you add, the closer the approximation is. The approximation is a sum of features -- Fourier features. This can be written in kernel form as:

Comparing the dot product of Fourier series vs. in Hilbert space:

We can see that the Hilbert space one has a squared norm k term, which enforces smoothness. This is because if we write it as dot product between one kernel and itself, we have:

Using this principle we can determine that the top hat function is not smooth enough (numerator decay at polynomial, denominator at exponential, the norm is going to blow up) so it can never be a kernel.

In short, the core property of RKHS kernel is its smoothness enforced by small norm.

Reproducing property:

A unique property of RKHS

The L2 Hilbert space does not have this property. This is equivalent to the positive definite condition when defining RKHS.

MMD and GANs

slides

Maximum mean discrepancy

Illustration

MMD as an integral probability metric

A "well behaved" (smooth) function to maximise the mean discrepency. But when two distributions are not too different, we'd have:

Which is still not too bad because P and Q are very similar.

How does MMD "maximises the mean discrepency"?

f is optimal (maximised) when it's in the same direction as mean discrepancy.

Divergences

Question: do we look at the difference or the ratio?

Two sample testing with MMD

Statistical testing language

Example: idependent P and Q

Draw n=200 i.i.d. samples from P and Q

Repeat this process 300 times, get 150 MMDs, this is the histogram you get for MMD:

Looks like a Gaussian! This can be proven, when P and Q are independent, MMD looks like a Gaussian. (P and Q can have the same mean and different variance, it'll still look like a Gaussian. The variance of Gaussian will depend on the kernel of choice)

Example: P and Q are the same

Side note: notice from above, MMD can be negative! Because you're calculating the discrepancy between two samples with the same mean, MMD is an unbiased estimation of a distribution with zero mean, therefore MMD can be negative sometimes

It's going to be an infinite weighted sum of chi-squared (weighted with eigenvalues) with zero mean.

Summary

How to get test threshold c?

Now we permute the dataset:

And this is how we get the threshold (cut off at a point where MMD is "small enough"). It's a quantile of null distribution.

Choosing kernels

maximising test power == minimising false negatives

we want the blue area to be as small as possible.

To compute the second term is extremely hard, its a function of our kernel in a way that is extremely difficult (eigenvalues, chi squred, blah)
But, luckily:

So, to maximise test power, we just need to maximise the first term. (check out code)

GAN: real vs. fake

Distinguishing from real to fake digits:

Our kernel has a bandwidth for each pixel, so that it ignores the unimportant pixels and focus on the important ones.

So while humans are not able to distinguish the differences, statistical tests are very confident.

Tutorial take-away

See tutorial and solution

In traditional t-testing we have to take multiple sample to say with confidence that two statistics are different, but with mmd we can tell with only two samples.

GAN: training

Recap: previously we talked about 2 different types of distance metrics, integral prob (Wasserstein, MMD) and F-divergences (KL).

F-divergence as critic

Unhelpful, the following two casesboth have D_js = log2

In practice, we:

Use variational approximation to the critic, alternate generator and critic training
Add instance noise to the reference and generator observations (or a gradient penalty for the variational critic)

Wasserstein as critic

Helpful, the distance decreases when we get closer to the real distribution

MMD as critic

Wide kernels are helpful, narrow kernels are unhelpful:

To use MMD as a GAN critic, we need features that are "image specific", because MMD itself doesn't really know what an image looks like -- so we use NN!

How to train?
Reminder: witness function is acquired by subtracting the red points from the blue points

So now when we push input through NN first, look at the witness function:

There are lots of distortions!

MMD: helpful vs. unhelpful example

If the MMD gives a powerful test (can accurately separate blue and red) it will not be a good critic. To see this, let's look at a 2D example.

The gradient when using kernels k(x, y):

kernel:

This is when MMD is helpful: there are gradients everywhere so we always know which way to optimise. Now we can have a look at the gradient when using CNN features, kernels k(h(x), h(y)):

Kernels:

This is not helpful! The red points generally don't know where to move the gradients, test power is too strong.

What to do with MMD? Regularisation!

We can add a gradient regulariser (an approximation of it because the exact form is too computationally expensive O(n^3)):

This is now a bit more helpful:

Early stages: diffused gradient, tells you where to move your distribution

Late stages: concentrated gradient

The regulariser enforce the kernels to take the shape of the target distribution, so that the kernels can move around and capture the target distribution.

Observations on GAN

To have a successful GAN, you don't use an off the shelf critic function, you need a data-specific gradient estimator. This is true for the MMD critic we just developed, it's also true for the traditional GAN with KL divergence (it's not the KL or the MMD that is working, it's the data dependant regularisation)

DO NOT TRAIN CRITIC TO CONVERGENCE

WGAN-GP

Evaluation of GAN: inception score

Based on the classification output p(y|x) of the inception model:

High when predictive label distribution p(y|x) has low entropy and when label entropy p(y) is high, but it basically can't be used if you don't have the specific labels of your generation. It also relies on a trained classifier, can't be used on new categories.

Evaluation of GAN: Frechet inception distance (FID)

It's not gonna go to zero even if you calculate the FID between two sets of real images. Experiments found that you need 100000 samples for the ordering of FID to be truthful of the image quality.

Evaluation of GAN: kernel inception distance (KID)

Dependence testing

slides

To find the correlation between text and image, we can use paired image-caption as P, randomised image-caption as Q, calculate MMD(P, Q) and use as a dependence measure.

Next question -- what kernel should we use? We can use kernel k on image feature space, kernel l on sentence feature space, then the kernel we use for correlation analysis can just be the product between the two and take the trace. If the images and captions are correlated the trace is going to be large.

BUT -- this is not very interpretable. Also taking the product between the kernels is quite random.
Dependence is not correlation: they can be dependent on each other, but when the dependence is not linear, the correlation can be very low.

But, you can always find a smooth transformation of the data and find a high correlation between the two.

To this end, we can define Hilbert Schmidt independence criterion (HSIC):
We can find a range of smooth functions (all orthogonal to each other) to find correlations and add the squared values of the correlations together.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernels (Arthur Gretton) #1

Kernels (Arthur Gretton) #1

YugeTen commented Aug 30, 2019

Kernels (Arthur Gretton) #1

Kernels (Arthur Gretton) #1

Comments

YugeTen commented Aug 30, 2019

Kernels

Kernels and feature space

Why

What

New kernels from old

Exponential quadratic kernel

Positive definite functions

The REPRODUCING Hilbert Kernel Space

Understanding smoothness of RKHS

Reproducing property:

A unique property of RKHS

MMD and GANs

Maximum mean discrepancy

Illustration

MMD as an integral probability metric

How does MMD "maximises the mean discrepency"?

Divergences

Two sample testing with MMD

Statistical testing language

Example: idependent P and Q

Example: P and Q are the same

Summary

How to get test threshold c?

Choosing kernels

GAN: real vs. fake

Tutorial take-away

GAN: training

F-divergence as critic

Wasserstein as critic

MMD as critic

MMD: helpful vs. unhelpful example

What to do with MMD? Regularisation!

Observations on GAN

WGAN-GP

Evaluation of GAN: inception score

Evaluation of GAN: Frechet inception distance (FID)

Evaluation of GAN: kernel inception distance (KID)

Dependence testing