You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Kernel methods make it possible to separate the XOR cases that cannot be deal with by any linear classifier;
Kernel methods can control smoothness and avoid overfitting/underfitting.
What
Hilbert space: inner product space containing Cauchy sequence limits. Kernel: a dot product in a Hilbert space (between features, you can use NN to produce a feature, we don't care)
New kernels from old
if a, b are kernels, c is constant, then ca and a+b are both kernels. (proof via positive definiteness)
The difference of two kernels might not be a kernel (because the length of a kernel cannot be negative, but calculating the difference between two kernels permits that)
Products of kernels are kernels
1,3 --> polynomial kernels! (expands to sums and products)
Exponential quadratic kernel
As long as the algorithm only uses dot product, and the dot product can be written in closed form, the features (the things you take inner product with) can be infinite (but obviously if you use an NN to construct the feature space you will have finite feature space)
Positive definite functions
Kernel validness: positive definite functions! Proof by the positivity of norm in Hilbert space.
This condition is sufficient and necessary: positive definite function is inner product in unique Hilbert space.
The REPRODUCING Hilbert Kernel Space
You can do a linear combination of infinitely many features, but then you need infinitely many coefficients. But that's super inconvenient! Remember how we could avoid infinity by taking dot product. Solution: instead of using infinitely many coefficient, you can approximate it with a finite number of features (let's say m). Now your infinite number of linear combination between coefficients and features becomes a finite sum of kernels. The choice of m will depend on the problem you're solving -- typically in SVM, m would be 10% of total training points.
So in actual fact kernels is a very simple function:
2 defining features of RKHS
The reproducing property (aka kernel trick):
2. The feature map of every point is a function. (if it's a kernel between x and y, you can consider it as a function of x with the function defined by y)
Understanding smoothness of RKHS
Consider a Fourier Series:
We want to use Fourier series to approximate the "top hat" function:
(in here in the basis function f(x), all the sin disappear because sin is asymmetrical.)
The more cosine you add, the closer the approximation is. The approximation is a sum of features -- Fourier features. This can be written in kernel form as:
Comparing the dot product of Fourier series vs. in Hilbert space:
We can see that the Hilbert space one has a squared norm k term, which enforces smoothness. This is because if we write it as dot product between one kernel and itself, we have:
Using this principle we can determine that the top hat function is not smooth enough (numerator decay at polynomial, denominator at exponential, the norm is going to blow up) so it can never be a kernel.
In short, the core property of RKHS kernel is its smoothness enforced by small norm.
Reproducing property:
A unique property of RKHS
The L2 Hilbert space does not have this property. This is equivalent to the positive definite condition when defining RKHS.
A "well behaved" (smooth) function to maximise the mean discrepency. But when two distributions are not too different, we'd have:
Which is still not too bad because P and Q are very similar.
How does MMD "maximises the mean discrepency"?
f is optimal (maximised) when it's in the same direction as mean discrepancy.
Divergences
Question: do we look at the difference or the ratio?
Two sample testing with MMD
Statistical testing language
Example: idependent P and Q
Draw n=200 i.i.d. samples from P and Q
Repeat this process 300 times, get 150 MMDs, this is the histogram you get for MMD:
Looks like a Gaussian! This can be proven, when P and Q are independent, MMD looks like a Gaussian. (P and Q can have the same mean and different variance, it'll still look like a Gaussian. The variance of Gaussian will depend on the kernel of choice)
Example: P and Q are the same
Side note: notice from above, MMD can be negative! Because you're calculating the discrepancy between two samples with the same mean, MMD is an unbiased estimation of a distribution with zero mean, therefore MMD can be negative sometimes
It's going to be an infinite weighted sum of chi-squared (weighted with eigenvalues) with zero mean.
Summary
How to get test threshold c?
Now we permute the dataset:
And this is how we get the threshold (cut off at a point where MMD is "small enough"). It's a quantile of null distribution.
Choosing kernels
maximising test power == minimising false negatives
we want the blue area to be as small as possible.
To compute the second term is extremely hard, its a function of our kernel in a way that is extremely difficult (eigenvalues, chi squred, blah)
But, luckily:
So, to maximise test power, we just need to maximise the first term. (check out code)
GAN: real vs. fake
Distinguishing from real to fake digits:
Our kernel has a bandwidth for each pixel, so that it ignores the unimportant pixels and focus on the important ones.
So while humans are not able to distinguish the differences, statistical tests are very confident.
In traditional t-testing we have to take multiple sample to say with confidence that two statistics are different, but with mmd we can tell with only two samples.
GAN: training
Recap: previously we talked about 2 different types of distance metrics, integral prob (Wasserstein, MMD) and F-divergences (KL).
F-divergence as critic
Unhelpful, the following two casesboth have D_js = log2
In practice, we:
Use variational approximation to the critic, alternate generator and critic training
Add instance noise to the reference and generator observations (or a gradient penalty for the variational critic)
Wasserstein as critic
Helpful, the distance decreases when we get closer to the real distribution
MMD as critic
Wide kernels are helpful, narrow kernels are unhelpful:
To use MMD as a GAN critic, we need features that are "image specific", because MMD itself doesn't really know what an image looks like -- so we use NN!
How to train?
Reminder: witness function is acquired by subtracting the red points from the blue points
So now when we push input through NN first, look at the witness function:
There are lots of distortions!
MMD: helpful vs. unhelpful example
If the MMD gives a powerful test (can accurately separate blue and red) it will not be a good critic. To see this, let's look at a 2D example.
The gradient when using kernels k(x, y):
kernel:
This is when MMD is helpful: there are gradients everywhere so we always know which way to optimise. Now we can have a look at the gradient when using CNN features, kernels k(h(x), h(y)):
Kernels:
This is not helpful! The red points generally don't know where to move the gradients, test power is too strong.
What to do with MMD? Regularisation!
We can add a gradient regulariser (an approximation of it because the exact form is too computationally expensive O(n^3)):
This is now a bit more helpful:
Early stages: diffused gradient, tells you where to move your distribution
Late stages: concentrated gradient
The regulariser enforce the kernels to take the shape of the target distribution, so that the kernels can move around and capture the target distribution.
Observations on GAN
To have a successful GAN, you don't use an off the shelf critic function, you need a data-specific gradient estimator. This is true for the MMD critic we just developed, it's also true for the traditional GAN with KL divergence (it's not the KL or the MMD that is working, it's the data dependant regularisation)
DO NOT TRAIN CRITIC TO CONVERGENCE
WGAN-GP
Evaluation of GAN: inception score
Based on the classification output p(y|x) of the inception model:
High when predictive label distribution p(y|x) has low entropy and when label entropy p(y) is high, but it basically can't be used if you don't have the specific labels of your generation. It also relies on a trained classifier, can't be used on new categories.
Evaluation of GAN: Frechet inception distance (FID)
It's not gonna go to zero even if you calculate the FID between two sets of real images. Experiments found that you need 100000 samples for the ordering of FID to be truthful of the image quality.
Evaluation of GAN: kernel inception distance (KID)
To find the correlation between text and image, we can use paired image-caption as P, randomised image-caption as Q, calculate MMD(P, Q) and use as a dependence measure.
Next question -- what kernel should we use? We can use kernel k on image feature space, kernel l on sentence feature space, then the kernel we use for correlation analysis can just be the product between the two and take the trace. If the images and captions are correlated the trace is going to be large.
BUT -- this is not very interpretable. Also taking the product between the kernels is quite random.
Dependence is not correlation: they can be dependent on each other, but when the dependence is not linear, the correlation can be very low.
But, you can always find a smooth transformation of the data and find a high correlation between the two.
To this end, we can define Hilbert Schmidt independence criterion (HSIC):
We can find a range of smooth functions (all orthogonal to each other) to find correlations and add the squared values of the correlations together.
The text was updated successfully, but these errors were encountered:
Kernels
slides
Kernels and feature space
Why
What
Hilbert space: inner product space containing Cauchy sequence limits.
Kernel: a dot product in a Hilbert space (between features, you can use NN to produce a feature, we don't care)
New kernels from old
1,3 --> polynomial kernels! (expands to sums and products)
Exponential quadratic kernel
As long as the algorithm only uses dot product, and the dot product can be written in closed form, the features (the things you take inner product with) can be infinite (but obviously if you use an NN to construct the feature space you will have finite feature space)
Positive definite functions
Kernel validness: positive definite functions! Proof by the positivity of norm in Hilbert space.
This condition is sufficient and necessary: positive definite function is inner product in unique Hilbert space.
The REPRODUCING Hilbert Kernel Space
You can do a linear combination of infinitely many features, but then you need infinitely many coefficients. But that's super inconvenient! Remember how we could avoid infinity by taking dot product.
Solution: instead of using infinitely many coefficient, you can approximate it with a finite number of features (let's say m). Now your infinite number of linear combination between coefficients and features becomes a finite sum of kernels. The choice of m will depend on the problem you're solving -- typically in SVM, m would be 10% of total training points.
So in actual fact kernels is a very simple function:
2 defining features of RKHS
2. The feature map of every point is a function. (if it's a kernel between x and y, you can consider it as a function of x with the function defined by y)
Understanding smoothness of RKHS
Consider a Fourier Series:
We want to use Fourier series to approximate the "top hat" function:
(in here in the basis function f(x), all the sin disappear because sin is asymmetrical.)
The more cosine you add, the closer the approximation is. The approximation is a sum of features -- Fourier features. This can be written in kernel form as:
Comparing the dot product of Fourier series vs. in Hilbert space:
We can see that the Hilbert space one has a squared norm k term, which enforces smoothness. This is because if we write it as dot product between one kernel and itself, we have:
Using this principle we can determine that the top hat function is not smooth enough (numerator decay at polynomial, denominator at exponential, the norm is going to blow up) so it can never be a kernel.
In short, the core property of RKHS kernel is its smoothness enforced by small norm.
Reproducing property:
A unique property of RKHS
The L2 Hilbert space does not have this property. This is equivalent to the positive definite condition when defining RKHS.
MMD and GANs
slides
Maximum mean discrepancy
Illustration
MMD as an integral probability metric
A "well behaved" (smooth) function to maximise the mean discrepency. But when two distributions are not too different, we'd have:
Which is still not too bad because P and Q are very similar.
How does MMD "maximises the mean discrepency"?
f is optimal (maximised) when it's in the same direction as mean discrepancy.
Divergences
Question: do we look at the difference or the ratio?
Two sample testing with MMD
Statistical testing language
Example: idependent P and Q
Draw n=200 i.i.d. samples from P and Q
Repeat this process 300 times, get 150 MMDs, this is the histogram you get for MMD:
Looks like a Gaussian! This can be proven, when P and Q are independent, MMD looks like a Gaussian. (P and Q can have the same mean and different variance, it'll still look like a Gaussian. The variance of Gaussian will depend on the kernel of choice)
Example: P and Q are the same
It's going to be an infinite weighted sum of chi-squared (weighted with eigenvalues) with zero mean.
Summary
How to get test threshold c?
Now we permute the dataset:
And this is how we get the threshold (cut off at a point where MMD is "small enough"). It's a quantile of null distribution.
Choosing kernels
maximising test power == minimising false negatives
we want the blue area to be as small as possible.
To compute the second term is extremely hard, its a function of our kernel in a way that is extremely difficult (eigenvalues, chi squred, blah)
But, luckily:
So, to maximise test power, we just need to maximise the first term. (check out code)
GAN: real vs. fake
Distinguishing from real to fake digits:
Our kernel has a bandwidth for each pixel, so that it ignores the unimportant pixels and focus on the important ones.
So while humans are not able to distinguish the differences, statistical tests are very confident.
Tutorial take-away
See tutorial and solution
In traditional t-testing we have to take multiple sample to say with confidence that two statistics are different, but with mmd we can tell with only two samples.
GAN: training
Recap: previously we talked about 2 different types of distance metrics, integral prob (Wasserstein, MMD) and F-divergences (KL).
F-divergence as critic
Unhelpful, the following two casesboth have D_js = log2
In practice, we:
Wasserstein as critic
Helpful, the distance decreases when we get closer to the real distribution
MMD as critic
Wide kernels are helpful, narrow kernels are unhelpful:
To use MMD as a GAN critic, we need features that are "image specific", because MMD itself doesn't really know what an image looks like -- so we use NN!
How to train?
Reminder: witness function is acquired by subtracting the red points from the blue points
So now when we push input through NN first, look at the witness function:
There are lots of distortions!
MMD: helpful vs. unhelpful example
If the MMD gives a powerful test (can accurately separate blue and red) it will not be a good critic. To see this, let's look at a 2D example.
The gradient when using kernels k(x, y):
kernel:
This is when MMD is helpful: there are gradients everywhere so we always know which way to optimise. Now we can have a look at the gradient when using CNN features, kernels k(h(x), h(y)):
Kernels:
This is not helpful! The red points generally don't know where to move the gradients, test power is too strong.
What to do with MMD? Regularisation!
We can add a gradient regulariser (an approximation of it because the exact form is too computationally expensive O(n^3)):
This is now a bit more helpful:
Early stages: diffused gradient, tells you where to move your distribution
Late stages: concentrated gradient
The regulariser enforce the kernels to take the shape of the target distribution, so that the kernels can move around and capture the target distribution.
Observations on GAN
To have a successful GAN, you don't use an off the shelf critic function, you need a data-specific gradient estimator. This is true for the MMD critic we just developed, it's also true for the traditional GAN with KL divergence (it's not the KL or the MMD that is working, it's the data dependant regularisation)
WGAN-GP
Evaluation of GAN: inception score
Based on the classification output p(y|x) of the inception model:
High when predictive label distribution p(y|x) has low entropy and when label entropy p(y) is high, but it basically can't be used if you don't have the specific labels of your generation. It also relies on a trained classifier, can't be used on new categories.
Evaluation of GAN: Frechet inception distance (FID)
It's not gonna go to zero even if you calculate the FID between two sets of real images. Experiments found that you need 100000 samples for the ordering of FID to be truthful of the image quality.
Evaluation of GAN: kernel inception distance (KID)
Dependence testing
slides
To find the correlation between text and image, we can use paired image-caption as P, randomised image-caption as Q, calculate MMD(P, Q) and use as a dependence measure.
Next question -- what kernel should we use? We can use kernel k on image feature space, kernel l on sentence feature space, then the kernel we use for correlation analysis can just be the product between the two and take the trace. If the images and captions are correlated the trace is going to be large.
BUT -- this is not very interpretable. Also taking the product between the kernels is quite random.
Dependence is not correlation: they can be dependent on each other, but when the dependence is not linear, the correlation can be very low.
But, you can always find a smooth transformation of the data and find a high correlation between the two.
To this end, we can define Hilbert Schmidt independence criterion (HSIC):
We can find a range of smooth functions (all orthogonal to each other) to find correlations and add the squared values of the correlations together.
The text was updated successfully, but these errors were encountered: