Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory collapses with precomputed block matrix #180

Closed
fsvbach opened this issue Apr 18, 2021 · 17 comments
Closed

Memory collapses with precomputed block matrix #180

fsvbach opened this issue Apr 18, 2021 · 17 comments
Labels
bug Something isn't working

Comments

@fsvbach
Copy link

fsvbach commented Apr 18, 2021

Expected behaviour

When I run tSNE on a symmetric 200x200 block matrix such as this one
distancematrix0
I expect TSNE to return 4 distinct clusters (actually 4 points only). Sklearn yields this.

Actual behaviour

Using openTSNE the terminal crashes with full memory (50% of the time). If it survives the clusters are visible, however the result is not as satisfying.

Steps to reproduce the behavior

matrix = Block matrix
tsne = TSNE(metric='precomputed', initialization='spectral', negative_gradient_method='bh')
embedding = tsne.fit(matrix)

NOTE: I am using the direct installation from GitHub this morning.

@pavlin-policar
Copy link
Owner

Please provide a minimal code example where this fails so we can reproduce and debug this. There really isn't anything obvious that would cause the memory to balloon.

Does the same thing happen when using negative_gradient_method="fft"?

@fsvbach
Copy link
Author

fsvbach commented Apr 18, 2021

Okay, with negative_gradient_method="fft" it works, but I sometimes get the RuntimeWarning: invalid value encountered in log
kl_divergence_ += sum_P * np.log(sum_Q + EPSILON)

A minimal example would be:

from openTSNE import TSNE
import numpy as np
matrix = np.genfromtxt('matrix.csv', delimiter=',')
tsne = TSNE(metric='precomputed', initialization='spectral', negative_gradient_method='bh')
embedding = tsne.fit(matrix)

with the matrix in matrix.csv
matrix.csv

Note: It doesn't crash always, but every third or fourth time, so I think its due to some initialization?

@pavlin-policar
Copy link
Owner

What I suspect may be happening is that if the points are too close together in the embedding, the BH tree balloons. However, I should have some safeguards in place so that this doesn't happen. Maybe those safeguards aren't enough. This might be plausible since you're trying to force all the points super close together. But I'd need to investigate it further. I suggest you use fft for the time being.

@dkobak
Copy link
Contributor

dkobak commented Apr 18, 2021

I expect TSNE to return 4 distinct clusters (actually 4 points only). Sklearn yields this.

Actually the points should not be on top of each other in the embedding. When some points have zero distance in the original embedding, they should get finite affinities and should end up close but not overlapping in the embedding.

If you get overlapping points in sklearn, that would be weird.

Can you post a picture of sklearn embedding, and also of what you get from openTSNE when it does not crash?

@fsvbach
Copy link
Author

fsvbach commented Apr 19, 2021

Yes.
openTSNE_bh
This is the result of openTSNE with bh setting when it doesn't crash.

openTSNE_fft
This is the result of openTSNE with fft settings (where I get the above warning).

sklearn
And this (similar) plot I get with sklearn all of the time without warning.

@dkobak
Copy link
Contributor

dkobak commented Apr 19, 2021

Wow. None of that makes any sense to me! Unless I am missing something, the points should not overlap at all, so we should see 200 points separately. BH-openTSNE makes most sense, but even there there are many fewer than 200 points visible. In FFT-openTSNE there is 1e11 on the y-axis?! And sklearn simply shows 4 points... Very odd.

@pavlin-policar
Copy link
Owner

I agree, none of these plots seem correct. The repulsive forces should push the points apart at least a little bit. However, the spans of the embeddings are quite big, so maybe it could be a visual plotting thing that they look like a single point. Are the final coordinates really just a single point, or are they just really close to one another and the large point sizes and large spans just make it look that way?

@dkobak
Copy link
Contributor

dkobak commented Apr 19, 2021

It seems that has nothing to do with precomputed specifically, but with how we treat identical points.

np.random.seed(42)

X = np.concatenate((
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50)
    ))

Z = TSNE(negative_gradient_method='bh').fit(X)

print(np.unique(np.sum(Z,axis=1)).size)

This prints 6 🤯

Update: with sklearn I get points that are almost overlapping but are not identical. However, this result seems fishy to me too.

Update2: with FFT I get 7 distinct points. So points are exactly overlapping.

Update3: I checked and the same thing happens with Uniform affinities, so it's not a perplexity calibration problem.

Also note that adding a minuscule amount of noise like X += np.random.randn(200,50) / 10000 makes the problem go away.

@fsvbach
Copy link
Author

fsvbach commented Apr 19, 2021

The points in FFT-openTSNE for example are something like

      ...      [  1.76457933,  13.81758939],
               [  2.02415051,  13.85560307],
               [  1.94253379,  13.86327763],
               [  2.01328632,  13.71940575],
               [  2.04201515,  13.7215927 ],
               [  2.17067327,  13.8131649], ...

It makes sense to me, that they are very close so that matplotlib overlaps even with points size 1. What doesn't make sense to me: Why doesn't BH-openTSNE show this behaviour, i.e. why are some points very close and some others fairly far apart (and why does it crash, of course)?

Note: In this run I didn't get the RuntimeWarning and the plot (with smaller size) looks like
openTSNE_fft
Now the y-axis is fine, too.

@dkobak
Copy link
Contributor

dkobak commented Apr 20, 2021

I realized that I was wrong when I said that the optimal embedding should not have overlapping points. On reflection, it should be possible to reduce KL to zero by making all points from each class overlap exactly. All pairs of points overlapping in high dimensions have identical affinities, and if they overlap in 2D they will have identical low-dimensional similarities too.

If so, this means that sklearnTSNE returns the correct result.

But I see a lot of thing going weird when optimizing it with openTSNE...

Using the same data as above

np.random.seed(42)
X = np.concatenate((
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50)
    ))

It works fine with starting from random initialization:

TSNE(negative_gradient_method='bh', initialization='random', verbose=1).fit(X)
TSNE(negative_gradient_method='fft', initialization='random', verbose=1).fit(X)

both reduce KL to near-zero and make the points roughly (but not exactly) overlap.

But using PCA initializatoin (where points in the initialization overlap exactly) the optimization goes wild.

TSNE(negative_gradient_method='bh', initialization='pca', verbose=1).fit(X)

BH yield negatative KL values which is mathematically impossible!

===> Running optimization with exaggeration=12.00, lr=200.00 for 250 iterations...
Iteration   50, KL divergence -5.1860, 50 iterations in 0.0159 sec
Iteration  100, KL divergence -6.2852, 50 iterations in 0.0147 sec
Iteration  150, KL divergence -7.0769, 50 iterations in 0.0148 sec
Iteration  200, KL divergence -7.6468, 50 iterations in 0.0149 sec
Iteration  250, KL divergence -8.0897, 50 iterations in 0.0147 sec
   --> Time elapsed: 0.08 seconds
===> Running optimization with exaggeration=1.00, lr=200.00 for 500 iterations...
Iteration   50, KL divergence -9.2070, 50 iterations in 0.0163 sec
Iteration  100, KL divergence -9.7631, 50 iterations in 0.0144 sec
Iteration  150, KL divergence -10.1641, 50 iterations in 0.0152 sec
Iteration  200, KL divergence -10.4828, 50 iterations in 0.0149 sec
Iteration  250, KL divergence -10.7496, 50 iterations in 0.0162 sec
Iteration  300, KL divergence -10.9802, 50 iterations in 0.0150 sec
Iteration  350, KL divergence -11.1838, 50 iterations in 0.0162 sec
Iteration  400, KL divergence -11.3666, 50 iterations in 0.0149 sec
Iteration  450, KL divergence -11.5325, 50 iterations in 0.0143 sec
Iteration  500, KL divergence -11.6847, 50 iterations in 0.0151 sec

While FFT...

TSNE(negative_gradient_method='fft', initialization='pca', verbose=1).fit(X)

...diverges!

===> Running optimization with exaggeration=12.00, lr=200.00 for 250 iterations...
Iteration   50, KL divergence 0.0446, 50 iterations in 0.6821 sec
Iteration  100, KL divergence 0.0292, 50 iterations in 0.7156 sec
Iteration  150, KL divergence 0.0227, 50 iterations in 0.7206 sec
Iteration  200, KL divergence 0.0240, 50 iterations in 0.7323 sec
Iteration  250, KL divergence 0.0195, 50 iterations in 0.6954 sec
   --> Time elapsed: 3.55 seconds
===> Running optimization with exaggeration=1.00, lr=200.00 for 500 iterations...
Iteration   50, KL divergence 0.0321, 50 iterations in 0.6926 sec
Iteration  100, KL divergence 0.0815, 50 iterations in 0.6904 sec
Iteration  150, KL divergence 0.1199, 50 iterations in 0.6936 sec
Iteration  200, KL divergence 0.1419, 50 iterations in 0.6955 sec
Iteration  250, KL divergence 8.1304, 50 iterations in 0.8760 sec
Iteration  300, KL divergence 13.9849, 50 iterations in 0.6924 sec
Iteration  350, KL divergence 14.7327, 50 iterations in 0.6963 sec
Iteration  400, KL divergence 15.2104, 50 iterations in 0.6898 sec
Iteration  450, KL divergence 15.4548, 50 iterations in 0.6824 sec
Iteration  500, KL divergence 15.5808, 50 iterations in 0.7088 sec

@dkobak
Copy link
Contributor

dkobak commented Apr 20, 2021

Update: sklearn also shows negative KL values when using PCA initialization (but not with random initialization)!

Z = SklearnTSNE(verbose=2, init='pca').fit_transform(X)

results in

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 200 samples in 0.001s...
[t-SNE] Computed neighbors for 200 samples in 0.006s...
[t-SNE] Computed conditional probabilities for sample 200 / 200
[t-SNE] Mean sigma: 0.000000
[t-SNE] Computed conditional probabilities in 0.034s
[t-SNE] Iteration 50: error = -59.8060608, gradient norm = 0.0029903 (50 iterations in 0.040s)
[t-SNE] Iteration 100: error = -75.7929688, gradient norm = 0.0015364 (50 iterations in 0.018s)
[t-SNE] Iteration 150: error = -85.3008423, gradient norm = 0.0010339 (50 iterations in 0.017s)
[t-SNE] Iteration 200: error = -92.0923309, gradient norm = 0.0007792 (50 iterations in 0.018s)
[t-SNE] Iteration 250: error = -97.3766937, gradient norm = 0.0006251 (50 iterations in 0.017s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: -97.376694
[t-SNE] Iteration 300: error = -10.6954212, gradient norm = 0.0005959 (50 iterations in 0.032s)
[t-SNE] Iteration 350: error = -10.9293900, gradient norm = 0.0005301 (50 iterations in 0.018s)
[t-SNE] Iteration 400: error = -11.2307148, gradient norm = 0.0004560 (50 iterations in 0.020s)
[t-SNE] Iteration 450: error = -11.5423441, gradient norm = 0.0003902 (50 iterations in 0.018s)
[t-SNE] Iteration 500: error = -11.8387394, gradient norm = 0.0003365 (50 iterations in 0.017s)
[t-SNE] Iteration 550: error = -12.1112652, gradient norm = 0.0002936 (50 iterations in 0.017s)
[t-SNE] Iteration 600: error = -12.3597937, gradient norm = 0.0002593 (50 iterations in 0.017s)
[t-SNE] Iteration 650: error = -12.5856562, gradient norm = 0.0002316 (50 iterations in 0.017s)
[t-SNE] Iteration 700: error = -12.7922802, gradient norm = 0.0002089 (50 iterations in 0.016s)
[t-SNE] Iteration 750: error = -12.9815378, gradient norm = 0.0001900 (50 iterations in 0.017s)
[t-SNE] Iteration 800: error = -13.1556988, gradient norm = 0.0001741 (50 iterations in 0.016s)
[t-SNE] Iteration 850: error = -13.3171625, gradient norm = 0.0001606 (50 iterations in 0.023s)
[t-SNE] Iteration 900: error = -13.4675283, gradient norm = 0.0001490 (50 iterations in 0.024s)
[t-SNE] Iteration 950: error = -13.6077213, gradient norm = 0.0001389 (50 iterations in 0.017s)
[t-SNE] Iteration 1000: error = -13.7389517, gradient norm = 0.0001301 (50 iterations in 0.027s)
[t-SNE] KL divergence after 1000 iterations: -13.738952

so it seems that sklearnTSNE and openTSNE have the same problem here.

@pavlin-policar
Copy link
Owner

Hmm, this is all very strange, and I'll need to look into it some more. But I think that in my implementation of BH, one of the assumptions I made was that the embedding points would never actually overlap like that. So the BH one doesn't surprise me that much. And also, such things are pretty difficult to solve with BH, if memory serves. I am a bit surprised about FIt-SNE not working, since it doesn't make any assumptions about how the points are spaced out.

Perhaps the negative KL values are the most puzzling of all.

I definitely consider this an edge case, and a very unusual usage. Adding just a little bit of noise to the distance matrix makes everything run smoothly.

@dkobak
Copy link
Contributor

dkobak commented Apr 23, 2021

Perhaps the negative KL values are the most puzzling of all.

Aren't KL values during BH optimization computed using some expressions computed via BH? If so, then it's not surprising that KL values end up being very wrong.

And also, such things are pretty difficult to solve with BH, if memory serves.

I have no idea about this. But if this is really true, then a dirty workaround would be to add some tiny noise to the initialization (with a warning) if the repulsion method is BH and if there are overlapping points... I'm just not sure what's the best way to detect that there are overlapping points in the initialization.

I am a bit surprised about FIt-SNE not working, since it doesn't make any assumptions about how the points are spaced out.

In fact, we may do the same for FFT as well if we cannot otherwise find out what's going on there.

@dkobak
Copy link
Contributor

dkobak commented Apr 26, 2021

I'm just not sure what's the best way to detect that there are overlapping points in the initialization.

Maybe something like

if initialization.shape[0] == np.unique(initialization.round(decimals=6), axis=0).shape[0]

can work to detect if there are duplicates in the initialization array...

@dkobak
Copy link
Contributor

dkobak commented May 5, 2021

Somewhat fleshing out this workaround:

import numpy as np

X = np.random.randn(1_000_000,2)
X[1000] = X[2000]

def jitter(X, precision=6):
    _, ind_unique = np.unique(X.round(decimals=precision), axis=0, return_index=True)
    ind_duplicates = np.setdiff1d(np.arange(X.shape[0]), ind_unique)
    if ind_duplicates.size > 0:
        print(f'Found {ind_duplicates.size} near-duplicate(s). Adding some jitter to them.')
        X[ind_duplicates] += np.random.randn(ind_duplicates.size, X.shape[1]) * 10**-precision
    return X

%time X = jitter(X)

works in <1 second for n=1mln:

Found 1 near-duplicate(s). Adding some jitter to them.
CPU times: user 872 ms, sys: 12 ms, total: 884 ms
Wall time: 884 ms

So my suggestion would be to run this on any initialization (explicitly provided by the user or computed using PCA) and print a warning when adding jitter.

@pavlin-policar
Copy link
Owner

Thanks for this solution, I'm actually amazed it works this fast!

At first I was kind of on the fence about this, thinking: "How often would this actually happen?" I think a block distance matrix is something you're really rarely ever see in practice. However, duplicate entries aren't that rare at all, e.g. even in the iris data set, there are two identical rows. However, for iris, this isn't a problem, and the two identical data points get embedded into the exact same position. So that seems to work okay. Then the question becomes "why doesn't it work when there are a lot of duplicate entries?"

For iris, I tried setting a bunch of points to being duplicate entries, and seeing where it failed. Interestingly enough, it failed when I put 30 duplicate points. This doesn't correlate with the perplexity, as increasing it to e.g. 70 still results in negative KL values. So this is definitely very strange. However, setting using 30 duplicate rows and setting perplexity to 70 still results in negative KL values. So it seems like the problem has always been there, but the KL values sometimes don't reach negative values. It also seems like it works if the number of duplicate entries isn't too big.

I'm fine with using your fix as a temporary solution. Clearly, there is a bug somewhere, and it needs to be found. However, this isn't something that would happen very often though I think. If you want to open a PR with this temporary fix, I'll be happy to merge it.

@pavlin-policar pavlin-policar added the bug Something isn't working label May 9, 2021
@dkobak
Copy link
Contributor

dkobak commented Jul 19, 2021

I'm fine with using your fix as a temporary solution. Clearly, there is a bug somewhere, and it needs to be found. However, this isn't something that would happen very often though I think. If you want to open a PR with this temporary fix, I'll be happy to merge it.

Just want to comment here briefly that I implemented this in a branch but then observed some odd behaviour (I was getting some "near-duplicates" detected in the initialization when there should not have been any) and did not get to investigating it further since then...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants