Memory collapses with precomputed block matrix #180

fsvbach · 2021-04-18T17:00:17Z

Expected behaviour

When I run tSNE on a symmetric 200x200 block matrix such as this one

I expect TSNE to return 4 distinct clusters (actually 4 points only). Sklearn yields this.

Actual behaviour

Using openTSNE the terminal crashes with full memory (50% of the time). If it survives the clusters are visible, however the result is not as satisfying.

Steps to reproduce the behavior

matrix = Block matrix
tsne = TSNE(metric='precomputed', initialization='spectral', negative_gradient_method='bh')
embedding = tsne.fit(matrix)

NOTE: I am using the direct installation from GitHub this morning.

pavlin-policar · 2021-04-18T17:10:36Z

Please provide a minimal code example where this fails so we can reproduce and debug this. There really isn't anything obvious that would cause the memory to balloon.

Does the same thing happen when using negative_gradient_method="fft"?

fsvbach · 2021-04-18T17:59:45Z

Okay, with negative_gradient_method="fft" it works, but I sometimes get the RuntimeWarning: invalid value encountered in log
kl_divergence_ += sum_P * np.log(sum_Q + EPSILON)

A minimal example would be:

from openTSNE import TSNE
import numpy as np
matrix = np.genfromtxt('matrix.csv', delimiter=',')
tsne = TSNE(metric='precomputed', initialization='spectral', negative_gradient_method='bh')
embedding = tsne.fit(matrix)

with the matrix in matrix.csv
matrix.csv

Note: It doesn't crash always, but every third or fourth time, so I think its due to some initialization?

pavlin-policar · 2021-04-18T18:11:13Z

What I suspect may be happening is that if the points are too close together in the embedding, the BH tree balloons. However, I should have some safeguards in place so that this doesn't happen. Maybe those safeguards aren't enough. This might be plausible since you're trying to force all the points super close together. But I'd need to investigate it further. I suggest you use fft for the time being.

dkobak · 2021-04-18T19:21:15Z

I expect TSNE to return 4 distinct clusters (actually 4 points only). Sklearn yields this.

Actually the points should not be on top of each other in the embedding. When some points have zero distance in the original embedding, they should get finite affinities and should end up close but not overlapping in the embedding.

If you get overlapping points in sklearn, that would be weird.

Can you post a picture of sklearn embedding, and also of what you get from openTSNE when it does not crash?

fsvbach · 2021-04-19T08:13:49Z

Yes.

This is the result of openTSNE with bh setting when it doesn't crash.

This is the result of openTSNE with fft settings (where I get the above warning).

And this (similar) plot I get with sklearn all of the time without warning.

dkobak · 2021-04-19T08:20:52Z

Wow. None of that makes any sense to me! Unless I am missing something, the points should not overlap at all, so we should see 200 points separately. BH-openTSNE makes most sense, but even there there are many fewer than 200 points visible. In FFT-openTSNE there is 1e11 on the y-axis?! And sklearn simply shows 4 points... Very odd.

pavlin-policar · 2021-04-19T08:28:15Z

I agree, none of these plots seem correct. The repulsive forces should push the points apart at least a little bit. However, the spans of the embeddings are quite big, so maybe it could be a visual plotting thing that they look like a single point. Are the final coordinates really just a single point, or are they just really close to one another and the large point sizes and large spans just make it look that way?

dkobak · 2021-04-19T08:35:30Z

It seems that has nothing to do with precomputed specifically, but with how we treat identical points.

np.random.seed(42)

X = np.concatenate((
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50)
    ))

Z = TSNE(negative_gradient_method='bh').fit(X)

print(np.unique(np.sum(Z,axis=1)).size)

This prints 6 🤯

Update: with sklearn I get points that are almost overlapping but are not identical. However, this result seems fishy to me too.

Update2: with FFT I get 7 distinct points. So points are exactly overlapping.

Update3: I checked and the same thing happens with Uniform affinities, so it's not a perplexity calibration problem.

Also note that adding a minuscule amount of noise like X += np.random.randn(200,50) / 10000 makes the problem go away.

fsvbach · 2021-04-19T08:36:07Z

The points in FFT-openTSNE for example are something like

      ...      [  1.76457933,  13.81758939],
               [  2.02415051,  13.85560307],
               [  1.94253379,  13.86327763],
               [  2.01328632,  13.71940575],
               [  2.04201515,  13.7215927 ],
               [  2.17067327,  13.8131649], ...

It makes sense to me, that they are very close so that matplotlib overlaps even with points size 1. What doesn't make sense to me: Why doesn't BH-openTSNE show this behaviour, i.e. why are some points very close and some others fairly far apart (and why does it crash, of course)?

Note: In this run I didn't get the RuntimeWarning and the plot (with smaller size) looks like

Now the y-axis is fine, too.

dkobak · 2021-04-20T11:29:26Z

I realized that I was wrong when I said that the optimal embedding should not have overlapping points. On reflection, it should be possible to reduce KL to zero by making all points from each class overlap exactly. All pairs of points overlapping in high dimensions have identical affinities, and if they overlap in 2D they will have identical low-dimensional similarities too.

If so, this means that sklearnTSNE returns the correct result.

But I see a lot of thing going weird when optimizing it with openTSNE...

Using the same data as above

np.random.seed(42)
X = np.concatenate((
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50)
    ))

It works fine with starting from random initialization:

TSNE(negative_gradient_method='bh', initialization='random', verbose=1).fit(X)
TSNE(negative_gradient_method='fft', initialization='random', verbose=1).fit(X)

both reduce KL to near-zero and make the points roughly (but not exactly) overlap.

But using PCA initializatoin (where points in the initialization overlap exactly) the optimization goes wild.

TSNE(negative_gradient_method='bh', initialization='pca', verbose=1).fit(X)

BH yield negatative KL values which is mathematically impossible!

===> Running optimization with exaggeration=12.00, lr=200.00 for 250 iterations...
Iteration   50, KL divergence -5.1860, 50 iterations in 0.0159 sec
Iteration  100, KL divergence -6.2852, 50 iterations in 0.0147 sec
Iteration  150, KL divergence -7.0769, 50 iterations in 0.0148 sec
Iteration  200, KL divergence -7.6468, 50 iterations in 0.0149 sec
Iteration  250, KL divergence -8.0897, 50 iterations in 0.0147 sec
   --> Time elapsed: 0.08 seconds
===> Running optimization with exaggeration=1.00, lr=200.00 for 500 iterations...
Iteration   50, KL divergence -9.2070, 50 iterations in 0.0163 sec
Iteration  100, KL divergence -9.7631, 50 iterations in 0.0144 sec
Iteration  150, KL divergence -10.1641, 50 iterations in 0.0152 sec
Iteration  200, KL divergence -10.4828, 50 iterations in 0.0149 sec
Iteration  250, KL divergence -10.7496, 50 iterations in 0.0162 sec
Iteration  300, KL divergence -10.9802, 50 iterations in 0.0150 sec
Iteration  350, KL divergence -11.1838, 50 iterations in 0.0162 sec
Iteration  400, KL divergence -11.3666, 50 iterations in 0.0149 sec
Iteration  450, KL divergence -11.5325, 50 iterations in 0.0143 sec
Iteration  500, KL divergence -11.6847, 50 iterations in 0.0151 sec

While FFT...

TSNE(negative_gradient_method='fft', initialization='pca', verbose=1).fit(X)

...diverges!

===> Running optimization with exaggeration=12.00, lr=200.00 for 250 iterations...
Iteration   50, KL divergence 0.0446, 50 iterations in 0.6821 sec
Iteration  100, KL divergence 0.0292, 50 iterations in 0.7156 sec
Iteration  150, KL divergence 0.0227, 50 iterations in 0.7206 sec
Iteration  200, KL divergence 0.0240, 50 iterations in 0.7323 sec
Iteration  250, KL divergence 0.0195, 50 iterations in 0.6954 sec
   --> Time elapsed: 3.55 seconds
===> Running optimization with exaggeration=1.00, lr=200.00 for 500 iterations...
Iteration   50, KL divergence 0.0321, 50 iterations in 0.6926 sec
Iteration  100, KL divergence 0.0815, 50 iterations in 0.6904 sec
Iteration  150, KL divergence 0.1199, 50 iterations in 0.6936 sec
Iteration  200, KL divergence 0.1419, 50 iterations in 0.6955 sec
Iteration  250, KL divergence 8.1304, 50 iterations in 0.8760 sec
Iteration  300, KL divergence 13.9849, 50 iterations in 0.6924 sec
Iteration  350, KL divergence 14.7327, 50 iterations in 0.6963 sec
Iteration  400, KL divergence 15.2104, 50 iterations in 0.6898 sec
Iteration  450, KL divergence 15.4548, 50 iterations in 0.6824 sec
Iteration  500, KL divergence 15.5808, 50 iterations in 0.7088 sec

dkobak · 2021-04-20T21:15:01Z

Update: sklearn also shows negative KL values when using PCA initialization (but not with random initialization)!

Z = SklearnTSNE(verbose=2, init='pca').fit_transform(X)

results in

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 200 samples in 0.001s...
[t-SNE] Computed neighbors for 200 samples in 0.006s...
[t-SNE] Computed conditional probabilities for sample 200 / 200
[t-SNE] Mean sigma: 0.000000
[t-SNE] Computed conditional probabilities in 0.034s
[t-SNE] Iteration 50: error = -59.8060608, gradient norm = 0.0029903 (50 iterations in 0.040s)
[t-SNE] Iteration 100: error = -75.7929688, gradient norm = 0.0015364 (50 iterations in 0.018s)
[t-SNE] Iteration 150: error = -85.3008423, gradient norm = 0.0010339 (50 iterations in 0.017s)
[t-SNE] Iteration 200: error = -92.0923309, gradient norm = 0.0007792 (50 iterations in 0.018s)
[t-SNE] Iteration 250: error = -97.3766937, gradient norm = 0.0006251 (50 iterations in 0.017s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: -97.376694
[t-SNE] Iteration 300: error = -10.6954212, gradient norm = 0.0005959 (50 iterations in 0.032s)
[t-SNE] Iteration 350: error = -10.9293900, gradient norm = 0.0005301 (50 iterations in 0.018s)
[t-SNE] Iteration 400: error = -11.2307148, gradient norm = 0.0004560 (50 iterations in 0.020s)
[t-SNE] Iteration 450: error = -11.5423441, gradient norm = 0.0003902 (50 iterations in 0.018s)
[t-SNE] Iteration 500: error = -11.8387394, gradient norm = 0.0003365 (50 iterations in 0.017s)
[t-SNE] Iteration 550: error = -12.1112652, gradient norm = 0.0002936 (50 iterations in 0.017s)
[t-SNE] Iteration 600: error = -12.3597937, gradient norm = 0.0002593 (50 iterations in 0.017s)
[t-SNE] Iteration 650: error = -12.5856562, gradient norm = 0.0002316 (50 iterations in 0.017s)
[t-SNE] Iteration 700: error = -12.7922802, gradient norm = 0.0002089 (50 iterations in 0.016s)
[t-SNE] Iteration 750: error = -12.9815378, gradient norm = 0.0001900 (50 iterations in 0.017s)
[t-SNE] Iteration 800: error = -13.1556988, gradient norm = 0.0001741 (50 iterations in 0.016s)
[t-SNE] Iteration 850: error = -13.3171625, gradient norm = 0.0001606 (50 iterations in 0.023s)
[t-SNE] Iteration 900: error = -13.4675283, gradient norm = 0.0001490 (50 iterations in 0.024s)
[t-SNE] Iteration 950: error = -13.6077213, gradient norm = 0.0001389 (50 iterations in 0.017s)
[t-SNE] Iteration 1000: error = -13.7389517, gradient norm = 0.0001301 (50 iterations in 0.027s)
[t-SNE] KL divergence after 1000 iterations: -13.738952

so it seems that sklearnTSNE and openTSNE have the same problem here.

pavlin-policar · 2021-04-23T12:38:56Z

Hmm, this is all very strange, and I'll need to look into it some more. But I think that in my implementation of BH, one of the assumptions I made was that the embedding points would never actually overlap like that. So the BH one doesn't surprise me that much. And also, such things are pretty difficult to solve with BH, if memory serves. I am a bit surprised about FIt-SNE not working, since it doesn't make any assumptions about how the points are spaced out.

Perhaps the negative KL values are the most puzzling of all.

I definitely consider this an edge case, and a very unusual usage. Adding just a little bit of noise to the distance matrix makes everything run smoothly.

dkobak · 2021-04-23T12:45:01Z

Perhaps the negative KL values are the most puzzling of all.

Aren't KL values during BH optimization computed using some expressions computed via BH? If so, then it's not surprising that KL values end up being very wrong.

And also, such things are pretty difficult to solve with BH, if memory serves.

I have no idea about this. But if this is really true, then a dirty workaround would be to add some tiny noise to the initialization (with a warning) if the repulsion method is BH and if there are overlapping points... I'm just not sure what's the best way to detect that there are overlapping points in the initialization.

I am a bit surprised about FIt-SNE not working, since it doesn't make any assumptions about how the points are spaced out.

In fact, we may do the same for FFT as well if we cannot otherwise find out what's going on there.

dkobak · 2021-04-26T20:25:03Z

I'm just not sure what's the best way to detect that there are overlapping points in the initialization.

Maybe something like

if initialization.shape[0] == np.unique(initialization.round(decimals=6), axis=0).shape[0]

can work to detect if there are duplicates in the initialization array...

dkobak · 2021-05-05T11:56:35Z

Somewhat fleshing out this workaround:

import numpy as np

X = np.random.randn(1_000_000,2)
X[1000] = X[2000]

def jitter(X, precision=6):
    _, ind_unique = np.unique(X.round(decimals=precision), axis=0, return_index=True)
    ind_duplicates = np.setdiff1d(np.arange(X.shape[0]), ind_unique)
    if ind_duplicates.size > 0:
        print(f'Found {ind_duplicates.size} near-duplicate(s). Adding some jitter to them.')
        X[ind_duplicates] += np.random.randn(ind_duplicates.size, X.shape[1]) * 10**-precision
    return X

%time X = jitter(X)

works in <1 second for n=1mln:

Found 1 near-duplicate(s). Adding some jitter to them.
CPU times: user 872 ms, sys: 12 ms, total: 884 ms
Wall time: 884 ms

So my suggestion would be to run this on any initialization (explicitly provided by the user or computed using PCA) and print a warning when adding jitter.

pavlin-policar · 2021-05-09T15:43:41Z

Thanks for this solution, I'm actually amazed it works this fast!

At first I was kind of on the fence about this, thinking: "How often would this actually happen?" I think a block distance matrix is something you're really rarely ever see in practice. However, duplicate entries aren't that rare at all, e.g. even in the iris data set, there are two identical rows. However, for iris, this isn't a problem, and the two identical data points get embedded into the exact same position. So that seems to work okay. Then the question becomes "why doesn't it work when there are a lot of duplicate entries?"

For iris, I tried setting a bunch of points to being duplicate entries, and seeing where it failed. Interestingly enough, it failed when I put 30 duplicate points. This doesn't correlate with the perplexity, as increasing it to e.g. 70 still results in negative KL values. So this is definitely very strange. However, setting using 30 duplicate rows and setting perplexity to 70 still results in negative KL values. So it seems like the problem has always been there, but the KL values sometimes don't reach negative values. It also seems like it works if the number of duplicate entries isn't too big.

I'm fine with using your fix as a temporary solution. Clearly, there is a bug somewhere, and it needs to be found. However, this isn't something that would happen very often though I think. If you want to open a PR with this temporary fix, I'll be happy to merge it.

dkobak · 2021-07-19T07:57:16Z

I'm fine with using your fix as a temporary solution. Clearly, there is a bug somewhere, and it needs to be found. However, this isn't something that would happen very often though I think. If you want to open a PR with this temporary fix, I'll be happy to merge it.

Just want to comment here briefly that I implemented this in a branch but then observed some odd behaviour (I was getting some "near-duplicates" detected in the initialization when there should not have been any) and did not get to investigating it further since then...

pavlin-policar added the bug Something isn't working label May 9, 2021

This was referenced Jan 10, 2023

Adding tiny amount of noise to PCA/spectral init to prevent points from overlapping #224

Closed

Improve PCA and spectral inits #225

Merged

pavlin-policar closed this as completed Feb 2, 2023

dkobak mentioned this issue Feb 6, 2023

Barnes-Hut optimization with the default learning rate collapses on small datasets #233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory collapses with precomputed block matrix #180

Memory collapses with precomputed block matrix #180

fsvbach commented Apr 18, 2021 •

edited

Loading

pavlin-policar commented Apr 18, 2021

fsvbach commented Apr 18, 2021 •

edited

Loading

pavlin-policar commented Apr 18, 2021

dkobak commented Apr 18, 2021

fsvbach commented Apr 19, 2021

dkobak commented Apr 19, 2021

pavlin-policar commented Apr 19, 2021

dkobak commented Apr 19, 2021 •

edited

Loading

fsvbach commented Apr 19, 2021 •

edited

Loading

dkobak commented Apr 20, 2021

dkobak commented Apr 20, 2021

pavlin-policar commented Apr 23, 2021

dkobak commented Apr 23, 2021

dkobak commented Apr 26, 2021

dkobak commented May 5, 2021 •

edited

Loading

pavlin-policar commented May 9, 2021

dkobak commented Jul 19, 2021

Memory collapses with precomputed block matrix #180

Memory collapses with precomputed block matrix #180

Comments

fsvbach commented Apr 18, 2021 • edited Loading

Expected behaviour

Actual behaviour

Steps to reproduce the behavior

pavlin-policar commented Apr 18, 2021

fsvbach commented Apr 18, 2021 • edited Loading

pavlin-policar commented Apr 18, 2021

dkobak commented Apr 18, 2021

fsvbach commented Apr 19, 2021

dkobak commented Apr 19, 2021

pavlin-policar commented Apr 19, 2021

dkobak commented Apr 19, 2021 • edited Loading

fsvbach commented Apr 19, 2021 • edited Loading

dkobak commented Apr 20, 2021

dkobak commented Apr 20, 2021

pavlin-policar commented Apr 23, 2021

dkobak commented Apr 23, 2021

dkobak commented Apr 26, 2021

dkobak commented May 5, 2021 • edited Loading

pavlin-policar commented May 9, 2021

dkobak commented Jul 19, 2021

fsvbach commented Apr 18, 2021 •

edited

Loading

fsvbach commented Apr 18, 2021 •

edited

Loading

dkobak commented Apr 19, 2021 •

edited

Loading

fsvbach commented Apr 19, 2021 •

edited

Loading

dkobak commented May 5, 2021 •

edited

Loading