Auto early exaggeration to prevent collapse for small datasets #234

dkobak · 2023-02-10T21:32:22Z

Should fix #233 (in a way). The idea is to set the default early exaggeration to 4 for small-ish (n<10k) datasets, following the original t-SNE paper. I am adding a test that checks if the early exaggeration leads to collapse for small structure-less datasets.

(Am going to push only the test at first, to see if it fails on the current master branch.)

dkobak · 2023-02-11T08:27:20Z

I can confirm that the implemented test fails on the current master brunch, and passes with the suggested fix.

pavlin-policar · 2023-02-14T21:05:44Z

It seems to me that since this only happens for BH and not FFT, then there must be an issue with the BH implementation, and this is more of a band-aid than an actual fix to the problem. But, this problem does seem serious enough to warrant a band-aid, since running BH on datasets with N<10k is our default. However, if I understand, this only happens when the datasets don't have any real structure, so this shouldn't be to common. Either way, is this ready to merge?

dkobak · 2023-02-14T23:13:10Z

Yes, there is definitely some issue with the BH implementation which is triggered when embedding points get too close to each other or overlap exactly (either in the initialization or during the optimization). I have no idea how to fix that, and haven't even looked into it.

At the same time, early exaggeration should not really lead to points collapsing on each other, and it seems that using smaller early exaggeration value for small datasets prevents that. So my thinking is, it's a good idea anyway -- irrespective of the issues with BH. Of course using exag 4 for N<10k and 12 for N>10k seems a bit arbitrary, but then any of those values is a bit arbitrary, and it's all just heuristics. What I like about this specific suggestion, is that both values (4 and 12) are taken from the literature.

However, if I understand, this only happens when the datasets don't have any real structure, so this shouldn't be to common.

Yes, I think triggering this bug is actually difficult. The current master probably works fine for any data with some structure in it.

I would say it's ready to merge... But maybe we should first try it out on some real datasets, just to stress test it a bit. I mean, comparing early exag 12 vs 4 on some real datasets, maybe Iris, maybe some small RNAseq data with <10k points, maybe MNIST subsets. I could try to run some tests.

dkobak · 2023-02-15T08:07:46Z

I ran some checks using the following code:

def compare(X, exaggerations=[4, 12]):
    fig, axs = plt.subplots(ncols=2, nrows=2, figsize=(8,6), layout='constrained')

    def savekl(it, kldiv, emb):
        global kl
        kl = kldiv

    for num, ex in enumerate(exaggerations):
        Z = TSNE(early_exaggeration=ex, n_iter=0, callbacks=savekl).fit(X['data'])
        axs[0,num].scatter(*Z.T, c=X['target'], s=2)
        axs[0,num].set_title(f'early_exaggeration={ex}, n_iter=0')
    
        Z = Z.optimize(n_iter=500)
        axs[1,num].scatter(*Z.T, c=X['target'], s=2)
        axs[1,num].set_title(f'n_iter=500')
        axs[1,num].text(.99, .99, f'KL: {kl:.4f}', transform=axs[1,num].transAxes, ha='right', va='top')
        
    return fig

sklearn.datasets.load_iris

No difference. Both ex=4 and ex=12 lead to collapse of individual clusters. But this does not affect downstream results.

sklearn.datasets.load_digits

No difference, no collapse.

sklearn.datasets.load_breast_cancer (standardized)

No difference for the final result, but ex=12 leads to collapse to 1-dim structure, whereas ex=4 does not.

sklearn.datasets.load_wine (standardized)

Both values lead to collapse to 1-dim structure, final result not affected.

np.random.randn(100,10)

Reproduces #233 issue. Strong collapse using ex=12, final results strongly affected due to some problem with the BH optimization.

Random data with some structure

X['data'] = np.random.randn(100,10)
X['data'][50:,0] += 3
X['target'] = np.zeros(100)
X['target'][50:] += 1

The same problem as above despite some structure in the data.

My conclusion is that the PR makes sense.

pavlin-policar · 2023-02-15T09:25:27Z

Wonderful! Thanks for looking into this. Although, I can't say I'm altogether convinced.

I've always considered the rationale for early-exaggeration phase as "it's hard to untangle the data if we use no exaggeration; everything starts off mixed up, and clusters fall apart. Instead, let's squish neighboring points close together, and untangle the clusters from one another. Then, we can run with no exaggeration to decompress the clusters". So conceptually, having clusters collapse into a small ball during EE seems completely fine, as long as they are "decompressed during the standard phase.

If EE=4 did this for all the small data sets, that would be fine, but it seems that it doesn't. In the sklearn.datasets.load_breast_cancer (standardized) example, the top-left panel EE=4 doesn't really compress the data like it does in other examples. It seems strange that it doesn't though, since it compresses the data in all other examples. Could this be a bug? But this could indicate that there might be real-world data sets where EE=4 wouldn't "untangle" the clusters during the EE phase, which is problematic.

I am also a bit reluctant to make this change because this will lead to different embeddings for small data sets than from what we produce now. Yes, the embeddings will show the same things, but they won't be identical, and there will inevitably be questions from users as to what's going on.

There doesn't really seem to be anything really wrong with using EE=12 everywhere, other than that we trigger the BH bug on some rare, synthetic datasets. EE=4 just avoids triggering that particular bug. I would much rather fix the BH bug and keep things as they are, to avoid unnecessary downstream changes in the embeddings. I'd rather have a crack at fixing the BH implementation first to see if we can't fix the real bug there. If we can't figure it out, we'll merge this PR.

dkobak · 2023-02-15T09:44:03Z

I agree. If the bug with the BH implementation is fixed, then there is no need for this PR. I was thinking that EE=4 is objectively better for small datasets than EE=12, but after looking at these examples I see that sometimes it's better, sometimes not, and sometimes both values are clearly both not ideal. Ideally I think one would have a more sophisticated heuristic for setting the EE, but we don't have any, so then one should rather keep the current EE=12 even if in some cases it's not great.

However, I don't think I will have time to look into the BH optimization any time soon.

pavlin-policar · 2023-02-15T10:12:21Z

I think I've found the bug. All I had to do was change the threshold for duplicate detection in the BH implementation and the problem seems to go away. @dkobak dould you please confirm this fixes the issue on your end too? Note that you'll have to recompile the binaries with python setup.py build_ext --inplace for the change to take effect.

dkobak · 2023-02-15T10:21:41Z

Oh this is really cool! If test_early_exaggeration_does_not_collapse passes, then I don't have any other examples where the bug is triggered, so there is nothing to check on my end.

pavlin-policar · 2023-02-15T10:25:37Z

I did end up changing the test a bit. Your version checked for collapse after the EE phase, but I check for collapse only after the entire optimization process. It seems to me that it doesn't really matter if the embedding becomes really small at some point, as long as the final embedding is fine.

dkobak · 2023-02-15T10:26:54Z

Yes I agree.

dkobak added 4 commits February 10, 2023 22:24

Test if early exaggeration leads to collapse

39ddadc

Fix tsne call

c1afb24

Auto early exaggeration defaults to 4 for n<10,000

8081bbf

Change early exaggeration to 4 in a test

226b3e5

pavlin-policar mentioned this pull request Feb 15, 2023

Fix BH collapse #235

Merged

3 tasks

pavlin-policar closed this Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto early exaggeration to prevent collapse for small datasets #234

Auto early exaggeration to prevent collapse for small datasets #234

dkobak commented Feb 10, 2023

dkobak commented Feb 11, 2023

pavlin-policar commented Feb 14, 2023

dkobak commented Feb 14, 2023

dkobak commented Feb 15, 2023

pavlin-policar commented Feb 15, 2023

dkobak commented Feb 15, 2023

pavlin-policar commented Feb 15, 2023

dkobak commented Feb 15, 2023

pavlin-policar commented Feb 15, 2023

dkobak commented Feb 15, 2023

Auto early exaggeration to prevent collapse for small datasets #234

Auto early exaggeration to prevent collapse for small datasets #234

Conversation

dkobak commented Feb 10, 2023

dkobak commented Feb 11, 2023

pavlin-policar commented Feb 14, 2023

dkobak commented Feb 14, 2023

dkobak commented Feb 15, 2023

sklearn.datasets.load_iris

sklearn.datasets.load_digits

sklearn.datasets.load_breast_cancer (standardized)

sklearn.datasets.load_wine (standardized)

np.random.randn(100,10)

Random data with some structure

pavlin-policar commented Feb 15, 2023

dkobak commented Feb 15, 2023

pavlin-policar commented Feb 15, 2023

dkobak commented Feb 15, 2023

pavlin-policar commented Feb 15, 2023

dkobak commented Feb 15, 2023