-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto early exaggeration to prevent collapse for small datasets #234
Conversation
I can confirm that the implemented test fails on the current master brunch, and passes with the suggested fix. |
It seems to me that since this only happens for BH and not FFT, then there must be an issue with the BH implementation, and this is more of a band-aid than an actual fix to the problem. But, this problem does seem serious enough to warrant a band-aid, since running BH on datasets with N<10k is our default. However, if I understand, this only happens when the datasets don't have any real structure, so this shouldn't be to common. Either way, is this ready to merge? |
Yes, there is definitely some issue with the BH implementation which is triggered when embedding points get too close to each other or overlap exactly (either in the initialization or during the optimization). I have no idea how to fix that, and haven't even looked into it. At the same time, early exaggeration should not really lead to points collapsing on each other, and it seems that using smaller early exaggeration value for small datasets prevents that. So my thinking is, it's a good idea anyway -- irrespective of the issues with BH. Of course using exag 4 for N<10k and 12 for N>10k seems a bit arbitrary, but then any of those values is a bit arbitrary, and it's all just heuristics. What I like about this specific suggestion, is that both values (4 and 12) are taken from the literature.
Yes, I think triggering this bug is actually difficult. The current master probably works fine for any data with some structure in it. I would say it's ready to merge... But maybe we should first try it out on some real datasets, just to stress test it a bit. I mean, comparing early exag 12 vs 4 on some real datasets, maybe Iris, maybe some small RNAseq data with <10k points, maybe MNIST subsets. I could try to run some tests. |
I ran some checks using the following code: def compare(X, exaggerations=[4, 12]):
fig, axs = plt.subplots(ncols=2, nrows=2, figsize=(8,6), layout='constrained')
def savekl(it, kldiv, emb):
global kl
kl = kldiv
for num, ex in enumerate(exaggerations):
Z = TSNE(early_exaggeration=ex, n_iter=0, callbacks=savekl).fit(X['data'])
axs[0,num].scatter(*Z.T, c=X['target'], s=2)
axs[0,num].set_title(f'early_exaggeration={ex}, n_iter=0')
Z = Z.optimize(n_iter=500)
axs[1,num].scatter(*Z.T, c=X['target'], s=2)
axs[1,num].set_title(f'n_iter=500')
axs[1,num].text(.99, .99, f'KL: {kl:.4f}', transform=axs[1,num].transAxes, ha='right', va='top')
return fig sklearn.datasets.load_irisNo difference. Both ex=4 and ex=12 lead to collapse of individual clusters. But this does not affect downstream results. sklearn.datasets.load_digitsNo difference, no collapse. sklearn.datasets.load_breast_cancer (standardized)No difference for the final result, but ex=12 leads to collapse to 1-dim structure, whereas ex=4 does not. sklearn.datasets.load_wine (standardized)Both values lead to collapse to 1-dim structure, final result not affected. np.random.randn(100,10)Reproduces #233 issue. Strong collapse using ex=12, final results strongly affected due to some problem with the BH optimization. Random data with some structureX['data'] = np.random.randn(100,10)
X['data'][50:,0] += 3
X['target'] = np.zeros(100)
X['target'][50:] += 1 The same problem as above despite some structure in the data. My conclusion is that the PR makes sense. |
Wonderful! Thanks for looking into this. Although, I can't say I'm altogether convinced. I've always considered the rationale for early-exaggeration phase as "it's hard to untangle the data if we use no exaggeration; everything starts off mixed up, and clusters fall apart. Instead, let's squish neighboring points close together, and untangle the clusters from one another. Then, we can run with no exaggeration to decompress the clusters". So conceptually, having clusters collapse into a small ball during EE seems completely fine, as long as they are "decompressed during the standard phase. If EE=4 did this for all the small data sets, that would be fine, but it seems that it doesn't. In the I am also a bit reluctant to make this change because this will lead to different embeddings for small data sets than from what we produce now. Yes, the embeddings will show the same things, but they won't be identical, and there will inevitably be questions from users as to what's going on. There doesn't really seem to be anything really wrong with using EE=12 everywhere, other than that we trigger the BH bug on some rare, synthetic datasets. EE=4 just avoids triggering that particular bug. I would much rather fix the BH bug and keep things as they are, to avoid unnecessary downstream changes in the embeddings. I'd rather have a crack at fixing the BH implementation first to see if we can't fix the real bug there. If we can't figure it out, we'll merge this PR. |
I agree. If the bug with the BH implementation is fixed, then there is no need for this PR. I was thinking that EE=4 is objectively better for small datasets than EE=12, but after looking at these examples I see that sometimes it's better, sometimes not, and sometimes both values are clearly both not ideal. Ideally I think one would have a more sophisticated heuristic for setting the EE, but we don't have any, so then one should rather keep the current EE=12 even if in some cases it's not great. However, I don't think I will have time to look into the BH optimization any time soon. |
I think I've found the bug. All I had to do was change the threshold for duplicate detection in the BH implementation and the problem seems to go away. @dkobak dould you please confirm this fixes the issue on your end too? Note that you'll have to recompile the binaries with |
Oh this is really cool! If |
I did end up changing the test a bit. Your version checked for collapse after the EE phase, but I check for collapse only after the entire optimization process. It seems to me that it doesn't really matter if the embedding becomes really small at some point, as long as the final embedding is fine. |
Yes I agree. |
Should fix #233 (in a way). The idea is to set the default early exaggeration to 4 for small-ish (n<10k) datasets, following the original t-SNE paper. I am adding a test that checks if the early exaggeration leads to collapse for small structure-less datasets.
(Am going to push only the test at first, to see if it fails on the current master branch.)