Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are the many noise (outliers)? #11

Open
Ibrokhimsadikov opened this issue Aug 26, 2024 · 3 comments
Open

Why are the many noise (outliers)? #11

Ibrokhimsadikov opened this issue Aug 26, 2024 · 3 comments

Comments

@Ibrokhimsadikov
Copy link

Thank you for sharing it with community great tool and I would say it is UMAP+HDBSCAN on steroids!

Quick question though, when I try to cluster 30k of text embeddings, I am getting a lot of the texts being grouped as outliers. I have tried to change params like noise_level, base_min_cluster_size or min_number_clusters, about 10-15% of the population is outlier, If I run UMAP+HDBSCAN manually I get significantly low number of outliers

@lmcinnes
Copy link
Contributor

It is hard to say as it is likely data dependent. EVoC effectively does UMAP to a higher dimensional space (often around 15 dimensions). Since it isn't packing data together as tightly in a low dim space you can end up with more outliers that are "between" other clusters. So that may be part of it.

Another posibility is that the actual loss function is a bit different, and this can result in less data being clustered. You can influence that via the noise_level parameter. Seeing it to 0.0 should result in trying to cluster more data. Whether that is enough to remedy the issue is not clear though.

@Ibrokhimsadikov
Copy link
Author

Hi @lmcinnes , thank you for your reply, I tried both suggestion and played with different ranges for those params, but it is giving me a lot of outliers still

@lmcinnes
Copy link
Contributor

I'm afraid I'm really not too sure then. Sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants