You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for sharing it with community great tool and I would say it is UMAP+HDBSCAN on steroids!
Quick question though, when I try to cluster 30k of text embeddings, I am getting a lot of the texts being grouped as outliers. I have tried to change params like noise_level, base_min_cluster_size or min_number_clusters, about 10-15% of the population is outlier, If I run UMAP+HDBSCAN manually I get significantly low number of outliers
The text was updated successfully, but these errors were encountered:
It is hard to say as it is likely data dependent. EVoC effectively does UMAP to a higher dimensional space (often around 15 dimensions). Since it isn't packing data together as tightly in a low dim space you can end up with more outliers that are "between" other clusters. So that may be part of it.
Another posibility is that the actual loss function is a bit different, and this can result in less data being clustered. You can influence that via the noise_level parameter. Seeing it to 0.0 should result in trying to cluster more data. Whether that is enough to remedy the issue is not clear though.
Hi @lmcinnes , thank you for your reply, I tried both suggestion and played with different ranges for those params, but it is giving me a lot of outliers still
Thank you for sharing it with community great tool and I would say it is UMAP+HDBSCAN on steroids!
Quick question though, when I try to cluster 30k of text embeddings, I am getting a lot of the texts being grouped as outliers. I have tried to change params like noise_level, base_min_cluster_size or min_number_clusters, about 10-15% of the population is outlier, If I run UMAP+HDBSCAN manually I get significantly low number of outliers
The text was updated successfully, but these errors were encountered: