Why are the many noise (outliers)? #11

Ibrokhimsadikov · 2024-08-26T16:02:13Z

Thank you for sharing it with community great tool and I would say it is UMAP+HDBSCAN on steroids!

Quick question though, when I try to cluster 30k of text embeddings, I am getting a lot of the texts being grouped as outliers. I have tried to change params like noise_level, base_min_cluster_size or min_number_clusters, about 10-15% of the population is outlier, If I run UMAP+HDBSCAN manually I get significantly low number of outliers

lmcinnes · 2024-08-27T16:59:23Z

It is hard to say as it is likely data dependent. EVoC effectively does UMAP to a higher dimensional space (often around 15 dimensions). Since it isn't packing data together as tightly in a low dim space you can end up with more outliers that are "between" other clusters. So that may be part of it.

Another posibility is that the actual loss function is a bit different, and this can result in less data being clustered. You can influence that via the noise_level parameter. Seeing it to 0.0 should result in trying to cluster more data. Whether that is enough to remedy the issue is not clear though.

Ibrokhimsadikov · 2024-08-27T17:15:22Z

Hi @lmcinnes , thank you for your reply, I tried both suggestion and played with different ranges for those params, but it is giving me a lot of outliers still

lmcinnes · 2024-08-27T20:52:43Z

I'm afraid I'm really not too sure then. Sorry.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are the many noise (outliers)? #11

Why are the many noise (outliers)? #11

Ibrokhimsadikov commented Aug 26, 2024

lmcinnes commented Aug 27, 2024

Ibrokhimsadikov commented Aug 27, 2024

lmcinnes commented Aug 27, 2024

Why are the many noise (outliers)? #11

Why are the many noise (outliers)? #11

Comments

Ibrokhimsadikov commented Aug 26, 2024

lmcinnes commented Aug 27, 2024

Ibrokhimsadikov commented Aug 27, 2024

lmcinnes commented Aug 27, 2024