Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] AgglomerativeClustering: Duplicate data samples cause incorrect cluster assignments #3801

Closed
rsshah1993 opened this issue Apr 28, 2021 · 1 comment · Fixed by #3824
Closed
Labels
bug Something isn't working

Comments

@rsshah1993
Copy link

Describe the bug
Getting unexpected clustering results from AgglomerativeClustering.

Steps/Code to reproduce bug

import numpy as np
from cuml import AgglomerativeClustering

features = np.array([[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [2.0, 2.0, 2.0]])
AgglomerativeClustering(n_clusters=2).fit_predict(features)
>>> array([0, 1, 0], dtype=int32)

Expected behavior
Expecting outputs to be [0, 0, 1] or [1, 1, 0].

Environment details (please complete the following information):

  • Environment location: Docker
  • Linux Distro/Architecture: Ubuntu 18.04
  • GPU Model/Driver: V100/455.32.00
  • CUDA: 11.1
  • Method of cuDF & cuML install: conda install -c rapidsai-nightly -c nvidia -c conda-forge -c defaults rapids=0.19 python=3.7 cudatoolkit=10.2 as part of a docker build.
@rsshah1993 rsshah1993 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Apr 28, 2021
@cjnolet
Copy link
Member

cjnolet commented Apr 28, 2021

I believe I know the cause of this bug. We assume absolute distance of 0 is on the diagonal in the pairwise distance matrix and so we set it to the max for the MST to converge to the correct solution. A reasonable fix for this case would be to do this only for the diagonal elements in order to support duplicate data samples.

Here's a small example of getting the correct solution by making the first two data samples slightly different.

>>> features = np.array([[0.0, 0.0, 0.001], [0.0, 0.0, 0.002], [2.0, 2.0, 2.0]])
>>> AgglomerativeClustering(n_clusters=2).fit_predict(features)
Label prop iterations: 3
Iterations: 1
2068,40,24,6,66,134
n_edges: 2
Finished dendrogram
array([0, 0, 1], dtype=int32)

@cjnolet cjnolet removed the ? - Needs Triage Need team to review and classify label Apr 28, 2021
@cjnolet cjnolet changed the title [BUG] Unexpected results for AgglomerativeClustering [BUG] AgglomerativeClustering: Duplicate data samples cause incorrect cluster assignments Apr 28, 2021
rapids-bot bot pushed a commit that referenced this issue May 20, 2021
…istances from self-loops (#3824)

Closes #3801 
Closes #3802 

Corresponding RAFT PR: rapidsai/raft#217

Authors:
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #3824
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this issue Oct 9, 2023
…istances from self-loops (rapidsai#3824)

Closes rapidsai#3801 
Closes rapidsai#3802 

Corresponding RAFT PR: rapidsai/raft#217

Authors:
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#3824
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants