[BUG] AgglomerativeClustering: Duplicate data samples cause incorrect cluster assignments #3801

rsshah1993 · 2021-04-28T13:13:09Z

Describe the bug
Getting unexpected clustering results from AgglomerativeClustering.

Steps/Code to reproduce bug

import numpy as np
from cuml import AgglomerativeClustering

features = np.array([[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [2.0, 2.0, 2.0]])
AgglomerativeClustering(n_clusters=2).fit_predict(features)
>>> array([0, 1, 0], dtype=int32)

Expected behavior
Expecting outputs to be [0, 0, 1] or [1, 1, 0].

Environment details (please complete the following information):

Environment location: Docker
Linux Distro/Architecture: Ubuntu 18.04
GPU Model/Driver: V100/455.32.00
CUDA: 11.1
Method of cuDF & cuML install: conda install -c rapidsai-nightly -c nvidia -c conda-forge -c defaults rapids=0.19 python=3.7 cudatoolkit=10.2 as part of a docker build.

The text was updated successfully, but these errors were encountered:

cjnolet · 2021-04-28T15:22:54Z

I believe I know the cause of this bug. We assume absolute distance of 0 is on the diagonal in the pairwise distance matrix and so we set it to the max for the MST to converge to the correct solution. A reasonable fix for this case would be to do this only for the diagonal elements in order to support duplicate data samples.

Here's a small example of getting the correct solution by making the first two data samples slightly different.

>>> features = np.array([[0.0, 0.0, 0.001], [0.0, 0.0, 0.002], [2.0, 2.0, 2.0]])
>>> AgglomerativeClustering(n_clusters=2).fit_predict(features)
Label prop iterations: 3
Iterations: 1
2068,40,24,6,66,134
n_edges: 2
Finished dendrogram
array([0, 0, 1], dtype=int32)

…istances from self-loops (#3824) Closes #3801 Closes #3802 Corresponding RAFT PR: rapidsai/raft#217 Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #3824

…istances from self-loops (rapidsai#3824) Closes rapidsai#3801 Closes rapidsai#3802 Corresponding RAFT PR: rapidsai/raft#217 Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#3824

rsshah1993 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Apr 28, 2021

cjnolet removed the ? - Needs Triage Need team to review and classify label Apr 28, 2021

cjnolet changed the title ~~[BUG] Unexpected results for AgglomerativeClustering~~ [BUG] AgglomerativeClustering: Duplicate data samples cause incorrect cluster assignments Apr 28, 2021

cjnolet mentioned this issue May 4, 2021

[REVIEW] AgglomerativeClustering support single cluster and ignore only zero distances from self-loops #3824

Merged

rapids-bot bot closed this as completed in #3824 May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] AgglomerativeClustering: Duplicate data samples cause incorrect cluster assignments #3801

[BUG] AgglomerativeClustering: Duplicate data samples cause incorrect cluster assignments #3801

rsshah1993 commented Apr 28, 2021

cjnolet commented Apr 28, 2021

[BUG] AgglomerativeClustering: Duplicate data samples cause incorrect cluster assignments #3801

[BUG] AgglomerativeClustering: Duplicate data samples cause incorrect cluster assignments #3801

Comments

rsshah1993 commented Apr 28, 2021

cjnolet commented Apr 28, 2021