Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Single-Linkage Hierarchical Clustering Python Wrapper #3631

Merged
merged 118 commits into from
Mar 30, 2021
Merged
Show file tree
Hide file tree
Changes from 110 commits
Commits
Show all changes
118 commits
Select commit Hold shift + click to select a range
43a8118
Checking in
cjnolet Dec 15, 2020
335e1f9
Getting MST to return results
cjnolet Dec 15, 2020
8dc58ba
Still trying to figure out why MST isn't returning expected results
cjnolet Dec 15, 2020
cef178a
Adding symmetrization to linkage
cjnolet Dec 16, 2020
1f00cbf
Fixing style
cjnolet Dec 16, 2020
209733b
Merge branch 'branch-0.18' into fea-018-hdbscan
cjnolet Dec 16, 2020
957192f
Test is executing end-to-end, need to verify results
cjnolet Dec 17, 2020
20e32e8
Adding new symmetrizaiton
cjnolet Dec 17, 2020
107b34d
Checking in
cjnolet Dec 25, 2020
5dd72ba
Merge branch 'branch-0.18' into fea-018-hdbscan
cjnolet Jan 11, 2021
3fe27c7
Merge branch 'branch-0.18' into fea-018-hdbscan
cjnolet Jan 11, 2021
48fbd1a
Adding final cluster extraction
cjnolet Jan 12, 2021
98dbf47
Fixing style
cjnolet Jan 12, 2021
567ab06
Fixing symmetrizatio bug
cjnolet Jan 12, 2021
7635b25
Output matches sklearn
cjnolet Jan 12, 2021
f3f45eb
Updating include check for test
cjnolet Jan 12, 2021
02125d6
Fixing style
cjnolet Jan 12, 2021
c8f1a58
Checking in
cjnolet Jan 12, 2021
20f57a4
Cleaning up logging
cjnolet Jan 12, 2021
3b08d79
Updating raft commit
cjnolet Jan 13, 2021
3e95e4b
Fixing style
cjnolet Jan 13, 2021
9525b80
Fixing style
cjnolet Jan 13, 2021
33b1948
Cleaning up log statements & fixing small bug in inherit_labels
cjnolet Jan 13, 2021
2ad0580
Adding test for outliers
cjnolet Jan 13, 2021
067ca4b
Merge branch 'branch-0.18' into fea-018-hdbscan
cjnolet Jan 14, 2021
c69d965
Updating test cmakelists
cjnolet Jan 14, 2021
7f78358
Adding benchmark for linkage
cjnolet Jan 14, 2021
e608d2b
Changes from benchmarking
cjnolet Jan 15, 2021
0a9cf0d
Updating changes
cjnolet Jan 22, 2021
f012f35
Checking in
cjnolet Jan 27, 2021
5f12acf
Fixing c++ style
cjnolet Feb 8, 2021
e63ad1d
Updating raft hash
cjnolet Feb 8, 2021
35d8a23
Removing sparse prims since they've been moved to raft
cjnolet Feb 9, 2021
5636caa
Merge branch 'branch-0.18' into imp-019-remove_sparse_prims
cjnolet Feb 9, 2021
31f1afc
Updating copyrights
cjnolet Feb 9, 2021
d564696
Updating raft hash
cjnolet Feb 10, 2021
2dfb89d
Merge branch 'branch-0.18' into imp-019-remove_sparse_prims
cjnolet Feb 10, 2021
5ca044b
Merge branch 'branch-0.19' into imp-019-remove_sparse_prims
cjnolet Feb 10, 2021
cae115a
Setting libcumprims to 0.18 for now
cjnolet Feb 10, 2021
9492296
Getting a start on connected knn graph construction
cjnolet Feb 10, 2021
a27ce76
Merge branch 'branch-0.19' into imp-019-remove_sparse_prims
cjnolet Feb 11, 2021
46af3a6
Making progress on fix connectivities
cjnolet Feb 11, 2021
4e58700
Making progress
cjnolet Feb 11, 2021
cefe5e8
gettting there
cjnolet Feb 12, 2021
7f2ce4e
Making progress on connectivity fixing
cjnolet Feb 16, 2021
ba24734
Checking in
cjnolet Feb 18, 2021
367b3c4
Debugging knn graph impl
cjnolet Feb 18, 2021
1de0f5b
Merge branch 'branch-0.19' into imp-019-remove_sparse_prims
cjnolet Feb 18, 2021
54d1288
Very close.
cjnolet Feb 19, 2021
433cdea
knn graph connection algorithm runs end to end.
cjnolet Feb 20, 2021
560048b
Fixing style
cjnolet Feb 20, 2021
f51a078
Style update
cjnolet Feb 20, 2021
27eab25
Removing HDBSCAN to isolate changeset to SLHC
cjnolet Feb 22, 2021
87f4a87
Merge branch 'branch-0.19' into imp-019-remove_sparse_prims
cjnolet Feb 22, 2021
1f4b90d
Merge branch 'imp-019-remove_sparse_prims' into fea-019-slhc
cjnolet Feb 22, 2021
d7e2d31
Fixing style
cjnolet Feb 22, 2021
903e85f
updating import
cjnolet Feb 22, 2021
c16012e
Fixies
cjnolet Feb 23, 2021
105424a
Using fused l2 nn from raft
cjnolet Feb 24, 2021
0c8663b
Fixing style
cjnolet Feb 24, 2021
4e37c30
Updating copyright
cjnolet Feb 24, 2021
0f158a8
Fixing style
cjnolet Feb 24, 2021
9f34bbd
Updating copyright years
cjnolet Feb 24, 2021
309ef27
Merge branch 'branch-0.19' into imp-019-remove_sparse_prims
cjnolet Feb 24, 2021
8a081d2
Merge branch 'imp-019-remove_sparse_prims' into imp-019-use_raft_fuse…
cjnolet Feb 24, 2021
4bf102b
Merge branch 'imp-019-use_raft_fused_l2_nn' into fea-019-slhc
cjnolet Feb 24, 2021
70148d5
Updating raft hash so ci will build
cjnolet Feb 24, 2021
e546579
Using raft hash to make CI build
cjnolet Feb 24, 2021
538891c
Moving cumlprims conda recipe back to minor_version
cjnolet Feb 24, 2021
3b43aad
Merge branch 'branch-0.19' into imp-019-use_raft_fused_l2_nn
cjnolet Mar 3, 2021
279d631
Updating style
cjnolet Mar 3, 2021
e0c9b1d
Updating raft hash to point to my branch until raft pr is merged
cjnolet Mar 3, 2021
8f0f709
Merge branch 'imp-19-use_raft_fused_l2_nn_2' into fea-019-slhc
cjnolet Mar 3, 2021
8652239
Removing tests that are no longer needed
cjnolet Mar 4, 2021
23893c3
Merge branch 'imp-19-use_raft_fused_l2_nn_2' into fea-019-slhc
cjnolet Mar 4, 2021
032887d
Merge branch 'branch-0.19' into fea-019-slhc
cjnolet Mar 4, 2021
427ebc6
Updating raft hash to branch-0.19
cjnolet Mar 4, 2021
72b3f10
Merge remote-tracking branch 'rapids/branch-0.19' into imp-19-use_raf…
cjnolet Mar 5, 2021
2039567
Updating raft hash
cjnolet Mar 6, 2021
7121a1c
Merge branch 'branch-0.19' into imp-19-use_raft_fused_l2_nn_2
cjnolet Mar 11, 2021
2ca9f49
Merge branch 'imp-19-use_raft_fused_l2_nn_2' into fea-019-slhc
cjnolet Mar 15, 2021
ebde06c
Removing fix_connectivities since that's already in raft
cjnolet Mar 15, 2021
ad1fc15
Merge branch 'branch-0.19' into imp-19-use_raft_fused_l2_nn_2
cjnolet Mar 15, 2021
289fd78
Updating nccl version
cjnolet Mar 15, 2021
40bbb45
Merge branch 'branch-0.19' into imp-19-use_raft_fused_l2_nn_2
cjnolet Mar 16, 2021
4f397ed
Updating includes
cjnolet Mar 16, 2021
cd77fed
Removing files from bad merge
cjnolet Mar 16, 2021
0accca3
Merge branch 'imp-19-use_raft_fused_l2_nn_2' into fea-019-slhc
cjnolet Mar 16, 2021
64fa5e6
Updating based on recent RAFT changes
cjnolet Mar 16, 2021
0670cbd
Merge branch 'branch-0.19' into fea-019-slhc
cjnolet Mar 16, 2021
1a30afd
Cleanup
cjnolet Mar 16, 2021
b6f89e5
More cleanup
cjnolet Mar 16, 2021
28f88c1
Updating style
cjnolet Mar 16, 2021
950d077
Updates based on review feedback
cjnolet Mar 16, 2021
1f07e8b
Correclty modifying impl
cjnolet Mar 16, 2021
a0d3873
Updating conda recipes
cjnolet Mar 16, 2021
e4d6955
removing unecessary date change in pca test
cjnolet Mar 17, 2021
bc1a1d7
Merge branch 'branch-0.19' into fea-019-slhc
cjnolet Mar 17, 2021
e354abf
Beginning python wrapper for agglomerativeclustering
cjnolet Mar 18, 2021
3f8247b
Pairwise tests are passing. Kneighbors cluster extraction has a bug s…
cjnolet Mar 18, 2021
c902b0a
Merge branch 'branch-0.19' into fea-019-slhc_python
cjnolet Mar 22, 2021
a926566
Pytests seem to be working w/ 1k samples. Still figure out why 10k sa…
cjnolet Mar 22, 2021
3a10f27
Connectivity algorithm works scaled up to 1M points. Need to optimize…
cjnolet Mar 23, 2021
b5055fd
Still working through scaling knn graph
cjnolet Mar 24, 2021
40e605b
Updating test
cjnolet Mar 25, 2021
fecc509
Fixing style. Everything works now!
cjnolet Mar 26, 2021
9abc185
Testing both connectivity types
cjnolet Mar 26, 2021
98b6095
Checking in code w/ docs
cjnolet Mar 26, 2021
56c8596
Removing printlns
cjnolet Mar 26, 2021
bdf71b5
Updating raft hash
cjnolet Mar 26, 2021
6b8e910
Merge branch 'branch-0.19' into fea-019-slhc_python
cjnolet Mar 26, 2021
a741349
Changes based on Dante's review
cjnolet Mar 26, 2021
3f6217d
Adjusting raft hash to current RAFT head
cjnolet Mar 26, 2021
49baf6e
Update Dependencies.cmake
cjnolet Mar 29, 2021
50e08ec
Fixes for pickling
cjnolet Mar 30, 2021
c6ffce3
Updating copyright
cjnolet Mar 30, 2021
9d770e5
Merge branch 'branch-0.19' into fea-019-slhc_python
cjnolet Mar 30, 2021
db92ebd
Removing unecessary file
cjnolet Mar 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions cpp/cmake/Dependencies.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,8 @@ else(DEFINED ENV{RAFT_PATH})
set(RAFT_DIR ${CMAKE_CURRENT_BINARY_DIR}/raft CACHE STRING "Path to RAFT repo")

ExternalProject_Add(raft
GIT_REPOSITORY https://github.com/rapidsai/raft.git
GIT_TAG fc46618d76d70710b07d445e79d3e07dea6cad2f
GIT_REPOSITORY https://github.com/cjnolet/raft.git
GIT_TAG 7bffddfe69aaa370d2affb2b1bb4bf7735589c1f
cjnolet marked this conversation as resolved.
Show resolved Hide resolved
PREFIX ${RAFT_DIR}
CONFIGURE_COMMAND ""
BUILD_COMMAND ""
Expand Down
4 changes: 2 additions & 2 deletions cpp/include/cuml/cluster/linkage.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,9 @@ namespace ML {
* @param[in] X dense feature matrix on device
* @param[in] m number of rows in X
* @param[in] n number of columns in X
* @param[out] out container object for output arrays
* @param[in] metric distance metric to use. Must be supported by the
* dense pairwise distances API.
* @param[out] out container object for output arrays
* @param[out] n_clusters number of clusters to cut from resulting dendrogram
*/
void single_linkage_pairwise(const raft::handle_t &handle, const float *X,
Expand All @@ -55,9 +55,9 @@ void single_linkage_pairwise(const raft::handle_t &handle, const float *X,
* @param[in] X dense feature matrix on device
* @param[in] m number of rows in X
* @param[in] n number of columns in X
* @param[out] out container object for output arrays
* @param[in] metric distance metric to use. Must be supported by the
* dense pairwise distances API.
* @param[out] out container object for output arrays
* @param[out] c the optimal value of k is guaranteed to be at least log(n) + c
* where c is some constant. This constant can usually be set to a fairly low
* value, like 15, and still maintain good performance.
Expand Down
1 change: 1 addition & 0 deletions python/cuml/cluster/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@

from cuml.cluster.dbscan import DBSCAN
from cuml.cluster.kmeans import KMeans
from cuml.cluster.agglomerative import AgglomerativeClustering
cjnolet marked this conversation as resolved.
Show resolved Hide resolved
249 changes: 249 additions & 0 deletions python/cuml/cluster/agglomerative.pyx
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
#
# Copyright (c) 2019-2021, NVIDIA CORPORATION.
cjnolet marked this conversation as resolved.
Show resolved Hide resolved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# distutils: language = c++

from libc.stdint cimport uintptr_t

import numpy as np

from cuml.common.array import CumlArray
from cuml.common.base import Base
from cuml.common.doc_utils import generate_docstring
from cuml.raft.common.handle cimport handle_t
from cuml.common import input_to_cuml_array
from cuml.common.array_descriptor import CumlArrayDescriptor
from cuml.common.mixins import ClusterMixin
from cuml.common.mixins import CMajorInputTagMixin

from cuml.metrics.distance_type cimport DistanceType


cdef extern from "raft/sparse/hierarchy/common.h" namespace "raft::hierarchy":

cdef cppclass linkage_output_int_float:
int m
int n_clusters
int n_leaves
int n_connected_components
int *labels
int *children

cdef extern from "cuml/cluster/linkage.hpp" namespace "ML":

cdef void single_linkage_pairwise(
const handle_t &handle,
const float *X,
size_t m,
size_t n,
linkage_output_int_float *out,
DistanceType metric,
int n_clusters
) except +

cdef void single_linkage_neighbors(
const handle_t &handle,
const float *X,
size_t m,
size_t n,
linkage_output_int_float *out,
DistanceType metric,
int c,
int n_clusters
) except +


_metrics_mapping = {
'l1': DistanceType.L1,
'cityblock': DistanceType.L1,
'manhattan': DistanceType.L1,
'l2': DistanceType.L2SqrtExpanded,
'euclidean': DistanceType.L2SqrtExpanded,
'cosine': DistanceType.CosineExpanded
}


class AgglomerativeClustering(Base, ClusterMixin, CMajorInputTagMixin):

"""
Agglomerative Clustering

Recursively merges the pair of clusters that minimally increases a
given linkage distance.

Parameters
----------
handle : cuml.Handle
Specifies the cuml.handle that holds internal CUDA state for
computations in this model. Most importantly, this specifies the CUDA
stream that will be used for the model's computations, so users can
run different models concurrently in different streams by creating
handles in several streams.
If it is None, a new one is created.
verbose : int or boolean, default=False
Sets logging level. It must be one of `cuml.common.logger.level_*`.
See :ref:`verbosity-levels` for more info.

n_clusters : int (default = 2)
cjnolet marked this conversation as resolved.
Show resolved Hide resolved
The number of clusters to find.
affinity : str, default='euclidean'
Metric used to compute the linkage. Can be "euclidean", "l1",
"l2", "manhattan", or "cosine". If connectivity is "knn" only
"euclidean" is accepted.
linkage : {"single"}, default="single"
Which linkage criterion to use. The linkage criterion determines
which distance to use between sets of observations. The algorithm
will merge the pairs of clusters that minimize this criterion.
- 'single' uses the minimum of the distances between all
observations of the two sets.
n_neighbors : int (default = 15)
The number of neighbors to compute when connectivity = "knn"
connectivity : {"pairwise", "knn"}, (default = "knn")
The type of connectivity matrix to compute.
- 'pairwise' will compute the entire fully-connected graph of
pairwise distances between each set of points. This is the
fastest to compute and can be very fast for smaller datasets
but requires O(n^2) space.
- 'knn' will sparsify the fully-connected connectivity matrix to
save memory and enable much larger inputs. "n_neighbors" will
control the amount of memory used and the graph will be connected
automatically in the event "n_neighbors" was not large enough
to connect it.
output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
Variable to control output type of the results and attributes of
the estimator. If None, it'll inherit the output type set at the
module level, `cuml.global_settings.output_type`.
See :ref:`output-data-type-configuration` for more info.
"""

labels_ = CumlArrayDescriptor()
children_ = CumlArrayDescriptor()

def __init__(self, n_clusters=2, affinity="euclidean", linkage="single",
handle=None, verbose=False, connectivity='knn',
n_neighbors=10, output_type=None):

super(AgglomerativeClustering, self).__init__(handle,
verbose,
output_type)

if linkage is not "single":
raise ValueError("Only single linkage clustering is "
"supported currently")

if connectivity not in ["knn", "pairwise"]:
raise ValueError("'connectivity' can only be one of "
"{'knn', 'pairwise'}")

if n_clusters <= 0:
raise ValueError("'n_clusters' must be >= 1")

if n_neighbors > 1023 or n_neighbors < 2:
raise ValueError("'n_neighbors' must be a positive number "
"between 2 and 1023")

if affinity not in _metrics_mapping:
raise ValueError("'affinity' %s is not supported." % affinity)

self.n_clusters = n_clusters
self.affinity = affinity
self.linkage = linkage
self.n_neighbors = n_neighbors
self.connectivity = connectivity

self.labels_ = None
self.n_clusters_ = None
self.n_leaves_ = None
self.n_connected_components_ = None
self.children_ = None
self.distances_ = None
cjnolet marked this conversation as resolved.
Show resolved Hide resolved

@generate_docstring()
def fit(self, X, y=None):
"""
Fit the hierarchical clustering from features.
"""

X_m, n_rows, n_cols, self.dtype = \
input_to_cuml_array(X, order='C',
check_dtype=[np.float32, np.float64])

if self.n_clusters > n_rows:
raise ValueError("'n_clusters' must be <= n_samples")

cdef uintptr_t input_ptr = X_m.ptr

cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()

# Hardcode n_components_ to 1 for single linkage. This will
# not be the case for other linkage types.
self.n_connected_components_ = 1
self.n_leaves_ = n_rows
self.n_clusters_ = self.n_clusters

self.labels_ = CumlArray.empty(n_rows, dtype="int32")
self.children_ = CumlArray.empty((2, n_rows), dtype="int32")
cdef uintptr_t labels_ptr = self.labels_.ptr
cdef uintptr_t children_ptr = self.children_.ptr

cdef linkage_output_int_float* linkage_output = \
new linkage_output_int_float()

linkage_output.children = <int*>children_ptr
linkage_output.labels = <int*>labels_ptr

cdef DistanceType metric
if self.affinity in _metrics_mapping:
metric = _metrics_mapping[self.affinity]
else:
raise ValueError("'affinity' %s not supported." % self.affinity)

if self.connectivity == 'knn':
single_linkage_neighbors(
handle_[0], <float*>input_ptr, <int> n_rows,
<int> n_cols, <linkage_output_int_float*> linkage_output,
<DistanceType> metric, <int>self.n_neighbors,
<int> self.n_clusters)
elif self.connectivity == 'pairwise':
single_linkage_pairwise(
handle_[0], <float*>input_ptr, <int> n_rows,
<int> n_cols, <linkage_output_int_float*> linkage_output,
<DistanceType> metric, <int> self.n_clusters)
else:
raise ValueError("'connectivity' can only be one of "
"{'knn', 'pairwise'}")

self.handle.sync()

@generate_docstring(return_values={'name': 'preds',
'type': 'dense',
'description': 'Cluster indexes',
'shape': '(n_samples, 1)'})
def fit_predict(self, X, y=None):
"""
Fit the hierarchical clustering from features and return
cluster labels.
"""
return self.fit(X).labels_

def get_param_names(self):
return super().get_param_names() + [
"n_clusters",
"affinity",
"linkage",
"compute_distances",
"n_neighbors"
]
82 changes: 82 additions & 0 deletions python/cuml/test/test_agglomerative.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Copyright (c) 2019-2021, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import pytest

from cuml.cluster import AgglomerativeClustering
from cuml.datasets import make_blobs

from cuml.metrics import adjusted_rand_score

from sklearn import cluster

import cupy as cp


@pytest.mark.parametrize('nrows', [100, 1000])
@pytest.mark.parametrize('ncols', [25, 50])
@pytest.mark.parametrize('nclusters', [2, 10, 50])
@pytest.mark.parametrize('k', [3, 5, 15])
@pytest.mark.parametrize('connectivity', ['knn', 'pairwise'])
def test_single_linkage_sklearn_compare(nrows, ncols, nclusters,
k, connectivity):

X, y = make_blobs(int(nrows),
ncols,
nclusters,
cluster_std=1.0,
shuffle=False)

cuml_agg = AgglomerativeClustering(
n_clusters=nclusters, affinity='euclidean', linkage='single',
n_neighbors=k, connectivity=connectivity)

cuml_agg.fit(X)

sk_agg = cluster.AgglomerativeClustering(
n_clusters=nclusters, affinity='euclidean', linkage='single')
sk_agg.fit(cp.asnumpy(X))

# Cluster assignments should be exact, even though the actual
# labels may differ
assert(adjusted_rand_score(cuml_agg.labels_, sk_agg.labels_) == 1.0)
assert(cuml_agg.n_connected_components_ == sk_agg.n_connected_components_)
assert(cuml_agg.n_leaves_ == sk_agg.n_leaves_)
assert(cuml_agg.n_clusters_ == sk_agg.n_clusters_)


def test_invalid_inputs():

# Test bad affinity
with pytest.raises(ValueError):
AgglomerativeClustering(affinity='doesntexist')

with pytest.raises(ValueError):
AgglomerativeClustering(linkage='doesntexist')

with pytest.raises(ValueError):
AgglomerativeClustering(connectivity='doesntexist')

with pytest.raises(ValueError):
AgglomerativeClustering(n_neighbors=1)

with pytest.raises(ValueError):
AgglomerativeClustering(n_neighbors=1024)

with pytest.raises(ValueError):
AgglomerativeClustering(n_clusters=0)

with pytest.raises(ValueError):
AgglomerativeClustering(n_clusters=500).fit(cp.ones((2, 5)))