Skip to content

Latest commit

 

History

History
23 lines (15 loc) · 2.26 KB

2009.04552.md

File metadata and controls

23 lines (15 loc) · 2.26 KB

Write a very high-quality and detailed summary of the paper that describes the paper in a way that a human can fully understand. The summary should cover the problem, the proposed solution and highlight the main contributions of the paper.

Here is a detailed summary of the paper:

Problem: The paper addresses the problem of scaling up density-based clustering algorithms like DBSCAN to high-dimensional datasets. DBSCAN requires constructing an epsilon-nearest neighbor graph (\enn) of the dataset, which becomes very expensive in high dimensions due to the curse of dimensionality. Existing parallel implementations of DBSCAN do not work well for high dimensional data as constructing the \enn graph has quadratic memory complexity and computation cost.

Proposed Solution: The paper proposes a modified DBSCAN algorithm called KNN-DBSCAN that uses an approximate K-nearest neighbor graph (\knn) instead of an exact \enn graph. The \knn graph has linear memory complexity and can be constructed efficiently using randomized projection tree methods.

The authors prove theoretically that under certain conditions, clustering the core points using the \knn graph produces a correct subset clustering compared to using an \enn graph. They introduce new definitions of reachability and connectivity using the asymmetric \knn graph to enable density-based clustering.

An efficient parallel implementation using hybrid MPI+OpenMP is presented. It constructs an inexact minimum spanning tree locally and across MPI ranks to cluster core points and handles border points correctly.

Contributions:

  • Introduction and analysis of KNN-DBSCAN algorithm that enables DBSCAN style clustering using approximate \knn graphs
  • Proof that KNN-DBSCAN produces correct clusters under certain conditions
  • Efficient distributed memory parallel implementation that scales to large high-dimensional datasets
  • Evaluation on up to 65 billion points in 20D demonstrating scalability. Comparison to existing PDBSCAN code shows up to 37x speedup.

The algorithm allows density-based clustering of datasets not possible before using DBSCAN, while achieving similar accuracy. This enables new applications for density-based clustering in high dimensions.