Streaming k-Clusters with Outliers Approximation

Implement k-cluster with outliers streaming algorithm.

It contains the implementation of two algorithms from these papers:

Section 3 from Algorithms for Facility Location Problems with Outliers, Moses Charikar, Samir Khuller, David M. Mount, Giri Narasimhan acm.
Algorithm 3.1 from Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity, Richard Matthew McCutchen, Samir Khuller springer. Where the requirement to an offline 4-approximation algorithm has been replaced with the above algorithm.

Building

Dependencies

The core code only requires a C++17/OpenMP capable compilator and the library jsoncpp.

Building maps and graphs requires python3 together with libraries glob, pillow and matplotlib.

Compilation

https://github.com/remi-dupre/cluster-outliers.git
cd cluster-outliers
make release

Usage

Running

# Use every logical threads as a default, but it seems to behave better without
# any hyperthreading going on.
export OMP_NUM_THREADS=4

# Cover the graph saved in dataset.txt using 50 clusters and 10 outliers.
./clustering dataset.txt 50 10

The input file must contain one vertex per line, beeing described with three columns separated with spaces. The first columns is ignored, and the second and last columns contains longitude and latitude of the point.

Testing

# Build 2D-maps in the directory maps/ from existing logs
scripts/map.py

# Plot a benchmark about existing logs
scripts/plot.py

Results

All tests were run on this dataset containing the geolocalisation of 1,000,000 tweets around the world.

Here are measures for several parameters:

clusters: the number of allowed clusters in the solution
outliers: number of allowed outliers
alpha: disks growth parameter in the algorithm
cost: max radius of the solution
min opt: a lower bound to the optimal solution
max opt: an upper bound to the optimal solution
time: run time of the algorithm

clusters	outliers	α	cost	min opt	max ratio	time
20	10	1.1	0.8282	0.3125	2.6502	2.6360s
20	50	1.1	0.6900	0.1992	3.4637	18.3627s
20	10	4	1.4160	0.3125	4.5312	1.9803s
20	20	4	0.8843	0.2666	3.3169	2.4288s
20	30	4	0.8665	0.2373	3.6514	2.4714s
20	40	4	1.3749	0.2178	6.3136	2.6495s
20	50	4	1.2991	0.1992	6.5210	3.1103s

For parameters clusters = 20, outliers = 50 and α = 1.1 the output looks like:

The bound on the optimal is computed by finding (clusters + outliers + 1) disks that don't intersect, this is how it looks like using previous parameters:

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
scripts		scripts
src		src
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Streaming k-Clusters with Outliers Approximation

Building

Dependencies

Compilation

Usage

Running

Testing

Results

About

Releases

Packages

Languages

remi-dupre/cluster-outliers

Folders and files

Latest commit

History

Repository files navigation

Streaming k-Clusters with Outliers Approximation

Building

Dependencies

Compilation

Usage

Running

Testing

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages