Very Slow with greater than 10k obs #45

ashish1610dhiman · 2017-06-08T11:39:21Z

It does not go ahead of this...

Init: initializing centroids
Init: initializing clusters
Starting iterations...

Even in gcs instance

nicodv · 2017-06-09T05:03:25Z

This is not due to the no. of observations, but likely due to your specific data. Note that Cao's init method can be very slow when n_clusters is large, maybe that's it.

You can run the benchmark.py script in the examples directory (which has 10k observations) to see if that works for you.

Also, you're leaving me guessing with this little information about the specifics of your problem...

paulaceccon · 2017-10-02T16:12:31Z

Same here. I was able to run DBSCAN, but I'm struggling to run k-Prototype. Not sure about what strategy to follow to be able to run it.

nicodv · 2017-10-02T19:25:58Z

@paulaceccon , what are the dimensions of your problem? Also, please provide sample output, running in verbose mode.

mpikoula · 2017-10-25T13:20:54Z

Any insight as to why kprototypes is a lot slower than kmodes for similar datasets? Just trying to understand the algorithm better.

I've got only 3 numerical variables in my dataset(out of 12 total) and if I get rid of one and turn the other two into categories, kmodes runs instantaneously for n=1-10

nicodv · 2017-11-15T17:49:44Z

I've had some time to analyze this problem by profiling my code.

In order to determine the clusters in the k-means part of the algorithm, we need to divide the sums of attribute values by the number of points in the cluster. I realized I was caching the sums of the attributes alright, but not the sums of the memberships in the clusters.

The following commit resolves this:
1a6a7be

I'm seeing very significant speedups as a result. :)

Thanks to everyone for pointing it out.

nicodv · 2017-11-15T18:53:14Z

Oh, and I've included a benchmark script specifically for k-prototypes in the examples folder.

nicodv closed this as completed Nov 15, 2017

This was referenced Nov 15, 2017

Release 0.8 #56

Closed

Performance considerations #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very Slow with greater than 10k obs #45

Very Slow with greater than 10k obs #45

ashish1610dhiman commented Jun 8, 2017

nicodv commented Jun 9, 2017 •

edited

Loading

paulaceccon commented Oct 2, 2017

nicodv commented Oct 2, 2017

mpikoula commented Oct 25, 2017

nicodv commented Nov 15, 2017

nicodv commented Nov 15, 2017

Very Slow with greater than 10k obs #45

Very Slow with greater than 10k obs #45

Comments

ashish1610dhiman commented Jun 8, 2017

nicodv commented Jun 9, 2017 • edited Loading

paulaceccon commented Oct 2, 2017

nicodv commented Oct 2, 2017

mpikoula commented Oct 25, 2017

nicodv commented Nov 15, 2017

nicodv commented Nov 15, 2017

nicodv commented Jun 9, 2017 •

edited

Loading