Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very Slow with greater than 10k obs #45

Closed
ashish1610dhiman opened this issue Jun 8, 2017 · 6 comments
Closed

Very Slow with greater than 10k obs #45

ashish1610dhiman opened this issue Jun 8, 2017 · 6 comments

Comments

@ashish1610dhiman
Copy link

It does not go ahead of this...

Init: initializing centroids
Init: initializing clusters
Starting iterations...

Even in gcs instance

@nicodv
Copy link
Owner

nicodv commented Jun 9, 2017

This is not due to the no. of observations, but likely due to your specific data. Note that Cao's init method can be very slow when n_clusters is large, maybe that's it.

You can run the benchmark.py script in the examples directory (which has 10k observations) to see if that works for you.

Also, you're leaving me guessing with this little information about the specifics of your problem...

@paulaceccon
Copy link

Same here. I was able to run DBSCAN, but I'm struggling to run k-Prototype. Not sure about what strategy to follow to be able to run it.

@nicodv
Copy link
Owner

nicodv commented Oct 2, 2017

@paulaceccon , what are the dimensions of your problem? Also, please provide sample output, running in verbose mode.

@mpikoula
Copy link

Any insight as to why kprototypes is a lot slower than kmodes for similar datasets? Just trying to understand the algorithm better.

I've got only 3 numerical variables in my dataset(out of 12 total) and if I get rid of one and turn the other two into categories, kmodes runs instantaneously for n=1-10

@nicodv
Copy link
Owner

nicodv commented Nov 15, 2017

I've had some time to analyze this problem by profiling my code.

In order to determine the clusters in the k-means part of the algorithm, we need to divide the sums of attribute values by the number of points in the cluster. I realized I was caching the sums of the attributes alright, but not the sums of the memberships in the clusters.

The following commit resolves this:
1a6a7be

I'm seeing very significant speedups as a result. :)

Thanks to everyone for pointing it out.

@nicodv nicodv closed this as completed Nov 15, 2017
@nicodv
Copy link
Owner

nicodv commented Nov 15, 2017

Oh, and I've included a benchmark script specifically for k-prototypes in the examples folder.

This was referenced Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants