-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PB-k means something / hc issue #64
Comments
Hello, I agree that k-means clustering is very interesting. Concerning hierarchical clustering, Python might deal with memory differently from R and could be able to deal with a lot of conformations. I propose that we implement the clustering with the hierarchical clustering in Python first and then try the k-means clustering. |
For the clustering in Python, another library for the clustering in Python is scikit-learn. It's quite a dependency but it provides more clustering methods (at least k-means which scipy doesn't have). It's worth testing. The critical point of going to Python (scipy or scikit-learn) is to have the same results as in R for the results (for HC). @alexdb27, what do you mean you don't have any results from a 850ns file ? R is crashing ? Is it due to RAM usage? Or it's the storage of distance matrix ? |
At least, on |
On 04/05/15 17:01, Hub wrote:
|
Indeed, here it is: http://nbviewer.ipython.org/gist/HubLot/9e0f76bc987489aedabe The downside, for now, is it's not possible to have directly medoid in scikit-learn with hclust. I search for an alternative way. |
Cool ! I added scipy to the notebook and it gives the same clusters as the others for 3 clusters. For 4 clusters, however, scipy and scikit-learn agree with each other but R disagrees a bit. http://nbviewer.ipython.org/gist/jbarnoud/7e9ea4362e948fe41dea |
Interesting. |
Concerning the 850 ns... it works ...it works ...it works ...it works ... night and day without any core dumped / crash or output... |
I updated my notebook to include your medoid function. I also increased the number of requested clusters to 5, showing more discrepancy between scipy and scikit-learn on one side, and R on the other side. It should be noted that I use R version 2.14.1 (2011-12-22) that only have the 'ward' method. |
I updated my notebook again to use the newest version of R (3.2.0 (2015-04-16)). The 'ward.D2' method gives an even more different result. Also, I encounter issue #66. |
Ouch...
Strange. Could you provide me the file ? I could try to see where it bugs. |
Thanks. |
To sum up the results about hierarchical clustering in R vs Python (scipy), I made a notebook. Basically:
|
Great test! I am quite disappointed about scipy but there are other options to use ward with python outside of scipy. The question now would be what criterion should we use to compare the clustering methods and figure out which one is the more appropriate? |
I updated the notebook with scikit-learn as the input is different. This doesn't change the conclusion. |
Excellent work @HubLot , I really like your notebook. So for me it is complete > average > Ward > linkage |
Very nice notebooks @jbarnoud and @HubLot However, I am not sur hierarchical clustering is the best clustering method here. As mentionned by @alexdb27, it is
but I am not sure the visual we can get here is meaningfull. Indeed, the distance we use is quite coarse and I do not know how to interpret the fact that two clusters are close to each other and far from a third one. So what do you believe is the most userfull to implement in PBxplore?
In any case, I advocate to remove R from the clustering process. This will be easier to install and to maintain. |
Both !!!! |
Hi @jbarnoud. It is OK with the k-means implementation? |
https://pl.wikipedia.org/wiki/Kmin_rzymski a new implementation ? |
Hey ! Here is a prototype of K-means for PBxplore: https://gist.github.com/jbarnoud/fc27c5048d6e8f394598 This notebook implements the K-means algorithm and tries to visualize the clusters that are produced. I am looking for a way to validate the clustering. Any idea? |
@HubLot and @jbarnoud you did a great job on this issue. |
Hello, cc @pierrepo @jbarnoud @HubLot
as suggested by the Ammmmmaaaaaaziing Spider @jbarnoud, i open a new issue.
I'm a big fan of hierarchical clustering (hc) as it is visually very simple to handle. Nonetheless, i've express in other issues as #62 #63 the fact that hc is perhaps not appropriate when we have thousands of snapshots to compare.
With hc, you need to compare N x (N-1 / 2) to create the distance matrix (and you need to store it), then you -on average- N x (N/2) computation to do the dendrogram, so we go easily to O(n^3).
We currently succeed to have results for simulations of 50 or 100 ns, but when we merge it to 850 ns... no results.
k-means algorithm is well known and appreciated, it needs a fixed number k of cluster (in fact as the hc when we want to analyze) and then you compute only N distances 20 times (and an average each times). It is so quite fast.
Ok, the drawback is that it needs original initialization values for the cluster. In R, at the beginning, it was an issue. It is no more, especially when you have a lot of data.
I would be please if it could be use with Python with scipy.
Please share your through.
PS: a small idea. If size of data is an issue ;-), perhaps we can (i) fix a maximum and/or (ii) if it is more than threshold only take one snapshot every x snaps.
The text was updated successfully, but these errors were encountered: