PB-k means something / hc issue #64

alexdb27 · 2015-05-02T09:37:31Z

as suggested by the Ammmmmaaaaaaziing Spider @jbarnoud, i open a new issue.
I'm a big fan of hierarchical clustering (hc) as it is visually very simple to handle. Nonetheless, i've express in other issues as #62 #63 the fact that hc is perhaps not appropriate when we have thousands of snapshots to compare.
With hc, you need to compare N x (N-1 / 2) to create the distance matrix (and you need to store it), then you -on average- N x (N/2) computation to do the dendrogram, so we go easily to O(n^3).
We currently succeed to have results for simulations of 50 or 100 ns, but when we merge it to 850 ns... no results.
k-means algorithm is well known and appreciated, it needs a fixed number k of cluster (in fact as the hc when we want to analyze) and then you compute only N distances 20 times (and an average each times). It is so quite fast.
Ok, the drawback is that it needs original initialization values for the cluster. In R, at the beginning, it was an issue. It is no more, especially when you have a lot of data.
I would be please if it could be use with Python with scipy.

Please share your through.

PS: a small idea. If size of data is an issue ;-), perhaps we can (i) fix a maximum and/or (ii) if it is more than threshold only take one snapshot every x snaps.

pierrepo · 2015-05-02T22:18:17Z

Hello,

I agree that k-means clustering is very interesting.

Concerning hierarchical clustering, Python might deal with memory differently from R and could be able to deal with a lot of conformations.

I propose that we implement the clustering with the hierarchical clustering in Python first and then try the k-means clustering.

HubLot · 2015-05-04T12:29:43Z

For the clustering in Python, another library for the clustering in Python is scikit-learn. It's quite a dependency but it provides more clustering methods (at least k-means which scipy doesn't have). It's worth testing.
And maybe in the future, it could be interesting to let the user choose the clustering algorithm (juste an idea).

The critical point of going to Python (scipy or scikit-learn) is to have the same results as in R for the results (for HC). psi_md_traj_1.pdb could be a good example.

@alexdb27, what do you mean you don't have any results from a 850ns file ? R is crashing ? Is it due to RAM usage? Or it's the storage of distance matrix ?
It will be interesting to have the file to see how the python implementation of the clustering is doing.

HubLot · 2015-05-04T15:01:46Z

The critical point of going to Python (scipy or scikit-learn) is to have the same results as in R for the results (for HC). psi_md_traj_1.pdb could be a good example.

At least, on psi_md_traj_1.pdb, scikit-learn & R gave the same results (just did a quick test)

jbarnoud · 2015-05-04T15:02:59Z

On 04/05/15 17:01, Hub wrote:

The critical point of going to Python (scipy or scikit-learn) is
to have the same results as in R for the results (for HC).
psi_md_traj_1.pdb could be a good example.
At least, on |psi_md_traj_1.pdb|, scikit-learn & R gave the same
results (just did a quick test)

—
Reply to this email directly or view it on GitHub
#64 (comment).

That's cool! Could you paste the code of your test? Knowing you it may
even be a notebook...

HubLot · 2015-05-04T15:27:36Z

Indeed, here it is: http://nbviewer.ipython.org/gist/HubLot/9e0f76bc987489aedabe

The downside, for now, is it's not possible to have directly medoid in scikit-learn with hclust. I search for an alternative way.

jbarnoud · 2015-05-04T16:03:18Z

Cool !

I added scipy to the notebook and it gives the same clusters as the others for 3 clusters. For 4 clusters, however, scipy and scikit-learn agree with each other but R disagrees a bit.

http://nbviewer.ipython.org/gist/jbarnoud/7e9ea4362e948fe41dea

HubLot · 2015-05-04T16:27:27Z

Interesting.
I computed the medoids in the same way as the R script. I updated the gist

alexdb27 · 2015-05-04T16:31:08Z

Concerning the 850 ns... it works ...it works ...it works ...it works ... night and day without any core dumped / crash or output...
My guess, RAM issue for R

jbarnoud · 2015-05-04T18:20:35Z

I updated my notebook to include your medoid function. I also increased the number of requested clusters to 5, showing more discrepancy between scipy and scikit-learn on one side, and R on the other side.

It should be noted that I use R version 2.14.1 (2011-12-22) that only have the 'ward' method.

jbarnoud · 2015-05-04T19:27:21Z

I updated my notebook again to use the newest version of R (3.2.0 (2015-04-16)). The 'ward.D2' method gives an even more different result.

Also, I encounter issue #66.

HubLot · 2015-05-04T20:40:46Z

Ouch...
By looking the source code, the ward Hclust in scikit-learn is based on the scipy one, hence the same results. But for R...
After digging a little bit, maybe we miss used scipy/sklearn, see:
http://stackoverflow.com/questions/18952587/use-distance-matrix-in-scipy-cluster-hierarchy-linkage/18954990#18954990
scipy/scipy#2614

Concerning the 850 ns... it works ...it works ...it works ...it works ... night and day without any core dumped / crash or output...
My guess, RAM issue for R

Strange. Could you provide me the file ? I could try to see where it bugs.

alexdb27 · 2015-05-05T06:17:30Z

Ask Matthieu G., he had the files ...
PS: it is not Ali G. [you can found 7 differences)

HubLot · 2015-05-05T09:25:47Z

Thanks.
About the R methods, see #66

HubLot · 2015-05-06T09:14:16Z

To sum up the results about hierarchical clustering in R vs Python (scipy), I made a notebook.

Basically:

matrix input of scipy functions are different from R.
Ward with distance matrix in scipy is not possible.
average & complete gave the same results for R and scipy.

jbarnoud · 2015-05-06T10:33:00Z

Great test! I am quite disappointed about scipy but there are other options to use ward with python outside of scipy. The question now would be what criterion should we use to compare the clustering methods and figure out which one is the more appropriate?

HubLot · 2015-05-06T12:03:15Z

I updated the notebook with scikit-learn as the input is different. This doesn't change the conclusion.
I agree with the questions raised by @jbarnoud

alexdb27 · 2015-05-06T12:24:56Z

Excellent work @HubLot , I really like your notebook.
What is nice with Ward is the fact that the clusters are well balanced.
What is nice with average is the fact that clusters are really based on a "natural rule", i.e., what is close is close in terms of simple distance.
Complete is not too far away from this one.
I'm not a big fan of single linkage as it is like onion and onion makes me cry like a river ...

So for me it is complete > average > Ward > linkage

pierrepo · 2015-05-06T15:45:56Z

Very nice notebooks @jbarnoud and @HubLot
Since the complete method gives the same results for HC in R, Python/scipy and Python/scikit-learn, I propose we implement the Python/scipy method in PBxplore (this is Python and it has less dependancies than Python/scikit-learn).

However, I am not sur hierarchical clustering is the best clustering method here. As mentionned by @alexdb27, it is

visually very simple to handle

but I am not sure the visual we can get here is meaningfull. Indeed, the distance we use is quite coarse and I do not know how to interpret the fact that two clusters are close to each other and far from a third one.

So what do you believe is the most userfull to implement in PBxplore?

HC/complete with Python/scipy
K-means with Python/scipy
both ?

In any case, I advocate to remove R from the clustering process. This will be easier to install and to maintain.

alexdb27 · 2015-05-06T16:42:40Z

Both !!!!

jbarnoud · 2015-07-07T11:38:21Z

After we discussed about it with @alexdb27 and @HubLot, I started to implement the k-means.

pierrepo · 2015-08-21T10:07:20Z

Hi @jbarnoud. It is OK with the k-means implementation?
I'd like to get ride of R as soon as possible.

alexdb27 · 2015-08-25T12:50:34Z

https://pl.wikipedia.org/wiki/Kmin_rzymski a new implementation ?

jbarnoud · 2015-09-17T12:57:27Z

Hey ! Here is a prototype of K-means for PBxplore: https://gist.github.com/jbarnoud/fc27c5048d6e8f394598

This notebook implements the K-means algorithm and tries to visualize the clusters that are produced. I am looking for a way to validate the clustering. Any idea?

@pierrepo @alexdb27 @HubLot

jbarnoud · 2015-09-23T07:17:31Z

I updated the K-means notebook 2 days ago but I don't know if you got notified.

@pierrepo @alexdb27 @HubLot

pierrepo · 2016-04-08T10:13:38Z

@HubLot and @jbarnoud you did a great job on this issue.
PR #106 is implementing the k-means method.
In order not to lose the work you previously did on hc clustering. Could please add to PBxplore a simple Notebook explaining how to make hc clustering with the PBxplore API and either Scipy or scikit-learn?
I could be a simple reformating of this Notebook:
http://nbviewer.jupyter.org/gist/jbarnoud/7e9ea4362e948fe41dea

pierrepo mentioned this issue Jun 24, 2015

PBclust is not compatible with most recent versions of R #66

Closed

jbarnoud added the clustering label Jul 6, 2015

jbarnoud self-assigned this Jul 7, 2015

jbarnoud modified the milestone: 1.3 Oct 26, 2015

jbarnoud mentioned this issue Oct 26, 2015

Get rid of R #13

Closed

jbarnoud mentioned this issue Oct 26, 2015

clusters #31

Closed

pierrepo modified the milestones: 1.4, 1.3 Apr 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PB-k means something / hc issue #64

PB-k means something / hc issue #64

alexdb27 commented May 2, 2015

pierrepo commented May 2, 2015

HubLot commented May 4, 2015

HubLot commented May 4, 2015

jbarnoud commented May 4, 2015

HubLot commented May 4, 2015

jbarnoud commented May 4, 2015

HubLot commented May 4, 2015

alexdb27 commented May 4, 2015

jbarnoud commented May 4, 2015

jbarnoud commented May 4, 2015

HubLot commented May 4, 2015

alexdb27 commented May 5, 2015

HubLot commented May 5, 2015

HubLot commented May 6, 2015

jbarnoud commented May 6, 2015

HubLot commented May 6, 2015

alexdb27 commented May 6, 2015

pierrepo commented May 6, 2015

alexdb27 commented May 6, 2015

jbarnoud commented Jul 7, 2015

pierrepo commented Aug 21, 2015

alexdb27 commented Aug 25, 2015

jbarnoud commented Sep 17, 2015

jbarnoud commented Sep 23, 2015

pierrepo commented Apr 8, 2016

PB-k means something / hc issue #64

PB-k means something / hc issue #64

Comments

alexdb27 commented May 2, 2015

pierrepo commented May 2, 2015

HubLot commented May 4, 2015

HubLot commented May 4, 2015

jbarnoud commented May 4, 2015

HubLot commented May 4, 2015

jbarnoud commented May 4, 2015

HubLot commented May 4, 2015

alexdb27 commented May 4, 2015

jbarnoud commented May 4, 2015

jbarnoud commented May 4, 2015

HubLot commented May 4, 2015

alexdb27 commented May 5, 2015

HubLot commented May 5, 2015

HubLot commented May 6, 2015

jbarnoud commented May 6, 2015

HubLot commented May 6, 2015

alexdb27 commented May 6, 2015

pierrepo commented May 6, 2015

alexdb27 commented May 6, 2015

jbarnoud commented Jul 7, 2015

pierrepo commented Aug 21, 2015

alexdb27 commented Aug 25, 2015

jbarnoud commented Sep 17, 2015

jbarnoud commented Sep 23, 2015

pierrepo commented Apr 8, 2016