Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PB-k means something / hc issue #64

Open
alexdb27 opened this issue May 2, 2015 · 25 comments
Open

PB-k means something / hc issue #64

alexdb27 opened this issue May 2, 2015 · 25 comments
Assignees
Milestone

Comments

@alexdb27
Copy link
Contributor

alexdb27 commented May 2, 2015

Hello, cc @pierrepo @jbarnoud @HubLot

as suggested by the Ammmmmaaaaaaziing Spider @jbarnoud, i open a new issue.
I'm a big fan of hierarchical clustering (hc) as it is visually very simple to handle. Nonetheless, i've express in other issues as #62 #63 the fact that hc is perhaps not appropriate when we have thousands of snapshots to compare.
With hc, you need to compare N x (N-1 / 2) to create the distance matrix (and you need to store it), then you -on average- N x (N/2) computation to do the dendrogram, so we go easily to O(n^3).
We currently succeed to have results for simulations of 50 or 100 ns, but when we merge it to 850 ns... no results.
k-means algorithm is well known and appreciated, it needs a fixed number k of cluster (in fact as the hc when we want to analyze) and then you compute only N distances 20 times (and an average each times). It is so quite fast.
Ok, the drawback is that it needs original initialization values for the cluster. In R, at the beginning, it was an issue. It is no more, especially when you have a lot of data.
I would be please if it could be use with Python with scipy.

Please share your through.

PS: a small idea. If size of data is an issue ;-), perhaps we can (i) fix a maximum and/or (ii) if it is more than threshold only take one snapshot every x snaps.

@pierrepo
Copy link
Owner

pierrepo commented May 2, 2015

Hello,

I agree that k-means clustering is very interesting.

Concerning hierarchical clustering, Python might deal with memory differently from R and could be able to deal with a lot of conformations.

I propose that we implement the clustering with the hierarchical clustering in Python first and then try the k-means clustering.

@HubLot
Copy link
Collaborator

HubLot commented May 4, 2015

For the clustering in Python, another library for the clustering in Python is scikit-learn. It's quite a dependency but it provides more clustering methods (at least k-means which scipy doesn't have). It's worth testing.
And maybe in the future, it could be interesting to let the user choose the clustering algorithm (juste an idea).

The critical point of going to Python (scipy or scikit-learn) is to have the same results as in R for the results (for HC). psi_md_traj_1.pdb could be a good example.

@alexdb27, what do you mean you don't have any results from a 850ns file ? R is crashing ? Is it due to RAM usage? Or it's the storage of distance matrix ?
It will be interesting to have the file to see how the python implementation of the clustering is doing.

@HubLot
Copy link
Collaborator

HubLot commented May 4, 2015

The critical point of going to Python (scipy or scikit-learn) is to have the same results as in R for the results (for HC). psi_md_traj_1.pdb could be a good example.

At least, on psi_md_traj_1.pdb, scikit-learn & R gave the same results (just did a quick test)

@jbarnoud
Copy link
Collaborator

jbarnoud commented May 4, 2015

On 04/05/15 17:01, Hub wrote:

The critical point of going to Python (scipy or scikit-learn) is
to have the same results as in R for the results (for HC).
psi_md_traj_1.pdb could be a good example.

At least, on |psi_md_traj_1.pdb|, scikit-learn & R gave the same
results (just did a quick test)


Reply to this email directly or view it on GitHub
#64 (comment).

That's cool! Could you paste the code of your test? Knowing you it may
even be a notebook...

@HubLot
Copy link
Collaborator

HubLot commented May 4, 2015

Indeed, here it is: http://nbviewer.ipython.org/gist/HubLot/9e0f76bc987489aedabe

The downside, for now, is it's not possible to have directly medoid in scikit-learn with hclust. I search for an alternative way.

@jbarnoud
Copy link
Collaborator

jbarnoud commented May 4, 2015

Cool !

I added scipy to the notebook and it gives the same clusters as the others for 3 clusters. For 4 clusters, however, scipy and scikit-learn agree with each other but R disagrees a bit.

http://nbviewer.ipython.org/gist/jbarnoud/7e9ea4362e948fe41dea

@HubLot
Copy link
Collaborator

HubLot commented May 4, 2015

Interesting.
I computed the medoids in the same way as the R script. I updated the gist

@alexdb27
Copy link
Contributor Author

alexdb27 commented May 4, 2015

Concerning the 850 ns... it works ...it works ...it works ...it works ... night and day without any core dumped / crash or output...
My guess, RAM issue for R

@jbarnoud
Copy link
Collaborator

jbarnoud commented May 4, 2015

I updated my notebook to include your medoid function. I also increased the number of requested clusters to 5, showing more discrepancy between scipy and scikit-learn on one side, and R on the other side.

It should be noted that I use R version 2.14.1 (2011-12-22) that only have the 'ward' method.

@jbarnoud
Copy link
Collaborator

jbarnoud commented May 4, 2015

I updated my notebook again to use the newest version of R (3.2.0 (2015-04-16)). The 'ward.D2' method gives an even more different result.

Also, I encounter issue #66.

@HubLot
Copy link
Collaborator

HubLot commented May 4, 2015

Ouch...
By looking the source code, the ward Hclust in scikit-learn is based on the scipy one, hence the same results. But for R...
After digging a little bit, maybe we miss used scipy/sklearn, see:
http://stackoverflow.com/questions/18952587/use-distance-matrix-in-scipy-cluster-hierarchy-linkage/18954990#18954990
scipy/scipy#2614

Concerning the 850 ns... it works ...it works ...it works ...it works ... night and day without any core dumped / crash or output...
My guess, RAM issue for R

Strange. Could you provide me the file ? I could try to see where it bugs.

@alexdb27
Copy link
Contributor Author

alexdb27 commented May 5, 2015

Ask Matthieu G., he had the files ...
PS: it is not Ali G. [you can found 7 differences)

alig

@HubLot
Copy link
Collaborator

HubLot commented May 5, 2015

Thanks.
About the R methods, see #66

@HubLot
Copy link
Collaborator

HubLot commented May 6, 2015

To sum up the results about hierarchical clustering in R vs Python (scipy), I made a notebook.

Basically:

  • matrix input of scipy functions are different from R.
  • Ward with distance matrix in scipy is not possible.
  • average & complete gave the same results for R and scipy.

@jbarnoud
Copy link
Collaborator

jbarnoud commented May 6, 2015

Great test! I am quite disappointed about scipy but there are other options to use ward with python outside of scipy. The question now would be what criterion should we use to compare the clustering methods and figure out which one is the more appropriate?

@HubLot
Copy link
Collaborator

HubLot commented May 6, 2015

I updated the notebook with scikit-learn as the input is different. This doesn't change the conclusion.
I agree with the questions raised by @jbarnoud

@alexdb27
Copy link
Contributor Author

alexdb27 commented May 6, 2015

Excellent work @HubLot , I really like your notebook.
What is nice with Ward is the fact that the clusters are well balanced.
What is nice with average is the fact that clusters are really based on a "natural rule", i.e., what is close is close in terms of simple distance.
Complete is not too far away from this one.
I'm not a big fan of single linkage as it is like onion and onion makes me cry like a river ...

So for me it is complete > average > Ward > linkage

@pierrepo
Copy link
Owner

pierrepo commented May 6, 2015

Very nice notebooks @jbarnoud and @HubLot
Since the complete method gives the same results for HC in R, Python/scipy and Python/scikit-learn, I propose we implement the Python/scipy method in PBxplore (this is Python and it has less dependancies than Python/scikit-learn).

However, I am not sur hierarchical clustering is the best clustering method here. As mentionned by @alexdb27, it is

visually very simple to handle

but I am not sure the visual we can get here is meaningfull. Indeed, the distance we use is quite coarse and I do not know how to interpret the fact that two clusters are close to each other and far from a third one.

So what do you believe is the most userfull to implement in PBxplore?

  • HC/complete with Python/scipy
  • K-means with Python/scipy
  • both ?

In any case, I advocate to remove R from the clustering process. This will be easier to install and to maintain.

@alexdb27
Copy link
Contributor Author

alexdb27 commented May 6, 2015

Both !!!!

@jbarnoud
Copy link
Collaborator

jbarnoud commented Jul 7, 2015

After we discussed about it with @alexdb27 and @HubLot, I started to implement the k-means.

@pierrepo
Copy link
Owner

Hi @jbarnoud. It is OK with the k-means implementation?
I'd like to get ride of R as soon as possible.

@alexdb27
Copy link
Contributor Author

https://pl.wikipedia.org/wiki/Kmin_rzymski a new implementation ?

@jbarnoud
Copy link
Collaborator

Hey ! Here is a prototype of K-means for PBxplore: https://gist.github.com/jbarnoud/fc27c5048d6e8f394598

This notebook implements the K-means algorithm and tries to visualize the clusters that are produced. I am looking for a way to validate the clustering. Any idea?

@pierrepo @alexdb27 @HubLot

@jbarnoud
Copy link
Collaborator

I updated the K-means notebook 2 days ago but I don't know if you got notified.

@pierrepo @alexdb27 @HubLot

@jbarnoud jbarnoud modified the milestone: 1.3 Oct 26, 2015
@jbarnoud jbarnoud mentioned this issue Oct 26, 2015
@jbarnoud jbarnoud mentioned this issue Oct 26, 2015
@pierrepo
Copy link
Owner

pierrepo commented Apr 8, 2016

@HubLot and @jbarnoud you did a great job on this issue.
PR #106 is implementing the k-means method.
In order not to lose the work you previously did on hc clustering. Could please add to PBxplore a simple Notebook explaining how to make hc clustering with the PBxplore API and either Scipy or scikit-learn?
I could be a simple reformating of this Notebook:
http://nbviewer.jupyter.org/gist/jbarnoud/7e9ea4362e948fe41dea

@pierrepo pierrepo modified the milestones: 1.4, 1.3 Apr 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants