Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of R #13

Closed
jbarnoud opened this issue Sep 25, 2014 · 11 comments
Closed

Get rid of R #13

jbarnoud opened this issue Sep 25, 2014 · 11 comments

Comments

@jbarnoud
Copy link
Collaborator

PBxplore uses R to draw figures and to perform hierarchical clustering. This implies a heavy dependence to R;. It also requires to know two languages to maintain the software instead of one. Finally, as R is called via subprocess rather that via rpy, it requires to write intermediate scripts which is error prone and difficult to maintain.

All the thing done with R could be done in python.

Figures could be drawn using the matplotlib library. This would imply a dependence to the matplotlib python package, yet depending on a python package is lighter in a python environment than depending on an external interpreter. Also, the python scientific stack is already required through the numpy package.

Hierarchical clustering is available in the scipy module: http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html. As matplotlib, scipy os part of the classical python scientific stack. Other python modules implement hierarchical clustering (see http://nbviewer.ipython.org/github/OxanaSachenkova/hclust-python/blob/master/hclust.ipynb and http://bioinformatics.org.au/tools/hclust/) but they would create dependencies to less classical python modules.

R is called in two files: PBstat.py and PBclust.py.

@pierrepo
Copy link
Owner

Thanks Jonathan.

You are absoluterly right regarging the use of Scipy+Matplotlib versus R. This is even more true when you see the beautiful figures we can obtain with Matplotlib http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003833

Unfortunately, I don't know how to use both Matplolib and Scipy. If you have some time I would be more than happy to merge your contributions.

@jbarnoud
Copy link
Collaborator Author

jbarnoud commented Oct 1, 2014

I am a bit busy right now but I'll have a look. It should not be very difficult to do.

@pierrepo
Copy link
Owner

pierrepo commented Oct 3, 2014

I agree. I'll published a dev branch as soon as I make something interesting.

@jbarnoud
Copy link
Collaborator Author

jbarnoud commented Oct 7, 2014

I am looking at the clustering. I'll try to send a pull request as soon as possible.

@alexdb27
Copy link
Contributor

One question is -i'm playing devil's advocate- that Scipy+Matplotlib also evolve. R seems quite stable, is it the same thing for both ?
We need a comparison on some simple examples...
PS: is it easy to use with Mac ?
PS2: is it not a big work ?

@jbarnoud
Copy link
Collaborator Author

On 22/04/15 10:20, Alexandre G. de Brevern wrote:

One question is -i'm playing devil's advocate- that Scipy+Matplotlib
also evolve. R seems quite stable, is it the same thing for both ?
We need a comparison on some simple examples...
PS: is it easy to use with Mac ?
PS2: is it not a big work ?


Reply to this email directly or view it on GitHub
#13 (comment).

Scipy and Matplotlib are quite stable and well tested. They are as easy
to install on mac as most python library as they can be installed using
pip. Being very popular python packages, they can also be installed
using package managers like macport or fink.

There are two main points at getting rid of R:

  • whithout R, PBxplore will be easier to install. There will not be less
    dependency as they R will be replaced by other dependencies. Yet, these
    new dependencies can be installed with the same process as other
    packages we already depend on (like numpy). Ultimately, can we will be
    able to install PBxplore with a single command line.
  • having 2 languages to handle make it more difficult to maintain.

Removing R is some work. But it is not an awful lot either. It may not
be a priority but we will really benefit from it.

@pierrepo
Copy link
Owner

I agree with @jbarnoud.
Actually, R is not that stable. See for instance the issue we had recently regarding the name of the ward method to use in hclust (issue #31).
If a user is able to install numpy (major requirement for PBxplore) on his computer he will be able to install scipy and matplotlib as well.

@alexdb27
Copy link
Contributor

  • One point R is stable for most of the common functions.
    hclust was considered as good for more than 10 years (is it the case with Python libs such as biopython or similar, ... seems not). The Ward question is specific question.
  • If scipy and matplotlib are simple to install for all, it is excellent.
  • My main question is the quality of the clustering with various python libs. We need to have some tests to ensure it. I'm sure it will be.

@alexdb27
Copy link
Contributor

So, where are we now ??

@pierrepo
Copy link
Owner

Regarding graphics, we are almost done. PBs distribution map is now in Python (see issue #52 ) and Neq vs sequence will soon be in Python too.
The last point is the clustering. Maybe we could implement just one clustering method in Python (hclust or k-means). Ping @jbarnoud ?

@jbarnoud jbarnoud modified the milestone: 1.3 Oct 26, 2015
@jbarnoud
Copy link
Collaborator Author

The only remaining dependancy to R is the clustering that we will revamp anyway. I set this issue as a duplicate of #64 and close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants