Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error occurs when running dataset has less than 4096 columns #71

Open
Xiaojieqiu opened this issue May 23, 2018 · 21 comments
Open

Error occurs when running dataset has less than 4096 columns #71

Xiaojieqiu opened this issue May 23, 2018 · 21 comments

Comments

@Xiaojieqiu
Copy link

Xiaojieqiu commented May 23, 2018

Hi Leland,

Thanks for this incredible method and algorithm. We love it! However, we find one annoying issue when we run UMAP on an input matrix (in sparseMatrix format) with metric=“correlation” that has less than 4096 columns. It throws this error:

Error in py_call_impl(callable, dots$args, dots$keywords) :
TypeError: scipy distance metrics do not support sparse matrices.

We find that this issue can be fixed by simply removing the following lines of code from the umap_.py

# Handle small cases efficiently by computing all distances
       if X.shape[0] < 4096:
           dmat = pairwise_distances(X, metric=self.metric, **self.metric_kwds)
           self.graph = fuzzy_simplicial_set(
               dmat,
               self.n_neighbors,
               random_state,
               ‘precomputed’,
               self.metric_kwds,
               self.angular_rp_forest,
               self.set_op_mix_ratio,
               self.local_connectivity,
               self.bandwidth,
               self.verbose
           )
       else:

This issue maybe related to #12

@lmcinnes
Copy link
Owner

Hmm, I had assumed sklearn's pairwise_distances would support sparse matrices, but it looks like correlation doesn't. I can patch this directly by adding and metric != 'correlation' to the code, but actually start to wonder if the sparse correlation metric I have really works as it should. I think more accurately correlation is not really a metric supported for sparse matrices. Ultimately I need to fix that problem. In the meantime you can either convert the matrix to dense, or continue what you are doing if that works for you.

@Xiaojieqiu
Copy link
Author

Hi Leland,

Thanks for your prompt response! If you can patch the repo by adding and metric != 'correlation or something, that will be great.

However, I am a little concerned about your message on correlation is not really a metric supported for sparse matrices... Could you please clarify that? Do you mean correlation cannot be calculated from the sparse matrices mathematically (which I don't believe so)? or do you mean your implementation of correlation for large datasets (with more than 4096 samples) with sparse matrix is potentially problematic? Right now we find that using correlation metric for large dataset with sparse matrix give us great results although we have the issue for small datasets. So I don't think that your implementation could be problematic...

@lmcinnes
Copy link
Owner

lmcinnes commented May 23, 2018 via email

@Xiaojieqiu
Copy link
Author

Thank you Leland! Please let me know what is the potential issue regarding your current sparse matrix based correlation distance calculation.

@lmcinnes
Copy link
Owner

lmcinnes commented May 24, 2018 via email

lmcinnes added a commit that referenced this issue May 24, 2018
@lmcinnes
Copy link
Owner

Okay, I believe the current master should now fix your issue, and correctly compute correlation distance for sparse matrix input data. It may be a little slower as the correct computation is a little more expensive. You may want to consider cosine distance as a faster alternative that will be similar.

@Xiaojieqiu
Copy link
Author

that is great. thank you for the quick fix and detailed explanation. That has been very helpful. I will check the updated version now. (and we may close this issue for now?)

@lmcinnes
Copy link
Owner

lmcinnes commented May 24, 2018 via email

@Xiaojieqiu
Copy link
Author

btw, one quick suggestion, since you mentioned that it is slower for performing this fix. is that possible to first check the data first to ensure there are zero values before applying the fix and otherwise we can use what we have before?

@lmcinnes
Copy link
Owner

lmcinnes commented May 24, 2018 via email

@Xiaojieqiu
Copy link
Author

thank you for point out this! I am closing the issue for now

@Xiaojieqiu
Copy link
Author

@lmcinnes Hi Leland, thanks for fixing this issue again! I wonder whether you can help us release umap with this fix to PyPI. We are planning to release another R package (I can provide more details later) which will be dependent on UMAP. We would like to make sure the users to be able to use this newest version of umap for their analysis directly from R. Unfortunately R only provides ways to install python packages from PyPI but not from github .

@Xiaojieqiu Xiaojieqiu reopened this Jun 15, 2018
@lmcinnes
Copy link
Owner

umap-learn 0.2.4 should now be on PyPI. Let me know if that works for you.

@Xiaojieqiu
Copy link
Author

Xiaojieqiu commented Jun 15, 2018

That is cool. Thanks a lot for your prompt response! I really appreciated that! I am able to install 0.2.4 from PyPi in R now.

I am reopening this issue for now in case the users may have any other potential questions and I can discuss them here.

@Xiaojieqiu
Copy link
Author

Xiaojieqiu commented Jun 18, 2018

@lmcinnes Hi Leland, I did a more close check between the correlation, cosine and the Euclidean metric for a dataset with about 2k samples with the umap-learn 0.2.4 version. Interestingly, I found UMAP reduced dimension from euclidean and cosine metric are very similar and forms a continuous manifold but the correlation metric gives way different result and the manifold is clustered into separate groups (See figures below). I wonder maybe there is a potential typo or something in your recent changes to the correlation metric for sample less than 4096 columns?
cosine:
plot_cosine
euclidean:
plot_euclidian
correlation:
monocle_plot_3d_correlation

@lmcinnes
Copy link
Owner

The easiest way to check at this point would be to cast your matrix to dense as that will avoid running the new code and fall back to sklearn correlation distance. If that gives notably different results then yes, there is still a bug somewhere in the sparse correlation distance computation.

@Xiaojieqiu
Copy link
Author

Thanks for the suggestions! I double checked and indeed found that converting the sparse matrix into dense matrix gives us much continuous and less separated low-dimensional embedding:
correlation metric after converting the sparse matrix to a dense matrix:
screen shot 2018-06-18 at 15 08 59

@lmcinnes
Copy link
Owner

lmcinnes commented Jun 18, 2018 via email

@Xiaojieqiu
Copy link
Author

yeah, take your time to fix it.... we will use cosine metric for now.

@lmcinnes
Copy link
Owner

Found it! New package is on PyPI (0.2.5) if that is helpful. Sorry if this was a little late.

@sleighsoft
Copy link
Collaborator

@lmcinnes and @Xiaojieqiu did all problems get resolved? Can this be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants