Error occurs when running dataset has less than 4096 columns #71

Xiaojieqiu · 2018-05-23T20:24:24Z

Hi Leland,

Thanks for this incredible method and algorithm. We love it! However, we find one annoying issue when we run UMAP on an input matrix (in sparseMatrix format) with metric=“correlation” that has less than 4096 columns. It throws this error:

Error in py_call_impl(callable, dots$args, dots$keywords) :
TypeError: scipy distance metrics do not support sparse matrices.

We find that this issue can be fixed by simply removing the following lines of code from the umap_.py

# Handle small cases efficiently by computing all distances
       if X.shape[0] < 4096:
           dmat = pairwise_distances(X, metric=self.metric, **self.metric_kwds)
           self.graph = fuzzy_simplicial_set(
               dmat,
               self.n_neighbors,
               random_state,
               ‘precomputed’,
               self.metric_kwds,
               self.angular_rp_forest,
               self.set_op_mix_ratio,
               self.local_connectivity,
               self.bandwidth,
               self.verbose
           )
       else:

This issue maybe related to #12

The text was updated successfully, but these errors were encountered:

lmcinnes · 2018-05-23T21:53:58Z

Hmm, I had assumed sklearn's pairwise_distances would support sparse matrices, but it looks like correlation doesn't. I can patch this directly by adding and metric != 'correlation' to the code, but actually start to wonder if the sparse correlation metric I have really works as it should. I think more accurately correlation is not really a metric supported for sparse matrices. Ultimately I need to fix that problem. In the meantime you can either convert the matrix to dense, or continue what you are doing if that works for you.

Xiaojieqiu · 2018-05-23T22:43:44Z

Hi Leland,

Thanks for your prompt response! If you can patch the repo by adding and metric != 'correlation or something, that will be great.

However, I am a little concerned about your message on correlation is not really a metric supported for sparse matrices... Could you please clarify that? Do you mean correlation cannot be calculated from the sparse matrices mathematically (which I don't believe so)? or do you mean your implementation of correlation for large datasets (with more than 4096 samples) with sparse matrix is potentially problematic? Right now we find that using correlation metric for large dataset with sparse matrix give us great results although we have the issue for small datasets. So I don't think that your implementation could be problematic...

lmcinnes · 2018-05-23T23:51:32Z

I think the sparse version of correlation may not be a true correlation distance -- I would have to check to be sure. If it is giving your good results then it probably does work correctly, and I'm worrying over nothing. I'll fix the distance measure for now so it should work more correctly for your case.

…

On Wed, May 23, 2018 at 6:43 PM, Xiaojie Qiu ***@***.***> wrote: Hi Leland, Thanks for your prompt response! If you can patch the repo by adding and metric != 'correlation or something, that will be great. However, I am a little concerned about your message on correlation is not really a metric supported for sparse matrices... Could you please clarify that? Do you mean correlation cannot be calculated from the sparse matrices mathematically (which I don't believe so)? or do you mean your implementation of correlation for large datasets (with more than 4096 samples) with sparse matrix is potentially problematic? Right now we find that using correlation metric for large dataset with sparse matrix give us great results although we have the issue for small datasets. So I don't think that your implementation could be problematic... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#71 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALaKBZ3xoz29IDLJNQ3yg6uPFyR7Xw5xks5t1eYhgaJpZM4ULH_R> .

…tances

Xiaojieqiu · 2018-05-23T23:59:56Z

Thank you Leland! Please let me know what is the potential issue regarding your current sparse matrix based correlation distance calculation.

lmcinnes · 2018-05-24T00:06:48Z

My concern is as follows: for correlation one should shift the data by the mean value, and then essentially compute cosine distance of the shifted data -- I believe the shifting in the sparse case is ignoring the zeros (as most of the parse computations do -- or else this would become dense and potentially messy for very dimension sparse datasets). That makes the result not an accurate correlation distance. The error can potentially be small for many cases, but reviewing the code I cannot guarantee accuracy.

…

On Wed, May 23, 2018 at 7:59 PM, Xiaojie Qiu ***@***.***> wrote: Thank you Leland! Please let me know what is the potential issue regarding your current sparse matrix based correlation distance calculation. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#71 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALaKBfZ4XNYDug8EyN2ac2oP-gi_Q09sks5t1ff9gaJpZM4ULH_R> .

lmcinnes · 2018-05-24T00:43:10Z

Okay, I believe the current master should now fix your issue, and correctly compute correlation distance for sparse matrix input data. It may be a little slower as the correct computation is a little more expensive. You may want to consider cosine distance as a faster alternative that will be similar.

Xiaojieqiu · 2018-05-24T00:54:34Z

that is great. thank you for the quick fix and detailed explanation. That has been very helpful. I will check the updated version now. (and we may close this issue for now?)

lmcinnes · 2018-05-24T00:57:34Z

You can close it for now, but feel free to re-open it if this doesn't resolve the issue for you.

…

On Wed, May 23, 2018 at 8:54 PM, Xiaojie Qiu ***@***.***> wrote: that is great. thank you for the quick fix and detailed explanation. That has been very helpful. I will check the updated version now. (and we may close this issue for now?) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#71 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALaKBdoeNkYlXUGhZInJpuLsXe1d3YR_ks5t1gTMgaJpZM4ULH_R> .

Xiaojieqiu · 2018-05-24T01:08:27Z

btw, one quick suggestion, since you mentioned that it is slower for performing this fix. is that possible to first check the data first to ensure there are zero values before applying the fix and otherwise we can use what we have before?

lmcinnes · 2018-05-24T02:01:46Z

If you don't have any zeros then you can convert he input to dense (use ``.toarray()``) with no loss and use the correlation distance there. Alternatively the quick check is if the value of the ``.nnz`` property on the input data is equal to the number of samples times the number of features.

…

On Wed, May 23, 2018 at 9:08 PM, Xiaojie Qiu ***@***.***> wrote: btw, one quick suggestion, since you mentioned that it is slower for performing this fix. is that possible to first check the data first to ensure there are zero values before applying the fix and otherwise we can use what we have before? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#71 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALaKBVtOQE1-1XHLJHYvIP5gSDsJkjLeks5t1ggNgaJpZM4ULH_R> .

Xiaojieqiu · 2018-05-24T02:05:17Z

thank you for point out this! I am closing the issue for now

Xiaojieqiu · 2018-06-15T02:44:47Z

@lmcinnes Hi Leland, thanks for fixing this issue again! I wonder whether you can help us release umap with this fix to PyPI. We are planning to release another R package (I can provide more details later) which will be dependent on UMAP. We would like to make sure the users to be able to use this newest version of umap for their analysis directly from R. Unfortunately R only provides ways to install python packages from PyPI but not from github .

lmcinnes · 2018-06-15T03:13:58Z

umap-learn 0.2.4 should now be on PyPI. Let me know if that works for you.

Xiaojieqiu · 2018-06-15T03:54:32Z

That is cool. Thanks a lot for your prompt response! I really appreciated that! I am able to install 0.2.4 from PyPi in R now.

I am reopening this issue for now in case the users may have any other potential questions and I can discuss them here.

Xiaojieqiu · 2018-06-18T19:00:48Z

@lmcinnes Hi Leland, I did a more close check between the correlation, cosine and the Euclidean metric for a dataset with about 2k samples with the umap-learn 0.2.4 version. Interestingly, I found UMAP reduced dimension from euclidean and cosine metric are very similar and forms a continuous manifold but the correlation metric gives way different result and the manifold is clustered into separate groups (See figures below). I wonder maybe there is a potential typo or something in your recent changes to the correlation metric for sample less than 4096 columns?
cosine:

euclidean:

correlation:

lmcinnes · 2018-06-18T20:56:48Z

The easiest way to check at this point would be to cast your matrix to dense as that will avoid running the new code and fall back to sklearn correlation distance. If that gives notably different results then yes, there is still a bug somewhere in the sparse correlation distance computation.

Xiaojieqiu · 2018-06-18T22:12:39Z

Thanks for the suggestions! I double checked and indeed found that converting the sparse matrix into dense matrix gives us much continuous and less separated low-dimensional embedding:
correlation metric after converting the sparse matrix to a dense matrix:

lmcinnes · 2018-06-18T22:53:26Z

Okay, there are still some bugs to be worked out then. Sparse correlation is hard, unfortunately.

…

On Mon, Jun 18, 2018 at 6:12 PM, Xiaojie Qiu ***@***.***> wrote: Thanks for the suggestions! I double checked and indeed found that converting the sparse matrix into dense matrix gives us much continuous and less separated low-dimensional embedding: correlation metric after converting the sparse matrix to a dense matrix: [image: screen shot 2018-06-18 at 15 08 59] <https://user-images.githubusercontent.com/7456281/41564974-99449666-7309-11e8-9e3d-38e5d98414c2.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#71 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALaKBZQqK_MeLIlYvb_fbG5LTXlfs_Wcks5t-CXZgaJpZM4ULH_R> .

Xiaojieqiu · 2018-06-19T17:23:22Z

yeah, take your time to fix it.... we will use cosine metric for now.

lmcinnes · 2018-06-20T14:21:08Z

Found it! New package is on PyPI (0.2.5) if that is helpful. Sorry if this was a little late.

sleighsoft · 2019-09-16T20:21:45Z

@lmcinnes and @Xiaojieqiu did all problems get resolved? Can this be closed?

lmcinnes added a commit that referenced this issue May 23, 2018

Fix for issue #71 -- sparse correlation not available in pairwise_dis…

9e199e4

…tances

lmcinnes added a commit that referenced this issue May 24, 2018

Issue #71 touch up

c86884d

lmcinnes added a commit that referenced this issue May 24, 2018

Correct sparse correlation distance per issue #71

d80f313

Xiaojieqiu closed this as completed May 24, 2018

Xiaojieqiu reopened this Jun 15, 2018

lmcinnes added a commit that referenced this issue Jun 20, 2018

Fix sparse correlation (at last) per issue #71

38c27be

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error occurs when running dataset has less than 4096 columns #71

Error occurs when running dataset has less than 4096 columns #71

Xiaojieqiu commented May 23, 2018 •

edited

Loading

lmcinnes commented May 23, 2018

Xiaojieqiu commented May 23, 2018

lmcinnes commented May 23, 2018 via email

Xiaojieqiu commented May 23, 2018

lmcinnes commented May 24, 2018 via email

lmcinnes commented May 24, 2018

Xiaojieqiu commented May 24, 2018

lmcinnes commented May 24, 2018 via email

Xiaojieqiu commented May 24, 2018

lmcinnes commented May 24, 2018 via email

Xiaojieqiu commented May 24, 2018

Xiaojieqiu commented Jun 15, 2018

lmcinnes commented Jun 15, 2018

Xiaojieqiu commented Jun 15, 2018 •

edited

Loading

Xiaojieqiu commented Jun 18, 2018 •

edited

Loading

lmcinnes commented Jun 18, 2018

Xiaojieqiu commented Jun 18, 2018

lmcinnes commented Jun 18, 2018 via email

Xiaojieqiu commented Jun 19, 2018

lmcinnes commented Jun 20, 2018

sleighsoft commented Sep 16, 2019

Error occurs when running dataset has less than 4096 columns #71

Error occurs when running dataset has less than 4096 columns #71

Comments

Xiaojieqiu commented May 23, 2018 • edited Loading

lmcinnes commented May 23, 2018

Xiaojieqiu commented May 23, 2018

lmcinnes commented May 23, 2018 via email

Xiaojieqiu commented May 23, 2018

lmcinnes commented May 24, 2018 via email

lmcinnes commented May 24, 2018

Xiaojieqiu commented May 24, 2018

lmcinnes commented May 24, 2018 via email

Xiaojieqiu commented May 24, 2018

lmcinnes commented May 24, 2018 via email

Xiaojieqiu commented May 24, 2018

Xiaojieqiu commented Jun 15, 2018

lmcinnes commented Jun 15, 2018

Xiaojieqiu commented Jun 15, 2018 • edited Loading

Xiaojieqiu commented Jun 18, 2018 • edited Loading

lmcinnes commented Jun 18, 2018

Xiaojieqiu commented Jun 18, 2018

lmcinnes commented Jun 18, 2018 via email

Xiaojieqiu commented Jun 19, 2018

lmcinnes commented Jun 20, 2018

sleighsoft commented Sep 16, 2019

Xiaojieqiu commented May 23, 2018 •

edited

Loading

Xiaojieqiu commented Jun 15, 2018 •

edited

Loading

Xiaojieqiu commented Jun 18, 2018 •

edited

Loading