-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error occurs when running dataset has less than 4096 columns #71
Comments
Hmm, I had assumed sklearn's |
Hi Leland, Thanks for your prompt response! If you can patch the repo by adding However, I am a little concerned about your message on |
I think the sparse version of correlation may not be a true correlation
distance -- I would have to check to be sure. If it is giving your good
results then it probably does work correctly, and I'm worrying over
nothing. I'll fix the distance measure for now so it should work more
correctly for your case.
…On Wed, May 23, 2018 at 6:43 PM, Xiaojie Qiu ***@***.***> wrote:
Hi Leland,
Thanks for your prompt response! If you can patch the repo by adding and
metric != 'correlation or something, that will be great.
However, I am a little concerned about your message on correlation is not
really a metric supported for sparse matrices... Could you please clarify
that? Do you mean correlation cannot be calculated from the sparse matrices
mathematically (which I don't believe so)? or do you mean your
implementation of correlation for large datasets (with more than 4096
samples) with sparse matrix is potentially problematic? Right now we find
that using correlation metric for large dataset with sparse matrix give us
great results although we have the issue for small datasets. So I don't
think that your implementation could be problematic...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#71 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALaKBZ3xoz29IDLJNQ3yg6uPFyR7Xw5xks5t1eYhgaJpZM4ULH_R>
.
|
Thank you Leland! Please let me know what is the potential issue regarding your current sparse matrix based |
My concern is as follows: for correlation one should shift the data by the
mean value, and then essentially compute cosine distance of the shifted
data -- I believe the shifting in the sparse case is ignoring the zeros (as
most of the parse computations do -- or else this would become dense and
potentially messy for very dimension sparse datasets). That makes the
result not an accurate correlation distance. The error can potentially be
small for many cases, but reviewing the code I cannot guarantee accuracy.
…On Wed, May 23, 2018 at 7:59 PM, Xiaojie Qiu ***@***.***> wrote:
Thank you Leland! Please let me know what is the potential issue regarding
your current sparse matrix based correlation distance calculation.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#71 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALaKBfZ4XNYDug8EyN2ac2oP-gi_Q09sks5t1ff9gaJpZM4ULH_R>
.
|
Okay, I believe the current master should now fix your issue, and correctly compute correlation distance for sparse matrix input data. It may be a little slower as the correct computation is a little more expensive. You may want to consider cosine distance as a faster alternative that will be similar. |
that is great. thank you for the quick fix and detailed explanation. That has been very helpful. I will check the updated version now. (and we may close this issue for now?) |
You can close it for now, but feel free to re-open it if this doesn't
resolve the issue for you.
…On Wed, May 23, 2018 at 8:54 PM, Xiaojie Qiu ***@***.***> wrote:
that is great. thank you for the quick fix and detailed explanation. That
has been very helpful. I will check the updated version now. (and we may
close this issue for now?)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#71 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALaKBdoeNkYlXUGhZInJpuLsXe1d3YR_ks5t1gTMgaJpZM4ULH_R>
.
|
btw, one quick suggestion, since you mentioned that it is slower for performing this fix. is that possible to first check the data first to ensure there are zero values before applying the fix and otherwise we can use what we have before? |
If you don't have any zeros then you can convert he input to dense (use
``.toarray()``) with no loss and use the correlation distance there.
Alternatively the quick check is if the value of the ``.nnz`` property on
the input data is equal to the number of samples times the number of
features.
…On Wed, May 23, 2018 at 9:08 PM, Xiaojie Qiu ***@***.***> wrote:
btw, one quick suggestion, since you mentioned that it is slower for
performing this fix. is that possible to first check the data first to
ensure there are zero values before applying the fix and otherwise we can
use what we have before?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#71 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALaKBVtOQE1-1XHLJHYvIP5gSDsJkjLeks5t1ggNgaJpZM4ULH_R>
.
|
thank you for point out this! I am closing the issue for now |
@lmcinnes Hi Leland, thanks for fixing this issue again! I wonder whether you can help us release umap with this fix to PyPI. We are planning to release another R package (I can provide more details later) which will be dependent on UMAP. We would like to make sure the users to be able to use this newest version of umap for their analysis directly from R. Unfortunately R only provides ways to install python packages from PyPI but not from github . |
umap-learn 0.2.4 should now be on PyPI. Let me know if that works for you. |
That is cool. Thanks a lot for your prompt response! I really appreciated that! I am able to install 0.2.4 from PyPi in R now. I am reopening this issue for now in case the users may have any other potential questions and I can discuss them here. |
@lmcinnes Hi Leland, I did a more close check between the correlation, cosine and the Euclidean |
The easiest way to check at this point would be to cast your matrix to dense as that will avoid running the new code and fall back to sklearn correlation distance. If that gives notably different results then yes, there is still a bug somewhere in the sparse correlation distance computation. |
Okay, there are still some bugs to be worked out then. Sparse correlation
is hard, unfortunately.
…On Mon, Jun 18, 2018 at 6:12 PM, Xiaojie Qiu ***@***.***> wrote:
Thanks for the suggestions! I double checked and indeed found that
converting the sparse matrix into dense matrix gives us much continuous and
less separated low-dimensional embedding:
correlation metric after converting the sparse matrix to a dense matrix:
[image: screen shot 2018-06-18 at 15 08 59]
<https://user-images.githubusercontent.com/7456281/41564974-99449666-7309-11e8-9e3d-38e5d98414c2.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#71 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALaKBZQqK_MeLIlYvb_fbG5LTXlfs_Wcks5t-CXZgaJpZM4ULH_R>
.
|
yeah, take your time to fix it.... we will use cosine metric for now. |
Found it! New package is on PyPI (0.2.5) if that is helpful. Sorry if this was a little late. |
@lmcinnes and @Xiaojieqiu did all problems get resolved? Can this be closed? |
Hi Leland,
Thanks for this incredible method and algorithm. We love it! However, we find one annoying issue when we run UMAP on an input matrix (in sparseMatrix format) with metric=“correlation” that has less than 4096 columns. It throws this error:
We find that this issue can be fixed by simply removing the following lines of code from the
umap_.py
This issue maybe related to #12
The text was updated successfully, but these errors were encountered: