UMAP as a dimensionality reduction (umap.transform()) #40

mlaprise · 2018-02-01T15:23:14Z

First of all, thx for this method. It working so well !

So I have a general question about using UMAP as a dimensionality reduction step in a prediction pipeline. We have a classification model where using a UMAP as a first dimensionality reduction step seem to gives really good results. It fixes a lot of regularization issue we have with this specific model. Now I guess my question is more related to manifold training in general, but I usually fit the dim reduction model first on the train data and then use the same model for the inference/prediction in order to have a consistent lower-dimensional projection.

Now obviously, like t-SNE, the manifold itself is learned with the data so it’s hard to “transform” new incoming data so that’s why there is no umap.transform() method I guess. There was a closely related discussion on sklearn at some point on a possible parametric t-SNE that would make this projection easier (scikit-learn/scikit-learn#5361) but looks like it’s a non trivial task in t-SNE. Anyway, long story short, since it’s mentioned in the documentation that UMAP can be used as a “reduction technique as a preliminary step to other machine learning tasks”, I was wondering how a prediction pipeline using UMAP would like like ?

The method I found so far is to reduce the dimensionality of the training AND test data at the same time in a single umap.fit_transform(), then train the model on the reduced train data and predict with the reduced test data. It’s work well in the a test scenario, but obviously in a real world environnement it mean that we would have to perform the dim reduction of the incoming data alongside the entire training dataset every time.

Is there a more elegant way of doing this ?

Martin

lmcinnes · 2018-02-01T15:46:15Z

I think the answer to that is that all the requisite code to manage that has not been developed yet -- specifically the prediction aspect. There are a few ways to do that, but the most viable is something like parametric t-SNE where one trains a function approximator to learn a functional transform to match the embedding (in this case a neural network). I should note that in UMAPs case this would look somewhat akin to a "deep" word2vec type training. Other prediction modes are possible whereby one retrains the model holding the training points embedding locations fixed and optimizing only the locations of the new points.

In other words for now it is more an "in principle" statement -- none of this is hard in the sense that I believe there are no theoretical obstructions to making this work, so from an algorithmic research point of view it is "solved" on some level. In practice yes, there is code that needs to be written to make this actually practical, and some of that is potentially somewhat non-trivial.

mlaprise · 2018-02-01T16:16:59Z

ok thx for the explanation ! that's exactly what I thought, the method allows it but it's not really implemented yet. Just wanted to make sure. I would be happy to contribute on this at some point.

lmcinnes · 2018-02-01T16:25:32Z

Contributions are more than welcome -- especially on a parametric version as I have limited neural network experience.

lmcinnes · 2018-02-06T03:37:12Z

I have experimental code in some notebooks I wrote out of curiousity that can do a transform operation on new data under basic UMAP theory assumptions (place new data assuming the training embedding is fixed -- which is not different than, say, PCA). On my one test so far on MNIST digits it did great -- but then everything does great on MNIST digits. I think it should generalise though -- I'll have to put all the pieces together properly and try it on a few other datatsets. One downside is that is it "slow" -- based on timings of doing it piecemeal I think we're talking say ~20s for 10000 test points compared to 1m40s for 60000 training points fit time. Does this seem reasonable to you?

mlaprise · 2018-02-06T19:06:29Z

Nice, I'd say 20sec is really reasonable ! Interestingly I did an experiment as well. In fact I tried something really naive just out of curiosity. I first projected the data into the manifold using the UMAP code. Then I wrote a simple fully connected neural network and trained it on the result of the UMAP. Essentially learning the function that does the projection. Then I used that model to do the dim reduction in my predictive model (instead of the actual UMAP). Of course that model is totally specific to my problem/dataset, but the accuracy I get is similar to the one I got with the actual UMAP.

lmcinnes · 2018-02-06T19:26:41Z

Nice! It sounds like both approaches are viable. I'm going to try and clean up the notebooks and then get a working version of my current approach working as a transform operation within UMAP itself. I would certainly be interested in neural network approaches as well though (I just don't have much expertise in that area).

kylemcdonald · 2018-03-07T01:48:03Z

fwiw i ported laurens' parametric t-SNE implemented to keras a few years ago https://github.com/kylemcdonald/Parametric-t-SNE and i tried both approaches: training a net to produce the same embedding as a previously "solved" t-SNE, and training a net to optimize for t-SNE's loss function directly. both gave interesting and useful results.

it gets really exciting when you can start using domain-specific operations like convolution or recurrence. for example, imagine UMAP running on images in a way that is simultaneously optimizing a convolutional network for processing the images at the same time as optimizing the embedding.

lmcinnes · 2018-03-07T03:06:48Z

That certainly sounds like an interesting approach. I would be interested to know more about training the convolutional layers to process the image and give the embedding. In the meantime I am finishing up the required refactoring to get the simplified transform method in place. I think I'll have the leave the NN work to others.

JaeDukSeo · 2018-10-31T20:03:29Z

@lmcinnes just as a question do you have a tutorial covering the math behind umap?

lmcinnes · 2018-10-31T22:27:38Z

So the closest I have currently is [this explanation]( https://umap-learn.readthedocs.io/en/latest/how_umap_works.html), and [my talk at scipy](https://www.youtube.com/watch?v=nq6iPZVUxZU). Depending on what you are looking for this will either be too much, or not enough detail. If it is not enough detail then [the paper]( https://arxiv.org/abs/1802.03426) contains the math, though is not necessarily the best introduction to the overall algorithm. I'm working with John Healy and James Melville on a longer version of the paper that provides more details, which will hopefully get posted to arXiv in the next few months (all going well).

…

On Wed, Oct 31, 2018 at 8:07 PM Jae Duk Seo ***@***.***> wrote: @lmcinnes <https://github.com/lmcinnes> just as a question do you have a tutorial covering the math behind umap? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALaKBVJrLMVLkykC0PuJXnZhn0oALL0_ks5uqgMJgaJpZM4R1yId> .

JaeDukSeo · 2018-11-01T00:07:45Z

@lmcinnes Thank you so much! Not gonna lie I am studying your implementation in umap, and it is very impressive! Very amazed to see a mathematician who is also a profound programmer.

piplus2 · 2019-05-16T08:57:45Z

An easy implementation of a transform method would be what Laureen suggests for his t-SNE.

Using a multivariate regressor (a neural network?) fitted using the training dataset and its low-dimensional representation and use it to project new data onto the same manifold.
It is a naive approach but easy to implement.
The other way is the analogous parametric t-SNE approach that fits the model parameters using the t-SNE loss function directly.

@kylemcdonald implementation is a very good starting point. If you want we can work together on this.

piplus2 · 2019-05-16T09:48:08Z

So this is the implementation of the naive multivariate regressor to project new data on the fitted UMAP embedding space

https://github.com/paoloinglese/Parametric-t-SNE/blob/master/Parametric%20UMAP%20(Keras).ipynb

It's based on @kylemcdonald implementation using Keras.
In the next days I'll have a look at a more refined model, analogous to parametric t-SNE.

lmcinnes · 2019-05-16T22:11:00Z

@paoloinglese: This looks really interesting -- it would add some non-trivial dependencies, but is certainly worth looking into further. At the very least it would be very nice to have a documentation page simialr to your notebook demonstrating how to do this sort of thing. I look forward to hearing more.

piplus2 · 2019-05-16T22:13:58Z

@lmcinnes Ok great! I'll prepare something in the next few days.

piplus2 · 2019-05-17T12:04:00Z

@lmcinnes I've updated the notebook setting the tensorflow backend and added a simple k-nn classification using the UMAP embedding of the train set and predicting using the neural network predicted UMAP embedding for the test set

https://github.com/paoloinglese/Parametric-t-SNE/blob/master/Parametric%20UMAP%20(Keras).ipynb

Unfortunately, I don't have much time to put more text in the notebook. I guess the overall idea is pretty clear, as suggested previously in other messages here on Github.

JaeDukSeo · 2019-05-17T13:15:19Z

thanks for this

HyperPotatoNeo · 2019-08-31T16:27:18Z

So I had tried a simple Auto-Encoder on Cifar10 a few months ago,
here it is.
https://github.com/HyperPotatoNeo/Deep-learning-experiments-with-umap/blob/master/AE%20vs%20UMAP_%20Cifar10.ipynb

damithsenanayake · 2019-12-12T03:16:30Z

Hi All,
I've recently put my method SONG up on arXiv (https://arxiv.org/abs/1912.04896), which may be an alternative approach for 'Model Persistence with UMAP' in Leland's long-term road-map. Please give it a read and let me know what you guys think.

https://arxiv.org/abs/1912.04896

omaralvarez · 2020-05-21T07:49:02Z

Any updates on this? UMAP is giving me great results, but I want to run it in real-time for new unseen points and the transform method is slow. Anybody knows how to get around this?

lmcinnes · 2020-05-21T15:54:00Z

I think the SONG option listed above is not unreasonable. In practice real-time transforms are not something that will be available any time soon in UMAP.

omaralvarez · 2020-05-22T07:33:06Z

I think the SONG idea is pretty good. I would like to give it a try, but I have not found any code. I would like to reproduce its results and test it out with my datasets.

damithsenanayake · 2020-05-22T09:06:48Z

Hi Omar,
We will be soon making the SONG code available (in a couple of weeks). Please stay tuned. Cheers.

damithsenanayake · 2020-09-09T07:59:27Z

Hi Omar,
we have now released the SONG source code (after much ado...). You may find it in this link.
https://github.com/damithsenanayake/SONG

none0none · 2021-08-10T19:29:20Z

So this is the implementation of the naive multivariate regressor to project new data on the fitted UMAP embedding space

https://github.com/paoloinglese/Parametric-t-SNE/blob/master/Parametric%20UMAP%20(Keras).ipynb

It's based on @kylemcdonald implementation using Keras.
In the next days I'll have a look at a more refined model, analogous to parametric t-SNE.

Today I checked this link but it is not valid any more.

piplus2 · 2021-08-10T20:32:41Z

@none0none yes, I removed it after @lmcinnes et al. published a refined model for parametric UMAP https://arxiv.org/abs/2009.12981

lmcinnes mentioned this issue Feb 1, 2018

UMAP Roadmap #15

Open

21 tasks

lmcinnes mentioned this issue Mar 7, 2018

Transform and Unsupervised Data #55

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UMAP as a dimensionality reduction (umap.transform()) #40

UMAP as a dimensionality reduction (umap.transform()) #40

mlaprise commented Feb 1, 2018

lmcinnes commented Feb 1, 2018 •

edited

Loading

mlaprise commented Feb 1, 2018

lmcinnes commented Feb 1, 2018

lmcinnes commented Feb 6, 2018

mlaprise commented Feb 6, 2018

lmcinnes commented Feb 6, 2018

kylemcdonald commented Mar 7, 2018

lmcinnes commented Mar 7, 2018

JaeDukSeo commented Oct 31, 2018

lmcinnes commented Oct 31, 2018 via email

JaeDukSeo commented Nov 1, 2018

piplus2 commented May 16, 2019

piplus2 commented May 16, 2019 •

edited

Loading

lmcinnes commented May 16, 2019

piplus2 commented May 16, 2019

piplus2 commented May 17, 2019

JaeDukSeo commented May 17, 2019

HyperPotatoNeo commented Aug 31, 2019 •

edited

Loading

damithsenanayake commented Dec 12, 2019

omaralvarez commented May 21, 2020

lmcinnes commented May 21, 2020

omaralvarez commented May 22, 2020

damithsenanayake commented May 22, 2020

damithsenanayake commented Sep 9, 2020

none0none commented Aug 10, 2021

piplus2 commented Aug 10, 2021

UMAP as a dimensionality reduction (umap.transform()) #40

UMAP as a dimensionality reduction (umap.transform()) #40

Comments

mlaprise commented Feb 1, 2018

lmcinnes commented Feb 1, 2018 • edited Loading

mlaprise commented Feb 1, 2018

lmcinnes commented Feb 1, 2018

lmcinnes commented Feb 6, 2018

mlaprise commented Feb 6, 2018

lmcinnes commented Feb 6, 2018

kylemcdonald commented Mar 7, 2018

lmcinnes commented Mar 7, 2018

JaeDukSeo commented Oct 31, 2018

lmcinnes commented Oct 31, 2018 via email

JaeDukSeo commented Nov 1, 2018

piplus2 commented May 16, 2019

piplus2 commented May 16, 2019 • edited Loading

lmcinnes commented May 16, 2019

piplus2 commented May 16, 2019

piplus2 commented May 17, 2019

JaeDukSeo commented May 17, 2019

HyperPotatoNeo commented Aug 31, 2019 • edited Loading

damithsenanayake commented Dec 12, 2019

omaralvarez commented May 21, 2020

lmcinnes commented May 21, 2020

omaralvarez commented May 22, 2020

damithsenanayake commented May 22, 2020

damithsenanayake commented Sep 9, 2020

none0none commented Aug 10, 2021

piplus2 commented Aug 10, 2021

lmcinnes commented Feb 1, 2018 •

edited

Loading

piplus2 commented May 16, 2019 •

edited

Loading

HyperPotatoNeo commented Aug 31, 2019 •

edited

Loading