Partial fit / Incremental UMAP #62

parashardhapola · 2018-04-30T17:07:59Z

There has been quite an exciting about UMAP in scRNA-Seq area recently. However, fitting larger data-sets might be an issue depending on amount of available memory. I would love to see a 'partial_fit' method (akin to one available in scikit's incremental PCA) in UMAP, so that data could be lazily loaded and fitted. I do realize that this might be non-trivial, if not theoretically infeasible, but your recent implementation of 'transform' method got my hopes high. It would be nice to have your thoughts on this.

Thanks.

lmcinnes · 2018-04-30T17:50:32Z

If I understand you correctly you want essentially an "out-of-core" fit method that doesn't require the whole dataset to be in memory at one time?

I believe that this is possible, but would likely be slow. Essentially what is really needed is a fast, out-of-core, approximate nearest neighbor algorithm. I believe what I have now may actually work with dask arrays which would provide out-of-core, but for the fact that check_array converts it to an array at the start, and modulo how numba handles dask (I suspect it doesn't). In other-words I think this is algorithmically quite tractable, but might require a fair amount of code to actually make it happen.

If there is significant interest in this (and it seems there may be) I'll see if I can figure out how to actually make this happen.

parashardhapola · 2018-04-30T18:54:32Z

Yes indeed, I'm looking for an out-of-core fit method and I'm sure that it would be of interest to many.

Pardon me if this is a very naive suggestion but I was thinking that may be this functionality can initially be implemented downstream of distance calculation. Such that, in effect, a user needs to provide a generator which yields distance vector for each sample (of shape m for an m x m distance matrix), instead of complete pre-computed matrix itself. This may avoid, for now, the requirement of an out-of-core ANN algorithm.

stsievert · 2018-07-18T14:34:00Z

Essentially what is really needed is a fast, out-of-core, approximate nearest neighbor algorithm.

Is this what we were working on during the SciPy sprints on some branch/fork https://github.com/lmcinnes/pynndescent? I'd like to see the code – I'm curious what progress has been made, and would be happy to help out more.

lmcinnes · 2018-07-18T17:39:11Z

Yes, that's the one. I got basic code hammered out, but hadn't fixed all the bugs. I also chatted with Matt Rocklin and it looks like the proposed solution would be quite slow due to the random(ish) access of a large distributed array required. Still, slow but tractable is better than impossible.

stsievert · 2018-07-18T21:21:59Z

it looks like the proposed solution would be quite slow due to the random(ish) access of a large distributed array required.

I believe that. I sounds like the proposed solution will be enabled by dask/dask#3409.

Is there anywhere I can review the code you've implemented?

lmcinnes · 2018-07-20T19:16:06Z

I didn't get it working to a point of pushing it up anywhere. I have a colleague who may be picking up this work, so I might try and loop him in with all of this on Monday.

…

On Wed, Jul 18, 2018 at 5:22 PM Scott Sievert ***@***.***> wrote: it looks like the proposed solution would be quite slow due to the random(ish) access of a large distributed array required. I believe that. I sounds like the proposed solution will be enabled by dask/dask#3409 <dask/dask#3409>. Is there anywhere I can review the code you've implemented? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#62 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALaKBaibBPDXe0VsSWXNDl6qUba2nyejks5uH6b6gaJpZM4Ts6Gq> .

mrocklin · 2018-07-20T21:52:30Z

On the dask side I think that dask/dask#3409 (comment) is probably the next step to try (assuming I understand the situation on the UMAP side (which is doubtful)) :)

soupault · 2019-06-25T16:13:45Z

Really looking forward to it! Computer vision domain would benefit quite a lot from an out-of-core implementation.

ksangeeta2429 · 2019-08-30T00:40:15Z

Was there something already released for this? I am also looking for the same for acoustic deep learning purpose. I am looking for using UMAP over some TB of data.

lmcinnes · 2019-08-30T05:22:54Z

Currently this is still, at best, an "in the works" issue. The first real hurdle to be overcome, the nearest neighbor search, is looking good. There is a paralllel NN search, and there is reasonable hope to make a few more changes to ensure it is dask enabled for distributed arrays -- this has all been put the the PyNNDescent library, which will become a dependency for umap in the future. The next step is distributed SGD. This is easier, but has had less work done. Speculatively I would suggest that these are features that may arrive in version 0.5. Note, of course, that 0.4 is not even released yet. I suspect that means you may have to use subsampling for your project now. In good news the structures found my umap are generally fairly robust to subsampling.

stsievert · 2019-08-30T06:09:07Z

next step is distributed SGD.

dask/dask#3409 is related (and mentioned above). It kept popping up in my Dask distributed SGD implementation (dask/dask-glm#69).

dask/dask#3901 resolves this issue (thanks to @TomAugspurger). Now, shuffling to a distributed Dask array is much quicker (~100x improvement, 13.7s vs 140ms).

teenooy · 2019-12-11T19:03:37Z

would really like to see the incremental training feature for UMAP. I work with both image and text data, UMAP easily generates the best manifold with the least amount of tuning compared to everything else I've tried in both categories.

lmcinnes · 2019-12-11T19:42:05Z

At present it looks like this is something that might be possible in the v0.5 to v0.6 time-frame. That's not soon, but it may yet happen.

ghost · 2020-01-31T20:48:09Z

I am writing to mention that I am very interested in this.

ghost · 2020-04-19T06:21:07Z

A possible method to do a large scale nearest neighbor search efficiently would be to use something like k-d trees or this. I had attempted this a while ago and found that it works for lower dimensional data like vgg16. I tried to use it on some very high dimensional genomics applications and had no luck.

I’ll try to dig up this code or my notes on this.

galinaalperovich · 2020-05-02T23:29:04Z

Would be happy to see this incremental UMAP feature!

lmcinnes · 2020-05-03T01:58:16Z

Depending on what exactly you are looking for, this may now be available (in preliminary form) in the 0.5dev branch. To make it work you would use the new, special, AlignedUMAP class, and the update functionality therein. I have a notebook gist that provides a basic walkthrough of these sorts of features. With this approach if your goal is to incrementally add (batches) of new points to an existing embedding, updating the embedding as you go, this should suffice. It may not meet all of your needs, but hopefully it will solve some problems for some users.

fedden · 2021-04-20T11:25:20Z

Thanks for the early release and consideration. Just wondering if there is an approach to take wrt the relation dictionaries if the successive instances in the dataset do not depend on each other (no overlap) and as such do not have relations between successive frames.

E.g I tried passing in empty dicts with the hope it would be interpreted as no-relations between successive instances:

reducer.fit([data], relations=[{} for _ in data][:-1])

But get an error roughly saying it's expecting non-0 len dicts.

lmcinnes · 2021-04-21T15:03:03Z

That's definitely a bug -- as you say, it should treat that as no relation. I presume you have a dataset you want aligned except at some intermediate slice where you lack relations? I'll have to dig in a little and see if this is easy to fix. Having at least some relation is baked in to the assumptions right now, but I can see that having empty relations occasionally does make sense.

irl-dan · 2023-01-03T17:52:51Z

Did this ever make it into any released version of the package? Or any plans on doing so? I'm also interested in this.

jmdelahanty · 2023-01-12T05:32:15Z

I too have an application potentially that would be very helpful to have this for!

AtomicCactus · 2023-01-23T07:53:26Z

Just want to +1 this request, an incremental out-of-core fit method would be awesome!

SbstnErhrdt · 2023-02-04T14:13:36Z

Joining +1

JdbermeoUZH · 2023-02-15T10:17:07Z

+1!

ECburx · 2023-02-23T02:12:42Z

+1

mahabenfares · 2023-03-03T08:08:11Z

Hello, I want to use UMAP for data streams. Do you offer an incremental version of UMAP that I can use please? I saw that you talked about Alignedumap, I wanted to test it, but I don't know what to put in the relation dictionaries.

lmcinnes · 2023-03-03T16:06:47Z

I think a lot of this is beyond what the basic UMAP approach can do as it is essentially transductive. For those who really need this I would suggest your best option is going to be using the ParametricUMAP option and using new batches to retrain. If you need pure streaming you can use the transform on ParametricUMAP for the streaming data, and store batches for batch based retraining on a cycle that is reasonable given the retrain cost.

big-o · 2023-03-07T09:45:42Z

I'm not sure I fully understand all of the issues, but is a scalable nearest neighbour search still a blocker? If so, I've had good successes with the FAISS implementation of an approximate NN search (went from days with the standard ball tree stuff to seconds), and it can be wrapped up fairly trivially in a sklearn API. The downside is that it limits you to Euclidean and Cosine distance measures only - not sure if this limitation is due to the algorithm or because other distances just haven't been implemented yet.

lagunabravo · 2023-04-20T01:39:44Z

Thank you @lmcinnes for your great work.
I’m uncertain about how to use retraining on ParametricUMAP as you suggested on Mar 4. I guess what I don’t understand is how retraining modifies the embedder in regard to different batches on a large dataset. Should I split the dataset into batches, train/retrain across all of them and them use the final embedder to apply transform across the whole dataset? Or, alternatively, should I apply transform after retraining for each individual batch?

AlekseyFedorovich · 2023-05-25T07:56:14Z

I would love this feature!!

123mitnik · 2023-06-19T17:56:52Z

+1

stsievert mentioned this issue Jul 18, 2018

Multi CPU / GPU capabilities? #37

Open

stsievert mentioned this issue Jul 20, 2018

Case Study: Criteo dataset dask/dask-ml#295

Open

TomAugspurger mentioned this issue Aug 24, 2018

API: shuffle dask array dask/dask#3901

Merged

bjarthur mentioned this issue Dec 15, 2019

memory mapped / out-of-core support dillondaudert/UMAP.jl#20

Open

sleighsoft added the new feature label Feb 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial fit / Incremental UMAP #62

Partial fit / Incremental UMAP #62

parashardhapola commented Apr 30, 2018

lmcinnes commented Apr 30, 2018

parashardhapola commented Apr 30, 2018

stsievert commented Jul 18, 2018

lmcinnes commented Jul 18, 2018

stsievert commented Jul 18, 2018

lmcinnes commented Jul 20, 2018 via email

mrocklin commented Jul 20, 2018

soupault commented Jun 25, 2019 •

edited

Loading

ksangeeta2429 commented Aug 30, 2019 •

edited

Loading

lmcinnes commented Aug 30, 2019

stsievert commented Aug 30, 2019

teenooy commented Dec 11, 2019

lmcinnes commented Dec 11, 2019

ghost commented Jan 31, 2020

ghost commented Apr 19, 2020 •

edited by ghost

Loading

galinaalperovich commented May 2, 2020 •

edited

Loading

lmcinnes commented May 3, 2020

fedden commented Apr 20, 2021 •

edited

Loading

lmcinnes commented Apr 21, 2021

irl-dan commented Jan 3, 2023

jmdelahanty commented Jan 12, 2023

AtomicCactus commented Jan 23, 2023

SbstnErhrdt commented Feb 4, 2023

JdbermeoUZH commented Feb 15, 2023

ECburx commented Feb 23, 2023

mahabenfares commented Mar 3, 2023

lmcinnes commented Mar 3, 2023

big-o commented Mar 7, 2023

lagunabravo commented Apr 20, 2023

AlekseyFedorovich commented May 25, 2023

123mitnik commented Jun 19, 2023

Partial fit / Incremental UMAP #62

Partial fit / Incremental UMAP #62

Comments

parashardhapola commented Apr 30, 2018

lmcinnes commented Apr 30, 2018

parashardhapola commented Apr 30, 2018

stsievert commented Jul 18, 2018

lmcinnes commented Jul 18, 2018

stsievert commented Jul 18, 2018

lmcinnes commented Jul 20, 2018 via email

mrocklin commented Jul 20, 2018

soupault commented Jun 25, 2019 • edited Loading

ksangeeta2429 commented Aug 30, 2019 • edited Loading

lmcinnes commented Aug 30, 2019

stsievert commented Aug 30, 2019

teenooy commented Dec 11, 2019

lmcinnes commented Dec 11, 2019

ghost commented Jan 31, 2020

ghost commented Apr 19, 2020 • edited by ghost Loading

galinaalperovich commented May 2, 2020 • edited Loading

lmcinnes commented May 3, 2020

fedden commented Apr 20, 2021 • edited Loading

lmcinnes commented Apr 21, 2021

irl-dan commented Jan 3, 2023

jmdelahanty commented Jan 12, 2023

AtomicCactus commented Jan 23, 2023

SbstnErhrdt commented Feb 4, 2023

JdbermeoUZH commented Feb 15, 2023

ECburx commented Feb 23, 2023

mahabenfares commented Mar 3, 2023

lmcinnes commented Mar 3, 2023

big-o commented Mar 7, 2023

lagunabravo commented Apr 20, 2023

AlekseyFedorovich commented May 25, 2023

123mitnik commented Jun 19, 2023

soupault commented Jun 25, 2019 •

edited

Loading

ksangeeta2429 commented Aug 30, 2019 •

edited

Loading

ghost commented Apr 19, 2020 •

edited by ghost

Loading

galinaalperovich commented May 2, 2020 •

edited

Loading

fedden commented Apr 20, 2021 •

edited

Loading