Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial fit / Incremental UMAP #62

Open
parashardhapola opened this issue Apr 30, 2018 · 31 comments
Open

Partial fit / Incremental UMAP #62

parashardhapola opened this issue Apr 30, 2018 · 31 comments

Comments

@parashardhapola
Copy link
Contributor

Hi @lmcinnes,

There has been quite an exciting about UMAP in scRNA-Seq area recently. However, fitting larger data-sets might be an issue depending on amount of available memory. I would love to see a 'partial_fit' method (akin to one available in scikit's incremental PCA) in UMAP, so that data could be lazily loaded and fitted. I do realize that this might be non-trivial, if not theoretically infeasible, but your recent implementation of 'transform' method got my hopes high. It would be nice to have your thoughts on this.

Thanks.

@lmcinnes
Copy link
Owner

If I understand you correctly you want essentially an "out-of-core" fit method that doesn't require the whole dataset to be in memory at one time?

I believe that this is possible, but would likely be slow. Essentially what is really needed is a fast, out-of-core, approximate nearest neighbor algorithm. I believe what I have now may actually work with dask arrays which would provide out-of-core, but for the fact that check_array converts it to an array at the start, and modulo how numba handles dask (I suspect it doesn't). In other-words I think this is algorithmically quite tractable, but might require a fair amount of code to actually make it happen.

If there is significant interest in this (and it seems there may be) I'll see if I can figure out how to actually make this happen.

@parashardhapola
Copy link
Contributor Author

Yes indeed, I'm looking for an out-of-core fit method and I'm sure that it would be of interest to many.

Pardon me if this is a very naive suggestion but I was thinking that may be this functionality can initially be implemented downstream of distance calculation. Such that, in effect, a user needs to provide a generator which yields distance vector for each sample (of shape m for an m x m distance matrix), instead of complete pre-computed matrix itself. This may avoid, for now, the requirement of an out-of-core ANN algorithm.

@stsievert
Copy link

Essentially what is really needed is a fast, out-of-core, approximate nearest neighbor algorithm.

Is this what we were working on during the SciPy sprints on some branch/fork https://github.com/lmcinnes/pynndescent? I'd like to see the code – I'm curious what progress has been made, and would be happy to help out more.

@lmcinnes
Copy link
Owner

Yes, that's the one. I got basic code hammered out, but hadn't fixed all the bugs. I also chatted with Matt Rocklin and it looks like the proposed solution would be quite slow due to the random(ish) access of a large distributed array required. Still, slow but tractable is better than impossible.

@stsievert
Copy link

it looks like the proposed solution would be quite slow due to the random(ish) access of a large distributed array required.

I believe that. I sounds like the proposed solution will be enabled by dask/dask#3409.

Is there anywhere I can review the code you've implemented?

@lmcinnes
Copy link
Owner

lmcinnes commented Jul 20, 2018 via email

@mrocklin
Copy link

On the dask side I think that dask/dask#3409 (comment) is probably the next step to try (assuming I understand the situation on the UMAP side (which is doubtful)) :)

@soupault
Copy link

soupault commented Jun 25, 2019

Really looking forward to it! Computer vision domain would benefit quite a lot from an out-of-core implementation.

@ksangeeta2429
Copy link

ksangeeta2429 commented Aug 30, 2019

Was there something already released for this? I am also looking for the same for acoustic deep learning purpose. I am looking for using UMAP over some TB of data.

@lmcinnes
Copy link
Owner

Currently this is still, at best, an "in the works" issue. The first real hurdle to be overcome, the nearest neighbor search, is looking good. There is a paralllel NN search, and there is reasonable hope to make a few more changes to ensure it is dask enabled for distributed arrays -- this has all been put the the PyNNDescent library, which will become a dependency for umap in the future. The next step is distributed SGD. This is easier, but has had less work done. Speculatively I would suggest that these are features that may arrive in version 0.5. Note, of course, that 0.4 is not even released yet. I suspect that means you may have to use subsampling for your project now. In good news the structures found my umap are generally fairly robust to subsampling.

@stsievert
Copy link

next step is distributed SGD.

dask/dask#3409 is related (and mentioned above). It kept popping up in my Dask distributed SGD implementation (dask/dask-glm#69).

dask/dask#3901 resolves this issue (thanks to @TomAugspurger). Now, shuffling to a distributed Dask array is much quicker (~100x improvement, 13.7s vs 140ms).

@teenooy
Copy link

teenooy commented Dec 11, 2019

would really like to see the incremental training feature for UMAP. I work with both image and text data, UMAP easily generates the best manifold with the least amount of tuning compared to everything else I've tried in both categories.

@lmcinnes
Copy link
Owner

At present it looks like this is something that might be possible in the v0.5 to v0.6 time-frame. That's not soon, but it may yet happen.

@ghost
Copy link

ghost commented Jan 31, 2020

I am writing to mention that I am very interested in this.

@ghost
Copy link

ghost commented Apr 19, 2020

A possible method to do a large scale nearest neighbor search efficiently would be to use something like k-d trees or this. I had attempted this a while ago and found that it works for lower dimensional data like vgg16. I tried to use it on some very high dimensional genomics applications and had no luck.

I’ll try to dig up this code or my notes on this.

@galinaalperovich
Copy link

galinaalperovich commented May 2, 2020

Would be happy to see this incremental UMAP feature!

@lmcinnes
Copy link
Owner

lmcinnes commented May 3, 2020

Depending on what exactly you are looking for, this may now be available (in preliminary form) in the 0.5dev branch. To make it work you would use the new, special, AlignedUMAP class, and the update functionality therein. I have a notebook gist that provides a basic walkthrough of these sorts of features. With this approach if your goal is to incrementally add (batches) of new points to an existing embedding, updating the embedding as you go, this should suffice. It may not meet all of your needs, but hopefully it will solve some problems for some users.

@fedden
Copy link

fedden commented Apr 20, 2021

Thanks for the early release and consideration. Just wondering if there is an approach to take wrt the relation dictionaries if the successive instances in the dataset do not depend on each other (no overlap) and as such do not have relations between successive frames.

E.g I tried passing in empty dicts with the hope it would be interpreted as no-relations between successive instances:

reducer.fit([data], relations=[{} for _ in data][:-1])

But get an error roughly saying it's expecting non-0 len dicts.

@lmcinnes
Copy link
Owner

That's definitely a bug -- as you say, it should treat that as no relation. I presume you have a dataset you want aligned except at some intermediate slice where you lack relations? I'll have to dig in a little and see if this is easy to fix. Having at least some relation is baked in to the assumptions right now, but I can see that having empty relations occasionally does make sense.

@irl-dan
Copy link

irl-dan commented Jan 3, 2023

Did this ever make it into any released version of the package? Or any plans on doing so? I'm also interested in this.

@jmdelahanty
Copy link

I too have an application potentially that would be very helpful to have this for!

@AtomicCactus
Copy link

Just want to +1 this request, an incremental out-of-core fit method would be awesome!

@SbstnErhrdt
Copy link

Joining +1

@JdbermeoUZH
Copy link

+1!

@ECburx
Copy link

ECburx commented Feb 23, 2023

+1

@mahabenfares
Copy link

Hello, I want to use UMAP for data streams. Do you offer an incremental version of UMAP that I can use please? I saw that you talked about Alignedumap, I wanted to test it, but I don't know what to put in the relation dictionaries.

@lmcinnes
Copy link
Owner

lmcinnes commented Mar 3, 2023

I think a lot of this is beyond what the basic UMAP approach can do as it is essentially transductive. For those who really need this I would suggest your best option is going to be using the ParametricUMAP option and using new batches to retrain. If you need pure streaming you can use the transform on ParametricUMAP for the streaming data, and store batches for batch based retraining on a cycle that is reasonable given the retrain cost.

@big-o
Copy link

big-o commented Mar 7, 2023

I'm not sure I fully understand all of the issues, but is a scalable nearest neighbour search still a blocker? If so, I've had good successes with the FAISS implementation of an approximate NN search (went from days with the standard ball tree stuff to seconds), and it can be wrapped up fairly trivially in a sklearn API. The downside is that it limits you to Euclidean and Cosine distance measures only - not sure if this limitation is due to the algorithm or because other distances just haven't been implemented yet.

@lagunabravo
Copy link

Thank you @lmcinnes for your great work.
I’m uncertain about how to use retraining on ParametricUMAP as you suggested on Mar 4. I guess what I don’t understand is how retraining modifies the embedder in regard to different batches on a large dataset. Should I split the dataset into batches, train/retrain across all of them and them use the final embedder to apply transform across the whole dataset? Or, alternatively, should I apply transform after retraining for each individual batch?

@AlekseyFedorovich
Copy link

I would love this feature!!

@123mitnik
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests