-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial fit / Incremental UMAP #62
Comments
If I understand you correctly you want essentially an "out-of-core" fit method that doesn't require the whole dataset to be in memory at one time? I believe that this is possible, but would likely be slow. Essentially what is really needed is a fast, out-of-core, approximate nearest neighbor algorithm. I believe what I have now may actually work with dask arrays which would provide out-of-core, but for the fact that If there is significant interest in this (and it seems there may be) I'll see if I can figure out how to actually make this happen. |
Yes indeed, I'm looking for an out-of-core fit method and I'm sure that it would be of interest to many. Pardon me if this is a very naive suggestion but I was thinking that may be this functionality can initially be implemented downstream of distance calculation. Such that, in effect, a user needs to provide a generator which yields distance vector for each sample (of shape m for an m x m distance matrix), instead of complete pre-computed matrix itself. This may avoid, for now, the requirement of an out-of-core ANN algorithm. |
Is this what we were working on during the SciPy sprints on some branch/fork https://github.com/lmcinnes/pynndescent? I'd like to see the code – I'm curious what progress has been made, and would be happy to help out more. |
Yes, that's the one. I got basic code hammered out, but hadn't fixed all the bugs. I also chatted with Matt Rocklin and it looks like the proposed solution would be quite slow due to the random(ish) access of a large distributed array required. Still, slow but tractable is better than impossible. |
I believe that. I sounds like the proposed solution will be enabled by dask/dask#3409. Is there anywhere I can review the code you've implemented? |
I didn't get it working to a point of pushing it up anywhere. I have a
colleague who may be picking up this work, so I might try and loop him in
with all of this on Monday.
…On Wed, Jul 18, 2018 at 5:22 PM Scott Sievert ***@***.***> wrote:
it looks like the proposed solution would be quite slow due to the
random(ish) access of a large distributed array required.
I believe that. I sounds like the proposed solution will be enabled by
dask/dask#3409 <dask/dask#3409>.
Is there anywhere I can review the code you've implemented?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#62 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALaKBaibBPDXe0VsSWXNDl6qUba2nyejks5uH6b6gaJpZM4Ts6Gq>
.
|
On the dask side I think that dask/dask#3409 (comment) is probably the next step to try (assuming I understand the situation on the UMAP side (which is doubtful)) :) |
Really looking forward to it! Computer vision domain would benefit quite a lot from an out-of-core implementation. |
Was there something already released for this? I am also looking for the same for acoustic deep learning purpose. I am looking for using UMAP over some TB of data. |
Currently this is still, at best, an "in the works" issue. The first real hurdle to be overcome, the nearest neighbor search, is looking good. There is a paralllel NN search, and there is reasonable hope to make a few more changes to ensure it is dask enabled for distributed arrays -- this has all been put the the PyNNDescent library, which will become a dependency for umap in the future. The next step is distributed SGD. This is easier, but has had less work done. Speculatively I would suggest that these are features that may arrive in version 0.5. Note, of course, that 0.4 is not even released yet. I suspect that means you may have to use subsampling for your project now. In good news the structures found my umap are generally fairly robust to subsampling. |
dask/dask#3409 is related (and mentioned above). It kept popping up in my Dask distributed SGD implementation (dask/dask-glm#69). dask/dask#3901 resolves this issue (thanks to @TomAugspurger). Now, shuffling to a distributed Dask array is much quicker (~100x improvement, 13.7s vs 140ms). |
would really like to see the incremental training feature for UMAP. I work with both image and text data, UMAP easily generates the best manifold with the least amount of tuning compared to everything else I've tried in both categories. |
At present it looks like this is something that might be possible in the v0.5 to v0.6 time-frame. That's not soon, but it may yet happen. |
I am writing to mention that I am very interested in this. |
A possible method to do a large scale nearest neighbor search efficiently would be to use something like k-d trees or this. I had attempted this a while ago and found that it works for lower dimensional data like vgg16. I tried to use it on some very high dimensional genomics applications and had no luck. I’ll try to dig up this code or my notes on this. |
Would be happy to see this incremental UMAP feature! |
Depending on what exactly you are looking for, this may now be available (in preliminary form) in the 0.5dev branch. To make it work you would use the new, special, AlignedUMAP class, and the |
Thanks for the early release and consideration. Just wondering if there is an approach to take wrt the relation dictionaries if the successive instances in the dataset do not depend on each other (no overlap) and as such do not have relations between successive frames. E.g I tried passing in empty dicts with the hope it would be interpreted as no-relations between successive instances: reducer.fit([data], relations=[{} for _ in data][:-1]) But get an error roughly saying it's expecting non-0 len dicts. |
That's definitely a bug -- as you say, it should treat that as no relation. I presume you have a dataset you want aligned except at some intermediate slice where you lack relations? I'll have to dig in a little and see if this is easy to fix. Having at least some relation is baked in to the assumptions right now, but I can see that having empty relations occasionally does make sense. |
Did this ever make it into any released version of the package? Or any plans on doing so? I'm also interested in this. |
I too have an application potentially that would be very helpful to have this for! |
Just want to +1 this request, an incremental out-of-core fit method would be awesome! |
Joining +1 |
+1! |
+1 |
Hello, I want to use UMAP for data streams. Do you offer an incremental version of UMAP that I can use please? I saw that you talked about Alignedumap, I wanted to test it, but I don't know what to put in the relation dictionaries. |
I think a lot of this is beyond what the basic UMAP approach can do as it is essentially transductive. For those who really need this I would suggest your best option is going to be using the ParametricUMAP option and using new batches to retrain. If you need pure streaming you can use the transform on ParametricUMAP for the streaming data, and store batches for batch based retraining on a cycle that is reasonable given the retrain cost. |
I'm not sure I fully understand all of the issues, but is a scalable nearest neighbour search still a blocker? If so, I've had good successes with the FAISS implementation of an approximate NN search (went from days with the standard ball tree stuff to seconds), and it can be wrapped up fairly trivially in a sklearn API. The downside is that it limits you to Euclidean and Cosine distance measures only - not sure if this limitation is due to the algorithm or because other distances just haven't been implemented yet. |
Thank you @lmcinnes for your great work. |
I would love this feature!! |
+1 |
Hi @lmcinnes,
There has been quite an exciting about UMAP in scRNA-Seq area recently. However, fitting larger data-sets might be an issue depending on amount of available memory. I would love to see a 'partial_fit' method (akin to one available in scikit's incremental PCA) in UMAP, so that data could be lazily loaded and fitted. I do realize that this might be non-trivial, if not theoretically infeasible, but your recent implementation of 'transform' method got my hopes high. It would be nice to have your thoughts on this.
Thanks.
The text was updated successfully, but these errors were encountered: