-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Online NMF #2007
Online NMF #2007
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start @anotherbugmaster 👍
Main things that you need to do now
- Benchmark (add notebook where you compare current implementation with others using different tasks)
- Support for BoW format (feel free to drop numpy dense matrices)
- API (should be very similar with Lda/Lsi)
- Tests
1. Improved performance ~4x 2. LDA-like API 3. BOW compatibility
Time to merge, awesome work @anotherbugmaster 🚀💣🔥💣🚀 |
@anotherbugmaster can you share those TL;DR comparisons against other implementations (sklearn etc), as per my comment above (time, memory, quality)? I'd like to include that in the release notes. Thanks! |
I found some numbers in the images at the bottom of the tutorial. Is the Gensim implementation really 6x slower than sklearn's? |
Sure, here they are: https://github.com/anotherbugmaster/gensim/blob/e34b939e9a5f1f79f9582ef3d0618fd43bbd7be2/docs/notebooks/nmf_wikipedia.ipynb
Only with certain hyperparameters. It's 2-3x faster than sklearn in most cases, which also have better F1: |
@anotherbugmaster thanks, but I don't know how to read any of these tables, what these There's almost no text in the tutorial. The part that was easy to interpret were the images in the end, which say Gensim is 6x slower than anything else :( Can you please post a TL;DR comparison against sklearn on the same dataset (wiki? images?): memory, time, quality? Why should someone use our NMF implementation, instead of other implementations? |
Ok, Radim, how about the first table in the release notes? https://github.com/RaRe-Technologies/gensim/releases Also, here are the insights from the tutorial notebook:
Here are the RAM comparison on wikipedia: NaN means that this metric weren't computed for particular model (coherence for sklearn NMF, for example). F1 is the quality of a model on the downstream task, 20-newsgroups classification. Our NMF is online (you can't just run sklearn on wikipedia, it won't fit in memory) and faster than sklearn NMF on sparse and large datasets (which is the case for Topic Modeling). |
@anotherbugmaster I already saw all these tables and notebook multiple times. They are not what I am asking. Nobody but you knows how the numbers relate, what's important, or even which way is up. I am asking for a short human summary of a head-to-head comparison of memory, time and quality with other NMF implementations (e.g. sklearn), on a concrete dataset.
Similarly for I'm sure the code is fine if @menshikh-iv OKed it and merged. That's not the issue. The issue is the documentation, especially with regard to motivation and user guidance. As a user, I don't understand where this NMF implementation stands, how it compares to other implementations, when I should use it (or not use it), what the parameters mean and which ones I should change (or not touch). I can help with the language once I understand it myself, but I need some insights, not a huge table full of some unexplained numbers and code cells without commentary. @menshikh-iv do you understand what I'm asking? Can you help out here? |
For clarity, here's an example what I meant by "insights", something users may understand, ground them conceptually and guide their intuition about this implementation: Gensim NMF should be used whenever you want to retrieve interpretable (non-negative factors) topics from a very large and sparse dataset. Its online incremental training allows you to update the NMF model in pieces, in constant memory. This is in stark contrast to other NMF implementation (such as in scikit-learn), where the entire dataset must be loaded into memory at once. It also allows resuming training with more data at a later time. Another application of this "online" architecture is joining NMF models built from partial data slices into a single model (e.g. individual NMF models from weekly time-slices combined into a single NMF model for the whole year) . In terms of memory, the Gensim NMF implementation scales linearly with the number of terms and topics. You also need to be able to load a partial chunk of documents into RAM at a time (the In terms of CPU, the runtime is dominated by the coordinate descent in each update iteration. You can control the CPU-accuracy tradeoff by tweaking the ABC parameter. The default are set to work well on standard English texts (sparsity <1%), but if your dataset is dense, you may want to change it to EFG. In terms of model quality, the algorithm implemented in Gensim NFM follows this paper. It achieves the online training capability by calculating only approximatate XYZ. On the English Wikipedia, this results in L2 reconstruction error of ABC (compared to sklearn's DEF). For more information, see the paper above or our benchmarks here. If you want to use NMF, check out our official tutorial here for a step-by-step code guide. The API parameters are documented here. (just an example, maybe the facts are wrong, or the implementation cannot do this -- I don't know. but this was our goal.) |
Radim, to be clear, Olivietti faces decomposition is added just to show that it's possible to extract latent components. Model is optimized for the case of sparse corpora, not dense image matrices. The main benchmark dataset is 20-newsgroups, and the huge table is concerning this dataset. As for the quality, I can't entirely agree, because:
I see what you mean by insights. I'll try to make something similar to your example. |
Gensim NMF should be used whenever you want to retrieve interpretable (non-negative factors) topics from a very large and sparse dataset. Its online incremental training allows you to update the NMF model in pieces, in constant memory. This is in stark contrast to other NMF implementation (such as in scikit-learn), where the entire dataset must be loaded into memory at once. It also allows resuming training at a later time. In terms of memory, the Gensim NMF implementation scales linearly with the number of terms and topics. You also need to be able to load a partial chunk of documents into RAM at a time (the chunksize parameter). For example, on the English Wikipedia dataset, you'll need 150Mb RAM for 50 NMF topics and 100k vocabulary, updating the model with chunks of 2,000 documents at a time. See this notebook table for more details and benchmark numbers. In terms of CPU, the runtime is dominated by the coordinate descent in each update iteration. You can control the CPU-accuracy tradeoff by tweaking the In terms of model quality, the algorithm implemented in Gensim NMF follows this paper. It achieves the online training capability by accumulating document-topic matrices of each subsequent batch in a special way and then iteratively computing topic-word matrix. For more information, see the paper above or our benchmarks here. If you want to use NMF, check out our official tutorial here for a step-by-step code guide. The API parameters are documented here. |
Thanks for your patience, but we need to improve the docs significantly before we really promote this exciting new model addition. Still missing: clear numbers from a single benchmark (ideally Wikipedia, 3 numbers: RAM + time + reconstruction error/loss), and a TL;DR comparison to sklearn (same 3 calculated/estimated numbers, for a direct head-to-head). I don't know how else to say it, but we need a human-friendly TL;DR comparison of NMF implementation in Gensim and other NMF implementations. The current nondescript table full of numbers and NaNs, in a notebook without comments, is insufficient. @anotherbugmaster Can you improve the parameter intuition too please? Enumerating the parameter names like Try to see this from the user perspective please. Users are not going to decode academic papers or pour over the code, just to understand what this model is supposed to do and how it differs from their other options. We have to provide a basic overview and intuition.
That wasn't clear at all from the notebook. In fact, "Olivietti faces" is not even introduced / described anywhere. As a reader, I don't know what I'm looking at, why, or what I'm supposed to be seeing there. I assume by Does this model support merging partial models built from independent chunks or not? I see you removed this sentence from my example text which you used as a template (I completely made it up, are you sure the algo descriptions fit?), but then the rest of the text makes it sound like it does support such partial training. |
Radim, as I wrote before, I can't run sklearn's NMF on Wikipedia (at least on my machine), it takes too much RAM. I can either run it on a smaller corpus (like 20-newsgroups) or compare NMF with some other model, LDA for example (though it wouldn't be completely fair to compare L2 here). Do you have any ideas how can I implement the right benchmark?
Okay, I obviously need to revamp the notebooks and NMF's documentation. I'll try to do it this week.
Sure. I think I'll add more info to the module docstrings and describe what W, h and r matrices mean and how exactly does algorithm works. Those parameters are for estimation and maximization steps of the algo. For example,
I see that a lot of things seem vague, I'll try to clear things up.
Fair enough. I can either elaborate more on this section or we can completely remove it to not confuse readers.
Yep, that's right.
No, the model doesn't support merging of partial chunks, and I have no idea how to implement that even in theory. Maybe updating in pieces is not a good description of the model's behavior, more like it updates iteratively, which means that we need to go through a corpus top-down, not to build partial models and then merge them. Yes, I get it that you've made an example up, but it's actully quite close to the truth, I fixed the parts where it wasn't. |
I understand, hence the word "estimated". Btw how much RAM would be needed? Perhaps we can run it on one of our machines (64 GB RAM).
I like that idea (showing a different non-text usecase/workflow), I'd prefer to keep it. Being visual always helps! Expanding the high-level descriptions, "what am I looking at and why, how is it different from others" is really what is needed here, across the board. We went over the API docs with Ivan today, and we'll need to:
Thanks! |
Online Robust NMF. Resolves #132. Based on this paper.