-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NMF metrics and wikipedia #2371
Conversation
This reverts commit 1c3a064
I understand that, but what does this mean for the user? Subsample? Increase precision somewhere? Surely "nothing is being trained" is not a desirable training progress. It sounds like a waste, a warning sign, as opposed to a best practice. |
Obviously, it doesn't need to learn iteratively, and due to the small size of the dataset it became an advantage. Things work differently on the Wikipedia: large corpus is learned much faster with an iterative approach. |
Oh, I see. Yes, user can either subsample the trainset or increase the precision of the model ( |
Yes, but see our other parallel thread, about the model not really learning anything new in reality. Is it really faster, on a large corpus where it learns something? Wouldn't a subsample learn similar/better topics, in faster time (and 5x faster with sklearn)? The tutorial nicely shows how to do the training, but the case for why to use Gensim's NMF is not yet convincing. |
Hm, I see. I think that the most obvious advantage over the Sklearn is the huge difference in RAM footprint, but I'll definitely need to check the subsampling case. |
@piskvorky My understanding is that you'd like to quietly merge this branch without mentioning anything in the change log. Is that correct? @anotherbugmaster Please let me know when this is ready to merge. |
@mpenkov correct. Let's merge and release 3.7.2 ASAP (and finish the rest of NMF later).
We can mention these NMF fixes in the release log. It's not a channel many people actively follow, so still counts as "silent" :) I'll wait with real promo until the tutorial is more compelling. |
@mpenkov Sure! Ok then, I'll check the possible caveats more thoroughly and report the results next week. |
@anotherbugmaster any progress on the SVD (LSI) stats, for comparison? Plus investigating the "NMF not learning anything" – effects of subsampling, precision on speed vs memory. Especially with regard to our recommended "best practices" and comparison to sklearn. Cheers! |
@anotherbugmaster I'm going to merge this so we can go ahead with the next release. Please continue working on this PR and pushing your changes. We'll merge again in the near future. |
@piskvorky, sorry, I forgot to make an experiment :( I think I'll report something on this weekend. |
@anotherbugmaster please start a new branch from the |
@anotherbugmaster what's the status here? I'd love to get NMF wrapped up nicely, so we can promote it. |
@piskvorky Sorry, Radim, still haven't finished the research yet. I'll do it ASAP. |
@anotherbugmaster how is it going? Let's get NMF finished while there's still momentum (or remove it). |
@piskvorky, I've changed the algorithm so that the model updates every batch even if an error is already low. Good news is that it didn't affect performance, here are the results: I've also changed an error in logs, now it looks like this: The lower the error, the better. |
@anotherbugmaster What's your rationale for these changes? What was the train-time + L2_norm before? (to easily compare to this latest In short, the same questions as before:
I'm not sure how forcing near-zero updates answers them. Thanks! |
@anotherbugmaster ping -- can we finish this please? |
@piskvorky I'll run an LSI tomorrow. |
@piskvorky Ok, here's the Wikipedia comparison: Seems like training an LSI takes the same amount of time as an LDA. L2 norm is a bit better, though coherence is not. |
Model updates every batch now (it skipped a lot of them before, that wasn't correct behavior), I don't see any problem with update rate now. |
Thanks @anotherbugmaster . Let me ping @mpenkov and @gojomo for a final review, especially of the new notebook -- are the motivation, evaluation and usage instructions convincing? Anything crucial missing, before we push this out "for real"? Thanks! |
@anotherbugmaster can you open a new PR (starting from upstream |
@anotherbugmaster There are some logging-related bugs. I suspect they are blocking the release of our conda feedstock. I'll highlight the problem iines in a separate review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some bugs in logging calls
|
||
logger.info( | ||
"PROGRESS: pass %i, at document #%i/%i", | ||
pass_, chunk_idx * chunksize + chunk_len, lencorpus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lencorpus may be a float (inf) here, but the formatstring requires it to be an integer
logger.info( | ||
"running NMF training, %s topics, %i passes over the supplied corpus of %i documents, evaluating l2 norm " | ||
"every %i documents", | ||
self.num_topics, passes, lencorpus if lencorpus < np.inf else "?", evalafter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formatstr expects an integer, but you will end up passing a string if lencorpus is np.inf.
@anotherbugmaster ping. Do you plan to finish NMF? |
Add clean up and fixes on top of #2361: