Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NMF metrics and wikipedia #2371

Merged
merged 269 commits into from
Mar 20, 2019
Merged

Conversation

anotherbugmaster
Copy link
Contributor

@anotherbugmaster anotherbugmaster commented Feb 4, 2019

Add clean up and fixes on top of #2361:

@piskvorky
Copy link
Owner

piskvorky commented Mar 7, 2019

I can add a comment. If W error diff equals zero than it means that the model already reconstructs current batch as good as it gets and doesn't learn nothing new.

I understand that, but what does this mean for the user? Subsample? Increase precision somewhere?

Surely "nothing is being trained" is not a desirable training progress. It sounds like a waste, a warning sign, as opposed to a best practice.

@anotherbugmaster
Copy link
Contributor Author

What is your explanation for sklearn being ~5x faster?

Obviously, it doesn't need to learn iteratively, and due to the small size of the dataset it became an advantage. Things work differently on the Wikipedia: large corpus is learned much faster with an iterative approach.

@anotherbugmaster
Copy link
Contributor Author

I understand that, but what does this mean for the user? Subsample? Increase precision somewhere?

Surely "nothing is being trained" is not a desirable training progress.

Oh, I see. Yes, user can either subsample the trainset or increase the precision of the model (w_stop_iteration and h_stop_iteration parameters).

@piskvorky
Copy link
Owner

piskvorky commented Mar 7, 2019

Things work differently on the Wikipedia: large corpus is learned much faster with an iterative approach.

Yes, but see our other parallel thread, about the model not really learning anything new in reality. Is it really faster, on a large corpus where it learns something? Wouldn't a subsample learn similar/better topics, in faster time (and 5x faster with sklearn)?

The tutorial nicely shows how to do the training, but the case for why to use Gensim's NMF is not yet convincing.

@anotherbugmaster
Copy link
Contributor Author

Hm, I see. I think that the most obvious advantage over the Sklearn is the huge difference in RAM footprint, but I'll definitely need to check the subsampling case.

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 9, 2019

@piskvorky My understanding is that you'd like to quietly merge this branch without mentioning anything in the change log. Is that correct?

@anotherbugmaster Please let me know when this is ready to merge.

@piskvorky
Copy link
Owner

piskvorky commented Mar 9, 2019

@mpenkov correct. Let's merge and release 3.7.2 ASAP (and finish the rest of NMF later).

without mentioning anything in the change log

We can mention these NMF fixes in the release log. It's not a channel many people actively follow, so still counts as "silent" :) I'll wait with real promo until the tutorial is more compelling.

@anotherbugmaster
Copy link
Contributor Author

@mpenkov Sure!

Ok then, I'll check the possible caveats more thoroughly and report the results next week.

@piskvorky
Copy link
Owner

piskvorky commented Mar 19, 2019

@anotherbugmaster any progress on the SVD (LSI) stats, for comparison?

Plus investigating the "NMF not learning anything" – effects of subsampling, precision on speed vs memory. Especially with regard to our recommended "best practices" and comparison to sklearn. Cheers!

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 20, 2019

@anotherbugmaster I'm going to merge this so we can go ahead with the next release.

Please continue working on this PR and pushing your changes. We'll merge again in the near future.

@mpenkov mpenkov merged commit 9eb3933 into piskvorky:develop Mar 20, 2019
@anotherbugmaster
Copy link
Contributor Author

@piskvorky, sorry, I forgot to make an experiment :(

I think I'll report something on this weekend.

@piskvorky
Copy link
Owner

piskvorky commented Mar 20, 2019

@anotherbugmaster please start a new branch from the develop branch, to have a clean PR. A previously squashed merge is showing as hundreds of new commits in this PR; let's avoid that in the new PR. Cheers.

@piskvorky
Copy link
Owner

@anotherbugmaster what's the status here? I'd love to get NMF wrapped up nicely, so we can promote it.

@anotherbugmaster
Copy link
Contributor Author

anotherbugmaster commented Mar 28, 2019

@piskvorky Sorry, Radim, still haven't finished the research yet. I'll do it ASAP.

@piskvorky
Copy link
Owner

piskvorky commented Apr 3, 2019

@anotherbugmaster how is it going? Let's get NMF finished while there's still momentum (or remove it).

@anotherbugmaster
Copy link
Contributor Author

@piskvorky, I've changed the algorithm so that the model updates every batch even if an error is already low. Good news is that it didn't affect performance, here are the results:

image

I've also changed an error in logs, now it looks like this:

image

The lower the error, the better.

@piskvorky
Copy link
Owner

piskvorky commented Apr 4, 2019

@anotherbugmaster What's your rationale for these changes? What was the train-time + L2_norm before? (to easily compare to this latest 27:20 and 94.97)
Can you also give the reconstruction error for LSI on the same data please, for anchoring.

In short, the same questions as before:

Would you mind adding LsiModel (SVD) and report the L2 error there? Just for comparison. LSI is another matrix decomposition, directly comparable to NMF (except it doesn't have the sparsity constraints).

How do you interpret the mostly-zero-errors during the training? What conclusions should users draw from that? Subsample their input? Increase precision? What is going on?

I'm not sure how forcing near-zero updates answers them.

Thanks!

@piskvorky
Copy link
Owner

@anotherbugmaster ping -- can we finish this please?

@anotherbugmaster
Copy link
Contributor Author

@piskvorky I'll run an LSI tomorrow.

@anotherbugmaster
Copy link
Contributor Author

anotherbugmaster commented Apr 12, 2019

@piskvorky Ok, here's the Wikipedia comparison:

image

Seems like training an LSI takes the same amount of time as an LDA. L2 norm is a bit better, though coherence is not.

@anotherbugmaster
Copy link
Contributor Author

anotherbugmaster commented Apr 12, 2019

I'm not sure how forcing near-zero updates answers them.

Model updates every batch now (it skipped a lot of them before, that wasn't correct behavior), I don't see any problem with update rate now.

@piskvorky
Copy link
Owner

piskvorky commented Apr 19, 2019

Thanks @anotherbugmaster . Let me ping @mpenkov and @gojomo for a final review, especially of the new notebook -- are the motivation, evaluation and usage instructions convincing? Anything crucial missing, before we push this out "for real"? Thanks!

@piskvorky
Copy link
Owner

@anotherbugmaster can you open a new PR (starting from upstream develop), with your recent changes to the update process + the LSI comparison? Cheers.

@mpenkov
Copy link
Collaborator

mpenkov commented Apr 23, 2019

@anotherbugmaster There are some logging-related bugs. I suspect they are blocking the release of our conda feedstock. I'll highlight the problem iines in a separate review.

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some bugs in logging calls


logger.info(
"PROGRESS: pass %i, at document #%i/%i",
pass_, chunk_idx * chunksize + chunk_len, lencorpus
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lencorpus may be a float (inf) here, but the formatstring requires it to be an integer

logger.info(
"running NMF training, %s topics, %i passes over the supplied corpus of %i documents, evaluating l2 norm "
"every %i documents",
self.num_topics, passes, lencorpus if lencorpus < np.inf else "?", evalafter,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatstr expects an integer, but you will end up passing a string if lencorpus is np.inf.

@piskvorky
Copy link
Owner

piskvorky commented May 1, 2019

@anotherbugmaster can you open a new PR (starting from upstream develop), with your recent changes to the update process + the LSI comparison? Cheers.

@anotherbugmaster ping. Do you plan to finish NMF?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants