NMF metrics and wikipedia #2371

anotherbugmaster · 2019-02-04T09:50:18Z

Add clean up and fixes on top of #2361:

This reverts commit 1c3a064

piskvorky · 2019-03-07T08:21:56Z

I can add a comment. If W error diff equals zero than it means that the model already reconstructs current batch as good as it gets and doesn't learn nothing new.

I understand that, but what does this mean for the user? Subsample? Increase precision somewhere?

Surely "nothing is being trained" is not a desirable training progress. It sounds like a waste, a warning sign, as opposed to a best practice.

anotherbugmaster · 2019-03-07T08:22:38Z

What is your explanation for sklearn being ~5x faster?

Obviously, it doesn't need to learn iteratively, and due to the small size of the dataset it became an advantage. Things work differently on the Wikipedia: large corpus is learned much faster with an iterative approach.

anotherbugmaster · 2019-03-07T08:24:00Z

I understand that, but what does this mean for the user? Subsample? Increase precision somewhere?

Surely "nothing is being trained" is not a desirable training progress.

Oh, I see. Yes, user can either subsample the trainset or increase the precision of the model (w_stop_iteration and h_stop_iteration parameters).

piskvorky · 2019-03-07T08:25:17Z

Things work differently on the Wikipedia: large corpus is learned much faster with an iterative approach.

Yes, but see our other parallel thread, about the model not really learning anything new in reality. Is it really faster, on a large corpus where it learns something? Wouldn't a subsample learn similar/better topics, in faster time (and 5x faster with sklearn)?

The tutorial nicely shows how to do the training, but the case for why to use Gensim's NMF is not yet convincing.

anotherbugmaster · 2019-03-07T10:03:32Z

Hm, I see. I think that the most obvious advantage over the Sklearn is the huge difference in RAM footprint, but I'll definitely need to check the subsampling case.

mpenkov · 2019-03-09T13:36:59Z

@piskvorky My understanding is that you'd like to quietly merge this branch without mentioning anything in the change log. Is that correct?

@anotherbugmaster Please let me know when this is ready to merge.

piskvorky · 2019-03-09T16:13:45Z

@mpenkov correct. Let's merge and release 3.7.2 ASAP (and finish the rest of NMF later).

without mentioning anything in the change log

We can mention these NMF fixes in the release log. It's not a channel many people actively follow, so still counts as "silent" :) I'll wait with real promo until the tutorial is more compelling.

anotherbugmaster · 2019-03-09T18:01:34Z

@mpenkov Sure!

Ok then, I'll check the possible caveats more thoroughly and report the results next week.

piskvorky · 2019-03-19T07:22:17Z

@anotherbugmaster any progress on the SVD (LSI) stats, for comparison?

Plus investigating the "NMF not learning anything" – effects of subsampling, precision on speed vs memory. Especially with regard to our recommended "best practices" and comparison to sklearn. Cheers!

mpenkov · 2019-03-20T12:15:52Z

@anotherbugmaster I'm going to merge this so we can go ahead with the next release.

Please continue working on this PR and pushing your changes. We'll merge again in the near future.

anotherbugmaster · 2019-03-20T14:17:18Z

@piskvorky, sorry, I forgot to make an experiment :(

I think I'll report something on this weekend.

piskvorky · 2019-03-20T14:39:38Z

@anotherbugmaster please start a new branch from the develop branch, to have a clean PR. A previously squashed merge is showing as hundreds of new commits in this PR; let's avoid that in the new PR. Cheers.

piskvorky · 2019-03-27T10:20:14Z

@anotherbugmaster what's the status here? I'd love to get NMF wrapped up nicely, so we can promote it.

anotherbugmaster · 2019-03-28T09:03:31Z

@piskvorky Sorry, Radim, still haven't finished the research yet. I'll do it ASAP.

piskvorky · 2019-04-03T17:59:49Z

@anotherbugmaster how is it going? Let's get NMF finished while there's still momentum (or remove it).

anotherbugmaster · 2019-04-04T09:11:45Z

@piskvorky, I've changed the algorithm so that the model updates every batch even if an error is already low. Good news is that it didn't affect performance, here are the results:

I've also changed an error in logs, now it looks like this:

The lower the error, the better.

piskvorky · 2019-04-04T11:18:33Z

@anotherbugmaster What's your rationale for these changes? What was the train-time + L2_norm before? (to easily compare to this latest 27:20 and 94.97)
Can you also give the reconstruction error for LSI on the same data please, for anchoring.

In short, the same questions as before:

Would you mind adding LsiModel (SVD) and report the L2 error there? Just for comparison. LSI is another matrix decomposition, directly comparable to NMF (except it doesn't have the sparsity constraints).

How do you interpret the mostly-zero-errors during the training? What conclusions should users draw from that? Subsample their input? Increase precision? What is going on?

I'm not sure how forcing near-zero updates answers them.

Thanks!

piskvorky · 2019-04-11T21:12:03Z

@anotherbugmaster ping -- can we finish this please?

anotherbugmaster · 2019-04-11T22:35:59Z

@piskvorky I'll run an LSI tomorrow.

anotherbugmaster · 2019-04-12T11:37:21Z

@piskvorky Ok, here's the Wikipedia comparison:

Seems like training an LSI takes the same amount of time as an LDA. L2 norm is a bit better, though coherence is not.

anotherbugmaster · 2019-04-12T11:47:32Z

I'm not sure how forcing near-zero updates answers them.

Model updates every batch now (it skipped a lot of them before, that wasn't correct behavior), I don't see any problem with update rate now.

piskvorky · 2019-04-19T20:27:40Z

Thanks @anotherbugmaster . Let me ping @mpenkov and @gojomo for a final review, especially of the new notebook -- are the motivation, evaluation and usage instructions convincing? Anything crucial missing, before we push this out "for real"? Thanks!

piskvorky · 2019-04-22T17:01:10Z

@anotherbugmaster can you open a new PR (starting from upstream develop), with your recent changes to the update process + the LSI comparison? Cheers.

mpenkov · 2019-04-23T04:27:11Z

@anotherbugmaster There are some logging-related bugs. I suspect they are blocking the release of our conda feedstock. I'll highlight the problem iines in a separate review.

mpenkov

some bugs in logging calls

mpenkov · 2019-04-23T04:27:53Z

gensim/models/nmf.py

+
+                logger.info(
+                    "PROGRESS: pass %i, at document #%i/%i",
+                    pass_, chunk_idx * chunksize + chunk_len, lencorpus


lencorpus may be a float (inf) here, but the formatstring requires it to be an integer

mpenkov · 2019-04-23T04:28:56Z

gensim/models/nmf.py

+        logger.info(
+            "running NMF training, %s topics, %i passes over the supplied corpus of %i documents, evaluating l2 norm "
+            "every %i documents",
+            self.num_topics, passes, lencorpus if lencorpus < np.inf else "?", evalafter,


The formatstr expects an integer, but you will end up passing a string if lencorpus is np.inf.

piskvorky · 2019-05-01T10:34:03Z

@anotherbugmaster can you open a new PR (starting from upstream develop), with your recent changes to the update process + the LSI comparison? Cheers.

@anotherbugmaster ping. Do you plan to finish NMF?

anotherbugmaster added 30 commits June 5, 2018 13:09

Fix random seed again

a154a6e

Optimize E/M step

e82628d

Add an eval_every option, use softmax for normalization

1ca33f8

Fixes

f19e6ce

Improve notebook examples a bit

583cb15

Fix eval_every

fe0ab0a

Return outliers

8e647a1

Optimizations

89cc803

Experimenting with loss

bbd3099

Merge remote-tracking branch 'upstream/develop' into online_nmf

f71ad89

Fix PEP8

936e629

Return nmf import

1c3a064

Revert "Return nmf import"

ce4b7ee

This reverts commit 1c3a064

Fix

f8de1d9

Merge remote-tracking branch 'upstream/develop' into online_nmf

df9b8c7

Fix minimum_probability & info -> debug logs

d159779

Compute metrics

3dcdedc

Count error on-the-fly

f11f2e2

Speed optimizations, changed error functions

8216541

Beat LDA

ee3a7c7

Outperform sklearn in speed (WTF)

a3315f2

Remove redundant arg

3a03ff9

Add Olivietti faces

70619e1

Remove redundant code

8c47ce0

Add Topics

e291664

Make it pretty

3302b92

Fix wrapper

5616bd6

Save corpus & dict, minor fixes

ed8f29f

Add RandomCorpus

2117c90

Dense -> sparse

950115d

mpenkov merged commit 9eb3933 into piskvorky:develop Mar 20, 2019

mpenkov reviewed Apr 23, 2019

View reviewed changes

This was referenced Apr 26, 2019

gensim v3.7.2 conda-forge/gensim-feedstock#26

Closed

Fix Numpy deprecation warning in matutils.corpus2dense #2433

Closed

piskvorky mentioned this pull request May 7, 2019

NMF notebook and logging fixups #2480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NMF metrics and wikipedia #2371

NMF metrics and wikipedia #2371

anotherbugmaster commented Feb 4, 2019 •

edited

Loading

piskvorky commented Mar 7, 2019 •

edited

Loading

anotherbugmaster commented Mar 7, 2019

anotherbugmaster commented Mar 7, 2019

piskvorky commented Mar 7, 2019 •

edited

Loading

anotherbugmaster commented Mar 7, 2019

mpenkov commented Mar 9, 2019

piskvorky commented Mar 9, 2019 •

edited

Loading

anotherbugmaster commented Mar 9, 2019

piskvorky commented Mar 19, 2019 •

edited

Loading

mpenkov commented Mar 20, 2019

anotherbugmaster commented Mar 20, 2019

piskvorky commented Mar 20, 2019 •

edited

Loading

piskvorky commented Mar 27, 2019

anotherbugmaster commented Mar 28, 2019 •

edited

Loading

piskvorky commented Apr 3, 2019 •

edited

Loading

anotherbugmaster commented Apr 4, 2019

piskvorky commented Apr 4, 2019 •

edited

Loading

piskvorky commented Apr 11, 2019

anotherbugmaster commented Apr 11, 2019

anotherbugmaster commented Apr 12, 2019 •

edited

Loading

anotherbugmaster commented Apr 12, 2019 •

edited

Loading

piskvorky commented Apr 19, 2019 •

edited

Loading

piskvorky commented Apr 22, 2019

mpenkov commented Apr 23, 2019

mpenkov left a comment

mpenkov Apr 23, 2019

mpenkov Apr 23, 2019

piskvorky commented May 1, 2019 •

edited

Loading

NMF metrics and wikipedia #2371

NMF metrics and wikipedia #2371

Conversation

anotherbugmaster commented Feb 4, 2019 • edited Loading

piskvorky commented Mar 7, 2019 • edited Loading

anotherbugmaster commented Mar 7, 2019

anotherbugmaster commented Mar 7, 2019

piskvorky commented Mar 7, 2019 • edited Loading

anotherbugmaster commented Mar 7, 2019

mpenkov commented Mar 9, 2019

piskvorky commented Mar 9, 2019 • edited Loading

anotherbugmaster commented Mar 9, 2019

piskvorky commented Mar 19, 2019 • edited Loading

mpenkov commented Mar 20, 2019

anotherbugmaster commented Mar 20, 2019

piskvorky commented Mar 20, 2019 • edited Loading

piskvorky commented Mar 27, 2019

anotherbugmaster commented Mar 28, 2019 • edited Loading

piskvorky commented Apr 3, 2019 • edited Loading

anotherbugmaster commented Apr 4, 2019

piskvorky commented Apr 4, 2019 • edited Loading

piskvorky commented Apr 11, 2019

anotherbugmaster commented Apr 11, 2019

anotherbugmaster commented Apr 12, 2019 • edited Loading

anotherbugmaster commented Apr 12, 2019 • edited Loading

piskvorky commented Apr 19, 2019 • edited Loading

piskvorky commented Apr 22, 2019

mpenkov commented Apr 23, 2019

mpenkov left a comment

Choose a reason for hiding this comment

mpenkov Apr 23, 2019

Choose a reason for hiding this comment

mpenkov Apr 23, 2019

Choose a reason for hiding this comment

piskvorky commented May 1, 2019 • edited Loading

anotherbugmaster commented Feb 4, 2019 •

edited

Loading

piskvorky commented Mar 7, 2019 •

edited

Loading

piskvorky commented Mar 7, 2019 •

edited

Loading

piskvorky commented Mar 9, 2019 •

edited

Loading

piskvorky commented Mar 19, 2019 •

edited

Loading

piskvorky commented Mar 20, 2019 •

edited

Loading

anotherbugmaster commented Mar 28, 2019 •

edited

Loading

piskvorky commented Apr 3, 2019 •

edited

Loading

piskvorky commented Apr 4, 2019 •

edited

Loading

anotherbugmaster commented Apr 12, 2019 •

edited

Loading

anotherbugmaster commented Apr 12, 2019 •

edited

Loading

piskvorky commented Apr 19, 2019 •

edited

Loading

piskvorky commented May 1, 2019 •

edited

Loading