Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NMF optimization & documentation #2361

Merged
merged 201 commits into from
Jan 31, 2019

Conversation

anotherbugmaster
Copy link
Contributor

@anotherbugmaster anotherbugmaster commented Jan 29, 2019

Massive performance improvements and better docs. Continues from PR #2007.

Takes less memory and 4-5 times faster now. Also metrics such as perplexity works as expected.

TODO:

  • Fix use_r functionality (invalid, dramatically slows down training, while not improving quality too much)
  • Make the load method work with the old implementation (invalid)
  • Revamp wikipedia benchmark and add sklearn implementation
  • Revamp corpus parameter docs, make it easier to grasp
  • Document or deprecate use_r parameter
  • More info in the tutorial notebook
  • Fix tests
  • Add useful parameters ranges
  • Document minimum_probability
  • Remove unused parameters from the tutorial notebook
  • Remove unused parameters from wiki notebook
  • Remove r from eval metrics
  • Change eval metric to relative error
  • Add RAM estimation

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions for language + code style. Looks much better overall 👍

gensim/models/nmf.py Outdated Show resolved Hide resolved
gensim/models/nmf.py Show resolved Hide resolved
gensim/models/nmf.py Outdated Show resolved Hide resolved
gensim/models/nmf.py Outdated Show resolved Hide resolved
gensim/models/nmf.py Outdated Show resolved Hide resolved
gensim/models/nmf.py Outdated Show resolved Hide resolved
first_doc = matutils.corpus2csc([first_doc], len(self.id2word))
self._h = None

if isinstance(corpus, scipy.sparse.csc.csc_matrix):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this some special undocumented case? Deserves a comment.

Copy link
Contributor Author

@anotherbugmaster anotherbugmaster Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corpus can be a csc matrix now, I'll add it to the docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see no comment about this code path. Why does this method accept two different input types? Is it some optimization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I added that to the __init__ docstring, but forgot to mention csc in the update. I'll fix it.

first_doc = corpus.getcol(0)
else:
first_doc_it = itertools.tee(corpus, 1)
first_doc = next(first_doc_it[0])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will still advance the original generator (lose the first document).

Isn't there a mismatch between expected corpus docs and actually seen corpus docs during iteration, after _setup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's something that I checked, itertools.tee copies generator and leaves the original one intact.

Copy link
Owner

@piskvorky piskvorky Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# python 3.6
corpus = (i for i in range(10))
first_doc_it = itertools.tee(corpus, 1)
first_doc = next(first_doc_it[0])
print("first doc:", first_doc)
for doc_no, doc in enumerate(corpus):
    print(f"doc #{doc_no}: {doc}")

# outputs:

first doc: 0
doc #0: 1
doc #1: 2
doc #2: 3
doc #3: 4
doc #4: 5
doc #5: 6
doc #6: 7
doc #7: 8
doc #8: 9

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, you're right, it works for iterators, but not for generators. I'll fix that.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does NMF accept generators? Or does it require a "restartable" sequence in corpus?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add those corner cases into unit tests, too? To prevent regressions. (generator input, empty docs, corpus shape not divisible by chunksize, etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NMF loses the first document of a non-restartable generator for now, though non-restartable sequence should work fine with passes=1.

I'll fix that and add unit tests.

gensim/models/nmf.py Show resolved Hide resolved
gensim/models/nmf.py Outdated Show resolved Hide resolved
@piskvorky
Copy link
Owner

@anotherbugmaster regarding the notebook:

  1. Does trainset / testset need to be a np.array, instead of list? Storing objects (dicts) as numpy array is inefficient and looks strange. If really necessary, deserves a comment "why".

  2. Can we introduce the notebook cells with a bit of context / motivation? For example, under the section Coherence, mention why this section is here in the notebook + link to an explanation what coherence is?

  3. The density definition as def density(matrix): return (matrix > 0).mean() seems non-standard and deserves a comment (why density as "mean of above-zero elements"?)

@anotherbugmaster
Copy link
Contributor Author

anotherbugmaster commented Jan 30, 2019

1. Does `trainset` / `testset` need to be a `np.array`, instead of `list`? Storing objects (`dict`s) as numpy array is inefficient and looks strange. If really necessary, deserves a comment "why".

Yes, this is not efficient, but I wanted to fix the random seed during permutation while not messing with the global random seed.

2. Can we introduce the notebook cells with a bit of context / motivation? For example, under the section **Coherence**, mention why this section is here in the notebook + link to an explanation what coherence is?

Sure, I'll explain it in more details.

3. The density definition as `def density(matrix):    return (matrix > 0).mean()` seems non-standard and deserves a comment (why density as "mean of above-zero elements"?)

The (matrix > 0) is a bool matrix, mean function sums up True's for every cell in which condition stands, than divides it by the number of all elements in the matrix. Thus, it's an ordinary density, nothing fancy. :)

@menshikh-iv
Copy link
Contributor

@anotherbugmaster awesome 🔥

Release can't more wait, so, I merge PR right now.
Please don't forget to fix unchecked notes from first post.

@menshikh-iv menshikh-iv merged commit 366d8ae into piskvorky:develop Jan 31, 2019
v, self._W, r=self._r, h=self._h, v_max=self.v_max
if isinstance(corpus, scipy.sparse.csc.csc_matrix):
grouper = (
corpus[:, col_idx:col_idx + self.chunksize]
Copy link
Owner

@piskvorky piskvorky Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do? Takes a chunk of matrix columns (features) instead of rows (documents)? And then shuffles the features??

Needs a strong comment, unusual process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The corpus has shape (n_tokens, n_documents) in this case, so the grouper splits the input corpus in chunks by column (documents dimension) and then shuffles columns of every chunk (shuffles documents order in the chunk).

Copy link
Owner

@piskvorky piskvorky Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! Are you sure documents are columns? Where does this order originate from? (atypical, definitely needs a comment / docstring)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's unusual, but that's the format that's used in the original paper. I'll specify the shape of the corpus in the module docstring.

v, self._W, r=self._r, h=self._h, v_max=self.v_max
if isinstance(corpus, scipy.sparse.csc.csc_matrix):
grouper = (
corpus[:, col_idx:col_idx + self.chunksize]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will raise an exception if corpus.shape[1] is not divisible by self.chunksize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

In [10]: foo = sparse.random(2, 10, density=0.5, format='csc')

In [11]: foo[:, 5:100500].toarray()
Out[11]: 
array([[0.98888343, 0.        , 0.        , 0.24760066, 0.        ],
       [0.24857359, 0.        , 0.        , 0.27890554, 0.16412464]])

The corpus[:, col_idx:col_idx + self.chunksize] is equivalent to corpus[:, col_idx:corpus.shape[1]] on the last iteration

Copy link
Owner

@piskvorky piskvorky Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be due to some changes in scipy, but your example results in

IndexError: index out of bounds: 0 <= 5 <= 10, 0 <= 100500 <= 10, 5 <= 100500

in scipy 0.19.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Seems like they added the support for that in the following versions, I'll do something like corpus[:, col_idx:min(col_idx + self.chunksize, chunk.shape[1])]

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Plus add a comment, so someone doesn't accidentally remove it again in the future.

We may simplify it again once the older scipy versions become irrelevant in the future.

@piskvorky
Copy link
Owner

@anotherbugmaster let me know about the performance (RAM, speed, quality) in our target use-case of sparse input (text, not dense faces), especially in comparison to sklearn or other tools. Then we can publish this. Super cool feature!

@anotherbugmaster
Copy link
Contributor Author

@piskvorky Ok! I'll push the notebook with metrics and additional annotations later this weekend.

@anotherbugmaster
Copy link
Contributor Author

Here are the current wikipedia results:

image

@piskvorky
Copy link
Owner

piskvorky commented Feb 1, 2019

So, 0.17 GB vs 16 GB RAM compared to sklearn… nice! And 2x as fast 🐎

But the L2 norm is worrying, that's a big difference. Is that expected? Especially given the lower perplexity (?).

Can you post the resulting topics too? For visual, manual comparison.

Reordering the table to show model first, then train time, then RAM, then the quality metrics should make it easier to read and interpret. Thanks!

@anotherbugmaster
Copy link
Contributor Author

anotherbugmaster commented Feb 1, 2019

So, 0.17 GB vs 16 GB RAM compared to sklearn… nice! And 2x as fast racehorse

But the L2 norm is worrying, that's a big difference. Is that expected? Especially given the lower perplexity (?).

Checked the metrics functions again and found a bug in evaluation of l2 (I normalized the input corpus for the sklearn, but not for the NMF and the LDA, hence the invalid metrics). I've fixed it, re-running wikipedia notebook now. I expect changes only in the l2 norm of all models and in the perplexity of sklearn NMF.

Here are updated metrics for the tutorial:

image

Can you post the resulting topics too? For visual, manual comparison.

Sure, but I'm not sure how to present them. I think I'll take top-5 topics from each model to fit them in one scroll.

Reordering the table to show model first, then train time, then RAM, then the quality metrics should make it easier to read and interpret. Thanks!

Ok!

@anotherbugmaster
Copy link
Contributor Author

anotherbugmaster commented Feb 2, 2019

@piskvorky, here are the updated metrics on wikipedia:

image

So, 2x faster, 100x less memory, still better on l2 and perplexity.

@piskvorky
Copy link
Owner

piskvorky commented Feb 2, 2019

Awesome. Please update the notebook (and reorder the table columns) and the parameter ranges, and I'll "officially" announce our new model.

Great work @anotherbugmaster , this came together very nicely in the end.

@piskvorky
Copy link
Owner

@anotherbugmaster since this PR is already merged, can you push these updated metrics / tables / notebook in a new PR?

@anotherbugmaster anotherbugmaster mentioned this pull request Feb 4, 2019
15 tasks
@anotherbugmaster
Copy link
Contributor Author

@piskvorky Sure, here it is

@@ -1,21 +1,114 @@
"""Online Non-Negative Matrix Factorization."""
"""`Online Non-Negative Matrix Factorization. <https://arxiv.org/abs/1604.02634>`
Copy link
Owner

@piskvorky piskvorky Feb 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formatting is broken, see https://radimrehurek.com/gensim/models/nmf.html. RST hyperlinks must end in _.

How about this instead:

Online Non-Negative Matrix Factorization.
This implementation uses the efficient sparse incremental algorithm of Renbo Zhao, Vincent Y. F. Tan et al. `[PDF] <https://arxiv.org/abs/1604.02634>`_.

Rendered version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants