-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NMF optimization & documentation #2361
NMF optimization & documentation #2361
Conversation
1. Improved performance ~4x 2. LDA-like API 3. BOW compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestions for language + code style. Looks much better overall 👍
first_doc = matutils.corpus2csc([first_doc], len(self.id2word)) | ||
self._h = None | ||
|
||
if isinstance(corpus, scipy.sparse.csc.csc_matrix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this some special undocumented case? Deserves a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corpus can be a csc matrix now, I'll add it to the docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see no comment about this code path. Why does this method accept two different input types? Is it some optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I added that to the __init__
docstring, but forgot to mention csc in the update
. I'll fix it.
first_doc = corpus.getcol(0) | ||
else: | ||
first_doc_it = itertools.tee(corpus, 1) | ||
first_doc = next(first_doc_it[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will still advance the original generator (lose the first document).
Isn't there a mismatch between expected corpus docs and actually seen corpus docs during iteration, after _setup
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's something that I checked, itertools.tee
copies generator and leaves the original one intact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# python 3.6
corpus = (i for i in range(10))
first_doc_it = itertools.tee(corpus, 1)
first_doc = next(first_doc_it[0])
print("first doc:", first_doc)
for doc_no, doc in enumerate(corpus):
print(f"doc #{doc_no}: {doc}")
# outputs:
first doc: 0
doc #0: 1
doc #1: 2
doc #2: 3
doc #3: 4
doc #4: 5
doc #5: 6
doc #6: 7
doc #7: 8
doc #8: 9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, you're right, it works for iterators, but not for generators. I'll fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does NMF accept generators? Or does it require a "restartable" sequence in corpus
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add those corner cases into unit tests, too? To prevent regressions. (generator input, empty docs, corpus shape not divisible by chunksize, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NMF loses the first document of a non-restartable generator for now, though non-restartable sequence should work fine with passes=1
.
I'll fix that and add unit tests.
@anotherbugmaster regarding the notebook:
|
Yes, this is not efficient, but I wanted to fix the random seed during permutation while not messing with the global random seed.
Sure, I'll explain it in more details.
The (matrix > 0) is a bool matrix, mean function sums up |
@anotherbugmaster awesome 🔥 Release can't more wait, so, I merge PR right now. |
v, self._W, r=self._r, h=self._h, v_max=self.v_max | ||
if isinstance(corpus, scipy.sparse.csc.csc_matrix): | ||
grouper = ( | ||
corpus[:, col_idx:col_idx + self.chunksize] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do? Takes a chunk of matrix columns (features) instead of rows (documents)? And then shuffles the features??
Needs a strong comment, unusual process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The corpus has shape (n_tokens, n_documents) in this case, so the grouper
splits the input corpus in chunks by column (documents dimension) and then shuffles columns of every chunk (shuffles documents order in the chunk).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha! Are you sure documents are columns? Where does this order originate from? (atypical, definitely needs a comment / docstring)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's unusual, but that's the format that's used in the original paper. I'll specify the shape of the corpus in the module docstring.
v, self._W, r=self._r, h=self._h, v_max=self.v_max | ||
if isinstance(corpus, scipy.sparse.csc.csc_matrix): | ||
grouper = ( | ||
corpus[:, col_idx:col_idx + self.chunksize] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will raise an exception if corpus.shape[1]
is not divisible by self.chunksize
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
In [10]: foo = sparse.random(2, 10, density=0.5, format='csc')
In [11]: foo[:, 5:100500].toarray()
Out[11]:
array([[0.98888343, 0. , 0. , 0.24760066, 0. ],
[0.24857359, 0. , 0. , 0.27890554, 0.16412464]])
The corpus[:, col_idx:col_idx + self.chunksize]
is equivalent to corpus[:, col_idx:corpus.shape[1]]
on the last iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be due to some changes in scipy, but your example results in
IndexError: index out of bounds: 0 <= 5 <= 10, 0 <= 100500 <= 10, 5 <= 100500
in scipy 0.19.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. Seems like they added the support for that in the following versions, I'll do something like corpus[:, col_idx:min(col_idx + self.chunksize, chunk.shape[1])]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Plus add a comment, so someone doesn't accidentally remove it again in the future.
We may simplify it again once the older scipy versions become irrelevant in the future.
@anotherbugmaster let me know about the performance (RAM, speed, quality) in our target use-case of sparse input (text, not dense faces), especially in comparison to sklearn or other tools. Then we can publish this. Super cool feature! |
@piskvorky Ok! I'll push the notebook with metrics and additional annotations later this weekend. |
So, 0.17 GB vs 16 GB RAM compared to sklearn… nice! And 2x as fast 🐎 But the L2 norm is worrying, that's a big difference. Is that expected? Especially given the lower perplexity (?). Can you post the resulting topics too? For visual, manual comparison. Reordering the table to show model first, then train time, then RAM, then the quality metrics should make it easier to read and interpret. Thanks! |
Checked the metrics functions again and found a bug in evaluation of l2 (I normalized the input corpus for the sklearn, but not for the NMF and the LDA, hence the invalid metrics). I've fixed it, re-running wikipedia notebook now. I expect changes only in the l2 norm of all models and in the perplexity of sklearn NMF. Here are updated metrics for the tutorial:
Sure, but I'm not sure how to present them. I think I'll take top-5 topics from each model to fit them in one scroll.
Ok! |
@piskvorky, here are the updated metrics on wikipedia: So, 2x faster, 100x less memory, still better on l2 and perplexity. |
Awesome. Please update the notebook (and reorder the table columns) and the parameter ranges, and I'll "officially" announce our new model. Great work @anotherbugmaster , this came together very nicely in the end. |
@anotherbugmaster since this PR is already merged, can you push these updated metrics / tables / notebook in a new PR? |
@piskvorky Sure, here it is |
@@ -1,21 +1,114 @@ | |||
"""Online Non-Negative Matrix Factorization.""" | |||
"""`Online Non-Negative Matrix Factorization. <https://arxiv.org/abs/1604.02634>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This formatting is broken, see https://radimrehurek.com/gensim/models/nmf.html. RST hyperlinks must end in _
.
How about this instead:
Online Non-Negative Matrix Factorization.
This implementation uses the efficient sparse incremental algorithm of Renbo Zhao, Vincent Y. F. Tan et al. `[PDF] <https://arxiv.org/abs/1604.02634>`_.
Massive performance improvements and better docs. Continues from PR #2007.
Takes less memory and 4-5 times faster now. Also metrics such as perplexity works as expected.
TODO:
use_r
functionality (invalid, dramatically slows down training, while not improving quality too much)Make the(invalid)load
method work with the old implementationcorpus
parameter docs, make it easier to graspuse_r
parameter