-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Online NMF #2007
Merged
Merged
Online NMF #2007
Changes from 4 commits
Commits
Show all changes
161 commits
Select commit
Hold shift + click to select a range
343e46f
Implement first version of the algorithm
anotherbugmaster 3171be3
Fix variable names
anotherbugmaster bd325bc
Add support for streaming corpora
anotherbugmaster 19b3ba4
Add benchmark
anotherbugmaster 9e52399
Fix bugs, introduce batches, add images to the benchmark notebook
anotherbugmaster c54fc92
Update notebook
anotherbugmaster 6dc9d3e
Improve model
anotherbugmaster 0554b7b
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster 5f4b3d3
Add show topics, change API
anotherbugmaster 52fc956
Add more LDA-like API
anotherbugmaster ddebcf0
Fix logger name
anotherbugmaster 6d0a1b3
Add more LDA API
anotherbugmaster cf430fc
Remove redundant method
anotherbugmaster df5a6e9
Remove commented out lines
anotherbugmaster 25080b4
Fix flakes
anotherbugmaster 83b1a6b
Cythonize
anotherbugmaster 7f27f52
Dramatically improve performance
anotherbugmaster 405e12f
Add parameters, improve accuracy and speed
anotherbugmaster 7b45b23
Remove redundant W copying
anotherbugmaster a154a6e
Fix random seed again
anotherbugmaster e82628d
Optimize E/M step
anotherbugmaster 1ca33f8
Add an eval_every option, use softmax for normalization
anotherbugmaster f19e6ce
Fixes
anotherbugmaster 583cb15
Improve notebook examples a bit
anotherbugmaster fe0ab0a
Fix eval_every
anotherbugmaster 8e647a1
Return outliers
anotherbugmaster 89cc803
Optimizations
anotherbugmaster bbd3099
Experimenting with loss
anotherbugmaster f71ad89
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster 936e629
Fix PEP8
anotherbugmaster 1c3a064
Return nmf import
anotherbugmaster ce4b7ee
Revert "Return nmf import"
anotherbugmaster f8de1d9
Fix
anotherbugmaster df9b8c7
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster d159779
Fix minimum_probability & info -> debug logs
anotherbugmaster 3dcdedc
Compute metrics
anotherbugmaster f11f2e2
Count error on-the-fly
anotherbugmaster 8216541
Speed optimizations, changed error functions
anotherbugmaster ee3a7c7
Beat LDA
anotherbugmaster a3315f2
Outperform sklearn in speed (WTF)
anotherbugmaster 3a03ff9
Remove redundant arg
anotherbugmaster 70619e1
Add Olivietti faces
anotherbugmaster 8c47ce0
Remove redundant code
anotherbugmaster e291664
Add Topics
anotherbugmaster 3302b92
Make it pretty
anotherbugmaster 5616bd6
Fix wrapper
anotherbugmaster ed8f29f
Save corpus & dict, minor fixes
anotherbugmaster 2117c90
Add RandomCorpus
anotherbugmaster 950115d
Dense -> sparse
anotherbugmaster 54993c6
First doc2dense
anotherbugmaster 572dc6c
Fix csc again
anotherbugmaster d40d89f
Fix len
anotherbugmaster 7a3ef47
Experimenting
anotherbugmaster f94de09
Revert "Experimenting"
anotherbugmaster 9ed2167
Fix evaluation
anotherbugmaster ad9443f
Sparse speedup
anotherbugmaster 1a04660
Improve performance
anotherbugmaster 87981bf
Divide A and B again
anotherbugmaster 0b314c7
Fix A and B computation bug
anotherbugmaster b024dd6
Sparsify W init
anotherbugmaster 35d5406
Experimenting
anotherbugmaster 74acb37
New norm
anotherbugmaster 8b28675
Sparse threshold -> sparse coefficient
anotherbugmaster 588ef6a
Optimize residuals computation
anotherbugmaster 8f84758
Fix residuals bug
anotherbugmaster 8a67c44
W speedup
anotherbugmaster 560f2bf
Experiment
anotherbugmaster cac2590
Revert changes a bit
anotherbugmaster 060ab28
Fix corpus
anotherbugmaster cde937f
Fix init error|
anotherbugmaster 66b753f
Merge branch 'online_nmf' of github.com:anotherbugmaster/gensim into …
anotherbugmaster 18dbb6b
Resolve conflict
anotherbugmaster 4b49d26
Fix corpus iteration issue
anotherbugmaster 9c6cbc6
Switch to numpy algos
anotherbugmaster b23d016
Merge upstream
anotherbugmaster 74ba37d
Train on wikipedia
anotherbugmaster c943264
Sparse coef -> density. More stable way to sparsify W matrix
anotherbugmaster a489807
Merge branch 'online_nmf' of github.com:anotherbugmaster/gensim into …
anotherbugmaster a95e345
Return old sparse algo
anotherbugmaster 0f90484
Max
anotherbugmaster 6ae43e4
Optimizations
anotherbugmaster 335170b
Fix A and B computation
anotherbugmaster 4cc8f1b
Fix A and B normalization
anotherbugmaster 5c6fe60
Add random_state
anotherbugmaster dd459a2
Infer id2word
anotherbugmaster 5121d85
Fix tests
anotherbugmaster 5f4018a
Document __init__
anotherbugmaster dbd8474
Document whole nmf
anotherbugmaster 5904f10
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster cd4b9b0
Remove unnecessary comments
anotherbugmaster 53a02a9
Add tutorial notebook
anotherbugmaster 937e340
Document __init__
anotherbugmaster 26a87bd
Fix flake version
anotherbugmaster 261c13a
Fix flake warning
anotherbugmaster 0147afc
Remove comments, reverse parallelization order
anotherbugmaster 1ece3c1
Add NMF's cython extension to setup.py
anotherbugmaster e6409fa
Fix imports, add solve_r function
anotherbugmaster 0743624
Remove comments
anotherbugmaster fd8088b
Add docstrings
anotherbugmaster e4ba0de
Common corpus and common dictionary
anotherbugmaster 8537eef
Remove redundant test
anotherbugmaster d2e8385
Add signature flag
anotherbugmaster b72bf39
Add files to manifest
anotherbugmaster ed080a3
Fix flake8
anotherbugmaster 67f6e75
Fix atol value
anotherbugmaster ee4373d
Implement top topics
anotherbugmaster d01c88c
Add rst files
anotherbugmaster 8111080
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster 3de3646
Fix appveyor issue
anotherbugmaster 183ea2d
Fix cython error
anotherbugmaster d2ac199
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster 2d664c6
Fix fmax/fmin not being on win-python27
anotherbugmaster c9a3577
Add word transformation test
anotherbugmaster fd0de20
Improve readability of residuals computation
anotherbugmaster fa384f2
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster a811c67
Fix tests
anotherbugmaster d063a4f
A few fixes
anotherbugmaster b8f5d79
Blank line at the end of each docstring
anotherbugmaster 361d160
Add blank line
anotherbugmaster e214582
Add the paper reference
anotherbugmaster 9527f39
Fix long line
anotherbugmaster e1e1168
Add log_perplexity
anotherbugmaster 3bf5be3
Merge remote-tracking branch 'remotes/upstream/develop' into online_nmf
anotherbugmaster d1c6e3e
Add NMF and LDA comparison table
anotherbugmaster 7927b6b
Change the sign of log perplexity
anotherbugmaster 1c6517e
Add Sklearn NMF comparison
anotherbugmaster 278fb05
Merge sklearn and tm tables
anotherbugmaster a330327
Add F1
anotherbugmaster 7ba9b84
Remove _solve_r
anotherbugmaster a14bfd3
Merge tutorial and benchmark
anotherbugmaster d28aef3
Identation's back
anotherbugmaster 83ec0f6
Optimize optimizers
anotherbugmaster d25332f
Remove unnecessary pic
anotherbugmaster 0e711d9
Optimize memory consumption
anotherbugmaster cc3085c
Add docstring
anotherbugmaster b090b6b
Optimize get_topic_words
anotherbugmaster e05a1c6
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster ba8ce1c
Fix tests
anotherbugmaster 6d78f83
Fix flake8
anotherbugmaster b16c1dd
Add missing test
anotherbugmaster 7c1e240
Code review fixes
anotherbugmaster 667ae99
n_tokens -> num_tokens
anotherbugmaster 251d5f9
[skip ci] Add explicit normalize parameter
anotherbugmaster 7a3f358
[skip ci] Add explicit normalize parameter[2]
anotherbugmaster c663f33
[skip ci] Update tutorial notebook
anotherbugmaster 8e15cd4
[skip ci] [WIP] Update wikipedia notebook
anotherbugmaster 3c76171
Merge branch 'online_nmf' of github.com:anotherbugmaster/gensim into …
anotherbugmaster 4941745
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster c4d6ebd
Add more description and metrics
anotherbugmaster 3b1195d
[skip ci] Fix log_probabiliy
anotherbugmaster 5edec1b
Multiple format fixes in notebook, outputs cleared til tomorrow
anotherbugmaster 33ce1a3
Merge remote-tracking branch 'upstream/develop' into online_nmf
menshikh-iv 1806bf6
Train on full corpus
anotherbugmaster 3b9b8ea
Merge branch 'online_nmf' of github.com:anotherbugmaster/gensim into …
anotherbugmaster 3f1af1d
[skip ci] Remove disclaimer
anotherbugmaster 38143a9
Add RAM usage stats
anotherbugmaster 72a02db
Native 20-newsgroups and additional text
anotherbugmaster 7cf80e1
Truncate outputs
anotherbugmaster 72178c0
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster 467a2ad
Fix last cell formatting
anotherbugmaster e34b939
[skip ci] Change model hyperparameters back
anotherbugmaster File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,168 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Comparison between sklearn's and gensim's implementations of NMF" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from gensim.models.nmf import NMF as GensimNmf\n", | ||
"from gensim.parsing.preprocessing import preprocess_documents\n", | ||
"from sklearn.decomposition.nmf import NMF as SklearnNmf\n", | ||
"from sklearn.datasets import fetch_20newsgroups\n", | ||
"from sklearn.feature_extraction.text import CountVectorizer\n", | ||
"import numpy as np\n", | ||
"from matplotlib import pyplot as plt" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"vectorizer = CountVectorizer()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"bow_matrix = vectorizer.fit_transform(fetch_20newsgroups().data)\n", | ||
"bow_matrix = bow_matrix.todense()[:100]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Sklearn NMF" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"CPU times: user 54.4 s, sys: 38 s, total: 1min 32s\n", | ||
"Wall time: 1min 29s\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"%%time\n", | ||
"\n", | ||
"sklearn_nmf = SklearnNmf(n_components=5, tol=1e-5, max_iter=int(1e9))\n", | ||
"\n", | ||
"W = sklearn_nmf.fit_transform(bow_matrix)\n", | ||
"H = sklearn_nmf.components_" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"184.40183405328017" | ||
] | ||
}, | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"np.linalg.norm(bow_matrix - W.dot(H), 'fro')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Gensim NMF" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"CPU times: user 2min 7s, sys: 8.22 s, total: 2min 15s\n", | ||
"Wall time: 2min 5s\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"%%time\n", | ||
"\n", | ||
"gensim_nmf = GensimNmf(n_components=5)\n", | ||
"\n", | ||
"n_samples = np.array(bow_matrix).shape[0]\n", | ||
"\n", | ||
"gensim_nmf.fit(np.array(bow_matrix))\n", | ||
"W, H = gensim_nmf.get_factor_matrices()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"353.4495647218574" | ||
] | ||
}, | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"np.linalg.norm(bow_matrix - W.dot(H), 'fro')" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.4" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,193 @@ | ||
from itertools import chain | ||
|
||
import numpy as np | ||
from scipy.stats import halfnorm | ||
|
||
|
||
class NMF(object): | ||
anotherbugmaster marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"""Online Non-Negative Matrix Factorization. | ||
|
||
Attributes | ||
---------- | ||
_W : matrix | ||
|
||
""" | ||
|
||
def __init__(self, n_components, lambda_=1., kappa=1.): | ||
anotherbugmaster marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
|
||
Parameters | ||
anotherbugmaster marked this conversation as resolved.
Show resolved
Hide resolved
|
||
---------- | ||
n_components : int | ||
Number of components in resulting matrices. | ||
lambda_ : float | ||
kappa : float | ||
""" | ||
anotherbugmaster marked this conversation as resolved.
Show resolved
Hide resolved
|
||
self.n_features = None | ||
self.n_components = n_components | ||
self.lambda_ = lambda_ | ||
self.kappa = kappa | ||
self._H = [] | ||
self.R = None | ||
self.is_fitted = False | ||
|
||
def _setup(self, X): | ||
self.h, self.r = None, None | ||
X_ = iter(X) | ||
x = next(X_) | ||
n_features = len(x) | ||
avg = np.sqrt(x.mean() / n_features) | ||
X = chain([x], X_) | ||
|
||
self.n_features = n_features | ||
|
||
self._W = np.abs(avg * halfnorm.rvs(size=(self.n_features, self.n_components)) / | ||
anotherbugmaster marked this conversation as resolved.
Show resolved
Hide resolved
|
||
np.sqrt(self.n_components)) | ||
|
||
self.A = np.zeros((self.n_components, self.n_components)) | ||
self.B = np.zeros((self.n_features, self.n_components)) | ||
return X | ||
|
||
def fit(self, X, batch_size=None): | ||
""" | ||
|
||
Parameters | ||
---------- | ||
X : matrix or iterator | ||
anotherbugmaster marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Matrix to factorize. | ||
batch_size : int or None | ||
If None than batch_size equals 1 sample. | ||
""" | ||
if self.n_features is None: | ||
X = self._setup(X) | ||
|
||
prod = np.outer | ||
if batch_size is not None: | ||
prod = np.dot | ||
length = X.shape[0] | ||
n_batches = max(length // batch_size, 1) | ||
X = np.array_split(X, n_batches, axis=0) | ||
r, h = self.r, self.h | ||
for v in X: | ||
h, r = self._solveproj(v, self._W, self.lambda_, self.kappa, r=r, h=h) | ||
self._H.append(h) | ||
if self.R is not None: | ||
self.R.append(r) | ||
|
||
self.A += prod(h, h.T) | ||
self.B += prod((v.T - r), h.T) | ||
self._solve_w() | ||
self.r = r | ||
self.h = h | ||
|
||
self.is_fitted = True | ||
|
||
def _solve_w(self): | ||
eta = self.kappa / np.linalg.norm(self.A, 'fro') | ||
n = 0 | ||
lasttwo = np.zeros(2) | ||
while n <= 2 or (np.abs( | ||
(lasttwo[1] - lasttwo[0]) / lasttwo[0]) > 1e-5 and n < 1e9): | ||
self._W -= eta * (np.dot(self._W, self.A) - self.B) | ||
self._W = self._transform(self._W) | ||
n += 1 | ||
lasttwo[0] = lasttwo[1] | ||
lasttwo[1] = 0.5 * np.trace(self._W.T.dot(self._W).dot(self.A)) - \ | ||
np.trace(self._W.T.dot(self.B)) | ||
|
||
def transform(self, X, return_r=False): | ||
H = [] | ||
if return_r: | ||
R = [] | ||
|
||
num = None | ||
W = self._W | ||
lambda_ = self.lambda_ | ||
kappa = self.kappa | ||
for v in X: | ||
h, r = self._solveproj(v, W, lambda_, kappa, v_max=np.inf) | ||
H.append(h.copy()) | ||
if return_r: | ||
R.append(r.copy()) | ||
|
||
H = np.stack(H, axis=-1) | ||
if return_r: | ||
return H, np.stack(R, axis=-1) | ||
else: | ||
return H | ||
|
||
def get_factor_matrices(self): | ||
if len(self._H) > 0: | ||
if len(self._H[0].shape) == 1: | ||
H = np.stack(self._H, axis=-1) | ||
else: | ||
H = np.concatenate(self._H, axis=1) | ||
return self._W, H | ||
else: | ||
return self._W, 0 | ||
anotherbugmaster marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
@staticmethod | ||
def _thresh(X, lambda_, v_max): | ||
res = np.abs(X) - lambda_ | ||
np.maximum(res, 0.0, out=res) | ||
res *= np.sign(X) | ||
np.clip(res, -v_max, v_max, out=res) | ||
return res | ||
|
||
@staticmethod | ||
def _mrdivide(B, A): | ||
"""Solve xB = A | ||
""" | ||
if len(B.shape) == 2 and B.shape[0] == B.shape[1]: | ||
return np.linalg.solve(B.T, A.T).T | ||
else: | ||
return np.linalg.lstsq(A.T, B.T, rcond=None)[0].T | ||
|
||
def _transform(self, W): | ||
newW = W.copy() | ||
np.maximum(newW, 0, out=newW) | ||
sumsq = np.sqrt(np.sum(W ** 2, axis=0)) | ||
np.maximum(sumsq, 1, out=sumsq) | ||
return self._mrdivide(newW, np.diag(sumsq)) | ||
|
||
def _solveproj(self, v, W, lambda_, kappa=1, h=None, r=None, v_max=None, max_iter=1e9): | ||
m, n = W.shape | ||
v = v.T | ||
if v_max is None: | ||
v_max = v.max() | ||
if len(v.shape) == 2: | ||
batch_size = v.shape[1] | ||
rshape = (m, batch_size) | ||
hshape = (n, batch_size) | ||
else: | ||
rshape = m, | ||
hshape = n, | ||
if h is None or h.shape != hshape: | ||
h = np.zeros(hshape) | ||
|
||
if r is None or r.shape != rshape: | ||
r = np.zeros(rshape) | ||
|
||
eta = kappa / np.linalg.norm(W, 'fro') ** 2 | ||
|
||
iters = 0 | ||
|
||
while True: | ||
iters += 1 | ||
# Solve for h | ||
htmp = h | ||
h = h - eta * np.dot(W.T, np.dot(W, h) + r - v) | ||
np.maximum(h, 0.0, out=h) | ||
|
||
# Solve for r | ||
rtmp = r | ||
r = self._thresh(v - np.dot(W, h), lambda_, v_max) | ||
|
||
# Stop conditions | ||
stoph = np.linalg.norm(h - htmp, 2) | ||
stopr = np.linalg.norm(r - rtmp, 2) | ||
stop = max(stoph, stopr) / m | ||
if stop < 1e-5 or iters > max_iter: | ||
break | ||
|
||
return h, r |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still unresolved https://github.com/RaRe-Technologies/gensim/pull/2007/files#r235028697