-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Lda training visualization in visdom #1399
Conversation
…into tensorboard_logs
Updated this PR to use visdom, removed tensorboard code. There are few parameters (distance, coherence, texts, window_size, topn) in ldamodel required for two different fuctionalities (one for visualization in this PR and other for logging in #1381). Maybe we can define a dict based input for visualization related parameters differently, for example Viz={param1:value1, param2:value2, ..}. Or otherwise, what alternatives can be used? |
Please add the ipynb with instructions on how to setup visdom to show the live stats |
gensim/models/ldamodel.py
Outdated
alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, | ||
iterations=50, gamma_threshold=0.001, minimum_probability=0.01, | ||
random_state=None, ns_conf={}, minimum_phi_value=0.01, | ||
per_word_topics=False, viz=False, env=None, distance="kulback_leibler", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move all parameters to metrics
by default, metrics={}
,
in another case, metrics
contains needed parameters for viz some things, for example
metrics = {
'diff_distance': 'jaccard',
'coherence_windows_size': 10,
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and don't forget about args validation (write a tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation about every arg's object type? Or for the valid values of 'diff_distance' and 'coherence' (though, for which there is a test already in their respective functions of Diff and CoherenceModel)?
" Now, this graph along with the others explained below, can be used to decide if it's time to stop the training. We can see if the value stops changing after some epochs and that we are able to get the highest possible coherence of our model. \n", | ||
"\n", | ||
"\n", | ||
"2. **Perplexity**\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numbers will not be rendered correctly, please remove them
gensim/models/ldamodel.py
Outdated
|
||
`env` defines the environment to use in visdom browser | ||
|
||
`distance` measure to be used for Diff plot visualization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit docstring in accordance with comment about viz
gensim/models/ldamodel.py
Outdated
gamma_threshold=None, chunks_as_numpy=False): | ||
def update(self, corpus, chunksize=None, decay=None, offset=None, passes=None, update_every=None, | ||
eval_every=None, iterations=None, gamma_threshold=None, chunks_as_numpy=False, | ||
viz=None, env=None, distance=None, coherence=None, texts=None, window_size=None, topn=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add metrics
for all args about metrics
gensim/models/ldamodel.py
Outdated
if self.viz: | ||
# calculate coherence | ||
cm = gensim.models.CoherenceModel(model=self, corpus=corpus, texts=texts, coherence=coherence, window_size=window_size, topn=topn) | ||
Coherence = np.array([cm.get_coherence()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the variable name should be in lowercase (here and anywhere)
gensim/models/ldamodel.py
Outdated
if reallen != lencorpus: | ||
raise RuntimeError("input corpus size changed during training (don't use generators as input)") | ||
|
||
if self.viz: | ||
# calculate coherence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this block in separate function (with stat calc)
The error för Travis:
Seems like a real error. |
There's a typo in "kullback_leibler" in diff function here. I replaced it with correct spelling in this PR but forgot to make the replacement in tests also. Though this PR might get a bit delayed due to API decisions so I'll just make another PR for correcting this typo. |
@tmylk @menshikh-iv updated the API structure as discussed recently |
alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, | ||
iterations=50, gamma_threshold=0.001, minimum_probability=0.01, | ||
random_state=None, ns_conf={}, minimum_phi_value=0.01, | ||
per_word_topics=False, callbacks=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add description for callbacks
in docstring
gensim/models/callbacks.py
Outdated
if any(isinstance(metric, (DiffMetric, ConvergenceMetric)) for metric in self.metrics): | ||
self.previous = copy.deepcopy(model) | ||
# store diff diagnols of previous epochs | ||
self.diff_mat = Queue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's a reason to use queue (not list
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there could be any no. of diff/convergence metric input (for ex. with different distance measures), so by using queue, we won't need to keep track of their no. and epoch count, and simply depend on the sequence order.
gensim/models/callbacks.py
Outdated
|
||
|
||
class CoherenceMetric(Metric): | ||
def __init__(self, corpus=None, texts=None, dictionary=None, coherence=None, window_size=None, topn=None, logger=None, viz_env=None, title=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add docstring for all parameters with description (here and anywhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's a good idea to use logger="shell" by default for all scalar metrics?
@@ -0,0 +1,189 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an example with logger="shell" in notebook (and show logging output in notebook)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally have some time for review, wrapping my head around the classes and callbacks here :) Your help will be appreciated.
gensim/models/callbacks.py
Outdated
|
||
def get_value(self, **parameters): | ||
""" | ||
Set the parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't get_value
a misnomer for setting parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll replace with set_parameters
"# define other remaining metrics available\n", | ||
"ch_umass = CoherenceMetric(corpus=training_corpus, coherence=\"u_mass\", logger=\"visdom\", viz_env=\"LdaModel\", title=\"Coherence (u_mass)\")\n", | ||
"diff_kl = DiffMetric(distance=\"kullback_leibler\", logger=\"visdom\", viz_env=\"LdaModel\", title=\"Diff (kullback_leibler)\")\n", | ||
"convergence_jc = ConvergenceMetric(distance=\"hellinger\", logger=\"visdom\", viz_env=\"LdaModel\", title=\"Convergence (jaccard)\")\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs say jaccard, but distance is hellinger.
"source": [ | ||
"# Training Logs\n", | ||
"\n", | ||
"We can also log the metric values after every epoch to the shell apart from visualizing them in Visdom. The only change is to define `logger=\"shell\"` instead of `\"visdom\"` in the input callbacks." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an example on how to get the values out programatically (no logging or plotting, using them from arbitrary Python code instead)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be done by using a loop to manually iterate over model and call metric classes at the end to store value:
model=LdaModel(passes=1)
perplexity=[]
for epoch in range(epochs):
model.update(passes=1)
pl = PerplexityMetric().get_value(model)
perplexity.append(pl)
and sure, I'll add an example for this in notebook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Is there a way to use the callbacks in a way that they collect this info?
I'm thinking a type of "logger" that instead of logging, appends the value to some internal list. Which other parts of the app can read from.
The idea is the interface would be the same as for logger=visdom/shell
, without a need for an explicit outer loop like in your example.
Is that possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can store them just after they are calculated in this step.
Maybe in a dict which could be an attribute of LdaModel? Structure could be:
metrics = {'PerplexityMetric':[val1, val2, ...], 'DiffMetric':[val1, val2, ...] }
on_epoch_end()
can be made to return the metric values from current epoch which could then be appended in metrics
dict after this step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used metrics
dict to save values as described above
"INFO:gensim.models.ldamodel:topic #3 (0.200): 0.041*\"the\" + 0.024*\"of\" + 0.021*\"to\" + 0.016*\"a\" + 0.016*\"and\" + 0.012*\"s\" + 0.010*\"in\" + 0.007*\"as\" + 0.006*\"with\" + 0.006*\"party\"\n", | ||
"INFO:gensim.models.ldamodel:topic #4 (0.200): 0.034*\"the\" + 0.023*\"of\" + 0.016*\"a\" + 0.015*\"and\" + 0.012*\"in\" + 0.011*\"to\" + 0.008*\"as\" + 0.007*\"is\" + 0.007*\"with\" + 0.007*\"that\"\n", | ||
"INFO:gensim.models.ldamodel:topic diff=1.878616, rho=1.000000\n", | ||
"INFO:LdaModel:Epoch 0: Perplexity estimate: 495.948922721\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use same logging format as in LdaModel (INFO:gensim.models.ldamodel
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"INFO:gensim.models.ldamodel:topic #4 (0.200): 0.034*\"the\" + 0.023*\"of\" + 0.016*\"a\" + 0.015*\"and\" + 0.012*\"in\" + 0.011*\"to\" + 0.008*\"as\" + 0.007*\"is\" + 0.007*\"with\" + 0.007*\"that\"\n", | ||
"INFO:gensim.models.ldamodel:topic diff=1.878616, rho=1.000000\n", | ||
"INFO:LdaModel:Epoch 0: Perplexity estimate: 495.948922721\n", | ||
"INFO:LdaModel:Epoch 0: Perplexity estimate: 609.481073631\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why perplexity showed twice (and different values)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One is for holdout corpus and other for test corpus. Though, updated the log statement now to define metric name from the input parameter title
"INFO:LdaModel:Epoch 0: Perplexity estimate: 609.481073631\n", | ||
"INFO:LdaModel:Epoch 0: Coherence estimate: -22.4407925538\n", | ||
"INFO:LdaModel:Epoch 0: Diff estimate: [ 0.44194712 0.96670853 0.8036907 0.72372737 0.63141336]\n", | ||
"INFO:LdaModel:Epoch 0: Convergence estimate: 0.0\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why always zero ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous state of model (needed to calculate Convergence
) was being saved at wrong place. Corrected it
"INFO:LdaModel:Epoch 0: Coherence estimate: -22.4407925538\n", | ||
"INFO:LdaModel:Epoch 0: Diff estimate: [ 0.44194712 0.96670853 0.8036907 0.72372737 0.63141336]\n", | ||
"INFO:LdaModel:Epoch 0: Convergence estimate: 0.0\n", | ||
"INFO:gensim.models.ldamodel:-7.121 per-word bound, 139.2 perplexity estimate based on a held-out corpus of 25 documents with 2182 words\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perplexity calculates twice, wdyt about it @parulsethi? So, you should disable evaluation flag in a notebook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabled eval_every
['graph', 'trees', 'binary', 'widths']] | ||
""" | ||
# only one of the model or topic would be defined | ||
self.model = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should you do this assignment? (only in current Callback)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As both model
and topics
can be used to calculate Coherence, and only one of them would be defined in **kwargs
. So this assignment is just to avoid name not defined error for the other variable which is not in **kwargs
.
gensim/models/callbacks.py
Outdated
other_model : second topic model instance to calculate the difference from | ||
""" | ||
super(DiffMetric, self).set_parameters(**kwargs) | ||
diff_matrix, _ = self.model.diff(self.other_model, self.distance, self.num_words, self.n_ann_terms, self.normed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now you can use new version for diff (with diagonal
and annotation
flags)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor code style comments.
"test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])\n", | ||
"lee_corpus = test_data_dir + os.sep + 'lee.cor'\n", | ||
"lee_train_file = test_data_dir + os.sep + 'lee_background.cor'\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os.path.join
more standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
gensim/models/ldamodel.py
Outdated
callback = Callback(self.callbacks) | ||
callback.set_model(self) | ||
# initialize metrics dict to store metric values after every epoch | ||
self.metrics = {} | ||
for metric in self.callbacks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dict comprehension more readable?
Also, defaultdict
might make the logic a little simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use defaultdict
gensim/models/callbacks.py
Outdated
# plot all metrics in current epoch | ||
for i, metric in enumerate(self.metrics): | ||
value = metric.get_value(topics=topics, model=self.model, other_model=self.previous) | ||
if metric.title is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove this if/else, instead of this define __str__
or __repr__
method for each metric class, after it you can use label = str(metric)
gensim/models/callbacks.py
Outdated
""" | ||
Base Metric class for topic model evaluation metrics | ||
""" | ||
def __init__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to define empty __init__
in base class
LGTM I think, last minor changes and I'll merge it |
gensim/models/callbacks.py
Outdated
@@ -87,6 +93,9 @@ def __init__(self, corpus=None, texts=None, dictionary=None, coherence=None, win | |||
self.viz_env = viz_env | |||
self.title = title | |||
|
|||
def __str__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, if you have no method
in child-class, this method will be called from parent
class -> no need to call __str__
from each callback explicitly.
"\n", | ||
"def read_corpus(fname):\n", | ||
" texts = []\n", | ||
" with open(fname, encoding=\"ISO-8859-1\") as f:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't work for python2 (because encoding
isn't supported in python2), replace it to smart_open.
"\n", | ||
"\n", | ||
"# Set file names for train and test data\n", | ||
"test_data_dir = os.path.join(gensim.__path__[0], 'test', 'test_data')\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use large dataset for this? (download from link in notebook)
"test_dictionary = Dictionary(test_texts)\n", | ||
"\n", | ||
"training_corpus = [training_dictionary.doc2bow(text) for text in training_texts]\n", | ||
"holdout_corpus = [holdout_dictionary.doc2bow(text) for text in holdout_texts]\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's 3 different mappings, mistake.
You should fit your Dictionary
on training_texts and use it for all conversions (for holdout/test too)
Could you please add to ipynb an actual screenshot of the entire Vizdom vis that your code generates? with both train/hold-out perplexity in particular. |
Congratulations @parulsethi, this summer you did a lot of viz for gensim. |
This PR adds an option to visualize LDA evaluation parameters in real-time while training/or after, using visdom.