Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CoherenceModel for LdaModel.top_topics #1427

Merged
merged 4 commits into from
Aug 18, 2017

Conversation

macks22
Copy link
Contributor

@macks22 macks22 commented Jun 18, 2017

This will resolve #1128.

Sweeney, Mack added 3 commits June 17, 2017 21:26
@menshikh-iv
Copy link
Contributor

@chinmayapancholi13 this PR fix problem with coherence and SklWrappers?

topic = topic / topic.sum() # normalize to probability distribution
bestn = matutils.argsort(topic, topn, reverse=True)
return [(id, topic[id]) for id in bestn]

def top_topics(self, corpus, num_words=20):
def top_topics(self, corpus=None, texts=None, dictionary=None, window_size=None,
coherence='u_mass', topn=20, processes=-1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we should use all coherence types for this method, u_mass will be enough for it (for this reason please remove coherence parameters from arguments)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several papers have shown that sliding-window based coherence metrics have higher correlation with human judgements, and u_mass has lower correlations with human judgements than these methods. Why limit to u_mass when it is not the best technique for coherence calculations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, but for "sliding windows" approach user should pass corpus, texts and dictionary, it's slightly complicated.

Although perhaps you are right, if the default method is UMass I see no difference for users.

"""
Calculate the Umass topic coherence for each topic. Algorithm from
**Mimno, Wallach, Talley, Leenders, McCallum: Optimizing Semantic Coherence in Topic Models, CEMNLP 2011.**
Calculate the coherence for each topic; default is Umass coherence. See the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Googe-style docstring (here and everywhere)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you're looking for here, though I've noticed this comment in several of my PRs. To clarify, are you asking for the beginning line to start on the same line as the start quotes (""")? Or are you asking that I include the Args specification for this method and the others? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion, I mean Args and Returns sections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed throughout

@macks22
Copy link
Contributor Author

macks22 commented Jul 30, 2017

@menshikh-iv when you have a moment, I'd love your thoughts on my comments. Based on your comments, I think this may still need some changes, but I need your guidance on how to make them. Thanks!

@menshikh-iv
Copy link
Contributor

Sorry for waiting, please fix doc-strings and I'll merge this PR 👍

@macks22
Copy link
Contributor Author

macks22 commented Aug 13, 2017

@menshikh-iv I'm not sure how this error relates to my PR; maybe something with bad scikit version in build server? The current build is only failing due to this error in test_sklearn_api:

======================================================================
ERROR: testPipeline (gensim.test.test_sklearn_api.TestLdaSeqWrapper)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/RaRe-Technologies/gensim/gensim/test/test_sklearn_api.py", line 340, in testPipeline
    text_ldaseq.fit(corpus, test_target)
  File "/home/travis/miniconda2/envs/gensim-test/lib/python3.5/site-packages/sklearn/pipeline.py", line 257, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/home/travis/miniconda2/envs/gensim-test/lib/python3.5/site-packages/sklearn/pipeline.py", line 226, in _fit
    self.steps[step_idx] = (name, fitted_transformer)
TypeError: 'tuple' object does not support item assignment

@menshikh-iv
Copy link
Contributor

@macks22 this problem isn't related to your changes. This problem connected with new sklearn release, we will fix it soon.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Aug 18, 2017

Nice work @macks22, you are very productive:+1:

@menshikh-iv menshikh-iv merged commit f24ec78 into piskvorky:develop Aug 18, 2017
fabriciorsf pushed a commit to LINE-PESC/gensim that referenced this pull request Aug 23, 2017
…iskvorky#1427)

* Add a `get_topics` method to all topic models, add test coverage for this, and update the `CoherenceModel` to use this for getting topics from models.

* Require topics returned from `get_topics` to be probability distributions for the probabilistic topic models.

* Replace code in `LdaModel.top_topics` with use of `CoherenceModel`.

* Fix docstrings to use Google style throughout PR changes and various LdaModel methods.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use CoherenceModel in top_topics in LDAModel to avoid code duplication
2 participants