Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] HDP changes #996

Closed
wants to merge 12 commits into from
Closed

[WIP] HDP changes #996

wants to merge 12 commits into from

Conversation

bhargavvader
Copy link
Contributor

@bhargavvader bhargavvader commented Nov 8, 2016

This is to address issues #262 , #901, #937, #945 - basically to attempt to clean up HDP as much as possible.

For starters, I've made HDP resemble ldamodel as much as possible - with regard to the imports, and the dirichlet_expectation method mirrors the ldamodel one.

Will keep making changes in this PR.

Edit: Would want to fix #952 with this as well.

@bhargavvader
Copy link
Contributor Author

@tmylk , I've also added a method for the HDP to LDA conversion, which returns an LDA model object.
Could you review that and the previous changes and see if it's fine?

Copy link
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add changelog and tests for the new method.

Returns closest corresponding ldamodel object corresponding to current hdp model.
"""
alpha, beta = self.hdp_to_lda()
ldam = ldamodel.LdaModel(num_topics=150, alpha=alpha, id2word=self.id2word)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this number of topics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 150 because the default number of topics in HDP before truncation is 150. This means all the internal matrices are of that size, and to make the expElogbeta matrice the same shape to accept the beta, we've to initialise ldam to that many topics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add text to the doc-string about this choice

@@ -34,11 +34,11 @@
from __future__ import with_statement

import logging, time
import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is standard abbreviation. Please keep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LSI, LDA, DTM all use numpy instead of np in the code, so I was making it uniform.
Should I make it np in all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will make reviewing this PR much easier if this decorative change wasn't here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll make it back to np.
But I still think it should be uniform throughout the package - will open another PR for the other 3.

@bhargavvader
Copy link
Contributor Author

@tmylk what sort of tests would you want to see?
I removed the np to numpy change; PR #1000 works to make it uniform throughout. It should be easier to review this now though!

@bhargavvader
Copy link
Contributor Author

@tmylk could you have a look?

@tmylk
Copy link
Contributor

tmylk commented Nov 14, 2016

An example test: train HDP, create default LDA using the new method, check they are similar.

@bhargavvader
Copy link
Contributor Author

@tmylk , added tests, and a random_state method similar to LDA for more control during testing.

Copy link
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just minor changes

return result.astype(alpha.dtype) # keep the same precision as input


def get_random_state(seed):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I instead call it from the ldamodel class? How would this rather be done?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better move to some shared class like util

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I also move dirichlet_expectation to matutils?
Same code being used in ldamodel and hdpmodel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@@ -51,12 +52,29 @@ class TestHdpModel(unittest.TestCase, basetests.TestBaseTopicModel):
def setUp(self):
self.corpus = mmcorpus.MmCorpus(datapath('testcorpus.mm'))
self.class_ = hdpmodel.HdpModel
self.model = self.class_(corpus, id2word=dictionary)
self.model = self.class_(corpus, id2word=dictionary, random_state=np.random.seed(0))

def testShowTopic(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but that only tests the types of the results; I thought it would be important to also test for particular values. The nature of my test is different. Should I maybe name it something else?

Also, how do I make this test file run the base class tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, give these test a different name.
it should run base class test automatically. can you see them in the log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, and yup the basetest tests show up in HDP.

@bhargavvader
Copy link
Contributor Author

Changed the test name, moved functions to utils and matutils, and changed all the numpy to np there as well (I had forgotten to last time, only had changed the models).

@bhargavvader
Copy link
Contributor Author

@tmylk , could you please review? Travis fails on all the python versions because of
AttributeError: 'KeyedVectors' object has no attribute 'syn0', which is a word2vec (unrelated) test.

I've added show_topic as well now.

@bhargavvader
Copy link
Contributor Author

Ping @tmylk ! Any clue what to do with the tests?

@tmylk
Copy link
Contributor

tmylk commented Nov 29, 2016

Did keyed vectors get merged in correctly from develop branch?

@bhargavvader
Copy link
Contributor Author

How do I check if it has been? My guess is it hasn't, because the test is failing again. But when I do git pull upstream develop from the HDP branch I'm working on, it says everything is up to date.

@bhargavvader
Copy link
Contributor Author

@tmylk , I checked, and syn0 is in the keyedvector class too. It's also identical to the file in the repo. Any clue why this is happening? And my changes shouldn't make the tests fail, anyway...

@bhargavvader
Copy link
Contributor Author

@tmylk how do I check if keyedvectors is merged in correctly?

@tmylk
Copy link
Contributor

tmylk commented Dec 22, 2016

@bhargavvader may I ask for a new pr with just HDP changes and nothing else? Would like to merge it to new release this week.

@bhargavvader
Copy link
Contributor Author

Sure, will close this and open 2 separate PRs this weekend.

@bhargavvader bhargavvader mentioned this pull request Dec 23, 2016
@bhargavvader bhargavvader deleted the HDP branch February 23, 2017 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add more tests for HdpModel
2 participants