Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infer method topic distribution of doc mostly zeros #49

Closed
ecoronado92 opened this issue May 19, 2020 · 6 comments
Closed

infer method topic distribution of doc mostly zeros #49

ecoronado92 opened this issue May 19, 2020 · 6 comments
Labels
bug Something isn't working

Comments

@ecoronado92
Copy link

ecoronado92 commented May 19, 2020

Hi -

I fitted an HDP model tried to obtain the topic distribution for an unseen document. I do get a list, however most of the entries are zeros so I'm thinking there might be a rounding issues in the code.

Here's an example of how it looks like

token_list = ['strong', 'organization', 'rusnews', 'line',  'misery', 'write', 'faq', 'ever', 'get', 
'modify', 'define', 'strong', 'atheist', 'believe', 'word']

doc_inst = hdp_model.make_doc(token_list)
topic_dist, ll = hdp_model.infer(doc_inst)

topic_dist
[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0, ## <--- Here's the only non-zero element which is correct, but I'd like to get %'s
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0]

Here's some other info on my OS

Darwin-18.5.0-x86_64-i386-64bit
Python 3.7.6 (default, Dec 30 2019, 19:38:28) 
[Clang 11.0.0 (clang-1100.0.33.16)]
NumPy 1.18.1
SciPy 1.4.1
tomotopy 0.7.1
@valmirselmani
Copy link

Hi,

I can confirm this behaviour, in most documents there are mostly zeros in the topic distributions and one or two topics have values greater than 0, which is usually 1, but it also can happen that the value is 1.0000157356262207 for example. It seems that HDP is very confident with the topic assignments.

I am currently writing a bachelor's thesis, where we are creating a topic model to propose similar documents. It's important that not many documents have the same topic distribution, so that we can sort them and thus improve the recommendation. The results of HDP are quite good, though.

I also use tomotopy 0.7.1 and have seen this behavior in several versions.

@bab2min bab2min added the bug Something isn't working label May 21, 2020
@bab2min
Copy link
Owner

bab2min commented May 21, 2020

Thank you for reporting a bug. I'll examine it.

@valmirselmani
Copy link

Any plans on when you're going to make a release so I can test?

@valmirselmani
Copy link

I saw that you released a test version. I installed it and ran it through a small number of documents. I was afraid that the topics would change, but that is not the case. The results look much better now. Thanks for the quick fix.

@bab2min
Copy link
Owner

bab2min commented May 25, 2020

Oh did you see the test version I'd released? Actually, it has some bugs about segmentation fault. It occurs not always, but often. So, I will check a little more and fix the problem and then include it in the next update.
Thanks for reporting it!

@valmirselmani
Copy link

Yes, I installed it from test.pypi.org. But, I only tested the inference with a small number of documents, so I did not notice the error at all. Keep up with your good work!

I'll wait for the release to infer my 575k documents. 😅

bab2min added a commit that referenced this issue Jun 4, 2020
fixed HDP inference bug (#49)
implemented converting HDP to LDA (#50)
added used_vocabs (#54)
added g-DMR model
@bab2min bab2min closed this as completed in 039e09d Jun 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants