-
Notifications
You must be signed in to change notification settings - Fork 6
/
READ_ME_GUIDEDLDA.txt
68 lines (42 loc) · 2.28 KB
/
READ_ME_GUIDEDLDA.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
This is a read me on how to get guided_lda working when the package will not successfully build wheels.
This is in no way meant to undermine the author of the guided_lda package or to replace it in anyway and is
simply meant as a work around until the issue can be fully resolved.
1. Pull down the repository.
2. Install the original LDA package.
https://pypi.org/project/lda/
3. Drop the *.py files from the GuidedLDA_WorkAround repo in the lda folder under site-packages for your specific enviroment.
4. Profit...
EXAMPLE:
import numpy as np
from lda import guidedlda as glda
##from lda import glda_datasets as gldad
##X = gldad.load_data(gldad.NYT)
##vocab = gldad.load_vocab(gldad.NYT)</s>
## !!!! Update courtesy of @senjed !!!!
import lda.datasets as gldad
X = gldad.load_reuters()
vocab = gldad.load_reuters_vocab()
word2id = dict((v, idx) for idx, v in enumerate(vocab))
print(X[:10])
print("TESTING....")
seed_topic_list = [['game', 'team', 'win', 'player', 'season', 'second', 'victory'],
['percent', 'company', 'market', 'price', 'sell', 'business', 'stock', 'share'],
['music', 'write', 'art', 'book', 'world', 'film'],
['political', 'government', 'leader', 'official', 'state', 'country',
'american','case', 'law', 'police', 'charge', 'officer', 'kill', 'arrest', 'lawyer']]
model = glda.GuidedLDA(n_topics=5, n_iter=100, random_state=7, refresh=20)
seed_topics = {}
for t_id, st in enumerate(seed_topic_list):
for word in st:
seed_topics[word2id[word]] = t_id
model.fit(X, seed_topics=seed_topics, seed_confidence=0.15)
n_top_words = 10
topic_word = model.topic_word_
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
print('Topic {}: {}'.format(i, ' '.join(topic_words)))
# Topic 0: game play team win season player second point start victory
# Topic 1: company percent market price business sell executive pay plan sale
# Topic 2: play life man music place write turn woman old book
# Topic 3: official government state political leader states issue case member country
# Topic 4: school child city program problem student state study family group