-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing a topic from a HDPModel #152
Comments
Hi @bertomartin |
Blueprint of model = tp.HDPModel(...)
...
model.train(...)
# model may have a lot of dead topics at this point, e.g.
# 0: live topic
# 1: live topic
# 2: dead topic
# 3: live topic
# 4: dead topic
# 5: dead topic
# purge all dead topics and relocate live topics.
relocate_result = model.purge_dead_topics()
# `relocate_result` is a array where `relocate_result[i]` has a new topic id for old topic `i`, or -1 if old topic `i` is purged.
# e.g. [0, 1, -1, 2, -1, -1]
assert model.k == model.live_k
# at this point, `model.k` should be equal to `model.live_k`, e.g. model.k == 3, model.live_k == 3 |
@bab2min thanks for the response. Yes I meant to purge them after the model's being built (training is already completed). Your Blueprint makes sense to me. What I'm really after is having a contiguous set of clean topics, so I can do topic similarity and don't try to query a 'dead' topic for similarity. Or just outputting them in pyldavis, I don't want to see the dead topics as it doesn't really add anything... |
Thank you! In the meantime I was wondering if I could somehow filter out these topics when I do the ldavis display. So basically the plan is to construct the display as below: topic_term_dists = np.stack([mdl.get_topic_word_dist(k) for k in range(mdl.k)])
topic_term_dists = topic_term_dists / topic_term_dists.sum(axis=1)[:, None]
doc_topic_dists = np.stack([doc.get_topic_dist() for doc in mdl.docs])
doc_topic_dists /= doc_topic_dists.sum(axis=1, keepdims=True)
doc_lengths = np.array([len(doc.words) for doc in mdl.docs])
vocab = list(mdl.used_vocabs)
term_frequency = mdl.used_vocab_freq The problem is the docs are not related to K, or at least I don't see how to relate them. Ideally I would only want docs that occur in live topics to be able to get this to work. |
@bertomartin live_topics = [k for k in range(mdl.k) if mdl.is_live_topic(k)] # topics you want to visualize
topic_term_dists = np.stack([mdl.get_topic_word_dist(k) for k in range(mdl.k)])
topic_term_dists = topic_term_dists[live_topics] # select only `live_topics`
topic_term_dists /= topic_term_dists.sum(axis=1, keepdims=True)
doc_topic_dists = np.stack([doc.get_topic_dist() for doc in mdl.docs])
doc_topic_dists = doc_topic_dists[:, live_topics] # select only `live_topics`
doc_topic_dists /= doc_topic_dists.sum(axis=1, keepdims=True)
doc_lengths = np.array([len(doc.words) for doc in mdl.docs])
vocab = list(mdl.used_vocabs)
term_frequency = mdl.used_vocab_freq
... I uploaded a new example cooperating pyldavis and HDPModel. |
Sweet! I figured out a hacky way but this looks better. Thank you! |
hi, I have a HDP model and I was wondering if there's an easy way to remove a topic from the model. For instance, it's easy to check whether a topic is "live" or "dead" but can you update the model to not include the dead topics then re-save the model artifact? I guess this would also involve removing the tomotopy documents associated with the dead topics.
The text was updated successfully, but these errors were encountered: