Removing a topic from a HDPModel #152

bertomartin · 2021-11-17T20:13:21Z

hi, I have a HDP model and I was wondering if there's an easy way to remove a topic from the model. For instance, it's easy to check whether a topic is "live" or "dead" but can you update the model to not include the dead topics then re-save the model artifact? I guess this would also involve removing the tomotopy documents associated with the dead topics.

bab2min · 2021-11-18T11:38:36Z

Hi @bertomartin
As you know, currently tomotopy has no feature about removing dead topics from HDP models.
This is because dead and live topics can be swapped out during training, so removing them in the training process causes frequent reallocations and slows down the total training procedure.
But if you want to remove dead topics after the whole training finished, that seems a pretty reasonable request. I'll try to implement it in the next update.

bab2min · 2021-11-18T11:52:06Z

Blueprint of purge_dead_topics method of tomotopy.HDPModel:

model = tp.HDPModel(...)
...
model.train(...)

# model may have a lot of dead topics at this point, e.g.
#  0: live topic
#  1: live topic
#  2: dead topic
#  3: live topic
#  4: dead topic
#  5: dead topic

# purge all dead topics and relocate live topics.
relocate_result = model.purge_dead_topics() 

# `relocate_result` is a array where `relocate_result[i]` has a new topic id for old topic `i`, or -1 if old topic `i` is purged.
# e.g. [0, 1, -1, 2, -1, -1]

assert model.k == model.live_k
# at this point, `model.k` should be equal to `model.live_k`, e.g. model.k == 3, model.live_k == 3

bertomartin · 2021-11-18T15:56:12Z

@bab2min thanks for the response. Yes I meant to purge them after the model's being built (training is already completed). Your Blueprint makes sense to me. What I'm really after is having a contiguous set of clean topics, so I can do topic similarity and don't try to query a 'dead' topic for similarity. Or just outputting them in pyldavis, I don't want to see the dead topics as it doesn't really add anything...

bertomartin · 2021-11-30T21:35:58Z

Thank you! In the meantime I was wondering if I could somehow filter out these topics when I do the ldavis display. So basically the plan is to construct the display as below:

topic_term_dists = np.stack([mdl.get_topic_word_dist(k) for k in range(mdl.k)])
topic_term_dists = topic_term_dists / topic_term_dists.sum(axis=1)[:, None]
doc_topic_dists = np.stack([doc.get_topic_dist() for doc in mdl.docs])
doc_topic_dists /= doc_topic_dists.sum(axis=1, keepdims=True)
doc_lengths = np.array([len(doc.words) for doc in mdl.docs])
vocab = list(mdl.used_vocabs)
term_frequency = mdl.used_vocab_freq

The problem is the docs are not related to K, or at least I don't see how to relate them. Ideally I would only want docs that occur in live topics to be able to get this to work.

bab2min · 2021-12-02T16:25:19Z

@bertomartin
You can filter out dead topics using numpy indexing like:

live_topics = [k for k in range(mdl.k) if mdl.is_live_topic(k)] # topics you want to visualize

topic_term_dists = np.stack([mdl.get_topic_word_dist(k) for k in range(mdl.k)])
topic_term_dists = topic_term_dists[live_topics] # select only `live_topics`
topic_term_dists /= topic_term_dists.sum(axis=1, keepdims=True)

doc_topic_dists = np.stack([doc.get_topic_dist() for doc in mdl.docs])
doc_topic_dists = doc_topic_dists[:, live_topics] # select only `live_topics`
doc_topic_dists /= doc_topic_dists.sum(axis=1, keepdims=True)

doc_lengths = np.array([len(doc.words) for doc in mdl.docs])
vocab = list(mdl.used_vocabs)
term_frequency = mdl.used_vocab_freq
...

I uploaded a new example cooperating pyldavis and HDPModel.
https://github.com/bab2min/tomotopy/blob/main/examples/hdp_visualization.py

bertomartin · 2021-12-02T20:04:06Z

Sweet! I figured out a hacky way but this looks better. Thank you!

bab2min added the enhancement New feature or request label Nov 18, 2021

bab2min added a commit that referenced this issue Jul 17, 2022

implemented HDPModel.purge_dead_topics (#152)

b38437f

bab2min mentioned this issue Jul 17, 2022

Dev 0.12.3 #176

Merged

bab2min closed this as completed Jan 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing a topic from a HDPModel #152

Removing a topic from a HDPModel #152

bertomartin commented Nov 17, 2021 •

edited

Loading

bab2min commented Nov 18, 2021

bab2min commented Nov 18, 2021

bertomartin commented Nov 18, 2021

bertomartin commented Nov 30, 2021

bab2min commented Dec 2, 2021

bertomartin commented Dec 2, 2021

Removing a topic from a HDPModel #152

Removing a topic from a HDPModel #152

Comments

bertomartin commented Nov 17, 2021 • edited Loading

bab2min commented Nov 18, 2021

bab2min commented Nov 18, 2021

bertomartin commented Nov 18, 2021

bertomartin commented Nov 30, 2021

bab2min commented Dec 2, 2021

bertomartin commented Dec 2, 2021

bertomartin commented Nov 17, 2021 •

edited

Loading