Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing a topic from a HDPModel #152

Closed
bertomartin opened this issue Nov 17, 2021 · 6 comments
Closed

Removing a topic from a HDPModel #152

bertomartin opened this issue Nov 17, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@bertomartin
Copy link

bertomartin commented Nov 17, 2021

hi, I have a HDP model and I was wondering if there's an easy way to remove a topic from the model. For instance, it's easy to check whether a topic is "live" or "dead" but can you update the model to not include the dead topics then re-save the model artifact? I guess this would also involve removing the tomotopy documents associated with the dead topics.

@bab2min bab2min added the enhancement New feature or request label Nov 18, 2021
@bab2min
Copy link
Owner

bab2min commented Nov 18, 2021

Hi @bertomartin
As you know, currently tomotopy has no feature about removing dead topics from HDP models.
This is because dead and live topics can be swapped out during training, so removing them in the training process causes frequent reallocations and slows down the total training procedure.
But if you want to remove dead topics after the whole training finished, that seems a pretty reasonable request. I'll try to implement it in the next update.

@bab2min
Copy link
Owner

bab2min commented Nov 18, 2021

Blueprint of purge_dead_topics method of tomotopy.HDPModel:

model = tp.HDPModel(...)
...
model.train(...)

# model may have a lot of dead topics at this point, e.g.
#  0: live topic
#  1: live topic
#  2: dead topic
#  3: live topic
#  4: dead topic
#  5: dead topic

# purge all dead topics and relocate live topics.
relocate_result = model.purge_dead_topics() 

# `relocate_result` is a array where `relocate_result[i]` has a new topic id for old topic `i`, or -1 if old topic `i` is purged.
# e.g. [0, 1, -1, 2, -1, -1]

assert model.k == model.live_k
# at this point, `model.k` should be equal to `model.live_k`, e.g. model.k == 3, model.live_k == 3

@bertomartin
Copy link
Author

@bab2min thanks for the response. Yes I meant to purge them after the model's being built (training is already completed). Your Blueprint makes sense to me. What I'm really after is having a contiguous set of clean topics, so I can do topic similarity and don't try to query a 'dead' topic for similarity. Or just outputting them in pyldavis, I don't want to see the dead topics as it doesn't really add anything...

@bertomartin
Copy link
Author

Thank you! In the meantime I was wondering if I could somehow filter out these topics when I do the ldavis display. So basically the plan is to construct the display as below:

topic_term_dists = np.stack([mdl.get_topic_word_dist(k) for k in range(mdl.k)])
topic_term_dists = topic_term_dists / topic_term_dists.sum(axis=1)[:, None]
doc_topic_dists = np.stack([doc.get_topic_dist() for doc in mdl.docs])
doc_topic_dists /= doc_topic_dists.sum(axis=1, keepdims=True)
doc_lengths = np.array([len(doc.words) for doc in mdl.docs])
vocab = list(mdl.used_vocabs)
term_frequency = mdl.used_vocab_freq

The problem is the docs are not related to K, or at least I don't see how to relate them. Ideally I would only want docs that occur in live topics to be able to get this to work.

@bab2min
Copy link
Owner

bab2min commented Dec 2, 2021

@bertomartin
You can filter out dead topics using numpy indexing like:

live_topics = [k for k in range(mdl.k) if mdl.is_live_topic(k)] # topics you want to visualize

topic_term_dists = np.stack([mdl.get_topic_word_dist(k) for k in range(mdl.k)])
topic_term_dists = topic_term_dists[live_topics] # select only `live_topics`
topic_term_dists /= topic_term_dists.sum(axis=1, keepdims=True)

doc_topic_dists = np.stack([doc.get_topic_dist() for doc in mdl.docs])
doc_topic_dists = doc_topic_dists[:, live_topics] # select only `live_topics`
doc_topic_dists /= doc_topic_dists.sum(axis=1, keepdims=True)

doc_lengths = np.array([len(doc.words) for doc in mdl.docs])
vocab = list(mdl.used_vocabs)
term_frequency = mdl.used_vocab_freq
...

I uploaded a new example cooperating pyldavis and HDPModel.
https://github.com/bab2min/tomotopy/blob/main/examples/hdp_visualization.py

@bertomartin
Copy link
Author

Sweet! I figured out a hacky way but this looks better. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants