Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding document metadata into the DMR-Model #107

Closed
hhagedorn opened this issue Mar 29, 2021 · 9 comments
Closed

Adding document metadata into the DMR-Model #107

hhagedorn opened this issue Mar 29, 2021 · 9 comments
Labels
enhancement New feature or request

Comments

@hhagedorn
Copy link

hhagedorn commented Mar 29, 2021

Hi everybody,
in a current project, I am using various models from this great package. One Model which seemed particularly interesting is the DMR-Topic Model, but when I was trying to deploy it I failed to properly include the metadata of the given documents.

From my understanding of the paper, which the model is based on, each document can be linked to an arbitrary amount of metadata labels from a given set, e.g. a list of authors. For example on page two of the Mimno and McCallum paper it says:

For each document d, let xd be a vector containing
feature that encode metadata values. For example, if
the observed features are indicators for the presence of
authors, then xd would include a 1 in the positions for
each author listed on document d, and a 0 otherwise.

Now looking at the signature of the DMR-Model's add_doc() Method it seems only possible to add a single string of metadata per document. Do I get it right, that accordingly only single-label documents can be included in this version of the DMR-Model? E.g. that only single author-documents can be considered and no lists or binary-vectors of metadata values can be put in?

Thank you already!

@bab2min bab2min added the enhancement New feature or request label Mar 30, 2021
@bab2min
Copy link
Owner

bab2min commented Mar 30, 2021

Hello @hhagedorn
As you said, the original version of DMR, introduced by the paper, can accept multi-hot vectors as metadata. But DMRModel in the current version of tomotopy is optimized for only one-hot vectors. In other words, it can accept only a single label.
(This is because I thought that it would receive one-hot input mainly, so I optimized it for only that case at the time of initial development.)

Fortunately, it doesn't seem difficult to extend the DMRModel to accept multi-hot input. So I will update it for multi-label metadata in the next update.
Thank you for your suggestion.

@hhagedorn
Copy link
Author

Hello @bab2min,
thank you for the quick response. I am glad to hear that, I think the DMRModel is quite promising in a number of cases with mixed/ multi-label data, as it constitutes a less restrictive approach than for example the PLDAModel.

May I ask if you already have any rough plan on when you might introduce the next update? Like will this rather be within a couple of weeks or more like a couple of months. I'm asking because in the first case I might wait for it before I actually conclude my current work.

Thank you very much anyways!

@bab2min
Copy link
Owner

bab2min commented Mar 30, 2021

@hhagedorn
In fact, this development is not my main job, so it is difficult to confirm the schedule.
I think it will be completed in April. If you need this feature quickly, I can provide a test version to you in advance rather than a release version. The test version is expected to be available within 2 weeks.

@hhagedorn
Copy link
Author

Hi,
that sounds great, thank you. I've got time to wait till the end of April, so there should not be any need to rush for a test version. Nevertheless, thank you very much for the offer. This not being your main job, makes all your effort and support even more remarkable.

bab2min added a commit that referenced this issue Apr 25, 2021
improved DMR & GDMR (#107)
improved GDMR's performance
fixed wrong topic_id for excluded words
copy() method for topic models
typed Python exceptions & warnings
refactored code based on c++14
@bab2min
Copy link
Owner

bab2min commented Apr 26, 2021

@hhagedorn , sorry for a little later update than scheduled.

A test version of DMRModel with multiple metadata labels is just uploaded.
You can install it like this:

$ pip install -U --index-url https://test.pypi.org/simple/ tomotopy==0.12.0

Multiple labels are supported by multi_metadata argument of add_doc and make_doc methods.
metadata, single label argument is remained for backwards compatibility.

Also a new method get_topic_prior() was added. It estimates the prior of topic from given metadata labels.

See more detail in https://github.com/bab2min/tomotopy/blob/main/examples/dmr_multi_label.py

@hhagedorn
Copy link
Author

No worries, thank you very much for all of your efforts.

It looks great and I will try it out within the next couple of days!

@hhagedorn
Copy link
Author

Hi,

in the meantime I trained all the models and using the new functionality in DMR works great, thank you!

I just have one small remark, I don't even know if it is important to mention. When I wan't to inspect priors for given metadata in DMR, everything works great and exactly like described in the documentation. However, the method is not "known" to the Python implementation, i.e. my IDE tells me it wouldn't exist.

@bab2min
Copy link
Owner

bab2min commented May 28, 2021

@hhagedorn, I'm glad the new functionality works well.
For IDE issues, can you tell me which IDE and linter you are using?

@hhagedorn
Copy link
Author

hhagedorn commented May 30, 2021

I am using PyCharm (Community Edition). But I am not sure whether the problem is linked to the IDE. When I inspect the DMRModel Class, all the methods and parameters are there - except the newly added ones. I.e. there is no get_topic_prior() or multi_metadata_dict. Since it is not shown, but I can actually call it nevertheless, maybe it is just an issue with the version upgrade on my machine? (I upgraded via pip by the way)

@bab2min bab2min closed this as completed Oct 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants