Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding similarity for two out of training sentences to doc2vec #707

Merged
merged 10 commits into from
May 27, 2016

Conversation

ellolo
Copy link
Contributor

@ellolo ellolo commented May 25, 2016

I added a method to the DocvecsAarray to compute similarity between two sentences that are out of training. Not sure if the method should be better placed under Doc2Vec.
Please bear with me, this is my first contribution to an open source project.
I included a simple test.

@tmylk
Copy link
Contributor

tmylk commented May 25, 2016

Thanks. Looks good to me. Pinging @gojomo for review.

@ellolo
Copy link
Contributor Author

ellolo commented May 25, 2016

@gojomo could you please review this feature?

@ellolo
Copy link
Contributor Author

ellolo commented May 25, 2016

@tmylk i got some errors in the test but it seems unrelated to my changes

@gojomo
Copy link
Collaborator

gojomo commented May 25, 2016

I can see people needing to do this, but as it's just a one-liner, not sure it needs an API convenience method.

As an API method name, the "oot" abbreviation is somewhat cryptic. (I see what it means from this PR title, but not even the method comment uses the same language. It's generally good to avoid abbreviations unless they are very, very pervasively used.)

Since the heart of the method is inference, and our existing inference steps/alpha defaults are somewhat wild-guesses that many people change, allowing the specification of non-default steps/alpha to be used for both inferences would make sense.


model = doc2vec.Doc2Vec(min_count=1)
model.build_vocab(corpus)
self.assertEqual(int(model.docvecs.oot_similarity(model, '', '')), 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not testing what you think it is. Repeated calls to inference with the exact same tokens don't necessarily result in identical vectors, because of randomness in the Doc2Vec algorithm. But it works here because these empty-strings result in no inference at all, which means each vector stays its initial untrained randomized value (which we have tried to make start in the same place for the same inputs). With actual tokens, you might not get a 1.0 similarity.

@ellolo
Copy link
Contributor Author

ellolo commented May 25, 2016

@gojomo I incorporated your comments. I hope the name for the method is clear enough. As for test, I interpreted incorrectly the infer_vector method, it should be fine now.

@tmylk
Copy link
Contributor

tmylk commented May 26, 2016

Maybe similarity_of_unseen_docs is better?

@ellolo
Copy link
Contributor Author

ellolo commented May 27, 2016

@tmylk i changed the name according to your suggestion


Document should be a list of (word) tokens.
"""
d1 = model.infer_vector(doc_words1, alpha, min_alpha, steps)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use named parameters (rather than positional).

It's safer in case the argument order inside infer_vector changes in the future.

@tmylk tmylk merged commit b171a2d into piskvorky:develop May 27, 2016
@tmylk
Copy link
Contributor

tmylk commented May 27, 2016

Thanks for the PR!

@gojomo
Copy link
Collaborator

gojomo commented May 28, 2016

Each name suggestion has been better, but I'd make it even more literally descriptive, perhaps: inferred_similarity. (There's no need to qualify as 'unseen'.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants