-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding similarity for two out of training sentences to doc2vec #707
Conversation
Thanks. Looks good to me. Pinging @gojomo for review. |
@gojomo could you please review this feature? |
@tmylk i got some errors in the test but it seems unrelated to my changes |
I can see people needing to do this, but as it's just a one-liner, not sure it needs an API convenience method. As an API method name, the "oot" abbreviation is somewhat cryptic. (I see what it means from this PR title, but not even the method comment uses the same language. It's generally good to avoid abbreviations unless they are very, very pervasively used.) Since the heart of the method is inference, and our existing inference steps/alpha defaults are somewhat wild-guesses that many people change, allowing the specification of non-default steps/alpha to be used for both inferences would make sense. |
|
||
model = doc2vec.Doc2Vec(min_count=1) | ||
model.build_vocab(corpus) | ||
self.assertEqual(int(model.docvecs.oot_similarity(model, '', '')), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not testing what you think it is. Repeated calls to inference with the exact same tokens don't necessarily result in identical vectors, because of randomness in the Doc2Vec algorithm. But it works here because these empty-strings result in no inference at all, which means each vector stays its initial untrained randomized value (which we have tried to make start in the same place for the same inputs). With actual tokens, you might not get a 1.0 similarity.
@gojomo I incorporated your comments. I hope the name for the method is clear enough. As for test, I interpreted incorrectly the infer_vector method, it should be fine now. |
Maybe |
@tmylk i changed the name according to your suggestion |
|
||
Document should be a list of (word) tokens. | ||
""" | ||
d1 = model.infer_vector(doc_words1, alpha, min_alpha, steps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use named parameters (rather than positional).
It's safer in case the argument order inside infer_vector
changes in the future.
Thanks for the PR! |
Each name suggestion has been better, but I'd make it even more literally descriptive, perhaps: |
I added a method to the DocvecsAarray to compute similarity between two sentences that are out of training. Not sure if the method should be better placed under Doc2Vec.
Please bear with me, this is my first contribution to an open source project.
I included a simple test.