Adding similarity for two out of training sentences to doc2vec #707

ellolo · 2016-05-25T20:12:16Z

I added a method to the DocvecsAarray to compute similarity between two sentences that are out of training. Not sure if the method should be better placed under Doc2Vec.
Please bear with me, this is my first contribution to an open source project.
I included a simple test.

tmylk · 2016-05-25T21:07:21Z

Thanks. Looks good to me. Pinging @gojomo for review.

…o out_similarity

ellolo · 2016-05-25T21:34:57Z

@gojomo could you please review this feature?

ellolo · 2016-05-25T22:13:06Z

@tmylk i got some errors in the test but it seems unrelated to my changes

gojomo · 2016-05-25T22:29:17Z

I can see people needing to do this, but as it's just a one-liner, not sure it needs an API convenience method.

As an API method name, the "oot" abbreviation is somewhat cryptic. (I see what it means from this PR title, but not even the method comment uses the same language. It's generally good to avoid abbreviations unless they are very, very pervasively used.)

Since the heart of the method is inference, and our existing inference steps/alpha defaults are somewhat wild-guesses that many people change, allowing the specification of non-default steps/alpha to be used for both inferences would make sense.

gojomo · 2016-05-25T22:32:22Z

gensim/test/test_doc2vec.py

+
+        model = doc2vec.Doc2Vec(min_count=1)
+        model.build_vocab(corpus)
+        self.assertEqual(int(model.docvecs.oot_similarity(model, '', '')), 1)


This is not testing what you think it is. Repeated calls to inference with the exact same tokens don't necessarily result in identical vectors, because of randomness in the Doc2Vec algorithm. But it works here because these empty-strings result in no inference at all, which means each vector stays its initial untrained randomized value (which we have tried to make start in the same place for the same inputs). With actual tokens, you might not get a 1.0 similarity.

ellolo · 2016-05-25T23:25:10Z

@gojomo I incorporated your comments. I hope the name for the method is clear enough. As for test, I interpreted incorrectly the infer_vector method, it should be fine now.

tmylk · 2016-05-26T16:19:33Z

Maybe similarity_of_unseen_docs is better?

ellolo · 2016-05-27T00:30:35Z

@tmylk i changed the name according to your suggestion

piskvorky · 2016-05-27T03:15:38Z

gensim/models/doc2vec.py


+        Document should be a list of (word) tokens.
+        """
+        d1 = model.infer_vector(doc_words1, alpha, min_alpha, steps)


Use named parameters (rather than positional).

It's safer in case the argument order inside infer_vector changes in the future.

tmylk · 2016-05-27T21:50:14Z

Thanks for the PR!

gojomo · 2016-05-28T00:14:34Z

Each name suggestion has been better, but I'd make it even more literally descriptive, perhaps: inferred_similarity. (There's no need to qualify as 'unseen'.)

mpennacchiotti added 3 commits May 25, 2016 12:13

adding similarity for two out of training setences

026433a

adding tests for out of training similarity

c6061cf

changing test, old version may have been too strict

f7461b3

ellolo added 3 commits May 25, 2016 14:23

adding tests for out of training similarity

0d4823f

changing test, old version may have been too strict

3e65ef9

Merge branch 'out_similarity' of https://github.com/ellolo/gensim int…

8201add

…o out_similarity

gojomo reviewed May 25, 2016
View reviewed changes

ellolo added 2 commits May 25, 2016 16:18

changing name and adding parameters

0815b01

modifying test

ffbce61

changing method name

5812216

piskvorky reviewed May 27, 2016
View reviewed changes

adding named arguments

67e7c4f

tmylk merged commit b171a2d into piskvorky:develop May 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding similarity for two out of training sentences to doc2vec #707

Adding similarity for two out of training sentences to doc2vec #707

ellolo commented May 25, 2016

tmylk commented May 25, 2016

ellolo commented May 25, 2016

ellolo commented May 25, 2016

gojomo commented May 25, 2016

gojomo May 25, 2016

ellolo commented May 25, 2016

tmylk commented May 26, 2016

ellolo commented May 27, 2016

piskvorky May 27, 2016

tmylk commented May 27, 2016

gojomo commented May 28, 2016

Adding similarity for two out of training sentences to doc2vec #707

Adding similarity for two out of training sentences to doc2vec #707

Conversation

ellolo commented May 25, 2016

tmylk commented May 25, 2016

ellolo commented May 25, 2016

ellolo commented May 25, 2016

gojomo commented May 25, 2016

gojomo May 25, 2016

Choose a reason for hiding this comment

ellolo commented May 25, 2016

tmylk commented May 26, 2016

ellolo commented May 27, 2016

piskvorky May 27, 2016

Choose a reason for hiding this comment

tmylk commented May 27, 2016

gojomo commented May 28, 2016