NamedVectors refactor for word2vec #819

droudy · 2016-08-09T14:34:03Z

Addresses #549. Refactored syn0, syn0norm, vocab, and index2word into their own class. NamedVecs of a model can be retrieved as follows:


model = Word2Vec(example_corpus)
retrieved_vecs = model.named_vecs

@gojomo please review

doc2vec NamedVectors refactor

gojomo · 2016-08-09T20:28:53Z

Yes, this is the right direction!

Ultimately the reason for the pain-of-refactoring is to move all existing and future related functionality (save/load to other formats, similarity calcs, indexing, translations/projections, etc) to the helper class, and make it usable for other purposes and algorithms than just current Word2Vec/Doc2Vec. So eventually this class would move to its own file, and when the related methods also move, they wouldn't be accessing vectors via a self.named_vectors property, but just viaself.

Thoughts on names:

After more thought, I'm now leaning toward 'KeyedVectors' as a better name for this functionality. 'Names' are a bit more of a loaded concepts than look-up keys.
Removed from Word2Vec, there's no reason to call the array-of-vectors property syn0 – a legacy of the internal neural-network naming conventions. Maybe just vectors? (It'd still be appropriate to mention that this is equivalent to syn0 in the original word2vec.c code, or even use syn0 someplace where the word2vec-algorithm code refers to the same array.)
Within the Word2Vec/Doc2Vec models, a property name based on role rather than type would be more readable. So rather than named_vectors, 'perhaps word_vectors or even wv for compactness.

Overall, this could be a pretty big and disruptive set of changes. Among other things, it will require extra code to maintain the capability to load-and-convert older models. It may even make sense to build the refactored-classes alongside the old still-operative classes, to make verification-of-identical-behavior easier, and the load-and-convert code. That is: do it all as Word2Vec2, Doc2Vec2 first, then when parity/compatibility with the old classes is verified, do the swap.

droudy · 2016-08-16T14:32:28Z

@gojomo If the save/load functions, similarity calcs, etc were moved into the NamedVectors/KeyedVectors class, calls such as trained_model.most_similar() wouldn't be viable anymore, they would have to be called with trained_model.named_vectors.most_similar(). Would it make sense to move save/load functions, similarity calcs, etc. into their own class w2v_no_training or something along the lines of that, and have the existing Word2Vec class inherit w2v_no_training with both of them accessing vectors and vocab from the NamedVectors/KeyedVectors class so that calls like trained_model.most_similar() remain viable?

gojomo · 2016-08-16T16:34:01Z

Except for backward-compatibility, I don't think it matters whether trained_model.most_similar() works. Its the set-of-vectors that's interesting to run various kinds of downstream-methods on – and there's no need to have those methods linked to the one Word2Vec algorithm when they work on vector-sets from many algorithms. For example:

my_model = Word2Vec(my_corpus, ...)
print(my_model.wv.most_similar('warm'))

Or even, when you mainly care about working with the output vectors:

my_vectors = my_model.wv
print(my_vectors.most_similar('hot'))
my_vectors.save_word2vec_format('words0001.vec`, binary=False)
googlenews_vectors = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz')

...and then some hypothetical new functionality...

# some measure of whether same-words-have-same-top-N-neighbors
my_vectors.neighbors_rank_correlation(googlenews_vectors)

# translating/extending via method of https://github.com/RaRe-Technologies/gensim/wiki/Word2Vec-&-Doc2Vec-Wishlist#implement-translation-matrix-of-exploiting-similarities-among-languages-for-machine-translation
tm = googlenews_vectors.learn_translation_matrix_to(my_vectors)
my_vectors.import_unique(googlenews_vectors, tm)

jayantj · 2016-08-17T04:11:18Z

Sorry for jumping in so late - I was just wondering if it'd be a good idea to keep the properties and methods of NamedVectors completely independent of word2vec/doc2vec. Specifically -

NamedVectors shouldn't be responsible for vocab, just an index2word and a word2index, since vocab seems specific to w2v/doc2v. Maybe have the word2vec/doc2vec class take care of vocab and the NamedVectors simply using that to create index2word and word2index?
Some naming (like vocab, index2word) could probably be made more generic - again, independent of w2v/doc2v.

jayantj · 2016-08-17T04:12:30Z

gensim/models/word2vec.py

@@ -1167,20 +1209,20 @@ def intersect_word2vec_format(self, fname, binary=False, encoding='utf8', unicod
                            word.append(ch)
                    word = utils.to_unicode(b''.join(word), encoding=encoding, errors=unicode_errors)
                    weights = fromstring(fin.read(binary_len), dtype=REAL)
-                    if word in self.vocab:
+                    if word in self.named_vectors.vocab:


Simply word in self.named_vectors would work fine too, since you've defined __contains__ on NamedVectors.
Same for lines 519, 1222, 1271, 1412

piskvorky · 2016-08-17T05:43:10Z

What is the difference between NamedVectors and a similarity index?

Just the fact that vectors have an additional string id = key (rather than being references by their ordinal position)?

If so (but I might be wrong and missing something), isn't it better to add string ids to vectors in our similarity interface, rather than duplicate the functionality under a different name?

See also #732.

gojomo · 2016-08-17T07:39:14Z

@jayantj - Yes, a word2index (or more generically key2index) facility is absolutely essential to the class. Whether the class takes on any of the other per-key metadata that Word2Vec needs and stores in vocab (like frequency counts or huffman-encoding details) is a thornier issue. (Maybe it's a useful extension of KeyedVectors with info that travels with the vectors; maybe it stays closer to the algorithm.)

@piskvorky - Do you mean like gensim's Similarity class? I can't quite see how that would drop into Word2Vec/Doc2Vec as the actual source of vectors during training, nor would I expect users to think of the set-of-vectors (whether trained or loaded) as a 'similarity index'. (Would you see a Similarity implementation offering a load_word2vec_format() method?) Also as I've mentioned in other contexts, by usual idioms, I'd expect bracket/__getitem__ access to lookup a single vector, not perform a broad calculation and return a large list-of-similarities (as appears to be the case with current Similarity classes).

piskvorky · 2016-08-17T11:27:52Z

Yeah, I see a large overlap there. Both are meant to store vectors, both are meant to retrieve similar vectors when queried either by key or by another vector.

What the API should look like is another question. But I still don't see how they're different, conceptually. And we don't want to be maintaining two pieces of code with ±identical functionality, just named differently.

tmylk · 2016-08-18T07:46:11Z

To finish something soon let's split the refactoring into two parts.

Stage 1. Implement what is needed to have 3 word2vec training modes: Gensim, TensorFlow and FastText. Hopefully @droudy can finish it this week.

Stage 2. Integrate with Similarity classes. Specifically it would be good to not have separate specific lines for word2vec and doc2vec in the Annoy similarity class. This can be done later.

gojomo · 2016-08-18T09:43:19Z

I can see the value of getting TensorFlow-trained or fastText-trained vectors into a common format, and loaded into gensim-operational objects.

I don't understand the goal of wrapping TensorFlow/fastText word2vec codebases in gensim as alternate training-modes. Folding slightly-different cores under the same interface can make it hard to use whatever's uniquely valuable about any of them, and increase the maintenance hassle and failure/confusion modes when there are slight differences in behavior.

Note that #549 was initially proposed to regroup and extend functionality, unrelated to TensorFlow or fastText. (To the extent that by separating training and post-training-operations, it could make loading external vectors cleaner, that's great. But working with externally-trained vectors wasn't the only planned benefit.)

There's certainly overlap with the most-similar/nearest-neighbors functionality of the Similarity classes. But, my hoped-for core-capabilities of KeyedVectors are primary access by key, direct use during Word2Vec/Doc2Vec training, and use as a shared-model with other vector sources. I see those as somewhat hard to mix with the existing Similarity interface/implementation choices. So, I'd prefer to just aim for consistent name/signature conventions, and reuse of utility code as appropriate, across separate classes for the time being. Unification of the underlying implementations could come later, if in practice they appear to have unnecessary duplication.

droudy · 2016-08-19T19:08:50Z

Opened a new PR addressing the same issue here: #833 because the new API is different and needed a different branch

droudy and others added 4 commits August 9, 2016 10:28

namedvecs

d60f1c4

add __contains__ to namedvecs

628c15b

test

8717189

Merge pull request #1 from droudy/tf2

1d59b8e

doc2vec NamedVectors refactor

jayantj reviewed Aug 17, 2016
View reviewed changes

droudy closed this Aug 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NamedVectors refactor for word2vec #819

NamedVectors refactor for word2vec #819

droudy commented Aug 9, 2016 •

edited by gojomo

Loading

gojomo commented Aug 9, 2016

droudy commented Aug 16, 2016

gojomo commented Aug 16, 2016

jayantj commented Aug 17, 2016 •

edited

Loading

jayantj Aug 17, 2016 •

edited

Loading

piskvorky commented Aug 17, 2016 •

edited

Loading

gojomo commented Aug 17, 2016 •

edited

Loading

piskvorky commented Aug 17, 2016 •

edited

Loading

tmylk commented Aug 18, 2016 •

edited

Loading

gojomo commented Aug 18, 2016

droudy commented Aug 19, 2016

NamedVectors refactor for word2vec #819

NamedVectors refactor for word2vec #819

Conversation

droudy commented Aug 9, 2016 • edited by gojomo Loading

gojomo commented Aug 9, 2016

droudy commented Aug 16, 2016

gojomo commented Aug 16, 2016

jayantj commented Aug 17, 2016 • edited Loading

jayantj Aug 17, 2016 • edited Loading

Choose a reason for hiding this comment

piskvorky commented Aug 17, 2016 • edited Loading

gojomo commented Aug 17, 2016 • edited Loading

piskvorky commented Aug 17, 2016 • edited Loading

tmylk commented Aug 18, 2016 • edited Loading

gojomo commented Aug 18, 2016

droudy commented Aug 19, 2016

droudy commented Aug 9, 2016 •

edited by gojomo

Loading

jayantj commented Aug 17, 2016 •

edited

Loading

jayantj Aug 17, 2016 •

edited

Loading

piskvorky commented Aug 17, 2016 •

edited

Loading

gojomo commented Aug 17, 2016 •

edited

Loading

piskvorky commented Aug 17, 2016 •

edited

Loading

tmylk commented Aug 18, 2016 •

edited

Loading