Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NamedVectors refactor for word2vec #819

Closed
wants to merge 4 commits into from
Closed

Conversation

droudy
Copy link
Contributor

@droudy droudy commented Aug 9, 2016

Addresses #549. Refactored syn0, syn0norm, vocab, and index2word into their own class. NamedVecs of a model can be retrieved as follows:


model = Word2Vec(example_corpus)
retrieved_vecs = model.named_vecs 

@gojomo please review

@gojomo
Copy link
Collaborator

gojomo commented Aug 9, 2016

Yes, this is the right direction!

Ultimately the reason for the pain-of-refactoring is to move all existing and future related functionality (save/load to other formats, similarity calcs, indexing, translations/projections, etc) to the helper class, and make it usable for other purposes and algorithms than just current Word2Vec/Doc2Vec. So eventually this class would move to its own file, and when the related methods also move, they wouldn't be accessing vectors via a self.named_vectors property, but just viaself.

Thoughts on names:

  • After more thought, I'm now leaning toward 'KeyedVectors' as a better name for this functionality. 'Names' are a bit more of a loaded concepts than look-up keys.
  • Removed from Word2Vec, there's no reason to call the array-of-vectors property syn0 – a legacy of the internal neural-network naming conventions. Maybe just vectors? (It'd still be appropriate to mention that this is equivalent to syn0 in the original word2vec.c code, or even use syn0 someplace where the word2vec-algorithm code refers to the same array.)
  • Within the Word2Vec/Doc2Vec models, a property name based on role rather than type would be more readable. So rather than named_vectors, 'perhaps word_vectors or even wv for compactness.

Overall, this could be a pretty big and disruptive set of changes. Among other things, it will require extra code to maintain the capability to load-and-convert older models. It may even make sense to build the refactored-classes alongside the old still-operative classes, to make verification-of-identical-behavior easier, and the load-and-convert code. That is: do it all as Word2Vec2, Doc2Vec2 first, then when parity/compatibility with the old classes is verified, do the swap.

@droudy
Copy link
Contributor Author

droudy commented Aug 16, 2016

@gojomo If the save/load functions, similarity calcs, etc were moved into the NamedVectors/KeyedVectors class, calls such as trained_model.most_similar() wouldn't be viable anymore, they would have to be called with trained_model.named_vectors.most_similar(). Would it make sense to move save/load functions, similarity calcs, etc. into their own class w2v_no_training or something along the lines of that, and have the existing Word2Vec class inherit w2v_no_training with both of them accessing vectors and vocab from the NamedVectors/KeyedVectors class so that calls like trained_model.most_similar() remain viable?

@gojomo
Copy link
Collaborator

gojomo commented Aug 16, 2016

Except for backward-compatibility, I don't think it matters whether trained_model.most_similar() works. Its the set-of-vectors that's interesting to run various kinds of downstream-methods on – and there's no need to have those methods linked to the one Word2Vec algorithm when they work on vector-sets from many algorithms. For example:

my_model = Word2Vec(my_corpus, ...)
print(my_model.wv.most_similar('warm'))

Or even, when you mainly care about working with the output vectors:

my_vectors = my_model.wv
print(my_vectors.most_similar('hot'))
my_vectors.save_word2vec_format('words0001.vec`, binary=False)
googlenews_vectors = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz')

...and then some hypothetical new functionality...

# some measure of whether same-words-have-same-top-N-neighbors
my_vectors.neighbors_rank_correlation(googlenews_vectors)

# translating/extending via method of https://github.com/RaRe-Technologies/gensim/wiki/Word2Vec-&-Doc2Vec-Wishlist#implement-translation-matrix-of-exploiting-similarities-among-languages-for-machine-translation
tm = googlenews_vectors.learn_translation_matrix_to(my_vectors)
my_vectors.import_unique(googlenews_vectors, tm)

@jayantj
Copy link
Contributor

jayantj commented Aug 17, 2016

Sorry for jumping in so late - I was just wondering if it'd be a good idea to keep the properties and methods of NamedVectors completely independent of word2vec/doc2vec. Specifically -

  1. NamedVectors shouldn't be responsible for vocab, just an index2word and a word2index, since vocab seems specific to w2v/doc2v. Maybe have the word2vec/doc2vec class take care of vocab and the NamedVectors simply using that to create index2word and word2index?
  2. Some naming (like vocab, index2word) could probably be made more generic - again, independent of w2v/doc2v.

@@ -1167,20 +1209,20 @@ def intersect_word2vec_format(self, fname, binary=False, encoding='utf8', unicod
word.append(ch)
word = utils.to_unicode(b''.join(word), encoding=encoding, errors=unicode_errors)
weights = fromstring(fin.read(binary_len), dtype=REAL)
if word in self.vocab:
if word in self.named_vectors.vocab:
Copy link
Contributor

@jayantj jayantj Aug 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply word in self.named_vectors would work fine too, since you've defined __contains__ on NamedVectors.
Same for lines 519, 1222, 1271, 1412

@piskvorky
Copy link
Owner

piskvorky commented Aug 17, 2016

What is the difference between NamedVectors and a similarity index?

Just the fact that vectors have an additional string id = key (rather than being references by their ordinal position)?

If so (but I might be wrong and missing something), isn't it better to add string ids to vectors in our similarity interface, rather than duplicate the functionality under a different name?

See also #732.

@gojomo
Copy link
Collaborator

gojomo commented Aug 17, 2016

@jayantj - Yes, a word2index (or more generically key2index) facility is absolutely essential to the class. Whether the class takes on any of the other per-key metadata that Word2Vec needs and stores in vocab (like frequency counts or huffman-encoding details) is a thornier issue. (Maybe it's a useful extension of KeyedVectors with info that travels with the vectors; maybe it stays closer to the algorithm.)

@piskvorky - Do you mean like gensim's Similarity class? I can't quite see how that would drop into Word2Vec/Doc2Vec as the actual source of vectors during training, nor would I expect users to think of the set-of-vectors (whether trained or loaded) as a 'similarity index'. (Would you see a Similarity implementation offering a load_word2vec_format() method?) Also as I've mentioned in other contexts, by usual idioms, I'd expect bracket/__getitem__ access to lookup a single vector, not perform a broad calculation and return a large list-of-similarities (as appears to be the case with current Similarity classes).

@piskvorky
Copy link
Owner

piskvorky commented Aug 17, 2016

Yeah, I see a large overlap there. Both are meant to store vectors, both are meant to retrieve similar vectors when queried either by key or by another vector.

What the API should look like is another question. But I still don't see how they're different, conceptually. And we don't want to be maintaining two pieces of code with ±identical functionality, just named differently.

@tmylk
Copy link
Contributor

tmylk commented Aug 18, 2016

To finish something soon let's split the refactoring into two parts.

Stage 1. Implement what is needed to have 3 word2vec training modes: Gensim, TensorFlow and FastText. Hopefully @droudy can finish it this week.

Stage 2. Integrate with Similarity classes. Specifically it would be good to not have separate specific lines for word2vec and doc2vec in the Annoy similarity class. This can be done later.

@gojomo
Copy link
Collaborator

gojomo commented Aug 18, 2016

I can see the value of getting TensorFlow-trained or fastText-trained vectors into a common format, and loaded into gensim-operational objects.

I don't understand the goal of wrapping TensorFlow/fastText word2vec codebases in gensim as alternate training-modes. Folding slightly-different cores under the same interface can make it hard to use whatever's uniquely valuable about any of them, and increase the maintenance hassle and failure/confusion modes when there are slight differences in behavior.

Note that #549 was initially proposed to regroup and extend functionality, unrelated to TensorFlow or fastText. (To the extent that by separating training and post-training-operations, it could make loading external vectors cleaner, that's great. But working with externally-trained vectors wasn't the only planned benefit.)

There's certainly overlap with the most-similar/nearest-neighbors functionality of the Similarity classes. But, my hoped-for core-capabilities of KeyedVectors are primary access by key, direct use during Word2Vec/Doc2Vec training, and use as a shared-model with other vector sources. I see those as somewhat hard to mix with the existing Similarity interface/implementation choices. So, I'd prefer to just aim for consistent name/signature conventions, and reuse of utility code as appropriate, across separate classes for the time being. Unification of the underlying implementations could come later, if in practice they appear to have unnecessary duplication.

@droudy
Copy link
Contributor Author

droudy commented Aug 19, 2016

Opened a new PR addressing the same issue here: #833 because the new API is different and needed a different branch

@droudy droudy closed this Aug 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants