-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NamedVectors refactor for word2vec #819
Conversation
doc2vec NamedVectors refactor
Yes, this is the right direction! Ultimately the reason for the pain-of-refactoring is to move all existing and future related functionality (save/load to other formats, similarity calcs, indexing, translations/projections, etc) to the helper class, and make it usable for other purposes and algorithms than just current Word2Vec/Doc2Vec. So eventually this class would move to its own file, and when the related methods also move, they wouldn't be accessing vectors via a Thoughts on names:
Overall, this could be a pretty big and disruptive set of changes. Among other things, it will require extra code to maintain the capability to load-and-convert older models. It may even make sense to build the refactored-classes alongside the old still-operative classes, to make verification-of-identical-behavior easier, and the load-and-convert code. That is: do it all as Word2Vec2, Doc2Vec2 first, then when parity/compatibility with the old classes is verified, do the swap. |
@gojomo If the save/load functions, similarity calcs, etc were moved into the |
Except for backward-compatibility, I don't think it matters whether
Or even, when you mainly care about working with the output vectors:
...and then some hypothetical new functionality...
|
Sorry for jumping in so late - I was just wondering if it'd be a good idea to keep the properties and methods of
|
@@ -1167,20 +1209,20 @@ def intersect_word2vec_format(self, fname, binary=False, encoding='utf8', unicod | |||
word.append(ch) | |||
word = utils.to_unicode(b''.join(word), encoding=encoding, errors=unicode_errors) | |||
weights = fromstring(fin.read(binary_len), dtype=REAL) | |||
if word in self.vocab: | |||
if word in self.named_vectors.vocab: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply word in self.named_vectors
would work fine too, since you've defined __contains__
on NamedVectors.
Same for lines 519, 1222, 1271, 1412
What is the difference between Just the fact that vectors have an additional string id = key (rather than being references by their ordinal position)? If so (but I might be wrong and missing something), isn't it better to add string ids to vectors in our similarity interface, rather than duplicate the functionality under a different name? See also #732. |
@jayantj - Yes, a @piskvorky - Do you mean like gensim's |
Yeah, I see a large overlap there. Both are meant to store vectors, both are meant to retrieve similar vectors when queried either by key or by another vector. What the API should look like is another question. But I still don't see how they're different, conceptually. And we don't want to be maintaining two pieces of code with ±identical functionality, just named differently. |
To finish something soon let's split the refactoring into two parts. Stage 1. Implement what is needed to have 3 word2vec training modes: Gensim, TensorFlow and FastText. Hopefully @droudy can finish it this week. Stage 2. Integrate with |
I can see the value of getting TensorFlow-trained or fastText-trained vectors into a common format, and loaded into gensim-operational objects. I don't understand the goal of wrapping TensorFlow/fastText word2vec codebases in gensim as alternate training-modes. Folding slightly-different cores under the same interface can make it hard to use whatever's uniquely valuable about any of them, and increase the maintenance hassle and failure/confusion modes when there are slight differences in behavior. Note that #549 was initially proposed to regroup and extend functionality, unrelated to TensorFlow or fastText. (To the extent that by separating training and post-training-operations, it could make loading external vectors cleaner, that's great. But working with externally-trained vectors wasn't the only planned benefit.) There's certainly overlap with the most-similar/nearest-neighbors functionality of the |
Opened a new PR addressing the same issue here: #833 because the new API is different and needed a different branch |
Addresses #549. Refactored
syn0
,syn0norm
,vocab
, andindex2word
into their own class.NamedVecs
of a model can be retrieved as follows:@gojomo please review