-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
8 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
e69e112
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This better covers implications for users, thanks! But for clarity, I would lead with the "main motivation", and cover the workaround (for testing if a word is OOV) inline in point (1).
Separately, I'm not sure the original only-retain-seen-ngrams behavior really was a memory or speed optimization for important workloads, like larger corpuses. For example:
(1) With large corpuses, the collision-oblivious fixed-size hashtable of ngrams could have most of its slots trained by one or more ngrams. So, slimming it to just used buckets wouldn't save much memory. (It's only when the default
bucket
count is far out-of-proportion to the number of ngrams seen that the prior optimization would save a lot of memory.)(2) Meanwhile, maintaining/consulting that list/slot-lookup-dict of seen-ngrams required some new memory (and computation) not needed to the FastText-comformant behavior. And while with a small corpus, an OOV word might have few ngram-hits, so skipping the vector-lookup/vector-averaging of all the missing potential ngrams might speed things up, in larger corpuses, an ever-greater proportion of ngrams will in fact be 'known'.
So I wouldn't be surprised if that particular "optimization" helped with toy-sized data (like say 100MB
text8
), but used more memory and computation on other corpuses (like say 10GB+ Wikipedia dumps). Looking at last year's #1916, I think reported speedups there were from cythonization and other tricks, not retaining the seen-ngrams. I'm a bit confused because the descriptive text at #1916 suggests that PR was already, as of May 2018, removing the model's cache of seen-ngrams. Yet the persistence ofKeyError: all ngrams for word X absent from model
errors through gensim 3.7.1 suggests that PR #1916 did not have that effect.In any case, I wouldn't characterize this fix as necessarily a memory/speed hit unless that's freshly demonstrated on real workloads.
So I'd word the change notes as:
e69e112
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I added a new commit, please have a look.