-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
smaller&faster neg-sampling table; reduce cython duplication; feedback tweaks #373
Conversation
Forgot to mention: this also creates a one-time-seeded numpy.random.RandomState that's local to the model, to isolate the model's RNG from anything else using the (shared global) numpy random instance. And, for seeded_vector, it draws from a deterministically-seeded one-time-use RandomState. (Previous behavior was a bit off – all randoms from the global source, which happened to be re-seeded on each seeded_vector call – but that just happened to work alright in the usual one-model-at-a-time case.) |
Looks great! How much of the changed functionality is covered by unit tests? Any expected incompatibilities? |
No new functionality here: just smaller/faster/clearer implementations – so baseline is, all unit tests still pass (and much faster!) Only incompatibility risks would be if someone was reaching into the old bulky neg-sampling table themselves for their own reasons, or was somehow dependent on perfect reproducibility of the prior flaky randomization approach. (Both very unlikely.) |
Great. Let's merge then 👍 |
smaller&faster neg-sampling table; reduce cython duplication; feedback tweaks
Done, thanks! |
The C modules no longer compile for me now. The error is:
(OS X, python 2.7.6, LLVM 6.1.0, latest |
Btw was there a rebase? Normally after merging, the pull request is automatically closed by github. But I see it's still open here..? |
Never mind, must be some weird github caching, the PR is gone now again. |
Removing the inline statement from I'm thinking if it gives trouble to (some) compilers, and the performance penalty is not too big, probably safer to leave it out. |
Hmm, I'm on OSX and it's working for me. But, definitely want a setup that works everywhere. Can you try removing the (Re: github stale caches - yes, I've even had clones/fetches grab stuff that was hours/days old... but a few retries usually find a fresh version.) |
Yes, pxd is enough. I've pushed the commit to |
Replaces 400MB negative-sampling table (which took ~30 seconds to build no matter how small the vocabulary) with a cumulative-distribution table which is proportional to the vocabulary size. For a ~100K word vocabulary, will take 400KB (1000x smaller) and is created almost instantly (making each class's tests finish 3+ minutes faster). Draws use a binary-search into this table; in my tests, sampling has also been faster for vocabularies at least up through 250K words (and probably into the millions).
Uses PXD file to share declarations from word2vec_inner to doc2vec_inner, reducing duplication (fixes #367 ).
Integrates small changes suggested on prior PR #356 from @cscorley and @e9t.