Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smaller&faster neg-sampling table; reduce cython duplication; feedback tweaks #373

Merged
merged 3 commits into from
Jun 30, 2015

Conversation

gojomo
Copy link
Collaborator

@gojomo gojomo commented Jun 30, 2015

Replaces 400MB negative-sampling table (which took ~30 seconds to build no matter how small the vocabulary) with a cumulative-distribution table which is proportional to the vocabulary size. For a ~100K word vocabulary, will take 400KB (1000x smaller) and is created almost instantly (making each class's tests finish 3+ minutes faster). Draws use a binary-search into this table; in my tests, sampling has also been faster for vocabularies at least up through 250K words (and probably into the millions).

Uses PXD file to share declarations from word2vec_inner to doc2vec_inner, reducing duplication (fixes #367 ).

Integrates small changes suggested on prior PR #356 from @cscorley and @e9t.

@gojomo
Copy link
Collaborator Author

gojomo commented Jun 30, 2015

Forgot to mention: this also creates a one-time-seeded numpy.random.RandomState that's local to the model, to isolate the model's RNG from anything else using the (shared global) numpy random instance. And, for seeded_vector, it draws from a deterministically-seeded one-time-use RandomState. (Previous behavior was a bit off – all randoms from the global source, which happened to be re-seeded on each seeded_vector call – but that just happened to work alright in the usual one-model-at-a-time case.)

@piskvorky
Copy link
Owner

Looks great!

How much of the changed functionality is covered by unit tests? Any expected incompatibilities?

@gojomo
Copy link
Collaborator Author

gojomo commented Jun 30, 2015

No new functionality here: just smaller/faster/clearer implementations – so baseline is, all unit tests still pass (and much faster!)

Only incompatibility risks would be if someone was reaching into the old bulky neg-sampling table themselves for their own reasons, or was somehow dependent on perfect reproducibility of the prior flaky randomization approach. (Both very unlikely.)

@piskvorky
Copy link
Owner

Great. Let's merge then 👍

gojomo added a commit that referenced this pull request Jun 30, 2015
smaller&faster neg-sampling table; reduce cython duplication; feedback tweaks
@gojomo gojomo merged commit 3cdc43c into piskvorky:develop Jun 30, 2015
@gojomo
Copy link
Collaborator Author

gojomo commented Jun 30, 2015

Done, thanks!

@piskvorky
Copy link
Owner

The C modules no longer compile for me now.

The error is:

./gensim/models/doc2vec_inner.c:1153:8: error: 'inline' can only appear on functions
static CYTHON_INLINE unsigned PY_LONG_LONG (*__pyx_f_6gensim_6models_14word2vec_inner_bisect_left)(__pyx_t_5numpy_uint32_t *, unsigned PY_LONG_LONG, u...
       ^
./gensim/models/doc2vec_inner.c:174:27: note: expanded from macro 'CYTHON_INLINE'
    #define CYTHON_INLINE __inline__
                          ^

(OS X, python 2.7.6, LLVM 6.1.0, latest develop at 3cdc43c)

@piskvorky
Copy link
Owner

Btw was there a rebase? Normally after merging, the pull request is automatically closed by github. But I see it's still open here..?

@piskvorky
Copy link
Owner

Never mind, must be some weird github caching, the PR is gone now again.

@piskvorky
Copy link
Owner

Removing the inline statement from bisect_left fixes the error. How important is that inline for performance?

I'm thinking if it gives trouble to (some) compilers, and the performance penalty is not too big, probably safer to leave it out.

@gojomo
Copy link
Collaborator Author

gojomo commented Jun 30, 2015

Hmm, I'm on OSX and it's working for me. But, definitely want a setup that works everywhere. Can you try removing the inline in the word2vec_inner.pxd file, but leave it in the pyx, and see if that works (or generates a new error)?

(Re: github stale caches - yes, I've even had clones/fetches grab stuff that was hours/days old... but a few retries usually find a fresh version.)

@piskvorky
Copy link
Owner

Yes, pxd is enough. I've pushed the commit to develop: 570f08a .

@gojomo gojomo deleted the bdv_followups_pr branch July 9, 2015 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

top-level __init__.py confuses cython; can we remove?
2 participants