smaller&faster neg-sampling table; reduce cython duplication; feedback tweaks #373

gojomo · 2015-06-30T02:20:44Z

Replaces 400MB negative-sampling table (which took ~30 seconds to build no matter how small the vocabulary) with a cumulative-distribution table which is proportional to the vocabulary size. For a ~100K word vocabulary, will take 400KB (1000x smaller) and is created almost instantly (making each class's tests finish 3+ minutes faster). Draws use a binary-search into this table; in my tests, sampling has also been faster for vocabularies at least up through 250K words (and probably into the millions).

Uses PXD file to share declarations from word2vec_inner to doc2vec_inner, reducing duplication (fixes #367 ).

Integrates small changes suggested on prior PR #356 from @cscorley and @e9t.

gojomo · 2015-06-30T07:22:03Z

Forgot to mention: this also creates a one-time-seeded numpy.random.RandomState that's local to the model, to isolate the model's RNG from anything else using the (shared global) numpy random instance. And, for seeded_vector, it draws from a deterministically-seeded one-time-use RandomState. (Previous behavior was a bit off – all randoms from the global source, which happened to be re-seeded on each seeded_vector call – but that just happened to work alright in the usual one-model-at-a-time case.)

piskvorky · 2015-06-30T08:11:01Z

Looks great!

How much of the changed functionality is covered by unit tests? Any expected incompatibilities?

gojomo · 2015-06-30T09:01:51Z

No new functionality here: just smaller/faster/clearer implementations – so baseline is, all unit tests still pass (and much faster!)

Only incompatibility risks would be if someone was reaching into the old bulky neg-sampling table themselves for their own reasons, or was somehow dependent on perfect reproducibility of the prior flaky randomization approach. (Both very unlikely.)

piskvorky · 2015-06-30T09:31:52Z

Great. Let's merge then 👍

smaller&faster neg-sampling table; reduce cython duplication; feedback tweaks

gojomo · 2015-06-30T09:39:40Z

Done, thanks!

piskvorky · 2015-06-30T13:14:58Z

The C modules no longer compile for me now.

The error is:

./gensim/models/doc2vec_inner.c:1153:8: error: 'inline' can only appear on functions
static CYTHON_INLINE unsigned PY_LONG_LONG (*__pyx_f_6gensim_6models_14word2vec_inner_bisect_left)(__pyx_t_5numpy_uint32_t *, unsigned PY_LONG_LONG, u...
       ^
./gensim/models/doc2vec_inner.c:174:27: note: expanded from macro 'CYTHON_INLINE'
    #define CYTHON_INLINE __inline__
                          ^

(OS X, python 2.7.6, LLVM 6.1.0, latest develop at 3cdc43c)

piskvorky · 2015-06-30T13:18:49Z

Btw was there a rebase? Normally after merging, the pull request is automatically closed by github. But I see it's still open here..?

piskvorky · 2015-06-30T13:19:23Z

Never mind, must be some weird github caching, the PR is gone now again.

piskvorky · 2015-06-30T13:22:34Z

Removing the inline statement from bisect_left fixes the error. How important is that inline for performance?

I'm thinking if it gives trouble to (some) compilers, and the performance penalty is not too big, probably safer to leave it out.

gojomo · 2015-06-30T13:41:38Z

Hmm, I'm on OSX and it's working for me. But, definitely want a setup that works everywhere. Can you try removing the inline in the word2vec_inner.pxd file, but leave it in the pyx, and see if that works (or generates a new error)?

(Re: github stale caches - yes, I've even had clones/fetches grab stuff that was hours/days old... but a few retries usually find a fresh version.)

piskvorky · 2015-06-30T13:55:37Z

Yes, pxd is enough. I've pushed the commit to develop: 570f08a .

gojomo added 3 commits June 29, 2015 18:44

share declarations from word2vec_inner.[pyx|pxd]

b8dc13a

cumulative table for neg-samples; local RandomState

1a393b8

super(); LabeledSentence deprecation; intersect error louder

b31e94f

This was referenced Jun 30, 2015

big doc-vector refactor/enhancements #356

Merged

top-level __init__.py confuses cython; can we remove? #367

Closed

Improves peak memory usage of Word2Vec on vocabulary creation #370

Closed

gojomo added a commit that referenced this pull request Jun 30, 2015

Merge pull request #373 from gojomo/bdv_followups_pr

3cdc43c

smaller&faster neg-sampling table; reduce cython duplication; feedback tweaks

gojomo merged commit 3cdc43c into piskvorky:develop Jun 30, 2015

gojomo deleted the bdv_followups_pr branch July 9, 2015 12:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smaller&faster neg-sampling table; reduce cython duplication; feedback tweaks #373

smaller&faster neg-sampling table; reduce cython duplication; feedback tweaks #373

gojomo commented Jun 30, 2015

gojomo commented Jun 30, 2015

piskvorky commented Jun 30, 2015

gojomo commented Jun 30, 2015

piskvorky commented Jun 30, 2015

gojomo commented Jun 30, 2015

piskvorky commented Jun 30, 2015

piskvorky commented Jun 30, 2015

piskvorky commented Jun 30, 2015

piskvorky commented Jun 30, 2015

gojomo commented Jun 30, 2015

piskvorky commented Jun 30, 2015

smaller&faster neg-sampling table; reduce cython duplication; feedback tweaks #373

smaller&faster neg-sampling table; reduce cython duplication; feedback tweaks #373

Conversation

gojomo commented Jun 30, 2015

gojomo commented Jun 30, 2015

piskvorky commented Jun 30, 2015

gojomo commented Jun 30, 2015

piskvorky commented Jun 30, 2015

gojomo commented Jun 30, 2015

piskvorky commented Jun 30, 2015

piskvorky commented Jun 30, 2015

piskvorky commented Jun 30, 2015

piskvorky commented Jun 30, 2015

gojomo commented Jun 30, 2015

piskvorky commented Jun 30, 2015