big doc-vector refactor/enhancements #356

gojomo · 2015-06-10T23:43:36Z

Ready for review/testing!

The headline changes are optimized doc-vector inference, and separating doc-vectors (during training or comparison) from the word-vocabulary – allowing many more docs (via memmap-backing) than words, and avoiding some confusion.

Many smaller changes include a DM-concatenative-context mode (as recommended in the Paragraph Vectors paper), an optional 'lock_factor' to attenuate training of some vectors, and other optimizations/cleanup.

See gensim/test/test_doc2vec.py to do a quick check on a new system or survey some API possibilities. See docs/notebook/doc2vec-IMDB.ipynb for a walkthrough of reproducing the PV paper IMDB sentiment experiment.

If you used Doc2Vec previously, a few key changes to note:

LabeledSentence is now TaggedDocument, and the (one or more) document labels that correspond to vectors are now referred to as 'tags'. Greatest memory efficiency is possible by using only int tags, contiguous and ascending from 0.
Doc vectors are stored, accessed, and compared through a consituent '.docvecs' field of the Doc2Vec model, rather than the old accessors/comparison methods that still work for words. So, d2v_model.docvecs[doc_tag] or d2v_model.docvecs.most_similar(doc_tag), rather than on d2v_model directly.

Some top needs:

testing on diverse, larger datasets and systems - though the ability to handle doc-vector sets much larger than RAM is theoretically there, I haven't forced an overflow yet
improving save() to externalize numpy arrays as with the prior models – maybe a recursive utils.SaveLoad?
refactoring the DocvecsArray similarity-testing methods – currently, just a quick-and-dirty copy-and-adapt from Word2Vec, but would ideally share code (and the optimizations pending in other @sebastien-j / @KCzar PRs) with word-vecs

…for padding; layer1_size potentially different than vector_size; parameter renames for clarity; one-time neg_lables precalc

…ence_dm_concat, infer_vector_dm_concat methods

…ter name cleanup

…ia shared paths

… ('...lock factor')

pep8 & python2 fixes to doc2vec notebook

gojomo · 2015-06-28T20:43:55Z

@piskvorky the notebook & most other work was done in py3.4/OSX 1st, but I've also regularly run the notebook and other tests on py2.7/ubuntu. Thanks for the PR, merged!

akhudek · 2015-06-28T20:51:09Z

@gojomo The vocabulary phase definitely isn't the problem, I had logging on during the attempt and saw that it completed building the vocabulary with about the same memory usage as word2vec. From memory, (I lost the exact numbers due to the crash, should have logged to a file), there are about 4 million words before pruning, much less afterwards. The crash happened immediately after seeing a log about resetting the layer weights.

The training script is very straightforward, as you can see below. A single line from the input data definitely won't blow memory.

import sys
import gzip
import logging
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

class TaggedLineSentence(object):
   def __init__(self, filename):
      self.filename = filename

   def __iter__(self):
      for uid, line in enumerate(gzip.open(self.filename, 'rb')):
         yield TaggedDocument(words=line.split(), tags=[uid])

sentences = TaggedLineSentence(sys.argv[1])

model = Doc2Vec(alpha=0.025, min_alpha=0.025, docvecs_mapfile='mapfile')  # use fixed learning rate
model.build_vocab(sentences)

for epoch in range(10):
   print epoch
   model.train(sentences)
   model.alpha -= 0.002  # decrease the learning rate
   model.min_alpha = model.alpha  # fix the learning rate, no decay

# store the model 
model.save(sys.argv[2])

gojomo · 2015-06-28T21:05:16Z

Hmm. If build_vocab() completes I think you would see 'mapfile' appear in the working directory: Doc2Vec should have written random initialization vectors across its entire extent. (Do you see the '0' epoch print?)

akhudek · 2015-06-28T21:21:17Z

No, it doesn't make it that far. I've just tried with a small subset of the data. After finishing the vocab with the small data, it's using ~100mb of memory. Somewhere between resetting layer weights and starting training, memory usage jumps to 1.3gb. This probably makes sense since OS X is no doubt loading the entire mmap'd file into memory. In this case the mmap'd file does get created.

I'm beginning to suspect that this might be OS X behaving poorly with the mmap'd file, trying to load the entire thing into it's concept of virtual memory resulting in swap file death. I've tried to limit the processes memory via ulimit, but it doesn't seem to work.

I'm going to try this on a linux system to see if it's OS related.

2015-06-28 17:10:58,509 : INFO : collected 143273 word types from a corpus of 31045724 words and 1000000 documents
2015-06-28 17:10:58,571 : INFO : total 29599 word types after removing those with count<5
2015-06-28 17:10:58,572 : INFO : constructing a huffman tree from 29599 words
2015-06-28 17:10:59,766 : INFO : built huffman tree with maximum node depth 23
2015-06-28 17:10:59,795 : INFO : resetting layer weights
0
2015-06-28 17:11:44,514 : INFO : training model with 1 workers on 29599 vocabulary and 300 features, using 'skipgram'=0 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
2015-06-28 17:11:45,519 : INFO : PROGRESS: at 0.73% words, alpha 0.02500, 225655 words/s
2015-06-28 17:11:46,523 : INFO : PROGRESS: at 1.46% words, alpha 0.02500, 225018 words/s
2015-06-28 17:11:47,532 : INFO : PROGRESS: at 2.19% words, alpha 0.02500, 223860 words/s

piskvorky · 2015-06-28T22:33:19Z

Let me merge this PR into develop.

We can continue the discussion here, as well as open new PRs for fixes / improvements.

big doc-vector refactor/enhancements

piskvorky · 2015-06-28T22:33:46Z

And it goes without saying -- massive thanks to @gojomo for his epic refactor!

craigpfeifer · 2015-06-29T01:27:24Z

Great changes! When will these be available via pip?

gojomo · 2015-06-29T02:03:26Z

I don't know @piskvorky's plans for a numbered release, but you can always pip install from a github branch. For example, this should do the trick (because 'develop' is the default branch for /piskvorky/gensim):

pip install git+https://github.com/piskvorky/gensim.git

e9t · 2015-06-29T05:57:20Z

Are you sure you want to totally remove LabeledSentence?

Some versioning schemes (ex: semver), suggest minor version changes to be backwards-compatible, and those who are used to this may have some confusions because this feature merge is a minor version up.
Rather than raising an AttributeError,

AttributeError: 'module' object has no attribute 'LabeledSentence'

how would it be to use a DeprecationWarning?

Anyways, thanks for the great work!

gojomo · 2015-06-30T02:21:32Z

Remaining @cscorley & @e9t suggestions handled on #373 – thanks for review!

e9t · 2015-06-30T16:30:56Z

@gojomo, I got your notebook code to work on my Macbook and reproduced similar results. (Awesome!)
Even when I tried using my own dataset, the code ran just fine with sequential integer tags as you suggested. However, when I replaced the tags with strings (I wanted the tags to be actual document IDs), I got a Segmentation Fault during the docvec training phase as shown below:

# Load corpus
labeledTrainData_clean.tsv
testData_clean.tsv
unlabeledTrainData_clean.tsv
# Set-up Doc2Vec Training & Evaluation Models
Doc2Vec(dm/c,d100,n5,w5,mc2,t8)
Doc2Vec(dbow,d100,n5,mc2,t8)
Doc2Vec(dm/m,d100,n5,w10,mc2,t8)
# Bulk Training
START 2015-07-01 01:20:52.256642
*0.392200 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 54.5s 0.5s
*0.392000 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8)_inferred 54.5s 2.1s
Segmentation fault: 11

I'm using a Macbook, so this is probably related to numpy/numpy#4007.
Do you think we can fix this?

gojomo · 2015-06-30T17:17:58Z

I wouldn't be confident it's related to the numpy issue – I've not had random/numpy segfaults while developing this on OSX. (All the many segfaults I've worked through have been traceable to my own genuine bugs.)

I didn't think any Macbooks had 8 cores; did you manually set that number of workers? (I'd recommend just the number of real cores, though that shouldn't risk a segfault.)

Does it happen every time, at roughly the same time, when using string tags? And never when using int IDs?

The string tags should definitely work – they just may prove unwieldy if you get to tens-of-millions of docs. Are your tags relatively short, and just one tag to a document?

Is there a chance any of your documents are > 10,000 words? (That's currently the hard limit, though I hope to loosen it, and going over it should only cause truncation rather than crashes.)

If you can reliably trigger segfaults with a small test case, I'll have other ideas for details to collect. (First among them: enabling core dumps and checking exactly where the fault is happening.)

e9t · 2015-06-30T19:23:42Z

I didn't think any Macbooks had 8 cores; did you manually set that number of workers? (I'd recommend just the number of real cores, though that shouldn't risk a segfault.)

My Macbook has a i7 CPU with 4 physical cores but enabling hyperthreading gives me 8 logical cores, which probably is the reason multiprocessing.cpu_count() returned 8.
I just manually set the number of workers to 4.

Does it happen every time, at roughly the same time, when using string tags? And never when using int IDs?

Until now, yes. (The sample isn't so big, but I have run each 5+ consecutive times.)

The string tags should definitely work – they just may prove unwieldy if you get to tens-of-millions of docs. Are your tags relatively short, and just one tag to a document?

Yes and yes, max length of the tags is 6 and just one tag per document.

Is there a chance any of your documents are > 10,000 words? (That's currently the hard limit, though I hope to loosen it, and going over it should only cause truncation rather than crashes.)

No, document lengths are all shorter than 10,000 words.

If you can reliably trigger segfaults with a small test case, I'll have other ideas for details to collect. (First among them: enabling core dumps and checking exactly where the fault is happening.)

I'll see if I can create a toy case, and see what I can do to spot the problem. It may just be a special case, or my own bug as you suggested but if I find anything special, I'll open a new issue. Thanks!

gojomo · 2015-06-30T22:02:10Z

I'd somehow overlooked the i7 had hyperthreading (usually doing this dev in a OSX VM with 4 virtual cores)... turns out 8 virtual cores do incrementally help, about a 14% speedup over my previous configuration of 4, in a quick test. So by all means keep using 8.

Unless you're doing c/cython stuff, any segfaulting-bug is far more likely to come from the optimized cython/blas code, which can easily ignore boundaries or hold pointers past memory reuse. So please keep me posted any triggering patterns you find. (There is one other report of a segfault on the discussion list.)

e9t · 2015-07-01T08:41:40Z

You are referring to this discussion right?

I looked at my crash logs and found the same exception type and codes EXC_BAD_ACCESS (SIGSEGV) and KERN_INVALID_ADDRESS.
However the trackback looks a bit different from the previous segfault report:

Thread 2 Crashed:
0   libBLAS.dylib                   0x00007fff9325bc95 cblas_sdot + 976
1   libBLAS.dylib                   0x00007fff9322ee8d SDOT + 16
2   doc2vec_inner.so                0x00000001102b71ea __pyx_f_5trunk_6gensim_6models_13doc2vec_inner_our_dot_double + 10 (doc2vec_inner.c:1455)
3   doc2vec_inner.so                0x00000001102c41fa __pyx_f_5trunk_6gensim_6models_13doc2vec_inner_fast_document_dbow_hs + 202 (doc2vec_inner.c:1678)
4   doc2vec_inner.so                0x00000001102c3909 __pyx_pw_5trunk_6gensim_6models_13doc2vec_inner_1train_document_dbow + 13769 (doc2vec_inner.c:4102)
5   org.python.python               0x000000010086091a PyEval_EvalFrameEx + 21166
6   org.python.python               0x00000001007e39e5 gen_send_ex + 169
7   org.python.python               0x00000001007c86f1 PyIter_Next + 16
8   org.python.python               0x0000000100859c31 builtin_sum + 378
9   org.python.python               0x0000000100860869 PyEval_EvalFrameEx + 20989
10  org.python.python               0x000000010085b4b5 PyEval_EvalCodeEx + 1622
11  org.python.python               0x0000000100863b38 fast_function + 321
12  org.python.python               0x00000001008606fb PyEval_EvalFrameEx + 20623
13  org.python.python               0x000000010085b4b5 PyEval_EvalCodeEx + 1622
14  org.python.python               0x00000001007e922f function_call + 372
15  org.python.python               0x00000001007c8e2a PyObject_Call + 103
16  org.python.python               0x0000000100860daf PyEval_EvalFrameEx + 22339
17  org.python.python               0x0000000100863ac2 fast_function + 203
18  org.python.python               0x00000001008606fb PyEval_EvalFrameEx + 20623
19  org.python.python               0x0000000100863ac2 fast_function + 203
20  org.python.python               0x00000001008606fb PyEval_EvalFrameEx + 20623
21  org.python.python               0x000000010085b4b5 PyEval_EvalCodeEx + 1622
22  org.python.python               0x00000001007e922f function_call + 372
23  org.python.python               0x00000001007c8e2a PyObject_Call + 103
24  org.python.python               0x00000001007da54c method_call + 136
25  org.python.python               0x00000001007c8e2a PyObject_Call + 103
26  org.python.python               0x00000001008631c4 PyEval_CallObjectWithKeywords + 93
27  org.python.python               0x000000010089445b t_bootstrap + 70
28  libsystem_pthread.dylib         0x00007fff8f102268 _pthread_body + 131
29  libsystem_pthread.dylib         0x00007fff8f1021e5 _pthread_start + 176
30  libsystem_pthread.dylib         0x00007fff8f10041d thread_start + 13

I've ran the code three times in a row, and the crashes logs all occurred in the same place.
So perhaps numpy wasn't the problem after all?

gojomo · 2015-07-01T09:34:13Z

Yes, that's the discussion. That is a different crash location... but if the bug is some other code, running a little earlier clobbering unintentional addresses with illegal values, the ultimate crash could happen in a variety of places. (However, the other thread reports not seeing the crash again, perhaps specifically since not using the 'sample' parameter... which it doesn't appear you're using at all. So still unclear whether the incidents are related.)

In your original output, it looked like at least one pass in one training mode (dm/c) completed, but then the crash occurred during the 1st pass in the second (dbow) mode. So, is that the repeated 'same place' a crash occurs: always that mode when run second? If you put logging to DEBUG does it indicate about the same amount of progress through the data each crash? How about if you only run the DBOW mode – still crashing same proportion of the way through? How about if you leave out DBOW mode entirely – does it crash another place?

(I'm somewhat skeptical the numpy issue is related – that would seem to necessarily trigger earlier if present. If you step through the triggering steps outlined in the original gensim-related-report, at #131 (comment), can you get that error?)

gojomo · 2015-07-05T07:08:28Z

@e9t – FYI, I found a way to reliably trigger segfaults locally, and fixed the bug responsible – see the commits about "str doctags trigger bad indexes" on PR #380. Specifically, if both using string doc tags, and some tags repeat before all tags are discovered, some tags could be assigned too-high indexes into the vector array (because of a bad assumption of exactly one training example per tag). That'd eventually lead to out-of-bounds accesses or writes.

Using repeated tags, while not exactly the mode described in the PV paper, is definitely a supported use-case. It might be reasonable to do so to create tag vectors for tags representing metadata/set-membership that many documents share. But also, feeding the training process "[tag-A] 1 2 3 . 4 5 6 ." as one large example, or "[tag-A] 1 2 3 ." then "[tag-A] 4 5 6 ." as two smaller examples, will cause approximately the same training for [tag-A] to occur. (In pure DBOW, the training is essentially identical; in other modes where the 'window' setting comes into play, there will be differences related to when the window reaches across the sentence boundary.)

Can you check if the fix in that PR resolves your crash? (The essential change is just the one line: https://github.com/piskvorky/gensim/pull/380/files#diff-e71d1aecc3d6bb450f077300f2cf763dR293 )

e9t · 2015-07-05T16:42:46Z

Sorry for the late reply.

Well, I've enabled DEBUG, and found that the crash does not occur exactly at the "same place".
Running the code three times all crashed during the 1st pass in the third (dm/m) mode, however each at job 3, 7, and 11.

The GOOD NEWS is that yes, that bug fix #380 did the trick.
My code now runs perfectly. Thanks for figuring it out!

piskvorky · 2015-07-05T20:09:16Z

@gojomo @cscorley I think this release warrants a version bump to 0.12 (rather than just 0.11.2)... what do you think?

Changelog: https://github.com/piskvorky/gensim/blob/develop/CHANGELOG.txt

gojomo · 2015-07-05T20:58:34Z

👍 – version increments are free, and features (and API changes) warrant it. (About to make a few small CHANGELOG tweaks.)

vierja · 2015-08-21T18:31:51Z

@gojomo is the infer_vector only applicable to a doc2vec model? Or can it be used with word2vec?

gojomo · 2015-08-21T20:07:41Z

It only works for Doc2Vec. (And, if presenting text for inference that contains new words, those words are treated like any other unknown words – dropped before analysis.)

It's possible a similar mechanism could be offered for inferring word vectors. In many ways the work in #435 supports similar goals.

vierja · 2015-08-21T20:15:18Z

@gojomo thanks for the reply.

I meant like, is it possible to generate a paragraph vector representation using a Word2Vec model, using the same technique/code in infer_vector. I have a Word2Vec model, can I use it to infer vectors on a list of paragraphs without training them with Doc2Vec from scratch?

gojomo · 2015-08-21T20:36:10Z

The Doc2Vec algorithms (from the 'Paragraph Vectors' paper) do not start with word vectors, then create doc vectors. Rather, they train doc vectors from text (and only sometimes, in some training methods, generate word vectors as part of that process).

So: no. The inference requires a trained up Doc2Vec model. Having preexisting word vectors isn't a typical/required input for creating doc-vectors (in this algorithm).

(It's intuitively plausible that seeding some kinds of Doc2Vec models with pre-existing word vectors might offer a benefit, but a few small experiments I've done in that direction have had mixed results. There's other research about ways to create sentence/document vectors which do in fact require word vectors first; those algorithms aren't currently in gensim.)

gojomo added 30 commits June 10, 2015 15:02

initial inference support

a93f8e6

support for doc2vec dm_concat (concatenative PV-DM) model: null_word …

627604e

…for padding; layer1_size potentially different than vector_size; parameter renames for clarity; one-time neg_lables precalc

pure-python doc2vec dm_concat (concatenative PV-DM) model: train_sent…

f1ad6b6

…ence_dm_concat, infer_vector_dm_concat methods

infer_vector() on Doc2Vec

305ae2b

missed rename in sg path

5aa0458

only swap dot/saxpy for detected blas – reducing code duplication

9f3d28b

rename for clarity

ff4cb98

merge pre-trained vectors; optionally lock syn0 indexes

d851078

dbow_words option; parameter name/doc cleanup; expect cython dm_concat

0a8bff5

cythonized dm_concat; dbow word cotraining; syn0locks support; parame…

eb04a73

…ter name cleanup

parameters to support doc2vec inference modes

bc6287b

train_sentence_* refactoring, parameterization to support inference v…

de9eafb

…ia shared paths

compact_name

6e85df5

rename merge_ to intersect_

f33bb27

for dm-sum, divide error over all conributing vectors

0e587c3

rm unnecessary pretrain()

d28b0b1

doclbls/docvecs separate from vocab/syn0; rename syn0locks syn0_lockf…

8d3d0f3

… ('...lock factor')

delegate docvecs to (memmappable, replaceable) DocvecsInArray

13b7ee2

fix thread perf crash from randint()-per-word

a75553d

thread count in compact_name

89ad0ff

rename [lbl,label,LabeledSentence] -> [tag,tag,TaggedDocument]

156ea06

reset_from, borrow_from to share vocab/etc between models in testing

5229139

most_similar, etc on docvecs

012823f

initial doc2vec unit tests

93a6272

fix off-by-1 risking segfaults

f171b52

corrections, tolerance tuning

9305980

looser float matching

15241a8

clarify shrunken sentence_len

c44cf58

expand deterministic tests

0878db8

comment cleanup; doc_locks in job batches

45c8151

Merge pull request #6 from piskvorky/bigdocvec_pr

b558262

pep8 & python2 fixes to doc2vec notebook

piskvorky added a commit that referenced this pull request Jun 28, 2015

Merge pull request #356 from gojomo/bigdocvec_pr

1d5bd88

big doc-vector refactor/enhancements

piskvorky merged commit 1d5bd88 into piskvorky:develop Jun 28, 2015

gojomo mentioned this pull request Jun 30, 2015

smaller&faster neg-sampling table; reduce cython duplication; feedback tweaks #373

Merged

AbdealiLoKo mentioned this pull request Jun 30, 2015

Doc2Vec getting back the same vector from infer_vector #374

Closed

gojomo deleted the bigdocvec_pr branch July 9, 2015 12:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

big doc-vector refactor/enhancements #356

big doc-vector refactor/enhancements #356

gojomo commented Jun 10, 2015

gojomo commented Jun 28, 2015

akhudek commented Jun 28, 2015

gojomo commented Jun 28, 2015

akhudek commented Jun 28, 2015

piskvorky commented Jun 28, 2015

piskvorky commented Jun 28, 2015

craigpfeifer commented Jun 29, 2015

gojomo commented Jun 29, 2015

e9t commented Jun 29, 2015

gojomo commented Jun 30, 2015

e9t commented Jun 30, 2015

gojomo commented Jun 30, 2015

e9t commented Jun 30, 2015

gojomo commented Jun 30, 2015

e9t commented Jul 1, 2015

gojomo commented Jul 1, 2015

gojomo commented Jul 5, 2015

e9t commented Jul 5, 2015

piskvorky commented Jul 5, 2015

gojomo commented Jul 5, 2015

vierja commented Aug 21, 2015

gojomo commented Aug 21, 2015

vierja commented Aug 21, 2015

gojomo commented Aug 21, 2015

big doc-vector refactor/enhancements #356

big doc-vector refactor/enhancements #356

Conversation

gojomo commented Jun 10, 2015

gojomo commented Jun 28, 2015

akhudek commented Jun 28, 2015

gojomo commented Jun 28, 2015

akhudek commented Jun 28, 2015

piskvorky commented Jun 28, 2015

piskvorky commented Jun 28, 2015

craigpfeifer commented Jun 29, 2015

gojomo commented Jun 29, 2015

e9t commented Jun 29, 2015

gojomo commented Jun 30, 2015

e9t commented Jun 30, 2015

gojomo commented Jun 30, 2015

e9t commented Jun 30, 2015

gojomo commented Jun 30, 2015

e9t commented Jul 1, 2015

gojomo commented Jul 1, 2015

gojomo commented Jul 5, 2015

e9t commented Jul 5, 2015

piskvorky commented Jul 5, 2015

gojomo commented Jul 5, 2015

vierja commented Aug 21, 2015

gojomo commented Aug 21, 2015

vierja commented Aug 21, 2015

gojomo commented Aug 21, 2015