-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
big doc-vector refactor/enhancements #356
Conversation
…for padding; layer1_size potentially different than vector_size; parameter renames for clarity; one-time neg_lables precalc
…ence_dm_concat, infer_vector_dm_concat methods
… ('...lock factor')
pep8 & python2 fixes to doc2vec notebook
@piskvorky the notebook & most other work was done in py3.4/OSX 1st, but I've also regularly run the notebook and other tests on py2.7/ubuntu. Thanks for the PR, merged! |
@gojomo The vocabulary phase definitely isn't the problem, I had logging on during the attempt and saw that it completed building the vocabulary with about the same memory usage as word2vec. From memory, (I lost the exact numbers due to the crash, should have logged to a file), there are about 4 million words before pruning, much less afterwards. The crash happened immediately after seeing a log about resetting the layer weights. The training script is very straightforward, as you can see below. A single line from the input data definitely won't blow memory. import sys
import gzip
import logging
from gensim.models.doc2vec import TaggedDocument, Doc2Vec
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
class TaggedLineSentence(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
for uid, line in enumerate(gzip.open(self.filename, 'rb')):
yield TaggedDocument(words=line.split(), tags=[uid])
sentences = TaggedLineSentence(sys.argv[1])
model = Doc2Vec(alpha=0.025, min_alpha=0.025, docvecs_mapfile='mapfile') # use fixed learning rate
model.build_vocab(sentences)
for epoch in range(10):
print epoch
model.train(sentences)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
# store the model
model.save(sys.argv[2]) |
Hmm. If build_vocab() completes I think you would see 'mapfile' appear in the working directory: Doc2Vec should have written random initialization vectors across its entire extent. (Do you see the '0' epoch print?) |
No, it doesn't make it that far. I've just tried with a small subset of the data. After finishing the vocab with the small data, it's using ~100mb of memory. Somewhere between I'm beginning to suspect that this might be OS X behaving poorly with the mmap'd file, trying to load the entire thing into it's concept of virtual memory resulting in swap file death. I've tried to limit the processes memory via ulimit, but it doesn't seem to work. I'm going to try this on a linux system to see if it's OS related.
|
Let me merge this PR into We can continue the discussion here, as well as open new PRs for fixes / improvements. |
big doc-vector refactor/enhancements
And it goes without saying -- massive thanks to @gojomo for his epic refactor! |
Great changes! When will these be available via pip? |
I don't know @piskvorky's plans for a numbered release, but you can always pip install from a github branch. For example, this should do the trick (because 'develop' is the default branch for /piskvorky/gensim):
|
Are you sure you want to totally remove Some versioning schemes (ex: semver), suggest minor version changes to be backwards-compatible, and those who are used to this may have some confusions because this feature merge is a minor version up.
how would it be to use a DeprecationWarning? Anyways, thanks for the great work! |
@gojomo, I got your notebook code to work on my Macbook and reproduced similar results. (Awesome!) # Load corpus labeledTrainData_clean.tsv testData_clean.tsv unlabeledTrainData_clean.tsv # Set-up Doc2Vec Training & Evaluation Models Doc2Vec(dm/c,d100,n5,w5,mc2,t8) Doc2Vec(dbow,d100,n5,mc2,t8) Doc2Vec(dm/m,d100,n5,w10,mc2,t8) # Bulk Training START 2015-07-01 01:20:52.256642 *0.392200 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 54.5s 0.5s *0.392000 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8)_inferred 54.5s 2.1s Segmentation fault: 11 I'm using a Macbook, so this is probably related to numpy/numpy#4007. |
I wouldn't be confident it's related to the numpy issue – I've not had random/numpy segfaults while developing this on OSX. (All the many segfaults I've worked through have been traceable to my own genuine bugs.) I didn't think any Macbooks had 8 cores; did you manually set that number of workers? (I'd recommend just the number of real cores, though that shouldn't risk a segfault.) Does it happen every time, at roughly the same time, when using string tags? And never when using int IDs? The string tags should definitely work – they just may prove unwieldy if you get to tens-of-millions of docs. Are your tags relatively short, and just one tag to a document? Is there a chance any of your documents are > 10,000 words? (That's currently the hard limit, though I hope to loosen it, and going over it should only cause truncation rather than crashes.) If you can reliably trigger segfaults with a small test case, I'll have other ideas for details to collect. (First among them: enabling core dumps and checking exactly where the fault is happening.) |
My Macbook has a i7 CPU with 4 physical cores but enabling hyperthreading gives me 8 logical cores, which probably is the reason
Until now, yes. (The sample isn't so big, but I have run each 5+ consecutive times.)
Yes and yes, max length of the tags is 6 and just one tag per document.
No, document lengths are all shorter than 10,000 words.
I'll see if I can create a toy case, and see what I can do to spot the problem. It may just be a special case, or my own bug as you suggested but if I find anything special, I'll open a new issue. Thanks! |
I'd somehow overlooked the i7 had hyperthreading (usually doing this dev in a OSX VM with 4 virtual cores)... turns out 8 virtual cores do incrementally help, about a 14% speedup over my previous configuration of 4, in a quick test. So by all means keep using 8. Unless you're doing c/cython stuff, any segfaulting-bug is far more likely to come from the optimized cython/blas code, which can easily ignore boundaries or hold pointers past memory reuse. So please keep me posted any triggering patterns you find. (There is one other report of a segfault on the discussion list.) |
You are referring to this discussion right? I looked at my crash logs and found the same exception type and codes EXC_BAD_ACCESS (SIGSEGV) and KERN_INVALID_ADDRESS. Thread 2 Crashed: 0 libBLAS.dylib 0x00007fff9325bc95 cblas_sdot + 976 1 libBLAS.dylib 0x00007fff9322ee8d SDOT + 16 2 doc2vec_inner.so 0x00000001102b71ea __pyx_f_5trunk_6gensim_6models_13doc2vec_inner_our_dot_double + 10 (doc2vec_inner.c:1455) 3 doc2vec_inner.so 0x00000001102c41fa __pyx_f_5trunk_6gensim_6models_13doc2vec_inner_fast_document_dbow_hs + 202 (doc2vec_inner.c:1678) 4 doc2vec_inner.so 0x00000001102c3909 __pyx_pw_5trunk_6gensim_6models_13doc2vec_inner_1train_document_dbow + 13769 (doc2vec_inner.c:4102) 5 org.python.python 0x000000010086091a PyEval_EvalFrameEx + 21166 6 org.python.python 0x00000001007e39e5 gen_send_ex + 169 7 org.python.python 0x00000001007c86f1 PyIter_Next + 16 8 org.python.python 0x0000000100859c31 builtin_sum + 378 9 org.python.python 0x0000000100860869 PyEval_EvalFrameEx + 20989 10 org.python.python 0x000000010085b4b5 PyEval_EvalCodeEx + 1622 11 org.python.python 0x0000000100863b38 fast_function + 321 12 org.python.python 0x00000001008606fb PyEval_EvalFrameEx + 20623 13 org.python.python 0x000000010085b4b5 PyEval_EvalCodeEx + 1622 14 org.python.python 0x00000001007e922f function_call + 372 15 org.python.python 0x00000001007c8e2a PyObject_Call + 103 16 org.python.python 0x0000000100860daf PyEval_EvalFrameEx + 22339 17 org.python.python 0x0000000100863ac2 fast_function + 203 18 org.python.python 0x00000001008606fb PyEval_EvalFrameEx + 20623 19 org.python.python 0x0000000100863ac2 fast_function + 203 20 org.python.python 0x00000001008606fb PyEval_EvalFrameEx + 20623 21 org.python.python 0x000000010085b4b5 PyEval_EvalCodeEx + 1622 22 org.python.python 0x00000001007e922f function_call + 372 23 org.python.python 0x00000001007c8e2a PyObject_Call + 103 24 org.python.python 0x00000001007da54c method_call + 136 25 org.python.python 0x00000001007c8e2a PyObject_Call + 103 26 org.python.python 0x00000001008631c4 PyEval_CallObjectWithKeywords + 93 27 org.python.python 0x000000010089445b t_bootstrap + 70 28 libsystem_pthread.dylib 0x00007fff8f102268 _pthread_body + 131 29 libsystem_pthread.dylib 0x00007fff8f1021e5 _pthread_start + 176 30 libsystem_pthread.dylib 0x00007fff8f10041d thread_start + 13 I've ran the code three times in a row, and the crashes logs all occurred in the same place. |
Yes, that's the discussion. That is a different crash location... but if the bug is some other code, running a little earlier clobbering unintentional addresses with illegal values, the ultimate crash could happen in a variety of places. (However, the other thread reports not seeing the crash again, perhaps specifically since not using the 'sample' parameter... which it doesn't appear you're using at all. So still unclear whether the incidents are related.) In your original output, it looked like at least one pass in one training mode (dm/c) completed, but then the crash occurred during the 1st pass in the second (dbow) mode. So, is that the repeated 'same place' a crash occurs: always that mode when run second? If you put logging to DEBUG does it indicate about the same amount of progress through the data each crash? How about if you only run the DBOW mode – still crashing same proportion of the way through? How about if you leave out DBOW mode entirely – does it crash another place? (I'm somewhat skeptical the numpy issue is related – that would seem to necessarily trigger earlier if present. If you step through the triggering steps outlined in the original gensim-related-report, at #131 (comment), can you get that error?) |
@e9t – FYI, I found a way to reliably trigger segfaults locally, and fixed the bug responsible – see the commits about "str doctags trigger bad indexes" on PR #380. Specifically, if both using string doc tags, and some tags repeat before all tags are discovered, some tags could be assigned too-high indexes into the vector array (because of a bad assumption of exactly one training example per tag). That'd eventually lead to out-of-bounds accesses or writes. Using repeated tags, while not exactly the mode described in the PV paper, is definitely a supported use-case. It might be reasonable to do so to create tag vectors for tags representing metadata/set-membership that many documents share. But also, feeding the training process "[tag-A] 1 2 3 . 4 5 6 ." as one large example, or "[tag-A] 1 2 3 ." then "[tag-A] 4 5 6 ." as two smaller examples, will cause approximately the same training for [tag-A] to occur. (In pure DBOW, the training is essentially identical; in other modes where the 'window' setting comes into play, there will be differences related to when the window reaches across the sentence boundary.) Can you check if the fix in that PR resolves your crash? (The essential change is just the one line: https://github.com/piskvorky/gensim/pull/380/files#diff-e71d1aecc3d6bb450f077300f2cf763dR293 ) |
Sorry for the late reply. Well, I've enabled DEBUG, and found that the crash does not occur exactly at the "same place". The GOOD NEWS is that yes, that bug fix #380 did the trick. |
@gojomo @cscorley I think this release warrants a version bump to 0.12 (rather than just 0.11.2)... what do you think? Changelog: https://github.com/piskvorky/gensim/blob/develop/CHANGELOG.txt |
👍 – version increments are free, and features (and API changes) warrant it. (About to make a few small CHANGELOG tweaks.) |
@gojomo is the |
It only works for Doc2Vec. (And, if presenting text for inference that contains new words, those words are treated like any other unknown words – dropped before analysis.) It's possible a similar mechanism could be offered for inferring word vectors. In many ways the work in #435 supports similar goals. |
@gojomo thanks for the reply. I meant like, is it possible to generate a paragraph vector representation using a Word2Vec model, using the same technique/code in |
The Doc2Vec algorithms (from the 'Paragraph Vectors' paper) do not start with word vectors, then create doc vectors. Rather, they train doc vectors from text (and only sometimes, in some training methods, generate word vectors as part of that process). So: no. The inference requires a trained up Doc2Vec model. Having preexisting word vectors isn't a typical/required input for creating doc-vectors (in this algorithm). (It's intuitively plausible that seeding some kinds of Doc2Vec models with pre-existing word vectors might offer a benefit, but a few small experiments I've done in that direction have had mixed results. There's other research about ways to create sentence/document vectors which do in fact require word vectors first; those algorithms aren't currently in gensim.) |
Ready for review/testing!
The headline changes are optimized doc-vector inference, and separating doc-vectors (during training or comparison) from the word-vocabulary – allowing many more docs (via memmap-backing) than words, and avoiding some confusion.
Many smaller changes include a DM-concatenative-context mode (as recommended in the Paragraph Vectors paper), an optional 'lock_factor' to attenuate training of some vectors, and other optimizations/cleanup.
See gensim/test/test_doc2vec.py to do a quick check on a new system or survey some API possibilities. See docs/notebook/doc2vec-IMDB.ipynb for a walkthrough of reproducing the PV paper IMDB sentiment experiment.
If you used Doc2Vec previously, a few key changes to note:
LabeledSentence
is nowTaggedDocument
, and the (one or more) document labels that correspond to vectors are now referred to as 'tags'. Greatest memory efficiency is possible by using only int tags, contiguous and ascending from 0.d2v_model.docvecs[doc_tag]
ord2v_model.docvecs.most_similar(doc_tag)
, rather than ond2v_model
directly.Some top needs: