Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conversion function naming #1270

Closed
amueller opened this issue Apr 10, 2017 · 30 comments
Closed

conversion function naming #1270

amueller opened this issue Apr 10, 2017 · 30 comments
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation

Comments

@amueller
Copy link
Contributor

amueller commented Apr 10, 2017

Hey. I'm trying to go from the CSR format used in scikit-learn to the gensim for mat and I'm a bit confused.
There is some instructions here:
https://radimrehurek.com/gensim/tut1.html#compatibility-with-numpy-and-scipy

But the naming seems odd. Why is "corpus to CSC" the inverse of "sparse to corpus"?
Looking at the helper functions here is even more confusing imo.

Does "corpus" mean an iterator over lists of tuples or what is the interface here?
There are some other functions like:


gensim.matutils.sparse2full(doc, length)

    Convert a document in sparse document format (=sequence of 2-tuples) into a dense np array (of size length).

and full2sparse. In this context "sparse" means sequence of 2-tuples, while in the "Sparse2Corpus" the "sparse" means "scipy sparse matrix".

Is it possible to explain what "sparse", "scipy", "dense" and "corpus" mean in all these functions? It seems to me like there is no consistent convention.

@tmylk
Copy link
Contributor

tmylk commented Apr 10, 2017

In this context corpus is a list/iterator/generator of tuples in bag of words format.

There is more context in any2sparse

What is the context of this conversion?
We have sklearn pipeline interface for LDA and LSI in gensim as described in this ipynb

@amueller
Copy link
Contributor Author

I want to use the word2vec though ;)
Maybe it would be good to have a general wrapper that can be applied to any transformation?

The context is that I'm trying to teach my students about word2vec using gensim and we have only used the sklearn representation so far. I think I got the representation but I'm still confused by the naming.

So in any2sparse, sparse again means the gensim format, so the opposite from "SparseToCorpus", i.e. these go into the same direction even though the naming suggests they go in opposite directions?
any2sparse also only works on lists of vectors it seems, which makes sense for streaming but is not what you'd have in sklearn.

@tmylk
Copy link
Contributor

tmylk commented Apr 10, 2017

The input to word2vec is not a corpus aka a list of tuples, but an iterable of lists of words - sentences.
For example LineSentence

@tmylk
Copy link
Contributor

tmylk commented Apr 10, 2017

Actually the simplest gensim - sklearn word2vec integration code is in shorttext package
https://pythonhosted.org/shorttext/tutorial_sumvec.html

@amueller
Copy link
Contributor Author

I only want to transform, not train, so then the interface is word-based, right?

@amueller
Copy link
Contributor Author

thanks for the hint for shorttext. That doesn't have paragraph2vec, though, right? Btw, is there a pretrained model for that?

@tmylk
Copy link
Contributor

tmylk commented Apr 10, 2017

model.wv[['office', 'products']] returns the vector representation as in shorttext here

Not aware of large doc2vec pre-trained model. This week there will be a small trained doc2vec model with tensorboard viz in this PR by @parulsethi

@amueller
Copy link
Contributor Author

@tmylk awesome, thanks!
Still think you need to work on your conversion function naming ;)

@tmylk tmylk added documentation Current issue related to documentation difficulty easy Easy issue: required small fix labels Apr 10, 2017
@amueller
Copy link
Contributor Author

pretrained doc2vec here: https://github.com/jhlau/doc2vec though unclear if that's applicable to other domains.

@amueller
Copy link
Contributor Author

somewhat unrelated, have you thought about including the feature of using a pretrained word model for the doc2vec as done here jhlau@9dc0f79 ?

@tmylk
Copy link
Contributor

tmylk commented Apr 10, 2017

Initialization of word vectors by pre-trained is possible to do manually in the main branch without that fork. Though it's debated on the mailing list by @gojomo whether that's helpful or not.

@amueller
Copy link
Contributor Author

hm upgraded to 1.0.1 model.wv still doesn't exist.

@tmylk
Copy link
Contributor

tmylk commented Apr 10, 2017

That is strange

import gensim
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)
model['first']
model.wv['first']

@amueller
Copy link
Contributor Author

ah, It's probably because I load the model?

from gensim import models
w = models.KeyedVectors.load_word2vec_format(
    '../GoogleNews-vectors-negative300.bin', binary=True)

@tmylk
Copy link
Contributor

tmylk commented Apr 10, 2017

that is not the model, that is just the vectors from the model :) You cannot train it, just read-only query with them.
w['first'] will work then

@gojomo
Copy link
Collaborator

gojomo commented Apr 11, 2017

Looking at this issue history, I see @amueller comments that seem to be reating to @tmylk answers... but no @tmylk comments at all. Some github bug?

If you're loading directly into a KeyedVectors, no need to access the .wv property - your object is already the vectors.

There's experimental support for merging some pretrained-word-vectors into a prediscovered vocabulary, in intersect_word2vec_format(). And folks who study the source/structures can try stitching such a model together (similar to the @jhlau code referenced). But it seems to me a lot more people want to do that, than have a good reason or understanding for why it should be done, and so I'd like to see some experiments/write-ups demonstrating the value (and limits) of such an approach before adding any further explicit support. (One of the best Doc2Vec modes for many applications, pure PV-DBOW without word-training, doesn't even use/create input/projection word-vectors.)

@jhlau
Copy link

jhlau commented Apr 11, 2017

There's experimental support for merging some pretrained-word-vectors into a prediscovered vocabulary, in intersect_word2vec_format(). And folks who study the source/structures can try stitching such a model together (similar to the @jhlau code referenced). But it seems to me a lot more people want to do that, than have a good reason or understanding for why it should be done, and so I'd like to see some experiments/write-ups demonstrating the value (and limits) of such an approach before adding any further explicit support. (One of the best Doc2Vec modes for many applications, pure PV-DBOW without word-training, doesn't even use/create input/projection word-vectors.)

We've done just that. It's all documented in this paper: https://arxiv.org/abs/1607.05368

Long story short, pre-trained word embeddings help most when you are using a small document collection (e.g. a special domain of text) when training doc2vec.

@piskvorky
Copy link
Owner

piskvorky commented Apr 12, 2017

I can also see only @amueller 's side of the conversation.

Sparse2Corpus should probably be called Scipy2Sparse, for consistency.

The confusion comes from the fact that both scipy and gensim have been calling their data structure "sparse", for almost a decade now... :( In scipy, it denotes a sparse matrix in CSR / CSC / whatever; in gensim it's anything that you can iterate over, yielding iterables of (feature_id, feature_weight) 2-tuples.

Maybe call it "gensim-sparse" vs "scipy-sparse"?

I'm also +1 on renaming the generic gensim structure to something else entirely. "Sparse" is taken (scipy). "Corpus" is taken (NLP). Any other ideas?

@gojomo
Copy link
Collaborator

gojomo commented Apr 12, 2017

@jhlau Thanks for your comment & analysis - but I found some of the parameter-choices and evaluations/explanations in your paper confusing, to the point of not being convinced of that conclusion. Some of my observations are in the gensim forum messages at https://groups.google.com/d/msg/gensim/MYbZBkM5KKA/lBKGf7WNDwAJ and https://groups.google.com/d/msg/gensim/MYbZBkM5KKA/j5OKViKzEgAJ. As an example, the claim in section 5 – "More importantly, using pre-trained word
embeddings never harms the performance" – is directly contradicted by the above-referenced table, where on several of the subcollections, the non-pretrained DBOW outperforms either one or the other choice of pretrained word-vectors. (And on the 'programmers' forum, it outperforms both.)

@jhlau
Copy link

jhlau commented Apr 12, 2017

but I found some of the parameter-choices and evaluations/explanations in your paper confusing, to the point of not being convinced of that conclusion.

Not sure what you are confused about, but looking at your comments on the links:

Pure PV-DBOW (dm=0, dbow_words=0) mode is fast and often a great performer on downstream tasks. It doesn't consult or create the traditional (input/'projection-layer') word-vectors at all. Whether they are zeroed-out, random, or pre-loaded from word-vectors created earlier won't make any difference.

PV-DBOW with concurrent skip-gram training (dm=0, dbow_words=1) will interleave wordvec-training with docvec-training. It can start with random word-vectors, just like plain word-vector training, and learn all the word-vectors/doc-vectors together, based on the current training corpus. The word-vectors and doc-vectors will influence each other, for better or worse, via the shared hidden-to-output-layer weights. (The training is slower, and the doc-vectors essentially have to 'share the coordinate space' with the word-vectors, and with typical window values the word vectors are in-aggregate getting far more training cycles.)

PV-DM (dm=1) inherently mixes word- and doc-vectors during every training example, but also like PV-DBOW+SG, can start with random word-vectors and learn all that's needed, from the current corpus, concurrently during training.

That seems to correspond to my understanding of doc2vec. What we found is that pure PV-DBOW ('dm=0, dbow_words=0') is pretty bad. PV-DBOW is generally the best option ('dm=0, dbow_words=1'), and PV-DM ('dm=1') at best performs on-par with PV-DBOW, but is often slightly worse and requires more training iteration (since its parameter size is much larger).

Feel free to ask any specific questions that you feel are not clear. I wasn't aware of any these discussions as no one has tagged me.

There's experimental support for merging some pretrained-word-vectors into a prediscovered vocabulary, in intersect_word2vec_format().

This function does not really work, as it uses pre-trained embeddings only for words that are in the model. The forked version of gensim that I've built on the other hand also loads new word embeddings. That is the key difference.

in section 5 – "More importantly, using pre-trained word
embeddings never harms the performance" – is directly contradicted by the above-referenced table, where on several of the subcollections, the non-pretrained DBOW outperforms either one or the other choice of pretrained word-vectors. (And on the 'programmers' forum, it outperforms both.)

On section 5 table 6, we really meant is that adding pre-trained word vectors doesn't harm performance substantially. Overall, we see that using pre-trained embeddings is generally beneficial for small training collection, and at the worst case, it'd give similar performance, and therefore there's little reason to not do it.

@piskvorky
Copy link
Owner

Nice thread hijacking! 😆

Perhaps mailing list better?

@amueller
Copy link
Contributor Author

I take all the blame for mixing about 10 issues into one.

The confusion comes from the fact that both scipy and gensim have been calling their data structure "sparse", for almost a decade now

exactly, that was confusing for me.
If you use corpus in some consistent way that would be fine for me, but I'm not an NLP person, who might be confused by that. Not sure what kind of data structures nltk has for example.

@amueller
Copy link
Contributor Author

One of the best Doc2Vec modes for many applications, pure PV-DBOW without word-training, doesn't even use/create input/projection word-vectors.

Can you give a reference for that - even how that works? That's not described in the original paper, right? [Sorry for highjack-continuation, I'm already on too many mailing lists. maybe separate issue?]

@gojomo
Copy link
Collaborator

gojomo commented Apr 12, 2017

@piskvorky - Could discuss on gensim list if @jhlau would also like that forum, but keeping full context here for now.

@jhlau -

For background, I am the original implementor of the dm_concat and dbow_words options in gensim, and the intersect_word2vec_format() method.

That seems to correspond to my understanding of doc2vec. What we found is that pure PV-DBOW ('dm=0, dbow_words=0') is pretty bad. PV-DBOW is generally the best option ('dm=0, dbow_words=1'), and PV-DM ('dm=1') at best performs on-par with PV-DBOW, but is often slightly worse and requires more training iteration (since its parameter size is much larger).

I didn't see any specific measurements in the paper about pure PV-DBOW – am I misreading something? (There, as here, I only see statements to the effect of, "we tried it but it was pretty bad".)

As mentioned in my 2nd-referenced-message, comparing pure PV-DBOW with arguments like dm=0, dbow_words=0, iter=n against PV-DBOW-plus-skip-gram with arguments like dm=0, dbow_words=1, window=15, iter=n may not be checking as much the value of words, but the value of the 16X-more training effort (which happens to be mostly focused on words). A more meaningful comparison would be dm=0, dbow_words=0, iter=15*n vs dm=0, dbow_words=1, window=15, iter=n – which I conjecture would have roughly the same runtime. With no indication such an apples-to-apples comparison was made, I can't assign much weight to the unquantified "pretty bad" assessment.

From the paper's description & your posted code, it appears all pvdm tests were done with the non-default dm_concat=1 mode. As noted in my message, I've not yet found any cases where this mode is worth the massive extra time/memory overhead. (It's unfortunate that the original Mikolov/Le paper touts this method, but implementations are rare, and so people may think it's the key to their non-reproducible results.) I try to warn all but the most adventurous, rigorous users away from this mode, and perhaps the gensim doc-comment should be even more discouraging. But the upshot is that if all your paper's pvdm tests were with dm_concat=1, they are unlikely generalizable to the more practical and commonly-used mode dm=1, dm_concat=0 mode.

There's experimental support for merging some pretrained-word-vectors into a prediscovered vocabulary, in intersect_word2vec_format().

This function does not really work, as it uses pre-trained embeddings only for words that are in the model. The forked version of gensim that I've built on the other hand also loads new word embeddings. That is the key difference.

Yes, but if someone is only computing doc-vectors over a current corpus C, and will be doing further training over just examples from current corpus C, and further inference just using documents from corpus C, why would any words that never appear in C be of any value? Sure, earlier larger corpus P may have pre-trained lots of other words. But any training/inference on C will never update or even consult those slots in the vector array, so why load them?

Now, there might be some vague intuition that bringing in such words could help later, when you start presenting new documents for inference, say from some new set D, that have words that are outside the vocabulary of C, but were in P. But there are problems with this hope:

  • Important aspects of some Doc2Vec modes – especially negative-sampling, and frequent-word-downsampling – require word-frequency information, which is perfectly available for the current corpus C, but (usually so far) not available for P (as in the case of GoogleNews or many other wordvec-sets in word2vec.c-format). So these word-frequency-sensitive modes will do something arbitrary/undefined whenever dealing with these penumbral words. (I see that your version of reset_weights(), later than what @amueller references, at https://github.com/jhlau/gensim/blob/d40a83ab2708b0cc04b03eb919b7b18741c75f90/gensim/models/word2vec.py#L1009 chooses to exempt such P-only words from both downsampling and negative-sampling. That might be a good arbitrary choice! But I've not seen any comparison against the simpler option of just ignoring only-in-P words.) And note that even if you had word-frequencies for P, you'd face the choice of whether you weigh C-frequencies or P-frequencies more as a fair estimate of D-frequencies.
  • Since P is just the input/projection vectors, there are no output-weights for the corresponding extra words/word-encodings. So the syn1 (HS-coding) or syn1neg (negative-sampling) layers have no ways to express these extra words, nor has further 'tuning' training on the corpus C created such values (since such words are not seen during that training). We'd thus need a theoretical extension to gensim that rather than just skipping unknown (or downsampled) words on both the 'input' and 'output' sides of the prediction NN, would elide such 'half-known' words as prediction-targets, but still use them as part of input-contexts. (Thus would be possible, and is analogous to the way that there are categories on the output-side of Facebook's FastText training that aren't part of the word-only input-side, but doesn't yet exist.) While I haven't tested your code, it looks like inference on such D documents (which have words from P not in C) is likely to error in HS mode (since the Vocab entries have no codes/points for P-word HS encodings) and just inject some nonsense-calculation in NS mode (since a P-word's slot in syn1neg s always just the untrained zero-vector). That is to say, the import of these frequency-less, output-weight-less words may not be helping as hoped – testing I've not yet seen would have to be done.
  • Since some words from P have been further adjusted by followup-training over C, but (many?) others have not, the relative distances between those that have been adjusted, and those that have not, can become arbitrarily less meaningful with more training over C (such as the 100 training epochs in your experiments). With enough training over corpus C, any residual influence of the original P-positions of shared words might be diluted to become arbitrarily small - as the vectors become optimized for just the occurrences/word-senses/domain-specialized-meanings in C. At that point, distance-comparisons with P-only-words may no longer have the meaning that was originally induced by competitive/interleaved training in the original same P-session. (This is why intersect_word2vec_format() offers, and by default enables, an optional 'locking' of the imported vectors into their original positions, under the assumption that if you're so confident of the P-vectors you trust them more than your current corpus, they should be the fixed anchors of your new training. The words unique to C, starting from their randomly-initialized positions, are thus forced to move to positions compatible with the original larger P set, rather than the P vectors becoming less P-like with more re-training over C. But this lockf=0.0 default can be changed to 1.0 if you prefer P-sourced vectors to drift. Both are worth testing before assuming either approach is better, and the answer likely varies with corpus and choice of iteration-count.)

These subtle issues are why I'm wary of a superficially-simple API to "bring in pretrained embeddings" That would make that step seem like an easy win, when I don't yet consider the evidence for that (including your paper) to be strong. And it introduces tradeoffs and unintuitive behaviors with regard to the P-but-not-C vocabulary words, and the handling of D examples with such words.

I see the limits and lock-options of intersect_word2vec_format() as somewhat protecting users from unwarranted assumptions and false intuitions about what imported-vectors might achieve. And even with all this said, if a user really wants words in their model imported from P that have made-up frequency values, and can't be meaningfully tuned by training over C, and may inject some arbitrary randomness in later inference over documents like those in D, I would still suggest leveraging intersect_word2vec_format(). For example, they could add a few synthetic texts to their C corpus, with the extra P words – and these noise docs are unlikely to have much effect on the overall model quality. Or, they can call the three submethods of build_vocab()scan_vocab(), scale_vocab(), finalize_vocab() – separately, and manually add entires for the extra P words just after scan_vocab(). These few lines of code outside the Word2Vec model can achieve the same effects, but avoid the implied endorsement of an option that presents a high risk of "shooting-self-in-foot".

On section 5 table 6, we really meant is that adding pre-trained word vectors doesn't harm performance substantially. Overall, we see that using pre-trained embeddings is generally beneficial for small training collection, and at the worst case, it'd give similar performance, and therefore there's little reason to not do it.

The benefits in that table generally look small to me, and I suspect they'd be even smaller with the fairer training-time comparison I suggest above. But "never harms" (with italicized emphasis!) was an unsupportable word choice if in fact you really meant 'substantially', and the adjacent data table provides actual examples where pre-trained embeddings harmed the evaluation score. Such a mismatch also lowers my confidence in all nearby claims.

@gojomo
Copy link
Collaborator

gojomo commented Apr 12, 2017

@amueller

One of the best Doc2Vec modes for many applications, pure PV-DBOW without word-training, doesn't even use/create input/projection word-vectors.

Can you give a reference for that - even how that works? That's not described in the original paper, right? [Sorry for highjack-continuation, I'm already on too many mailing lists. maybe separate issue?]

The original Paragraph Vectors paper only describes that PV-DBOW mode: the doc-vector-in-training, alone, is optimized to predict each word in turn. It's not averaged with any word-vectors, nor does the paper explicitly describe training word-vectors at the same time – though it's a naturally composable approach, given how analogous PV-DBOW is with skip-gram words, with the PV-DBOW doc-vector being like a magic pseudo-word that, within one text example, has an 'infinite' effective window, floating into every context.

That 'floating word' is indeed how Mikolov's small patch to word2vec.c, adding a -sentence-vector option to demonstrate Paragraph Vectors, worked. The first token on any line was treated as this special, contributes-to-every-context word – and regular word-vector training was always still happening. So by my interpretation, that demonstration was not a literal implementation of the original Paragraph Vectors paper algorithm, but a demo of combined, interleaved PV & word-vector training.

The followup paper, "Document Embeddings with Paragraph Vector" (https://arxiv.org/abs/1507.07998) seems to share my interpretation, because it observes that word-vector training was an extra option they chose (section 3 paragraph 2):

We also jointly trained word embeddings with the paragraph vectors since preliminary experiments showed that this can improve the quality of the paragraph vectors.

However, the only places this paper compares "PV w/out word-training" against PV-with-word-training, in figures 4 and 5, the without-word-training is very similar in evaluation score, and even better at 1-out-of-4 comparison points (lower dimensionality in figure 4). And I suspect the same conjecture I've made about @jhlau's results, that using some/all of the time saved from not-training-words to do more iterations of pure-DBOW, would be a fairer comparison and further improve plain PV-DBOW's relative performance.

@jhlau
Copy link

jhlau commented Apr 12, 2017

I didn't see any specific measurements in the paper about pure PV-DBOW – am I misreading something? (There, as here, I only see statements to the effect of, "we tried it but it was pretty bad".)

Indeed. Its performance is far worse than PV-DBOW with SG so we omit from including them entirely.

As mentioned in my 2nd-referenced-message, comparing pure PV-DBOW with arguments like dm=0, dbow_words=0, iter=n against PV-DBOW-plus-skip-gram with arguments like dm=0, dbow_words=1, window=15, iter=n may not be checking as much the value of words, but the value of the 16X-more training effort (which happens to be mostly focused on words). A more meaningful comparison would be dm=0, dbow_words=0, iter=15*n vs dm=0, dbow_words=1, window=15, iter=n – which I conjecture would have roughly the same runtime. With no indication such an apples-to-apples comparison was made, I can't assign much weight to the unquantified "pretty bad" assessment.

I disagree that is a fairer comparison. What would be a fairer comparison, though, is that you extract the most optimal performance from both methods. If PV-DBOW without SG takes longer to converge to optimal performance, then yes I agree that one should train it more (but not by arbitrarily setting some 'standardised' epoch number). I did the same when comparing with PV-DM - it uses much more training epochs but the key point is finding its best performance. I might go back and run PV-DBOW without SG to check if this is the case.

From the paper's description & your posted code, it appears all pvdm tests were done with the non-default dm_concat=1 mode. As noted in my message, I've not yet found any cases where this mode is worth the massive extra time/memory overhead. (It's unfortunate that the original Mikolov/Le paper touts this method, but implementations are rare, and so people may think it's the key to their non-reproducible results.) I try to warn all but the most adventurous, rigorous users away from this mode, and perhaps the gensim doc-comment should be even more discouraging. But the upshot is that if all your paper's pvdm tests were with dm_concat=1, they are unlikely generalizable to the more practical and commonly-used mode dm=1, dm_concat=0 mode.

The intention is about checking the original paragraph vector, so yes I only experiment with dm_concat=1 option. In terms of observations we found what you've seen, that the increased number of parameters is hardly worth it.

Yes, but if someone is only computing doc-vectors over a current corpus C, and will be doing further training over just examples from current corpus C, and further inference just using documents from corpus C, why would any words that never appear in C be of any value? Sure, earlier larger corpus P may have pre-trained lots of other words. But any training/inference on C will never update or even consult those slots in the vector array, so why load them?

Not quite, because often there is vocab filter for low frequency words. A word might have been filtered out due to this frequency threshold and excluded from the dataset, but it could be included back again when you are importing it from a larger pre-trained word embeddings model.

Now, there might be some vague intuition that bringing in such words could help later, when you start presenting new documents for inference, say from some new set D, that have words that are outside the vocabulary of C, but were in P. But there are problems with this hope:

That wasn't quite the intention behind why the new vocab is included, for all the reasons you pointed out below.

The benefits in that table generally look small to me, and I suspect they'd be even smaller with the fairer training-time comparison I suggest above. But "never harms" (with italicized emphasis!) was an unsupportable word choice if in fact you really meant 'substantially', and the adjacent data table provides actual examples where pre-trained embeddings harmed the evaluation score. Such a mismatch also lowers my confidence in all nearby claims.

Fair point. The wording might have been a little strong but I stand by what I said previously and the key point is take a step back and look at the bigger picture. Ultimately the interpretation is up to the users -they can make the choice whether they want to incorporate pre-trained embeddings or not.

@gojomo
Copy link
Collaborator

gojomo commented Apr 13, 2017

Indeed. Its performance is far worse than PV-DBOW with SG so we omit from including them entirely.

My concern is that without seeing the numbers, & knowing what parameters were tested, it's hard to use this observation to guide future work.

I disagree that is a fairer comparison. What would be a fairer comparison, though, is that you extract the most optimal performance from both methods. If PV-DBOW without SG takes longer to converge to optimal performance, then yes I agree that one should train it more (but not by arbitrarily setting some 'standardised' epoch number). I did the same when comparing with PV-DM - it uses much more training epochs but the key point is finding its best performance. I might go back and run PV-DBOW without SG to check if this is the case.

Sure, never mind any default epoch-counts (or epoch-ratios). The conjecture is that even though PV-DBOW-without-SG may benefit from more epochs, these are so much faster (perhaps ~15X in the window=15 case) that a deeper search of its parameters may still show it to be a comparable- or top-performer on both runtime and final-scoring. (So it doesn't "take longer" in tangible runtime, just more iterations in same-or-less time.)

If you get a chance to test that, in a comparable set-up to the published results, I'd love to see the numbers and it'd give me far more confidence in any conclusions. (Similarly, the paper's reporting of 'optimal' parameters in Table 4, §3.3, and footnotes 9 & 11 would be far more informative if it also reported the full range of alternative values tried, and in what combinations.)

The intention is about checking the original paragraph vector, so yes I only experiment with dm_concat=1 option. In terms of observations we found what you've seen, that the increased number of parameters is hardly worth it.

I understand that choice. But given the dubiousness of the original paper's PV-DM-with-concatenation results, comparative info about the gensim default PV-DM-with-averaging mode could be more valuable. That mode might be competitive with PV-DBOW, especially on large datasets. So if you're ever thinking of a followup paper...

Not quite, because often there is vocab filter for low frequency words. A word might have been filtered out due to this frequency threshold and excluded from the dataset, but it could be included back again when you are importing it from a larger pre-trained word embeddings model.

I see. That's an interesting subset of the combined vocabulary – but raises the same concerns about vector-quality-vs-frequency as come into play in picking a min_count, or deciding if imported vectors should be tuned by the new examples, or frozen in place (perhaps because the pretraining corpus is assumed to be more informative. In some cases discarding words that lack enough training examples to induce 'good' vectors improves the quality of the surviving words, by effectively shrinking the distances between surviving words, and removing de facto interference from the lower-resolution/more-idiosyncratic words. So I could see the bulk import, and thus 'rescue' of below-min_count words in the current corpus, as either helping or hurting – it'd need testing to know. It's even within the realm-of-outside-possibility that the best policy might be to only pre-load some lowest-frequency words – trusting that more-frequent words are best trained from their plentiful domain-specific occurrences. Such policies could be explored by users by direct-tampering with the model's vocabulary between the scan_vocab() and scale_vocab() initialization steps.

@amueller
Copy link
Contributor Author

@gojomo Ah, in the original paper i thought they implemented Figure 2 but they actually implemented Figure 3 (I only skimmed).

@gojomo
Copy link
Collaborator

gojomo commented Apr 13, 2017

@amueller - I'd describe it as, figure-2 is PV-DM (available in gensim as dm=1, with potential submodes controlled by dm_mean and dm_concat), and figure-3 is PV-DBOW (available in gensim as dm=0, with potential skip-gram training interleaved with dbow_words=1).

@tmylk
Copy link
Contributor

tmylk commented May 2, 2017

Closing as resolved open-ended discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation
Projects
None yet
Development

No branches or pull requests

5 participants