You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> > > import gensim, logging
> > > from gensim import utils
> > > logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
> > > full_sentence=[str(x)+'times' for x in range(1,5)]
> > > sentences=[full_sentence,full_sentence[1:],full_sentence[2:],full_sentence[3:]]
> > > print(sentences)
[['1times', '2times', '3times', '4times'], ['2times', '3times', '4times'], ['3times', '4times'], ['4times']]
> > > def print_vocab(model):
> > > for x in model.index2word:
> > > print(x)
> > > print(model.vocab[x])
> > > # trim rule:
> > >
> > > def my_rule(word, count, min_count):
> > > if word.startswith("1"):
> > > return gensim.utils.RULE_KEEP
> > > else:
> > > return gensim.utils.RULE_DEFAULT
> > > model = gensim.models.Word2Vec(sentences,min_count=3,trim_rule=my_rule)
> > > # the trim rule work
> > >
> > > print_vocab(model)
2016-08-12 09:31:47,100 : INFO : collecting all words and their counts
2016-08-12 09:31:47,100 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-08-12 09:31:47,100 : INFO : collected 4 word types from a corpus of 10 raw words and 4 sentences
2016-08-12 09:31:47,100 : INFO : min_count=3 retains 3 unique words (drops 1)
2016-08-12 09:31:47,101 : INFO : min_count leaves 8 word corpus (80% of original 10)
2016-08-12 09:31:47,101 : INFO : deleting the raw counts dictionary of 4 items
2016-08-12 09:31:47,101 : INFO : sample=0.001 downsamples 3 most-common words
2016-08-12 09:31:47,101 : INFO : downsampling leaves estimated 0 word corpus (5.6% of prior 8)
2016-08-12 09:31:47,101 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes
2016-08-12 09:31:47,101 : INFO : resetting layer weights
2016-08-12 09:31:47,102 : INFO : training model with 3 workers on 3 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5
2016-08-12 09:31:47,102 : INFO : expecting 4 sentences, matching count from corpus used for vocabulary survey
4times
Vocab(count:4, index:0, sample_int:200666711)
3times
Vocab(count:3, index:1, sample_int:233244404)
1times
Vocab(count:1, index:2, sample_int:418513292)
2016-08-12 09:31:47,103 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-08-12 09:31:47,103 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-08-12 09:31:47,103 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-08-12 09:31:47,104 : INFO : training on 50 raw words (2 effective words) took 0.0s, 2544 effective words/s
2016-08-12 09:31:47,104 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
> > > # I want to separate initialization and vocabulary building
> > >
> > > model = gensim.models.Word2Vec(min_count=3,trim_rule=my_rule)
> > > model.build_vocab(sentences)
> > > # the trim rule doesn't work in this case
> > >
> > > print_vocab(model)
2016-08-12 09:31:59,707 : INFO : collecting all words and their counts
2016-08-12 09:31:59,708 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-08-12 09:31:59,708 : INFO : collected 4 word types from a corpus of 10 raw words and 4 sentences
2016-08-12 09:31:59,708 : INFO : min_count=3 retains 2 unique words (drops 2)
2016-08-12 09:31:59,708 : INFO : min_count leaves 7 word corpus (70% of original 10)
2016-08-12 09:31:59,708 : INFO : deleting the raw counts dictionary of 4 items
2016-08-12 09:31:59,708 : INFO : sample=0.001 downsamples 2 most-common words
2016-08-12 09:31:59,708 : INFO : downsampling leaves estimated 0 word corpus (4.7% of prior 7)
2016-08-12 09:31:59,708 : INFO : estimated required memory for 2 words and 100 dimensions: 2600 bytes
2016-08-12 09:31:59,708 : INFO : resetting layer weights
4times
Vocab(count:4, index:0, sample_int:187187565)
3times
Vocab(count:3, index:1, sample_int:217488221)
> > > model.train(sentences)
> > > print_vocab(model)
2016-08-12 09:32:08,306 : INFO : training model with 3 workers on 2 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5
2016-08-12 09:32:08,306 : INFO : expecting 4 sentences, matching count from corpus used for vocabulary survey
2016-08-12 09:32:08,308 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-08-12 09:32:08,308 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-08-12 09:32:08,308 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-08-12 09:32:08,308 : INFO : training on 50 raw words (0 effective words) took 0.0s, 0 effective words/s
2016-08-12 09:32:08,308 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
4times
Vocab(count:4, index:0, sample_int:187187565)
3times
Vocab(count:3, index:1, sample_int:217488221)
The text was updated successfully, but these errors were encountered:
Looks like a bug / documentation confusion to me. -- re-opening issue.
At the very least, if the user tries the combination of no corpus in init, but trim_rule in init, we should log a warning that trim_rule is being ignored. Or even an exception.
I can see how the current API could be confusing. CC @gojomo
…ky#1186)
* no corpus in init, but trim_rule in init
logged warning that trim_rule is being ignored for separate model initialization and vocabulary building
* log warning only when trim_rule is specified
Is it bug or feature?
The text was updated successfully, but these errors were encountered: