You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First create the following Corpus, save it to disk, and note that upon reloading you can still get word doc counts:
import textacy
corpus = textacy.Corpus('en', ['Pittsburgh', 'slated for. Stacey designated as moderator'])
corpus.save('foo.textacy')
corpus = textacy.Corpus.load('en', 'foo.textacy')
print(corpus.word_doc_counts())
But then open a new Python shell, load the same corpus from disk, and get an error about a word ID missing from the vocab:
import textacy
corpus = textacy.Corpus.load('en', 'foo.textacy')
print(corpus.word_doc_counts())
Traceback (most recent call last):
File "/Users/radkoff/anaconda3/envs/st-py37/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3296, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-dff4867a4989>", line 3, in <module>
print(corpus.word_doc_counts())
File "/Users/radkoff/anaconda3/envs/st-py37/lib/python3.7/site-packages/textacy/corpus.py", line 494, in word_doc_counts
normalize=normalize, weighting="binary", as_strings=as_strings
File "/Users/radkoff/anaconda3/envs/st-py37/lib/python3.7/site-packages/textacy/spacier/doc_extensions.py", line 511, in to_bag_of_words
lex = vocab[wid]
File "vocab.pyx", line 237, in spacy.vocab.Vocab.__getitem__
File "lexeme.pyx", line 44, in spacy.lexeme.Lexeme.__init__
File "vocab.pyx", line 152, in spacy.vocab.Vocab.get_by_orth
File "strings.pyx", line 138, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '10542206011124529393'."
context
The particular example above was narrowed down from larger texts, and strangely at this point, it seems like removing any more words causes the bug to go away. Eg, the following all work: ['Pittsburgh', 'slated for. Stacey designated moderator'] ['Pittsburgh', 'slated. Stacey designated as moderator'] ['Pittsburgh', 'for. Stacey designated as moderator'] ['slated for. Stacey designated as moderator'] ['this is doc one', 'this is doc two']
I've run into this with several different corpora (I'm trying to build IDF models).
possible solution?
I'm guessing it has something to do with trying to access the lemmas of words? Maybe the Vocab needs to be serialized along with the docs themselves? explosion/spaCy#2419
environment
platform: darwin
python: 3.7.3 (default, Mar 27 2019, 16:54:48) [Clang 4.0.1 (tags/RELEASE_401/final)]
spacy: 2.1.3
spacy_models: ['en']
textacy: 0.7.1
The text was updated successfully, but these errors were encountered:
After upgrading textacy and spacy, the error now seems to be intermittent (or maybe it was before?..), so you may have try loading it in a new shell a few times before it fails.
platform: darwin
python: 3.7.3 (default, Mar 27 2019, 16:54:48) [Clang 4.0.1 (tags/RELEASE_401/final)]
steps to reproduce
First create the following Corpus, save it to disk, and note that upon reloading you can still get word doc counts:
But then open a new Python shell, load the same corpus from disk, and get an error about a word ID missing from the vocab:
context
The particular example above was narrowed down from larger texts, and strangely at this point, it seems like removing any more words causes the bug to go away. Eg, the following all work:
['Pittsburgh', 'slated for. Stacey designated moderator']
['Pittsburgh', 'slated. Stacey designated as moderator']
['Pittsburgh', 'for. Stacey designated as moderator']
['slated for. Stacey designated as moderator']
['this is doc one', 'this is doc two']
I've run into this with several different corpora (I'm trying to build IDF models).
possible solution?
I'm guessing it has something to do with trying to access the lemmas of words? Maybe the Vocab needs to be serialized along with the docs themselves? explosion/spaCy#2419
environment
The text was updated successfully, but these errors were encountered: