-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(un)serializing a doc object #152
Comments
Thanks. So, first I would say that you should try to have your worker node reuse the NLP object, if possible. But if each worker process only needs to work with a single document, then yes, the loading time will be significant. Your plan to reduce it seems like a good way to go. It looks like there's also a bug, that I introduced to warn users about accessing the sentence boundaries when the parse wasn't set. I've now corrected that. As a temporary work-around, until the new version is released, you can set the two flags yourself:
|
Thanks for the quick reply! Manually setting the flags worked. However, nlp.vocab still needs "nlp = English()" to be initialized each time. Is there a way to deserialize without nlp.vocab? Alternatively, what would be the best way to have the worker node reuse the NLP object? |
If you just want to load the vocab, you should be able to do that without loading the rest of nlp.vocab.from_dir(path.join(English.default_data_dir(), 'vocab')) You could also load the arguments to Reloading the vocab on each document is still a bad idea, though. You'll want the worker nodes to load it once. I find a global variable to be the most explicit way to do that, but some people hold the state in a generator function, or a class variable. Another way to do it is to distribute a batch of documents to the workers, and have the task be See here: https://github.com/honnibal/spaCy/blob/master/bin/get_freqs.py |
It seems that other data is also lost in deserialization that are not recovered by setting the flags. I can't access the lemmas for tokens in the deserialized doc. doc.sents also only ever generates one sentence with a deserialized Doc, regardless of the original number of sentences. |
Hmm. Thanks. Will fix. On Tue, Nov 3, 2015 at 3:35 AM, thorsonlinguistics <[email protected]
|
I think the data loss is fixed in v0.98, but I haven't written all the tests yet. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Some background: to avoid the minute or so required for spacy's initialization, I plan to run it on a local server on my machine and pass strings to / get docs from it using zeromq. To do that, I need to serialize the doc object, and if I'm reading http://spacy.io/docs correctly, spacy has methods for that: Doc.to_bytes(doc) and Doc.from_bytes(bytearray). There's two places where spacy behaves unexpectedly, best illustrated by pasting the code - one, a TypeError:
Two - following the code sample in http://spacy.io/docs and writing Doc(nlp.vocab).from_bytes() instead of Doc.from_bytes() removes the error message, but I get another on trying to work with the deserialized object:
'Sents' worked for the original doc, but throws the error above for the deserialized one.
I'd appreciate someone pointing out what I'm doing wrong; using the latest version (v 0.97) of spacy.
The text was updated successfully, but these errors were encountered: