Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(un)serializing a doc object #152

Closed
belihub opened this issue Oct 27, 2015 · 7 comments
Closed

(un)serializing a doc object #152

belihub opened this issue Oct 27, 2015 · 7 comments

Comments

@belihub
Copy link

belihub commented Oct 27, 2015

Some background: to avoid the minute or so required for spacy's initialization, I plan to run it on a local server on my machine and pass strings to / get docs from it using zeromq. To do that, I need to serialize the doc object, and if I'm reading http://spacy.io/docs correctly, spacy has methods for that: Doc.to_bytes(doc) and Doc.from_bytes(bytearray). There's two places where spacy behaves unexpectedly, best illustrated by pasting the code - one, a TypeError:

In [9]: mydoc = nlp(txt)

In [10]: mybs = Doc.to_bytes(mydoc)

In [11]: mynewdoc = Doc.from_bytes(mybs)
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-11-eeef45863a37> in <module>()
    ----> 1 mynewdoc = Doc.from_bytes(mybs)

TypeError: descriptor 'from_bytes' requires a 'spacy.tokens.doc.Doc' object but received a 'bytearray'

Two - following the code sample in http://spacy.io/docs and writing Doc(nlp.vocab).from_bytes() instead of Doc.from_bytes() removes the error message, but I get another on trying to work with the deserialized object:

In [23]: mynewdoc = Doc(nlp.vocab).from_bytes(mybs)

In [24]: sents = [sent.orth_ for sent in mynewdoc.sents]
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-24-237283605709> in <module>()
    ----> 1 sents = [sent.orth_ for sent in mynewdoc.sents]

spacy/tokens/doc.pyx in sents (spacy/tokens/doc.cpp:7908)()

ValueError: sentence boundary detection requires the dependency parse, which requires data to be installed. If you haven't done so, run:
    python -m spacy.en.download all
    to install the data

'Sents' worked for the original doc, but throws the error above for the deserialized one.
I'd appreciate someone pointing out what I'm doing wrong; using the latest version (v 0.97) of spacy.

@honnibal
Copy link
Member

Thanks.

So, first I would say that you should try to have your worker node reuse the NLP object, if possible. But if each worker process only needs to work with a single document, then yes, the loading time will be significant. Your plan to reduce it seems like a good way to go.

It looks like there's also a bug, that I introduced to warn users about accessing the sentence boundaries when the parse wasn't set. I've now corrected that.

As a temporary work-around, until the new version is released, you can set the two flags yourself:

doc.is_parsed = True
doc.is_tagged = True

@belihub
Copy link
Author

belihub commented Oct 31, 2015

Thanks for the quick reply! Manually setting the flags worked. However, nlp.vocab still needs "nlp = English()" to be initialized each time. Is there a way to deserialize without nlp.vocab?

Alternatively, what would be the best way to have the worker node reuse the NLP object?

@honnibal
Copy link
Member

honnibal commented Nov 1, 2015

If you just want to load the vocab, you should be able to do that without loading the rest of English. Try:

nlp.vocab.from_dir(path.join(English.default_data_dir(), 'vocab'))

You could also load the arguments to Vocab directly, and use the constructor.

Reloading the vocab on each document is still a bad idea, though. You'll want the worker nodes to load it once. I find a global variable to be the most explicit way to do that, but some people hold the state in a generator function, or a class variable.

Another way to do it is to distribute a batch of documents to the workers, and have the task be do_batch. This is what I did when processing the Reddit comments corpus: each month of comments was its own task.

See here: https://github.com/honnibal/spaCy/blob/master/bin/get_freqs.py

@thorsonlinguistics
Copy link

It seems that other data is also lost in deserialization that are not recovered by setting the flags.

I can't access the lemmas for tokens in the deserialized doc.

doc.sents also only ever generates one sentence with a deserialized Doc, regardless of the original number of sentences.

@honnibal
Copy link
Member

honnibal commented Nov 2, 2015

Hmm. Thanks. Will fix.

On Tue, Nov 3, 2015 at 3:35 AM, thorsonlinguistics <[email protected]

wrote:

It seems that other data is also lost in deserialization that are not
recovered by setting the flags.

I can't access the lemmas for tokens in the deserialized doc.

doc.sents also only ever generates one sentence with a deserialized Doc,
regardless of the original number of sentences.


Reply to this email directly or view it on GitHub
#152 (comment).

@honnibal
Copy link
Member

honnibal commented Nov 3, 2015

I think the data loss is fixed in v0.98, but I haven't written all the tests yet.

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants