(un)serializing a doc object #152

belihub · 2015-10-27T08:34:28Z

Some background: to avoid the minute or so required for spacy's initialization, I plan to run it on a local server on my machine and pass strings to / get docs from it using zeromq. To do that, I need to serialize the doc object, and if I'm reading http://spacy.io/docs correctly, spacy has methods for that: Doc.to_bytes(doc) and Doc.from_bytes(bytearray). There's two places where spacy behaves unexpectedly, best illustrated by pasting the code - one, a TypeError:

In [9]: mydoc = nlp(txt)

In [10]: mybs = Doc.to_bytes(mydoc)

In [11]: mynewdoc = Doc.from_bytes(mybs)
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-11-eeef45863a37> in <module>()
    ----> 1 mynewdoc = Doc.from_bytes(mybs)

TypeError: descriptor 'from_bytes' requires a 'spacy.tokens.doc.Doc' object but received a 'bytearray'

Two - following the code sample in http://spacy.io/docs and writing Doc(nlp.vocab).from_bytes() instead of Doc.from_bytes() removes the error message, but I get another on trying to work with the deserialized object:

In [23]: mynewdoc = Doc(nlp.vocab).from_bytes(mybs)

In [24]: sents = [sent.orth_ for sent in mynewdoc.sents]
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-24-237283605709> in <module>()
    ----> 1 sents = [sent.orth_ for sent in mynewdoc.sents]

spacy/tokens/doc.pyx in sents (spacy/tokens/doc.cpp:7908)()

ValueError: sentence boundary detection requires the dependency parse, which requires data to be installed. If you haven't done so, run:
    python -m spacy.en.download all
    to install the data

'Sents' worked for the original doc, but throws the error above for the deserialized one.
I'd appreciate someone pointing out what I'm doing wrong; using the latest version (v 0.97) of spacy.

The text was updated successfully, but these errors were encountered:

honnibal · 2015-10-27T23:44:54Z

Thanks.

So, first I would say that you should try to have your worker node reuse the NLP object, if possible. But if each worker process only needs to work with a single document, then yes, the loading time will be significant. Your plan to reduce it seems like a good way to go.

It looks like there's also a bug, that I introduced to warn users about accessing the sentence boundaries when the parse wasn't set. I've now corrected that.

As a temporary work-around, until the new version is released, you can set the two flags yourself:

doc.is_parsed = True
doc.is_tagged = True

…, re Issue #152

belihub · 2015-10-31T10:50:42Z

Thanks for the quick reply! Manually setting the flags worked. However, nlp.vocab still needs "nlp = English()" to be initialized each time. Is there a way to deserialize without nlp.vocab?

Alternatively, what would be the best way to have the worker node reuse the NLP object?

honnibal · 2015-11-01T21:59:34Z

If you just want to load the vocab, you should be able to do that without loading the rest of English. Try:

nlp.vocab.from_dir(path.join(English.default_data_dir(), 'vocab'))

You could also load the arguments to Vocab directly, and use the constructor.

Reloading the vocab on each document is still a bad idea, though. You'll want the worker nodes to load it once. I find a global variable to be the most explicit way to do that, but some people hold the state in a generator function, or a class variable.

Another way to do it is to distribute a batch of documents to the workers, and have the task be do_batch. This is what I did when processing the Reddit comments corpus: each month of comments was its own task.

See here: https://github.com/honnibal/spaCy/blob/master/bin/get_freqs.py

thorsonlinguistics · 2015-11-02T16:35:53Z

It seems that other data is also lost in deserialization that are not recovered by setting the flags.

I can't access the lemmas for tokens in the deserialized doc.

doc.sents also only ever generates one sentence with a deserialized Doc, regardless of the original number of sentences.

honnibal · 2015-11-02T16:39:17Z

Hmm. Thanks. Will fix.

On Tue, Nov 3, 2015 at 3:35 AM, thorsonlinguistics <[email protected]

wrote:

It seems that other data is also lost in deserialization that are not
recovered by setting the flags.

I can't access the lemmas for tokens in the deserialized doc.

doc.sents also only ever generates one sentence with a deserialized Doc,
regardless of the original number of sentences.

—
Reply to this email directly or view it on GitHub
#152 (comment).

…re Issue #152

honnibal · 2015-11-03T09:40:43Z

I think the data loss is fixed in v0.98, but I haven't written all the tests yet.

lock · 2018-05-09T16:31:37Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal pushed a commit that referenced this issue Oct 27, 2015

* Set is_parsed and is_tagged attrs when loading annotations into Doc…

52fc338

…, re Issue #152

honnibal pushed a commit that referenced this issue Nov 3, 2015

* Ensure morphological features and lemmas are loaded in from_array, …

5e04085

…re Issue #152

honnibal closed this as completed Nov 8, 2015

honnibal mentioned this issue May 7, 2017

💫 Improve annotation serialisation #1045

Closed

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(un)serializing a doc object #152

(un)serializing a doc object #152

belihub commented Oct 27, 2015

honnibal commented Oct 27, 2015

belihub commented Oct 31, 2015

honnibal commented Nov 1, 2015

thorsonlinguistics commented Nov 2, 2015

honnibal commented Nov 2, 2015

honnibal commented Nov 3, 2015

lock bot commented May 9, 2018

(un)serializing a doc object #152

(un)serializing a doc object #152

Comments

belihub commented Oct 27, 2015

honnibal commented Oct 27, 2015

belihub commented Oct 31, 2015

honnibal commented Nov 1, 2015

thorsonlinguistics commented Nov 2, 2015

honnibal commented Nov 2, 2015

honnibal commented Nov 3, 2015

lock bot commented May 9, 2018