Doc.sents value changes when converted to bytes and back to Doc #1799

Mindful · 2018-01-04T08:05:12Z

First of all, I can only reproduce this problem for a specific sentence. There may be other sentences which have the same issue, but I have only found one so far. I noticed only because I have code that reads a large number of reviews, and I got (un)lucky.

Additionally, my laptop runs a slightly older version of spacy, and does not have this problem. Here's the environment information for my setup where the bug does not occur:

spaCy version:** 1.9.0
Platform:** Darwin-16.7.0-x86_64-i386-64bit
Python version:** 3.6.1
Installed models:** en

And here is the environment information for my setup where the bug does occur:

spaCy version:** 2.0.5
Models:** en
Platform:** Linux-4.4.0-92-generic-x86_64-with-Ubuntu-16.04-xenial
Python version:** 3.5.2

I have a gist that illustrates the problem here, but basically, the issue is that if I run this code in the aforementioned problem environment:

problem_sentence = 'Just what I was looking for, a retro mobile that fits my old car.'
doc = nlp(problem_sentence)
doc_bytes = doc.to_bytes()
doc_from_bytes = spacy.tokens.Doc(nlp.vocab).from_bytes(doc_bytes)

original_doc_sentences = list(doc.sents)
byte_doc_sentences = list(doc_from_bytes.sents)

print("Original doc sentence count:", len(original_doc_sentences))
print("original doc sentences:", ["<" + str(sent) + ">" for sent in original_doc_sentences])
print("New doc (loaded from bytes) sentence count:", len(byte_doc_sentences))
print("New doc (loaded from bytes) sentence:", ["<" + str(sent) + ">" for sent in byte_doc_sentences])

I get this as output:

Original doc sentence count: 1
original doc sentences: ['<Just what I was looking for, a retro mobile that fits my old car.>']
New doc (loaded from bytes) sentence count: 2
New doc (loaded from bytes) sentence: ['<Just what>', '<I was looking for, a retro mobile that fits my old car.>']

When I change the doc into bytes and then read it back again, what should be one sentence has split into two.

The text was updated successfully, but these errors were encountered:

honnibal · 2018-01-22T18:59:21Z

Thanks for the report -- got to the bottom of this.

Upon deserializing, the secondary parse attributes l_kids, r_kids, l_edge, r_edge and sent_start are reconstructed from the HEAD array. However, this logic assumes that the parse is projective --- that it doesn't contain any crossing branches. The sentence you give as an example is one of the relatively rare cases in English of a non-projective structure. However, non-projectivity is fairly common in other languages, so we'd be seeing this problem more for other treebanks.

…non-projective parses.

lock · 2018-05-08T01:55:29Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the bug Bugs and behaviour differing from documentation label Jan 12, 2018

honnibal mentioned this issue Jan 12, 2018

Sentences boundaries not serialized / deserialized with custom SBD component #1834

Closed

honnibal closed this as completed in 56164ab Jan 22, 2018

honnibal added a commit that referenced this issue Jan 22, 2018

Add test for #1799: Set left and right edges (and thus sentences) in …

4ce7d24

…non-projective parses.

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc.sents value changes when converted to bytes and back to Doc #1799

Doc.sents value changes when converted to bytes and back to Doc #1799

Mindful commented Jan 4, 2018

honnibal commented Jan 22, 2018 •

edited

Loading

lock bot commented May 8, 2018

Doc.sents value changes when converted to bytes and back to Doc #1799

Doc.sents value changes when converted to bytes and back to Doc #1799

Comments

Mindful commented Jan 4, 2018

honnibal commented Jan 22, 2018 • edited Loading

lock bot commented May 8, 2018

honnibal commented Jan 22, 2018 •

edited

Loading