Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc.sents value changes when converted to bytes and back to Doc #1799

Closed
Mindful opened this issue Jan 4, 2018 · 2 comments
Closed

Doc.sents value changes when converted to bytes and back to Doc #1799

Mindful opened this issue Jan 4, 2018 · 2 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@Mindful
Copy link

Mindful commented Jan 4, 2018

First of all, I can only reproduce this problem for a specific sentence. There may be other sentences which have the same issue, but I have only found one so far. I noticed only because I have code that reads a large number of reviews, and I got (un)lucky.

Additionally, my laptop runs a slightly older version of spacy, and does not have this problem. Here's the environment information for my setup where the bug does not occur:

spaCy version:** 1.9.0
Platform:** Darwin-16.7.0-x86_64-i386-64bit
Python version:** 3.6.1
Installed models:** en

And here is the environment information for my setup where the bug does occur:

spaCy version:** 2.0.5
Models:** en
Platform:** Linux-4.4.0-92-generic-x86_64-with-Ubuntu-16.04-xenial
Python version:** 3.5.2

I have a gist that illustrates the problem here, but basically, the issue is that if I run this code in the aforementioned problem environment:

problem_sentence = 'Just what I was looking for, a retro mobile that fits my old car.'
doc = nlp(problem_sentence)
doc_bytes = doc.to_bytes()
doc_from_bytes = spacy.tokens.Doc(nlp.vocab).from_bytes(doc_bytes)

original_doc_sentences = list(doc.sents)
byte_doc_sentences = list(doc_from_bytes.sents)

print("Original doc sentence count:", len(original_doc_sentences))
print("original doc sentences:", ["<" + str(sent) + ">" for sent in original_doc_sentences])
print("New doc (loaded from bytes) sentence count:", len(byte_doc_sentences))
print("New doc (loaded from bytes) sentence:", ["<" + str(sent) + ">" for sent in byte_doc_sentences])

I get this as output:

Original doc sentence count: 1
original doc sentences: ['<Just what I was looking for, a retro mobile that fits my old car.>']
New doc (loaded from bytes) sentence count: 2
New doc (loaded from bytes) sentence: ['<Just what>', '<I was looking for, a retro mobile that fits my old car.>']

When I change the doc into bytes and then read it back again, what should be one sentence has split into two.

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Jan 12, 2018
@honnibal
Copy link
Member

honnibal commented Jan 22, 2018

Thanks for the report -- got to the bottom of this.

Upon deserializing, the secondary parse attributes l_kids, r_kids, l_edge, r_edge and sent_start are reconstructed from the HEAD array. However, this logic assumes that the parse is projective --- that it doesn't contain any crossing branches. The sentence you give as an example is one of the relatively rare cases in English of a non-projective structure. However, non-projectivity is fairly common in other languages, so we'd be seeing this problem more for other treebanks.

honnibal added a commit that referenced this issue Jan 22, 2018
@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

2 participants