You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, I can only reproduce this problem for a specific sentence. There may be other sentences which have the same issue, but I have only found one so far. I noticed only because I have code that reads a large number of reviews, and I got (un)lucky.
Additionally, my laptop runs a slightly older version of spacy, and does not have this problem. Here's the environment information for my setup where the bug does not occur:
And here is the environment information for my setup where the bug does occur:
spaCy version:** 2.0.5
Models:** en
Platform:** Linux-4.4.0-92-generic-x86_64-with-Ubuntu-16.04-xenial
Python version:** 3.5.2
I have a gist that illustrates the problem here, but basically, the issue is that if I run this code in the aforementioned problem environment:
problem_sentence = 'Just what I was looking for, a retro mobile that fits my old car.'
doc = nlp(problem_sentence)
doc_bytes = doc.to_bytes()
doc_from_bytes = spacy.tokens.Doc(nlp.vocab).from_bytes(doc_bytes)
original_doc_sentences = list(doc.sents)
byte_doc_sentences = list(doc_from_bytes.sents)
print("Original doc sentence count:", len(original_doc_sentences))
print("original doc sentences:", ["<" + str(sent) + ">" for sent in original_doc_sentences])
print("New doc (loaded from bytes) sentence count:", len(byte_doc_sentences))
print("New doc (loaded from bytes) sentence:", ["<" + str(sent) + ">" for sent in byte_doc_sentences])
I get this as output:
Original doc sentence count: 1
original doc sentences: ['<Just what I was looking for, a retro mobile that fits my old car.>']
New doc (loaded from bytes) sentence count: 2
New doc (loaded from bytes) sentence: ['<Just what>', '<I was looking for, a retro mobile that fits my old car.>']
When I change the doc into bytes and then read it back again, what should be one sentence has split into two.
The text was updated successfully, but these errors were encountered:
Thanks for the report -- got to the bottom of this.
Upon deserializing, the secondary parse attributes l_kids, r_kids, l_edge, r_edge and sent_start are reconstructed from the HEAD array. However, this logic assumes that the parse is projective --- that it doesn't contain any crossing branches. The sentence you give as an example is one of the relatively rare cases in English of a non-projective structure. However, non-projectivity is fairly common in other languages, so we'd be seeing this problem more for other treebanks.
First of all, I can only reproduce this problem for a specific sentence. There may be other sentences which have the same issue, but I have only found one so far. I noticed only because I have code that reads a large number of reviews, and I got (un)lucky.
Additionally, my laptop runs a slightly older version of spacy, and does not have this problem. Here's the environment information for my setup where the bug does not occur:
And here is the environment information for my setup where the bug does occur:
I have a gist that illustrates the problem here, but basically, the issue is that if I run this code in the aforementioned problem environment:
I get this as output:
When I change the doc into bytes and then read it back again, what should be one sentence has split into two.
The text was updated successfully, but these errors were encountered: