-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inconsistent sentence boundaries before and after serialization #322
Comments
Thanks, there's definitely something wrong here. |
(atting @wbwseeker because we were talking about this bug on Slack) I've just gone back over the code and realised that I'd forgotten how my transition system works, with respect to the Break transition. It's really not written down anywhere, and it's in fact rather different from the paper that the code cites as inspiration. So, I'll give some background here. The intention is that all sentences are connected trees, so there's one word per sentence that is its own head, and that has the label ROOT. Mostly, sentence boundaries are inserted by the Break action. The Break action flags the first word of the buffer as the start of the next sentence. The parser then acts as though the buffer is exhausted until the stack has only one word. That is, it continues parsing using the "Unshift" action to connect the stack, until only one word is left. That word then becomes the root of the sentence, it's popped, and parsing continues. There is however another way that we can get a sentence boundary. If the buffer is fully exhausted (i.e. we're really at the end of the sentence), the parser might end up with two root words on the stack. It's then allowed to join them with a left or right arc, using the label ROOT. This should be interpreted as saying "These are both root words, of different sentences. Insert a sentence boundary between them." In the code, there's a flag At some point, the code to actually insert the sentence boundaries during this Below you can find the transition sequence taken by the current model for the example sentence. You can see the final R-ROOT action, which connects the two root words. Note that the tokenization problem "IKEA." is the underlying cause for the model's initial mistake here, which is how it ends up trying to use this error-correction mechanism to arrive at the correct parse. Another important part of the post-mortem here is that it's really noticeable that I've got a lot of fairly intricate logic in the transition system that has only been supported by informal experiments, and hasn't been written up anywhere. This isn't very satisfying. I really wanted to have a paper that explained the joint sentence boundary detection and parsing mechanism, and presented the whole-document evaluations. But I never got the CoreNLP comparison done, and the priority was always to keep developing. The decisions should at least be written up somewhere, with whatever results are available. >>> import spacy
>>> nlp = spacy.load('en')
>>> string = u"I bought a couch from IKEA. It wasn't very comfortable."
>>> doc = nlp.tokenizer(string)
>>> nlp.tagger(doc)
>>> with nlp.parser.step_through(doc) as state:
... while not state.is_final:
... action = state.predict()
... print(action)
... state.transition(action)
...
L-nsubj
S
L-det
R-dobj
D
R-prep
R-pobj
S
L-nsubj
D
D
S
R-neg
S
L-advmod
D
R-acomp
D
R-punct
R-ROOT |
This bug is hurting one of my projects too that relies on being able to serialize large docs to avoid re-parsing when they are utilized later. Is there any workaround on the outside or internal patch that you can think of? |
Haven't tested this yet but you could replace the call to doc.sents with def iter_sents(doc): This might not do what you want though --- this inserts extra sentence On Thursday, April 21, 2016, Robert Clewley [email protected]
|
Yes, I don't want the extra boundaries. Could I extract these edges before serializing and force new sentences with these edges after using |
Thank you, this does seem to work for now. For future reference, you just need the |
…responsibility for setting sentence boundaries. Re Issue #322
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I've been running into a problem where a parse's sentence boundaries change after converting it to a bytestring:
This happened to be one where the sentence boundaries were more correct after the conversion, but I have other examples where
it actually breaks the parseEDIT: the parse is already broken; in the original, two ROOTs appear in the same sentence, whereas in the from_bytes version, the ROOTs are forced to be in different sentences.Not sure if this means there is a bug in the serialization or initial sentence boundary detection!
The text was updated successfully, but these errors were encountered: