-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setting the ZZZ connector length_limit to 1 #214
Comments
Sure, seems plausible to me. I don't understand what EMPTY-WORD does, given that the only ZZZ+ is at the start of the sentence ... do we really need EMPTY-WORD? |
I'm not sure I understand what you don't understand... Here is a summary on EMPTY-WORD - it surely has nothing new for you: The ZZZ+ on the LEFT-WALL was added in order to support the quotation mark tokens, that for now has ZZZ- (as if they were EMPTY-WORDs). So we still need it. Since EMPTY-WORD and quotation mark tokens has ZZZ- and every word has ZZZ+, the parser may try to match the ZZZ- with word distance of more than 1, which is a useless try. (I don't know about any sentence example in which such links of ZZZ- and ZZZ+ with a word distance > 1 are actually created, but if this would happen it would of course be a bogus linkage.) So for now I will (eventually) send a pull request to set the ZZZ distance to 1 for sentences in which short_len is getting set, for the slight speedup it makes possible. Later I would like to try a more efficient solution, of doing the whole setting of length_limit once in sentence-size O(1) complexity (there are 2 ways to do that, I will have to test both of them). |
I think I forget that every word automatically gets an optional ZZZ+ . I wonder if this is documented on the website. |
I think it is not documented there at all. BTW, I found an example of ZZZ links that get linked with a distance > 1, in case there are null words in the linkage:
(Not related to this issue: Such sentences could parse correctly if we allowed "parallel regexes", |
There is now documentation for ZZZ on the website. Its minimal. I did a lot of work last night on a QU link type, which handles simpler paraphrasing quotations correctly, although it still messes up on more complex cases. That means that ZZZ is no longer used for those cases. The place where ZZZ does get used is for these "random quotes" where people just "randomly" use "quotes" for "no reason". Its OK to limit distance to 1 if the underlying sentence isn't parsing anyway. Such a limit does not make things worse. |
issue #632 cause work in this area to happen. Is this issue still alive, or can it be closed? |
If this is not considered a problem, then yes, it can be closed. Anyway, I still investigate handling of stray quotes as a totally unlinked tokens, with not even null links. |
In en/corpus-fixes.batch there are about 22K ZZZ actual connector matchings to other ZZZ connectors with distance>1.
In ru/corpus-basic.batch there are about 7K such comparisons.
These comparisons, which naturally succeed and add to the parsing time, could be eliminated.
However, for short sentences (< short_len) it is currently better to avoid altogether the length_limit setup in the standard parser, because it is done in a loop (set_connector_length_limits()) which takes more CPU time than the saving.
But in case of long sentences, in which the said loop is done anyway, we can also set the ZZZ length_limit to 1 with a negligible overhead (in the SAT solver - even for short sentences).
The comparison to ZZZ can be done now using string_set_cmp(). However, there is a need to perform
(one time per sentence) string_set_add(EMPTY_CONNECTOR, sent->dict->string_set);
in order to get its string-set address.If all of that seems as a good idea I can send a pull request.
The text was updated successfully, but these errors were encountered: