Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting the ZZZ connector length_limit to 1 #214

Closed
ampli opened this issue Nov 26, 2015 · 7 comments
Closed

Setting the ZZZ connector length_limit to 1 #214

ampli opened this issue Nov 26, 2015 · 7 comments

Comments

@ampli
Copy link
Member

ampli commented Nov 26, 2015

In en/corpus-fixes.batch there are about 22K ZZZ actual connector matchings to other ZZZ connectors with distance>1.
In ru/corpus-basic.batch there are about 7K such comparisons.

These comparisons, which naturally succeed and add to the parsing time, could be eliminated.
However, for short sentences (< short_len) it is currently better to avoid altogether the length_limit setup in the standard parser, because it is done in a loop (set_connector_length_limits()) which takes more CPU time than the saving.

But in case of long sentences, in which the said loop is done anyway, we can also set the ZZZ length_limit to 1 with a negligible overhead (in the SAT solver - even for short sentences).

The comparison to ZZZ can be done now using string_set_cmp(). However, there is a need to perform (one time per sentence) string_set_add(EMPTY_CONNECTOR, sent->dict->string_set); in order to get its string-set address.

If all of that seems as a good idea I can send a pull request.

@linas
Copy link
Member

linas commented Nov 26, 2015

Sure, seems plausible to me. I don't understand what EMPTY-WORD does, given that the only ZZZ+ is at the start of the sentence ... do we really need EMPTY-WORD?

@ampli
Copy link
Member Author

ampli commented Nov 26, 2015

I'm not sure I understand what you don't understand...

Here is a summary on EMPTY-WORD - it surely has nothing new for you:
EMPTY-WORD was originally added to support Russian stems with a null suffix. It is still needed (in all the languages now) to handle alternatives with different number of words in the word array that the parser uses. If the parser could work on the word-graph directly then we would not need the EMPTY-WORD device.

The ZZZ+ on the LEFT-WALL was added in order to support the quotation mark tokens, that for now has ZZZ- (as if they were EMPTY-WORDs).

So we still need it. Since EMPTY-WORD and quotation mark tokens has ZZZ- and every word has ZZZ+, the parser may try to match the ZZZ- with word distance of more than 1, which is a useless try. (I don't know about any sentence example in which such links of ZZZ- and ZZZ+ with a word distance > 1 are actually created, but if this would happen it would of course be a bogus linkage.)

So for now I will (eventually) send a pull request to set the ZZZ distance to 1 for sentences in which short_len is getting set, for the slight speedup it makes possible. Later I would like to try a more efficient solution, of doing the whole setting of length_limit once in sentence-size O(1) complexity (there are 2 ways to do that, I will have to test both of them).

@linas
Copy link
Member

linas commented Nov 27, 2015

I think I forget that every word automatically gets an optional ZZZ+ . I wonder if this is documented on the website.

@ampli
Copy link
Member Author

ampli commented Dec 5, 2015

I think it is not documented there at all.

BTW, I found an example of ZZZ links that get linked with a distance > 1, in case there are null words in the linkage:

We should distinguish between "who" and "whom" but who fucking cares?

Found 12 linkages (8 had no P.P. violations) at null count 4
    Linkage 1, cost vector = (UNUSED=4 DIS= 2.00 LEN=43)

    +------------------------------------------------Xp-----------------------------------------------+
    +-------------------------------------------->WV-------------------------------------------->+    |
    |      +-----------------------CC----------------------+----------Ws---------+------Ss*w-----+    |
    +--Wd--+--Sp-+-----I----+-------ZZZ-------+--ZZZ--+    +ZZZ+---ZZZ--+        +--EL--+        |    |
    |      |     |          |                 |       |    |   |        |        |      |        |    |
LEFT-WALL we should.v distinguish.v [between] " [who] " and.ij " [whom] " [but] who fucking.e cares.v ? 

(Not related to this issue: Such sentences could parse correctly if we allowed "parallel regexes",
which experimental support has been already removed (a regex like: /"[^"]*"/). However, additional infrastructure is needed in order to prevent the spurious parses it may cause. It seems to me it has something in common with the zero word support, and with the problem of splitting words when their component morphemes are in the dictionary but the linkage between them is not supported by the dictionary. I guess it is better to open a new issue on that.)

@linas
Copy link
Member

linas commented Dec 5, 2015

There is now documentation for ZZZ on the website. Its minimal.

I did a lot of work last night on a QU link type, which handles simpler paraphrasing quotations correctly, although it still messes up on more complex cases. That means that ZZZ is no longer used for those cases.

The place where ZZZ does get used is for these "random quotes" where people just "randomly" use "quotes" for "no reason".

Its OK to limit distance to 1 if the underlying sentence isn't parsing anyway. Such a limit does not make things worse.

@linas
Copy link
Member

linas commented Feb 27, 2018

issue #632 cause work in this area to happen. Is this issue still alive, or can it be closed?

@ampli
Copy link
Member Author

ampli commented Feb 27, 2018

Its OK to limit distance to 1 if the underlying sentence isn't parsing anyway. Such a limit does not make things worse.

If this is not considered a problem, then yes, it can be closed.

Anyway, I still investigate handling of stray quotes as a totally unlinked tokens, with not even null links.
In that case this last usage of ZZZ will not be needed.

@ampli ampli closed this as completed Feb 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants