-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing connector length_limit per connector name #632
Comments
This is an interesting idea, but premature. To establish the correct costs to use would require a detailed statistical analysis, which I am not prepared to do at this point. I also suspect that the hand-authored English dictionary has too many gross errors, that this kind of fine adjustment would be washed out.
It should not depend on options. My goal is to minimize/avoid the use of options, with certain "reasonable" exceptions. This should not depend on the options.
Probably not. Some of these optimizations aren't obvious till you stare at the code. Also, the code has changed enough that certain optimizations are now much more obvious.
I'd prefer to handle this the same way that
Avoid using the equals sign for now, we might need it later. The above is very simple, not at all flexible, but that is all we need for now; I'd rather not over-engineer this right now. We can do something fancier when we know more. |
I implemented it [ignoring the cost for now]. It caused drastic speed up of the Russian handling - ~50% CPU reduction on the batch [EDIT: (*)] and only 65 second/1.3GB (instead of about one hour/6.6GB) on the problematic sentence of issue #537. I used However, So I need your advice regarding this problem. |
Regarding different costs for longer length_limits: EDIT: See the next post... Since the connector length had to be recomputed for each connector (currently separately in expression pruning and power pruning - unless expressions will get changed to use the Connector struct), and all of this twice if there is no complete linkage, there is a benefit to use a connector length cache. A natural place to put such a cache is the Directory struct. However, it is not available at the point it is needed for length-limit computation.
(1) and (2) are easy. (3) is cumbersome and adds overhead of its own. Please advise... |
Updates (see also EDIT):
EDIT (more info):
|
Regarding Yes, in English, both the |
My apologies for not answering this question much earlier:
Is this really true? I see
So: Next:
The SAT code is already a mess in this area. I did not study it carefully. Making it messier doesn't seem wrong, just yet. So I would prefer to have the cache in the Dict, and not elsewhere. ... unless I misunderstand the the problem. |
Where should I put it? |
|
Here is a documentation draft for the added dictionary LENGTH-LIMIT-n definitions. It is not included in the PR because I don't know where to put it (possible places are NEWS or a new README about the dict syntax). I don't have an idea in which place in the current documents it should reside - I didn't find a section that is dedicated to the syntax of the dictionary. For context, I started the description with text that I copied from "6.2 Using Parse Options". Setting the length limit of connectorsAs noted in section "6.2 Using Parse Options", the short_length parameter determines how long the links are allowed to be. The intended use of this is to speed up parsing by not considering very long links for most connectors, since they are very rarely used in a correct parse. An entry for UNLIMITED-CONNECTORS in the dictionary will specify which connectors are exempt from the length limit. A new feature in version 5.4.4 is the ability to specify in the directory a specific length limit for connectors. A entry for LENGTH-LIMIT-n is used for that, when
As with UNLIMITED-CONNECTORS, only For setting connectors according to a connector name prefix, a
(Add here about possible ambiguity of this In addition to possible parse speed up, this feature can also prevent possible bogus parses in sentences that have more than one word with so explicitly limited connectors and can be parsed only with null links to at least one of these words. Example:
The short_length parse option doesn't change the length limit of so explicitly limited connectors unless its value is less than their defined length limit. |
I couldn't set the length-limit of
Instead I have set |
Re: |
I guess that PH can be added now to the length-limit=1 setup. |
Writing rules to make sure that PH links are always just length-one is hard to do. So the length-limit constraint is nice. |
I believe that ZZZ is also length-limit=1 always for english, right? |
Just now ZZZ is only used for "random" quotation marks. However, the idea of attaching ZZZ+ only to words before quotes is somewhat problematic when this word is a null word. In that case the quote may become null word too because it cannot attache to anything. A solution can be to add ZZZ+ to all the words before a quote (and then length-limit=1 for ZZZ cannot be used. BTW: |
See https://www.abisource.com/projects/link-grammar/dict/introduction.html#MORPH (section 1,.2.12) for docs |
This is an interesting example, as it is a meta-example -- both of the quoted words should not be looked up in the dictionary, but instead, should be treated as unknown nouns:
which does parse correctly. So, somehow ignore all of the connectors on a single quoted word, and use the |
Because frequently a sentence can be reasonably parsed ignoring the quotes on a quoted words (e.g.: |
Well, yes. There are (at least) three types of distinct quote usage:
Both case 3 and case 4 can probably be handled the same way. |
Just to be clear: as a native English speaker, when I read |
Note:
I'm assuming that the new way of working with connectors means that they can no longer be deleted in this particular way. |
To sum up the possibilities for "word" tokenization:
(I'm for (2), because I think that (1) disregards possible info in "word" that may still be interesting.) This issue of "connector length_limit per connector name" can be closed now, but I don't know what to do with the issue of handling quotes that got mentioned here. |
Option 2. |
I copied one part of above discussion into the very old issue #42. If you want to copy/add more info there, feel free. Close these whenever you are ready. |
Also, I'm guessing that issue #214 is now a dead issue? |
(Issue 214: Setting the ZZZ connector length_limit to 1) BTW, ZZZ is still used for quotes. Replacing it by another mechanism (as well as a better quote handling) is in my TODO list. |
I mostly finished the connector-enumeration modification, and it, as expected, maks a significant speedup. Currently it uses 32-byte Connector struct (instead of 64).
Letter it may allow further reducing of the Connector struct (even 4 bytes is possible if desired - the connector descriptor number, bool:1 multi, and length_limit). It will also allow to reduce the size of the Table_connector struct, which its memory access is a major bottleneck.
Since I changed parts of the library to use connector descriptors instead of connectors (their ordinal number serves as a perfect hash), I also so change the expression pruning. In the same occasion I added a length_limit check (as you wrote here). This significantly speeds up the power-pruning, which results in speed-up per (not trivially short) sentence.
Now I would like to add morphology-connectors length limits to solve problems like issue #537 (Russian sentences that are not very long take very much memory and CPU time). This may also solve some strange cases of linkages with null words, in which a stem of one word is connected to a suffix of another one.
(BTW I suppose that
-
and+
connectors never need different length_limit, the original code had a partial implementation for that which I once removed).There are several questions regarding it, among them:
The previous discussion of defining length limit per connector type is here.
1. Whether to implement length_limit weighting, and if so, how.
You said there:
I didn't understand how you intend to make such weighting, and responded with a suggestion to make a length_limit table (per connector type) which depends on the
cost
option, e.g.I.e., higher costs allow using a longer length_limit.
But now I guess that you intended that each "extra length" will get translated to an added (disjunct, or a different metric?) cost, probably according to a predefined table. This will reduce the pruning efficiency, but maybe not by much.
As you said, a good option is to ignore for now length_limit weighting.
2. Which file format to use.
Maybe such format, in the affix file (# is a wildcard that includes uppercase):
And an extension for weighting :
(Maybe specifying several connectors can be allowed, like in
UNLIMITED-CONNECTORS
.)3. How to implement length_limit caching.
In case the length_limit don't depend on the options, it can just be done in the Dictionary struct. (If it needs to depend on the options - it is much more complicated to implement its caching because there is no natural place to put it.)
Last thing:
For the expression pruning, instead of a
length_limit
field I used afarthest_word
field!This is because the word number of each connector is known, and this saves the overhead of computing the word difference before each easy-match.
I don't know why it has not been generally done - did I miss something?
The text was updated successfully, but these errors were encountered: