Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing connector length_limit per connector name #632

Closed
ampli opened this issue Jan 11, 2018 · 26 comments
Closed

Implementing connector length_limit per connector name #632

ampli opened this issue Jan 11, 2018 · 26 comments

Comments

@ampli
Copy link
Member

ampli commented Jan 11, 2018

I mostly finished the connector-enumeration modification, and it, as expected, maks a significant speedup. Currently it uses 32-byte Connector struct (instead of 64).

Letter it may allow further reducing of the Connector struct (even 4 bytes is possible if desired - the connector descriptor number, bool:1 multi, and length_limit). It will also allow to reduce the size of the Table_connector struct, which its memory access is a major bottleneck.

Since I changed parts of the library to use connector descriptors instead of connectors (their ordinal number serves as a perfect hash), I also so change the expression pruning. In the same occasion I added a length_limit check (as you wrote here). This significantly speeds up the power-pruning, which results in speed-up per (not trivially short) sentence.

Now I would like to add morphology-connectors length limits to solve problems like issue #537 (Russian sentences that are not very long take very much memory and CPU time). This may also solve some strange cases of linkages with null words, in which a stem of one word is connected to a suffix of another one.

(BTW I suppose that - and + connectors never need different length_limit, the original code had a partial implementation for that which I once removed).

There are several questions regarding it, among them:

  1. Whether to initially implement length_limit weighting, and if so, how.
  2. Which file format to use. It will be a good idea to use a flexible and extendable format, so it can be extended with easy backward compatibility (so old dicts can still be used with new libraries, for e.g. speed comparisons).
  3. Where to implement length_limit caching . A caching is needed in order not to parse and recompute the length_limit definitions again for each sentence, and not computing it twice (expression pruning and power pruning).

The previous discussion of defining length limit per connector type is here.

1. Whether to implement length_limit weighting, and if so, how.
You said there:

Even if LL-style connectors were weighted, their length would probably still be restricted to 2 or 3, and so pruning would still mostly work. But maybe we should ignore wights until later; I'm not ready yet with complex morphology.

I didn't understand how you intend to make such weighting, and responded with a suggestion to make a length_limit table (per connector type) which depends on the cost option, e.g.

% For the connector PLx, the length limit depends on the cost, as follows:
% length limit=1 for cost<=2.7
% length_limit=3 for 2.7<cost<4
% length_limit=5 for cost>4
PLx  1 2.7   3 4   5: LENGTH_LIMIT+;

I.e., higher costs allow using a longer length_limit.

But now I guess that you intended that each "extra length" will get translated to an added (disjunct, or a different metric?) cost, probably according to a predefined table. This will reduce the pruning efficiency, but maybe not by much.

As you said, a good option is to ignore for now length_limit weighting.

2. Which file format to use.
Maybe such format, in the affix file (# is a wildcard that includes uppercase):

lenght_limit=1:  LL#+;
length_limit=2:  PL#+;

And an extension for weighting :

lenght_limit=1:  LL#+;
lenght_limit=2:  [LL#+]0.5;
lenght_limit=3:  [LL#+]1.7;

(Maybe specifying several connectors can be allowed, like in UNLIMITED-CONNECTORS.)

3. How to implement length_limit caching.
In case the length_limit don't depend on the options, it can just be done in the Dictionary struct. (If it needs to depend on the options - it is much more complicated to implement its caching because there is no natural place to put it.)

Last thing:
For the expression pruning, instead of a length_limit field I used a farthest_word field!
This is because the word number of each connector is known, and this saves the overhead of computing the word difference before each easy-match.
I don't know why it has not been generally done - did I miss something?

@linas
Copy link
Member

linas commented Jan 11, 2018

length_limit table (per connector type) which depends on the cost option,

This is an interesting idea, but premature. To establish the correct costs to use would require a detailed statistical analysis, which I am not prepared to do at this point. I also suspect that the hand-authored English dictionary has too many gross errors, that this kind of fine adjustment would be washed out.

  1. How to implement length_limit caching. ... length_limit don't depend on the options

It should not depend on options. My goal is to minimize/avoid the use of options, with certain "reasonable" exceptions. This should not depend on the options.

farthest_word ... did I miss something?

Probably not. Some of these optimizations aren't obvious till you stare at the code. Also, the code has changed enough that certain optimizations are now much more obvious.

  1. Which file format to use.

I'd prefer to handle this the same way that UNLIMITED-CONNECTORS is currently handled. So, in the main 4.0.dict file:

LENGTH-LIMIT-1:  LL+;
LENGTH-LIMIT-3:  [LL+];

Avoid using the equals sign for now, we might need it later. The above is very simple, not at all flexible, but that is all we need for now; I'd rather not over-engineer this right now. We can do something fancier when we know more.

@ampli
Copy link
Member Author

ampli commented Jan 12, 2018

I implemented it [ignoring the cost for now]. It caused drastic speed up of the Russian handling - ~50% CPU reduction on the batch [EDIT: (*)] and only 65 second/1.3GB (instead of about one hour/6.6GB) on the problematic sentence of issue #537.
EDIT: The above speed report is for the classic parser. A more detailed report follows in an additional post.

I used * as an uppercase wildcard:
LENGTH-LIMIT-1: LL*+
The idea of using such a wildcard is so one case set the length-limit of (for example_ connector A without affecting AA and so on.

However, * just after the uppercase part is a valid regular connector combination (the English dict uses that), so this, in principle, may create ambiguity.
I originally wanted to use #, but then we have to accept it (in check_connector()) as a valid connector character, even though it is not supposed to appear in a regular connector. Fixing that seems to have non-negligible overhead since we don't know yet, at the point of the said check, that it is a length-limit definition (unless fixed in a "crazy" way, e.g. storing a tentative error and notifying it later if this was not a length-limit definition).

So I need your advice regarding this problem.

@ampli
Copy link
Member Author

ampli commented Jan 12, 2018

Regarding different costs for longer length_limits:
I think it is beneficial to do a cost cut-off when setting connector lengths.
So the actual setup of connector lengths depends on the cost option and also on short and all_short options.

EDIT: See the next post...

Since the connector length had to be recomputed for each connector (currently separately in expression pruning and power pruning - unless expressions will get changed to use the Connector struct), and all of this twice if there is no complete linkage, there is a benefit to use a connector length cache.

A natural place to put such a cache is the Directory struct. However, it is not available at the point it is needed for length-limit computation.
Possible solutions for that cache:

  1. Since Prase_Option is available (needed anyway for length_limit computation) that cache can be stored in it.
  2. A TLS variable.
  3. Add the Dictionary struct as an argument all the way to the places in which the connector setup is done (which are deep).

(1) and (2) are easy. (3) is cumbersome and adds overhead of its own.

Please advise...

@ampli
Copy link
Member Author

ampli commented Jan 14, 2018

Updates (see also EDIT):

  1. Regarding length-limit cache: I will skipp that for now, because it is relatively negligible relative to other overhead that stands out in the profiling.

  2. Regarding connector prefix format: I still don't have a good solution, and I use LL*.
    I would like to have a better solution before I post the PR (but if desired I can posit it and we can change it later).

  3. Regarding different weights to different length limits: I thought to create some infrastructure for that, but there so many open questions regarding it that I think it is better to postpone that until we have a good usage case that can guide us.

  4. The reduction in CPU for Russian is for both the classic and SAT parsers. Most of the contribution for the classic parser is due to the length-limit=1 for LL, but I don't have exact number for that since I made other optimizations too.
    The total speed up for now is more than twice for Russian in the classic parser, and 8% for SAT parser.
    The SAT parser is still faster for Russian.
    The English speed up is about 5% for both parsers (from the other optimizations).

EDIT (more info):
For long sentences the speed up is more drastic. In a previous post I noted the vast improval (maybe > 50x) for the Russian sentence of issue #537 (mainly due to the length limit addition).
The English speed for long sentence is from the other optimizations, mainly the connector enumeration.
For test sentence "The problem is... (44 words + 9 punctuations), the speedup is about 15%.

  1. Are there connectors in the English dictionary that can be set to length_limit=1?
    (I guess that YS can.)

@linas
Copy link
Member

linas commented Jan 22, 2018

Regarding LL#+ -- just use LL*+ instead. The hash-mark idea was nice, but the ambiguity of using * for this is minimal. The documentation explaining length limits will have to be written ...

Yes, in English, both the YS and YP would be length 1. Also PH would be length 1 -- its kind of the whole point of PH

@linas
Copy link
Member

linas commented Jan 22, 2018

My apologies for not answering this question much earlier:

A natural place to put such a cache is the Directory struct. However, it is not available at the point it is needed for length-limit computation.

Is this really true? I see length_limit appearing in three places:

parse/preparation.c
parse/fast-match.c
parse/prune.c
sat-solver/word-tag.cpp

So: prune_context points to Sent which points to Dict, so no problem there. For extra speed, copy Dict into prune_context

Next: fast_matcher_t does not have Dict in it, but no reason that it could not be placed into it. Dict is available when fast_matcher_t is alloced

parse/preparation.c seems to have Dict readily available.

The SAT code is already a mess in this area. I did not study it carefully. Making it messier doesn't seem wrong, just yet.

So I would prefer to have the cache in the Dict, and not elsewhere. ... unless I misunderstand the the problem.

@ampli
Copy link
Member Author

ampli commented Jan 22, 2018

The documentation explaining length limits will have to be written ...

Where should I put it?

@linas
Copy link
Member

linas commented Jan 22, 2018

The documentation explaining length limits will have to be written ...
right now, its on svn for abiword, and you don't have access to that. I can write it or you can write rough draft, I can copy it in.

@ampli
Copy link
Member Author

ampli commented Feb 14, 2018

Here is a documentation draft for the added dictionary LENGTH-LIMIT-n definitions. It is not included in the PR because I don't know where to put it (possible places are NEWS or a new README about the dict syntax).

I don't have an idea in which place in the current documents it should reside - I didn't find a section that is dedicated to the syntax of the dictionary. For context, I started the description with text that I copied from "6.2 Using Parse Options".


Setting the length limit of connectors

As noted in section "6.2 Using Parse Options", the short_length parameter determines how long the links are allowed to be. The intended use of this is to speed up parsing by not considering very long links for most connectors, since they are very rarely used in a correct parse. An entry for UNLIMITED-CONNECTORS in the dictionary will specify which connectors are exempt from the length limit.

A new feature in version 5.4.4 is the ability to specify in the directory a specific length limit for connectors. A entry for LENGTH-LIMIT-n is used for that, when n is the desired length limit.
Example:

LENGTH-LIMIT-1: YS+ & YP+ & PH+;

As with UNLIMITED-CONNECTORS, only + connectors are used in the definition, and the definition
applies to all the other connectors which match them according to the connector matching rules.

For setting connectors according to a connector name prefix, a * is used as a wild-card character:
Example:

LENGTH-LIMIT-1: LL*+;

(Add here about possible ambiguity of this * notation?)

In addition to possible parse speed up, this feature can also prevent possible bogus parses in sentences that have more than one word with so explicitly limited connectors and can be parsed only with null links to at least one of these words.

Example:

    +-------------------------------------Xp------------------------------------+
    +-------------Wd------------+                                               |
    |         +-------Sp3-------+--------------------MVv-------------------+    |
    |         |       +--LLFXI--+         +--------------LLDNF-------------+    |
    |         |       |         |         |                                |    |
LEFT-WALL они.m3pi окруж.= =или.vspdpp плот.= [ным] [кольцом] [кроват] =ь.ndfsv .

The short_length parse option doesn't change the length limit of so explicitly limited connectors unless its value is less than their defined length limit.

@ampli
Copy link
Member Author

ampli commented Feb 14, 2018

I couldn't set the length-limit of PH to 1, due to:

                +-----------Ost----------+
                |  +--------Ds**x--------+
    +---->WV--->+  +-----PHc-----+       |
    +->Wd--+-Ss-+  |     +---EA--+---A---+
    |      |    |  |     |       |       |
LEFT-WALL she 's.v a really.e good.a player.n

Instead I have set PHv.

@linas
Copy link
Member

linas commented Feb 14, 2018

Re: she's a really good player -- that's a bug, broken: the PH links must never be greater than one in length.

@ampli
Copy link
Member Author

ampli commented Feb 14, 2018

I guess that PH can be added now to the length-limit=1 setup.
This has a benefit of omitting altogether the "bad" linkages (2 and 5 in the case of this sentence).
(But it will hide such problem if it happens again.)

@linas
Copy link
Member

linas commented Feb 15, 2018

Writing rules to make sure that PH links are always just length-one is hard to do. So the length-limit constraint is nice.

@linas
Copy link
Member

linas commented Feb 15, 2018

I believe that ZZZ is also length-limit=1 always for english, right?

@ampli
Copy link
Member Author

ampli commented Feb 15, 2018

Just now ZZZ is only used for "random" quotation marks.
In order save linkage time, an optional ZZZ+ is attached only to the word before them.
In that regard, length-limit=1 for ZZZ is desired.

However, the idea of attaching ZZZ+ only to words before quotes is somewhat problematic when this word is a null word. In that case the quote may become null word too because it cannot attache to anything. A solution can be to add ZZZ+ to all the words before a quote (and then length-limit=1 for ZZZ cannot be used.

BTW:
I still think about a better solution for quotes. I had an idea to add the quoted text as alternative to its individual tokens, as one big token with disjuncts of a special UNKNOWN-WORD that has a big cost. It is still not clear to me how to do it in the general case (especially how to display it), but it seems doable for "word" (i.e. one quoted word), that can be added a an alternative to the 3 tokens " word ".
So this will get a full and correct parse:
We should distinguish between "who" and "whom" but who fucking cares?

@linas
Copy link
Member

linas commented Feb 15, 2018

@linas
Copy link
Member

linas commented Feb 15, 2018

We should distinguish between "who" and "whom" but who fucking cares?

This is an interesting example, as it is a meta-example -- both of the quoted words should not be looked up in the dictionary, but instead, should be treated as unknown nouns:

We should distinguish between foobar and barfoo but who fucking cares?

which does parse correctly. So, somehow ignore all of the connectors on a single quoted word, and use the UNKNOWN-WORD.n connectors on it.

@ampli
Copy link
Member Author

ampli commented Feb 15, 2018

Because frequently a sentence can be reasonably parsed ignoring the quotes on a quoted words (e.g.:
He "agreed" by waving his hand), I thought that maybe the UNKNOWN-WORD should be an alternative with a higher cost, so the "plain text" parse will take precedence.

@linas
Copy link
Member

linas commented Feb 15, 2018

Well, yes. There are (at least) three types of distinct quote usage:

  1. direct quotation: "Let's go to the movies", John said. These mostly work already.

  2. the "whom" example, where the word is quoted because the writer knows that it doesn't make grammatical sense, and is trying to keep the reader from getting confused. In this case, the quoted word should be treated as an unknown noun.

  3. Some "people" who "don't" know how to write "clearly" seem to use "quotes" for "emphasis". I'm not sure what to make of this. Quotes used in this way are usually meant to be sarcastic, indicating that the writer knows the statement is false, and wants the reader to pick up on the sarcasm more easily.

  4. For case three, I guess we could have the notion of "sarcasm quotes". Which brings us to case 4: a new term is being introduced and defined, so it is placed in quotes. Which are not sarcastic, but definitional. Although this is exactly the device that sarcastic writers enjoy using.

Both case 3 and case 4 can probably be handled the same way.

@linas
Copy link
Member

linas commented Feb 15, 2018

Just to be clear: as a native English speaker, when I read He "agreed" by waving his hand and see the word "agreed" in quotes, the mental image in my mind is that that someone is flipping off, giving the bird, to someone else. That kind of agreement. These are sarcasm quotes at their finest.

@linas
Copy link
Member

linas commented Feb 16, 2018

Note: test/mem-leak now crashes with a double-free:

#6  free_connectors (e=<optimized out>) at ../../link-grammar/connectors.c:34
#7  0x00007ffff7b70fd5 in free_disjuncts (c=0x55555646fda0)
    at ../../link-grammar/disjunct-utils.c:33
#8  0x00007ffff7b79eca in classic_parse (sent=sent@entry=0x5555563cda00, 
    opts=opts@entry=0x55555576a4e0) at ../../link-grammar/parse/parse.c:364

I'm assuming that the new way of working with connectors means that they can no longer be deleted in this particular way.

@ampli
Copy link
Member Author

ampli commented Feb 24, 2018

To sum up the possibilities for "word" tokenization:

  1. Tokenize it, including the quotes, as UNKNOWN-WORLD.
  2. Tokenize it as now (separating the quotes) in case the word is used in a grammatical context, and add UNKNOWN-WORD alternative for it (including the quotes).

(I'm for (2), because I think that (1) disregards possible info in "word" that may still be interesting.)


This issue of "connector length_limit per connector name" can be closed now, but I don't know what to do with the issue of handling quotes that got mentioned here.

@linas
Copy link
Member

linas commented Feb 27, 2018

To sum up the possibilities for "word" tokenization:
2. Tokenize ...

Option 2.

@linas
Copy link
Member

linas commented Feb 27, 2018

I copied one part of above discussion into the very old issue #42. If you want to copy/add more info there, feel free. Close these whenever you are ready.

@linas
Copy link
Member

linas commented Feb 27, 2018

Also, I'm guessing that issue #214 is now a dead issue?

@ampli ampli closed this as completed Mar 7, 2018
@ampli
Copy link
Member Author

ampli commented Oct 12, 2019

Also, I'm guessing that issue #214 is now a dead issue?

(Issue 214: Setting the ZZZ connector length_limit to 1)

BTW, ZZZ is still used for quotes. Replacing it by another mechanism (as well as a better quote handling) is in my TODO list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants