Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing linkages at null-count>0 due to empty-words or optional-word #527

Open
ampli opened this issue Feb 25, 2017 · 14 comments
Open

Missing linkages at null-count>0 due to empty-words or optional-word #527

ampli opened this issue Feb 25, 2017 · 14 comments
Labels

Comments

@ampli
Copy link
Member

ampli commented Feb 25, 2017

Until and including 5.3.13, the library used the empty-word device.
We already know that the linkages null-count (when it is >0), for sentences that got tokenized using empty-words, may often be bogus due to counting unlinked empty-words.

What we may not noticed is that entire good linkages were sometimes got missing.
For example (version 5.1.13):

linkparser> As it was commanded, so it shall be done
linkparser> As it was commanded, so it shall be done
No complete linkages found.
Found 26 linkages (10 had no P.P. violations) at null count 1
[...]
LEFT-WALL as.e it was.v-d commanded.v-d , [so] it shall.v be.v done.a 
[...]
LEFT-WALL as.e it was.v-d commanded.v-d , so [it] shall.v be.v done.a 

Note that the following doesn't appear:
LEFT-WALL [as] it was.v-d commanded.v-d , so it shall.v be.v done.a

However, it appears if the first one is not capitalized, as in:
as it was commanded, so it shall be done

This happens because As got seperated by the tokenizer unit-separation to "A s" in addition to "As" and "as" (I.e. 3 alternatives to the word "As"). This needed issuing an empty-word because the alternatives were imbalanced. However, in order that "as" will be a null-word, the empty-word needs to be a null-word too (since the ZZZ distance is set to 1 it cannot be linked to the LEFT-WALL, and even the ZZZ+ at the LEFT-WALL is a relatively recent addition) and this increases the null count to 2 (so this linkage is not generated, since we are at null count 1).

This maybe can be fixed (I have not validated that) by removing the ZZZ distance=1 restriction, and allowing empty-words to chain to each other, most probably at some loss of speed (by removing the distance=1 limitation for ZZZ).

The new optional-word device that replaces the empty-word device has a similar problem, for which I don't have a good fix. For example (as of version 5.3.15), using a specially crafted bogus sentence:
*let's him do it (that needs an optional word due to the let's and let 's imbalanced alternatives).

linkparser> let's him do it
No complete linkages found.
[...]
LEFT-WALL let's [him] do.v it 

However, in version 5.3.13 (using the empty-word device) we get an additional linkage:

linkparser>  let's him do it
No complete linkages found.
[...]
LEFT-WALL let's [him] do.v it 
[...]
LEFT-WALL let.v-d ['s] him do.v it 

(Let's ignore the issue of whether this extra alternative is needed, or whether this specific extra linkage is helpful, because this is a demo of the problem [but I think both are generally good ideas]).

The reason of the problem when using the optional-word device

[EDITED]
In the example above, the 's token is in the slot of the optional word, and thus is always eliminated.
In addition, since optional-word slots are never considered as null-words, an additional null-word results
(so the total null-words == 1). In the case of this example, this linkage is generated: LEFT-WALL let.v-d {} him do.v it [] , i.e. the RIGHT-WALL is a null-word.
(The result linkage is then rejected by sane_linkage_morphism because there is no such a path in the word-graph.)

Possible solution:

[EDITED]
I was able to introduce a small fix that preserves such linkages in simple cases and some more complex ones (like the combined sentence As it was commanded, so it shall be done, let's him do it). It works by producing some more potential linkages, and not removing some null-words in remove_empty_words().

However:

  • This fix is a result of an intuition that it can solve such problems, but I don't have a proof that it fixes all cases, and I suspect it doesn't (but I guess that if such cases are found, a better fix can be found too).
  • I have not extended yet this fix to cover also islands-ok=1.
  • I am still thinking on more complex example to test.
    (I can send a PR.)

EDIT: I made more extensive checks by now and the fix, however somewhat strange, seems to work on the current corpus files.

As far as I can see, this problem doesn't happen with the current ru and he dicts, and maybe not with any other existing dicts (but en), but it seems to me disturbing, and to eliminate in a provable way (if unlimited ZZZ istance and chainable empty-words fixes it) we may need to revert back to the empty-word device.

Also, no such problem exists in the SAT-parser, as it doesn't support at all parsing with null-count>0. However, I'm working to extend it for null-count>0, and it is not expected to have this problem (since its handling of optional-words is totally different).

(2017-3-17: Further edited to fix some typos and clarify.)

@ampli
Copy link
Member Author

ampli commented Feb 25, 2017

For clarification, I edited two sections in my original issue posting (marked with [EDITED]).

@ampli
Copy link
Member Author

ampli commented Mar 19, 2017

It turns out elimination of the empty word in favor of "optional words" was not a good idea after all, since tit works fine only with null_count==0.

After much thinking and trying, I couldn't make it work reasonably fine with null_count>0. The problem is that in general, I don't know the exact number of real null words when n words are skipped. At most, I found a way to find it when n=1, but this restores only some of the missing linkages.

So I propose to declare my "optional word" fix a failed try, and return the empty-word device.
I will send a PR for that.

(BTW, I still think it is possible to eliminate the empty-word and have null-count>0 works fine, maybe using a corresponding change in mk_parse_set().)

@ampli
Copy link
Member Author

ampli commented Apr 12, 2017

After thinking on this problem some time, I was able to fix it without introducing again the empty word device. However, now the fix is not as simple as before. So I'm not sure that it is better than introducing empty words.

The idea is that the wordgraph contains the information about which "optional" words can be null-linked and which are not. Suppose that words lw and rware not adjacent when le=NULL && re==NULL: word lw+1 can be null linked to lw only if it is adjacent to it in the wordgraph. It is not adjacent if it belongs to another alternative and in that case it is not to be counted as a null-word.

The main fix is in num_optional_words(). It doesn't any more blindly disregard all the "optional' words.
Instead, it checks the index of a word that follows lw in the wordgraph. If some words are skipped, these words cannot be null-linked. (The rest of the words are assumed to be null linked disregarding whether they are marked as "optional". This handling may need a change if "optional" words will be used to introduce "phantom" words).

This change by itself doesn't solve the problem. The reason is its bad interaction with the memoization: Two identical word ranges with le=NULL && re=NULL can one be parsable and one not, depending on the particular disjunct of lw that is used (it can be from different alternatives that are mapped to the same word slot index).

My first solution was to use the disjunct address instead of le when le=NULL. With that fix, everything works correctly. However, since there are many disjuncts, many table entries are used and the results is not efficient (do_count() and mk_parse_set() run considerably slower). Some speed is restored when using the corresponding Gword address instead. However, some sentences are still parsed much slower (I'm not sure why). The batch run times can be restored by changing the memoization this way only if an "optional" word is encountered, but this means that there can still be pathological slower cases.

The code also contains minor changes in sane_linkage_morphism() and remove empty_words() in order to remove only the "optional" words that are not null-linked.

I tested the new code by producing detailed linkages of the batch files and comparing them to those of version 5.3.13 (the last one that uses the empty word device). The results are the same also for null_count>0.

I have not completely cleaned up the new code, but here is the new num_optional_word(), to show some of the added complexity.

sent_wordidx below is a new Gword member that is set to the corresponding 2D array word index.
The new argument ld of num_optional_words() is a new argument of do_count(), and corresponds to le there.

/**
 * Return the 2D-array word index following the gword of the given disjunct.
 */
static int gword_nextword(Sentence sent, Disjunct *d)
{
	Gword *g = (NULL == d) ? sent->wordgraph->next[0] :
	                         d->originating_gword->o_gword;
	return (int)g->next[0]->sent_wordidx;
}

/**
 * Return the number of optional words after ld.
 */
static int num_optional_words(count_context_t *ctxt, int w1, int w2,
                              Disjunct *ld)
{
	int n;
	if (!ctxt->local_sent->word[w1+1].optional) return 0;
	n = MIN(gword_nextword(ctxt->local_sent, ld), w2) - w1 - 1;
	return n;
}

The memoization calls are now like the following:

	t = find_table_pointer(ctxt, lw, rw, NCtoD(lw, le, ld), re, null_count);

When NCtoD (I need to change its name and make it an inline function) is very complex, but it essentially uses the Gword address when needed (instead of le) in order not to have same memoizations for different cases with the same parameters when an "optional" word is involved.

#define NCtoD(lw, le, d) ((NULL == le && lw+1<(int)ctxt->local_sent->length && ctxt->local_sent->word[lw+1].optional) ? (d?(Connector *)d->originating_gword->o_gword:NULL) : le)

The new code still doesn't work for islands_ok=1, and I still don't know why (I have not debugged this yet).

My question now is what to do next:
Option A: Return the empty-word device.
Option B: Send a PR for the new code (supposing that I can fix it for islands_ok=1).

@ampli
Copy link
Member Author

ampli commented Apr 12, 2017

I the post above I said:

I tested the new code by producing detailed linkages of the batch files and comparing them to those of version 5.3.13 (the last one that uses the empty word device). The results are the same also for null_count>0.

Regretfully, the result of this sentence is also the same:

linkparser> As it was commanded, so it shall be done

i.e., the following still doesn't appear:

LEFT-WALL [as] it was.v-d commanded.v-d , so it shall.v be.v done.a
But the reason this time is different.
The tokenization is:

word0: LEFT-WALL
alt0: LEFT-WALL
word1: As
alt0: A {s}
alt1: a
alt2: as
word3: test
alt0: test
word4: RIGHT-WALL
alt0: RIGHT-WALL

When parsed, as (slot 1) is indeed null-linked, but also s (slot 2). They are counted as 2 nulls and thus are not considered parsable with null_count==1. Such problems most probably can be fixed by fully analyzing the wordgraph of each block of null-linked words but this will introduce some more overhead (but only when needed) comparing to using of the empty word device.

@linas
Copy link
Member

linas commented Apr 18, 2017

Well, the core difficulty is that the parser itself does not know what the word-graph is; it is just trying to create attachments between "words" in a linear array. If it were possible to replace the linear array by the word-graph, then the problem would "go away".

To replace the linear array, you would have to replace constructs like sent->word[i] by some kind of pointer into the word-graph, and replace i+1 with some kind of "get next in the wordgraph" function.

Unfortunately, this has at least two very undesirable side-effects: you'd have to change vast quantities of existing code, and the change replaces a very fast array dereference by some slow method to find previous/next in the word-graph.

So, switching to a "pure wordgraph" parser means:

  • performance is a problem, may be a huge problem
  • code that has always been difficult/complex now gets more complex
  • ...are there any theoretical or conceptual advantages?

This last item is one that has me stumped. There is something appealing about replacing the linear array by a generic left-right ordered graph, but is this enough to justify the associated penalties?

@ampli
Copy link
Member Author

ampli commented Apr 18, 2017

To replace the linear array, you would have to replace constructs like sent->word[i] by some kind of pointer into the word-graph, and replace i+1 with some kind of "get next in the wordgraph" function.

Unfortunately, this has at least two very undesirable side-effects: you'd have to change vast quantities of existing code, and the change replaces a very fast array dereference by some slow method to find previous/next in the word-graph.

It is possible to create, in advance, a data structure with a space complexity of O(n**2) (for n words), which includes a linear array of words to the left and right of each word (and some other needed data). However, with the current type of algorithm this will still create "insane-morphism" unless checked in the parsing algo (an overhead).

What I'm trying to do instead is to mimic the empty-word device without empty-words.
The empty-word solution uses an empty-word in each slot for which not using the slot at all, besides linking to the empty-word, still may produce a valid linkage.

If we don't use an empty-word in such a slot, the slot will be "null-linked", creating a null-word (or an island, but we can ignore islands_ok=1 for now). If we parse with null_count==0, we can just disregard such null-linked slots. However, if we would like accurate parsing with minimal null_count we need to find out which null-linked slots are to be counted toward the desired null_count.

The idea that I'm trying to investigate is that slots cannot be even null-linked if the candidate words (to be linked) are from different alternatives. This can be found out when a candidate null-block is found (the do_count() code sections in which both connectors are NULL and lw+1 != rw). It is simple and cheap to find that out, and it is memoized, so the overhead of this idea is small (my test code runs the same time as the solution with empty-words).

I found a few examples for which the solutions are not exactly those when actually using empty-words (for example 2 out of 9 linkages are missing), and I'm now investigate whether this is just an implementation bug or a hole in this idea (of null-link count). So far all the differences were due to implementation bugs, which I fixed, but I still have some cases to investigate.

On the long run, I think that the SAT-parser can be better. I even find out how to parse with null words (not yet with islands_ok=true), but I still need to try to implement that (a big obstacle is the post processing). What is better with to SAT-parser is that it is easier to make tests like allowing cross-links, different link-colors, etc. (because such changes involve "local" small code changes, e.g. to lift the cross-linking limitation one only needs o ifdef-out the section of code that encodes this restriction).

@linas
Copy link
Member

linas commented Apr 18, 2017

Ah, OK, I rather skimmed through what you had written above, without thinking hard about it. I can't really recommend the best way, without spending a few hours or more, thinking it all through.

@ampli ampli added the bug label Jun 15, 2017
@ampli
Copy link
Member Author

ampli commented Sep 29, 2017

Since by elimination the empty-word device I created a problem of inaccurate null count in some cases of null_count!=0 (no such examples in the corpus files, since the number of sentences in them that generate alternatives is relatively small), I devoted time to find a solution that doesn't need to revert that change (i.e. to again use the empty-word device).

On Apr 18 I wrote:

The idea that I'm trying to investigate is that slots cannot be even null-linked if the candidate words (to be linked) are from different alternatives. This can be found out when a candidate null-block is found (the do_count() code sections in which both connectors are NULL and lw+1 != rw). It is simple and cheap to find that out, and it is memoized, so the overhead of this idea is small (my test code runs the same time as the solution with empty-words).
3

I didn't like this solution because it needed:

  1. Adding the disjuncts as arguments in do_count().
  2. Manipulate the connector addresses used in memoizing so NULL values will not be used.
  3. A complex and hard to understand num_optional_words() function, that essentially needs to do "sane_morphism" to the null-word block to find out which optional words are indeed null-words.

I finally found another solution that works well. However, it is not clear to me that it is better than using the empty-word device.

This solution mainly consists of small changes in 3 places:

  1. do_count() and mk_parse_set().
  2. sane_linkage_morphism().
  3. remove_empty_words() (now a misnomer since empty words are no used).

The idea is based on the observation that using an inequality in the check for null words is exactly equivalent to the previous use of the empty-word device: The null words that are not counted serve as an exact replacement of empty-words, and sane_linkage_morphism() rejects the bad combinations.

1)

Instead of using

/* If we don't allow islands (a set of words linked together
 * but separate from the rest of the sentence) then the
 * null_count of skipping n words is just n. */
if (null_count == (rw-lw-1) - num_optional_words(ctxt, lw, rw))

I used:

/* The null_count of skipping n words is just n.
 * In case the unparsable range contains optional words, we
 * don't know here how many of them are actually skipped, because
 * they may belong to different alternatives and essentially just
 * be ignored.  Hence the inequality - sane_linkage_morphism()
 * will discard the linkages with extra null words. */
if ((null_count <= unparseable_len) &&
     (null_count >= unparseable_len - nopt_words))

In the case of island_ok==1 , a loop of (maximum 2 iterations) over the corresponding code is needed
in order to implement this idea (in both do?_count() and mk_parse_set()):

for (int opt = 0; opt <= !!ctxt->local_sent[w].optional; opt++)
{
   null_count += opt;
   EXISTING CODE
}

These changes are said to create the same number of linkages that using the empty-word device would generate.

2)

I changed sane_linkage_morphism() to count the real null-words (or the number of islands in case of island_ok==1) and reject linkages which doesn't have the null count reported by do_count().
(The code for counting islands also finds which words belong to each island, and an API access for that can be provided if desired.)

3)

The new remove_empty_words() now removes the optional words that are really optional (and doesn't remove the ones that happen to be real null words).

To sum up:

I think that in order to solve the incorrect null count problem I introduced by eliminating the empty-word device, one of the following is needed:

  1. Use my new "null word inequality" implementation (already done and tested).
    In that case I will remove the remaining empty-word code that still exists.

  2. Return to use the empty-word device, and also try to fix several related bugs that cause incorrect null count when using it.

@linas
Copy link
Member

linas commented Oct 1, 2017

I have very little that I can add to this; I'd have to review the code, and carefully ponder. Either 1 or 2 is OK by me; maybe 1. is more elegant.

When you say this:

using an inequality in the check for null words is exactly equivalent to the previous use of the empty-word device

Hah! I think this is called the "wall or mirror principle": in programming, either you can't do something (wall), or there are many solutions that are mirror-images of one-another, like in a maze where the walls are all mirrors.

Right now, a "null-word inequaility" that does not use an array for storage does sound more elegant. But I can't really say.

@ampli
Copy link
Member Author

ampli commented Oct 3, 2017

I will send it for review after the current PRs are applied (so I can sync my code for now conflicts).

ampli added a commit to ampli/link-grammar that referenced this issue Oct 4, 2017
Also move and fix the comments.

Extensive benchmarks don't find that separating the case of adjacent
words when null_count==0 (marked with #if 1) is really faster.
(However, its attached comment adds insight.)

This code still has the same problem of not finding all the cases of a
given null_count in the presence of optional words (issue opencog#527).
@ampli
Copy link
Member Author

ampli commented Oct 4, 2017

I have sent it (PR #588) for your review.

@ampli
Copy link
Member Author

ampli commented Oct 7, 2017

I will try now to eliminate the empty-word as far as possible.
In order to eliminate it altogether, I will need to find a way to allow those quotation marks that have a ZZZ linkage now, to have no links at all. The idea is to allow them to be a null-word but not to count them as such (by marking them in the Word struct and using that mark in do_count() and sane_linkage_morphism()).

I see more than one problem in doing that, so I'm not sure how easy it is to implement it.

@ampli
Copy link
Member Author

ampli commented Mar 11, 2018

Status summary of this issue:

  • The bug is fixed by PR Empty-word simulation by null-count range check #588.
  • The empty-word device is still used for quotation marks (and BTW, due to ZZZ distance=1, this doesn't work when a word before the quotation mark is a null-word). The proposed solution (in my post above) is still a WIP (it needs a rethinking).
  • The source code still contains old empty-word comments and unused code that I need to clean up.

Hence I leave this issue open.

@ampli
Copy link
Member Author

ampli commented Sep 22, 2018

Status update:

  • The empty-word device is still used for quotation marks (and BTW, due to ZZZ distance=1, this doesn't work when a word before the quotation mark is a null-word). The proposed solution (in my post above) is still a WIP (it needs a rethinking).
    -The source code still contains old empty-word comments and unused code that I need to clean up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants