-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing linkages at null-count>0 due to empty-words or optional-word #527
Comments
For clarification, I edited two sections in my original issue posting (marked with [EDITED]). |
It turns out elimination of the empty word in favor of "optional words" was not a good idea after all, since tit works fine only with null_count==0. After much thinking and trying, I couldn't make it work reasonably fine with null_count>0. The problem is that in general, I don't know the exact number of real null words when n words are skipped. At most, I found a way to find it when n=1, but this restores only some of the missing linkages. So I propose to declare my "optional word" fix a failed try, and return the empty-word device. (BTW, I still think it is possible to eliminate the empty-word and have null-count>0 works fine, maybe using a corresponding change in mk_parse_set().) |
After thinking on this problem some time, I was able to fix it without introducing again the empty word device. However, now the fix is not as simple as before. So I'm not sure that it is better than introducing empty words. The idea is that the wordgraph contains the information about which "optional" words can be null-linked and which are not. Suppose that words The main fix is in This change by itself doesn't solve the problem. The reason is its bad interaction with the memoization: Two identical word ranges with My first solution was to use the disjunct address instead of The code also contains minor changes in I tested the new code by producing detailed linkages of the batch files and comparing them to those of version 5.3.13 (the last one that uses the empty word device). The results are the same also for null_count>0. I have not completely cleaned up the new code, but here is the new
/**
* Return the 2D-array word index following the gword of the given disjunct.
*/
static int gword_nextword(Sentence sent, Disjunct *d)
{
Gword *g = (NULL == d) ? sent->wordgraph->next[0] :
d->originating_gword->o_gword;
return (int)g->next[0]->sent_wordidx;
}
/**
* Return the number of optional words after ld.
*/
static int num_optional_words(count_context_t *ctxt, int w1, int w2,
Disjunct *ld)
{
int n;
if (!ctxt->local_sent->word[w1+1].optional) return 0;
n = MIN(gword_nextword(ctxt->local_sent, ld), w2) - w1 - 1;
return n;
} The memoization calls are now like the following: t = find_table_pointer(ctxt, lw, rw, NCtoD(lw, le, ld), re, null_count); When NCtoD (I need to change its name and make it an inline function) is very complex, but it essentially uses the Gword address when needed (instead of #define NCtoD(lw, le, d) ((NULL == le && lw+1<(int)ctxt->local_sent->length && ctxt->local_sent->word[lw+1].optional) ? (d?(Connector *)d->originating_gword->o_gword:NULL) : le) The new code still doesn't work for My question now is what to do next: |
I the post above I said:
Regretfully, the result of this sentence is also the same:
i.e., the following still doesn't appear:
word0: LEFT-WALL When parsed, |
Well, the core difficulty is that the parser itself does not know what the word-graph is; it is just trying to create attachments between "words" in a linear array. If it were possible to replace the linear array by the word-graph, then the problem would "go away". To replace the linear array, you would have to replace constructs like Unfortunately, this has at least two very undesirable side-effects: you'd have to change vast quantities of existing code, and the change replaces a very fast array dereference by some slow method to find previous/next in the word-graph. So, switching to a "pure wordgraph" parser means:
This last item is one that has me stumped. There is something appealing about replacing the linear array by a generic left-right ordered graph, but is this enough to justify the associated penalties? |
It is possible to create, in advance, a data structure with a space complexity of O(n**2) (for n words), which includes a linear array of words to the left and right of each word (and some other needed data). However, with the current type of algorithm this will still create "insane-morphism" unless checked in the parsing algo (an overhead). What I'm trying to do instead is to mimic the empty-word device without empty-words. If we don't use an empty-word in such a slot, the slot will be "null-linked", creating a null-word (or an island, but we can ignore islands_ok=1 for now). If we parse with null_count==0, we can just disregard such null-linked slots. However, if we would like accurate parsing with minimal null_count we need to find out which null-linked slots are to be counted toward the desired null_count. The idea that I'm trying to investigate is that slots cannot be even null-linked if the candidate words (to be linked) are from different alternatives. This can be found out when a candidate null-block is found (the I found a few examples for which the solutions are not exactly those when actually using empty-words (for example 2 out of 9 linkages are missing), and I'm now investigate whether this is just an implementation bug or a hole in this idea (of null-link count). So far all the differences were due to implementation bugs, which I fixed, but I still have some cases to investigate. On the long run, I think that the SAT-parser can be better. I even find out how to parse with null words (not yet with islands_ok=true), but I still need to try to implement that (a big obstacle is the post processing). What is better with to SAT-parser is that it is easier to make tests like allowing cross-links, different link-colors, etc. (because such changes involve "local" small code changes, e.g. to lift the cross-linking limitation one only needs o ifdef-out the section of code that encodes this restriction). |
Ah, OK, I rather skimmed through what you had written above, without thinking hard about it. I can't really recommend the best way, without spending a few hours or more, thinking it all through. |
Since by elimination the empty-word device I created a problem of inaccurate null count in some cases of On Apr 18 I wrote:
I didn't like this solution because it needed:
I finally found another solution that works well. However, it is not clear to me that it is better than using the empty-word device. This solution mainly consists of small changes in 3 places:
The idea is based on the observation that using an inequality in the check for null words is exactly equivalent to the previous use of the empty-word device: The null words that are not counted serve as an exact replacement of empty-words, and 1)Instead of using /* If we don't allow islands (a set of words linked together
* but separate from the rest of the sentence) then the
* null_count of skipping n words is just n. */
if (null_count == (rw-lw-1) - num_optional_words(ctxt, lw, rw)) I used: /* The null_count of skipping n words is just n.
* In case the unparsable range contains optional words, we
* don't know here how many of them are actually skipped, because
* they may belong to different alternatives and essentially just
* be ignored. Hence the inequality - sane_linkage_morphism()
* will discard the linkages with extra null words. */
if ((null_count <= unparseable_len) &&
(null_count >= unparseable_len - nopt_words)) In the case of for (int opt = 0; opt <= !!ctxt->local_sent[w].optional; opt++)
{
null_count += opt;
EXISTING CODE
} These changes are said to create the same number of linkages that using the empty-word device would generate. 2)I changed 3)The new To sum up:I think that in order to solve the incorrect null count problem I introduced by eliminating the empty-word device, one of the following is needed:
|
I have very little that I can add to this; I'd have to review the code, and carefully ponder. Either 1 or 2 is OK by me; maybe 1. is more elegant. When you say this:
Hah! I think this is called the "wall or mirror principle": in programming, either you can't do something (wall), or there are many solutions that are mirror-images of one-another, like in a maze where the walls are all mirrors. Right now, a "null-word inequaility" that does not use an array for storage does sound more elegant. But I can't really say. |
I will send it for review after the current PRs are applied (so I can sync my code for now conflicts). |
Also move and fix the comments. Extensive benchmarks don't find that separating the case of adjacent words when null_count==0 (marked with #if 1) is really faster. (However, its attached comment adds insight.) This code still has the same problem of not finding all the cases of a given null_count in the presence of optional words (issue opencog#527).
I have sent it (PR #588) for your review. |
I will try now to eliminate the empty-word as far as possible. I see more than one problem in doing that, so I'm not sure how easy it is to implement it. |
Status summary of this issue:
Hence I leave this issue open. |
Status update:
|
Until and including 5.3.13, the library used the empty-word device.
We already know that the linkages null-count (when it is >0), for sentences that got tokenized using empty-words, may often be bogus due to counting unlinked empty-words.
What we may not noticed is that entire good linkages were sometimes got missing.
For example (version 5.1.13):
Note that the following doesn't appear:
LEFT-WALL [as] it was.v-d commanded.v-d , so it shall.v be.v done.a
However, it appears if the first one is not capitalized, as in:
as it was commanded, so it shall be done
This happens because
As
got seperated by the tokenizer unit-separation to "A s" in addition to "As" and "as" (I.e. 3 alternatives to the word "As"). This needed issuing an empty-word because the alternatives were imbalanced. However, in order that "as" will be a null-word, the empty-word needs to be a null-word too (since the ZZZ distance is set to 1 it cannot be linked to the LEFT-WALL, and even the ZZZ+ at the LEFT-WALL is a relatively recent addition) and this increases the null count to 2 (so this linkage is not generated, since we are at null count 1).This maybe can be fixed (I have not validated that) by removing the ZZZ distance=1 restriction, and allowing empty-words to chain to each other, most probably at some loss of speed (by removing the distance=1 limitation for ZZZ).
The new optional-word device that replaces the empty-word device has a similar problem, for which I don't have a good fix. For example (as of version 5.3.15), using a specially crafted bogus sentence:
*let's him do it
(that needs an optional word due to thelet's
andlet 's
imbalanced alternatives).However, in version 5.3.13 (using the empty-word device) we get an additional linkage:
(Let's ignore the issue of whether this extra alternative is needed, or whether this specific extra linkage is helpful, because this is a demo of the problem [but I think both are generally good ideas]).
The reason of the problem when using the optional-word device
[EDITED]
In the example above, the
's
token is in the slot of the optional word, and thus is always eliminated.In addition, since optional-word slots are never considered as null-words, an additional null-word results
(so the total null-words == 1). In the case of this example, this linkage is generated:
LEFT-WALL let.v-d {} him do.v it []
, i.e. the RIGHT-WALL is a null-word.(The result linkage is then rejected by
sane_linkage_morphism
because there is no such a path in the word-graph.)Possible solution:
[EDITED]
I was able to introduce a small fix that preserves such linkages in simple cases and some more complex ones (like the combined sentence
As it was commanded, so it shall be done, let's him do it
). It works by producing some more potential linkages, and not removing some null-words inremove_empty_words()
.However:
islands-ok=1
.(I can send a PR.)
EDIT: I made more extensive checks by now and the fix, however somewhat strange, seems to work on the current corpus files.
As far as I can see, this problem doesn't happen with the current
ru
andhe
dicts, and maybe not with any other existing dicts (buten
), but it seems to me disturbing, and to eliminate in a provable way (if unlimited ZZZ istance and chainable empty-words fixes it) we may need to revert back to the empty-word device.Also, no such problem exists in the SAT-parser, as it doesn't support at all parsing with null-count>0. However, I'm working to extend it for null-count>0, and it is not expected to have this problem (since its handling of optional-words is totally different).
(2017-3-17: Further edited to fix some typos and clarify.)
The text was updated successfully, but these errors were encountered: