Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add !!word functionality #1083

Merged
merged 17 commits into from
Jan 21, 2020
Merged

Add !!word functionality #1083

merged 17 commits into from
Jan 21, 2020

Conversation

ampli
Copy link
Member

@ampli ampli commented Jan 19, 2020

This PR makes intensive changes in order to help debugging of dictionary expressions.
This is important for the next changes, that may need such debugging.

The added functionality is documented in !help !.
Also, I added the following in command-line.c:

  • A short description in the command table (displayed as first line of !help !).
  • An additional line in !help.
    (Changes may be needed to make it clearer.)

I looked at the macro annotation of test.n and test.v, and several macros appear more than once.
It may be intentional but it looks suspected.
The disjunct list also have many entries that seems strange to me (especially those with duplicate disjuncts).
Note also some very strange high costs. This happens due to the "strange" algo used for cost cutoff. I checked in 5.0.8 and the generated disjunct lists are the same (for the same expressions) per given !max-cost (i.e. the same high costs). In case such disjuncts are not useful, they just slow the parsing.

Main changes:

  • Disjunct display with optional regex filtering. It depends on !cost-max.
  • An option for macro annotation in expression display.
  • An option for low level expression display, including macro and dialect indications.
  • Expressions are displayed after applying dialect info (and thus affected by !dialect).
  • The expressing stringifying code has been rewritten. Now extra (redundant) parens at the same level are always shown. I chose to still show macros inside parens (that are an artifact of the expression construction because they must be wrapped in a unary AND).
  • Help text for !.
  • !<macro> clean display.

Forced-push to update the help text.

@ampli
Copy link
Member Author

ampli commented Jan 19, 2020

Here is more info on the disjunct display (that is not yet documented in the help file).

linkparser> !!test.n//
Token "test.n" matches:
    test.n                        8509  disjuncts <en/words/words.n.1-const>


Token "test.n" disjuncts:
    test.n 4273/4501 disjuncts

          test.n: [0]2.000= @AN @A <--> AN
          test.n: [1]2.000= @AN @A <--> AN NMa
...

The displayed disjuncts (4273 ones) are after duplicate elimination.
The number before duplicate elimination is 4501. It is less than 8509 due to the cost cutoff.

Flags may come after the last / but there are no useful ones for now.
It is possible to add a flag to display without duplicate elimination but I didn't find it useful.
It is also possible to add a flag to sort according to cost, or to always sort according to cost.
It is possible (and simple) to add a paginator in link-parser (that can serve for other things too).

I can add a flag to show the source macro for each connector. However, I don't know in which format to display that. One possibility is to use the rest of the line (after the connectors). This will cause the lines to be very long and to fold, but it will make filtering by /regex/ easier.

For expression display, there are 2 flags (documented in the help files). I can add more, but I don't know how much it is useful. For example:

  1. Marking shallow and deep connectors. In case a word has several links, the link from a shallow connector is longer than ones from deep connectors.
  2. Simplify the expressions.

@ampli
Copy link
Member Author

ampli commented Jan 19, 2020

I said above:

It is possible (and simple) to add a paginator in link-parser (that can serve for other things too).

The code for wordgraph display can be reused in link-parser to invoke a pager.

BTW, I now think it could be a design error to invoke the wordgraph display process from within the library. Instead, the API call could just return a string in the DOT language and leave it for the UI (link-parser in that case) to invoke the display process.

@ampli
Copy link
Member Author

ampli commented Jan 19, 2020

Forced-push with 2 cleanups.
However, I noted a strange regression in the non-debug compilation: The dialect tags are not printed in the expressions. I still cannot understand how this may happen (in debug mode all is fine).
So you may want to wait with the application of this PR (but you can apply it nevertheless and I will submit a fix when I dins the problem).

@ampli
Copy link
Member Author

ampli commented Jan 19, 2020

The created expressions are different in DEBUG and non-DEBUG modes, so something bad happens.
I investigate that.
Please don't apply this PR.

@linas
Copy link
Member

linas commented Jan 19, 2020

ok

@ampli
Copy link
Member Author

ampli commented Jan 20, 2020

I fixed the problem (introduced in PR #1079) ) in commit "make_expression(): Fix comparing to the wrong Exp field".
I also introduced several improvements, including a full description in !help ! (please check if it cane be made clearer).

BTW, in the forced-push commit display, the commits are not in the same order as in my git branch. I hope this doesn't have bad implications.

This PR can be applied now.

@ampli
Copy link
Member Author

ampli commented Jan 20, 2020

For these I still need your input:

I can add a flag to show the source macro for each connector. However, I don't know in which format to display that. One possibility is to use the rest of the line (after the connectors). This will cause the lines to be very long and to fold, but it will make filtering by /regex/ easier.

For expression display, there are 2 flags (documented in the help files). I can add more, but I don't know how much it is useful. For example:

Marking shallow and deep connectors. In case a word has several links, the link from a shallow connector is longer than ones from deep connectors.
Simplify the expressions.

@linas linas merged commit 9da1368 into opencog:master Jan 21, 2020
@linas
Copy link
Member

linas commented Jan 21, 2020

So, !!test.n// shows a numbers list, but what is the numbering? There's no apparent sort order.

@linas
Copy link
Member

linas commented Jan 21, 2020

Marking shallow and deep connectors.

? Would would that marking look like? Isn't shallow/deep already more-or-less explicit with !!test.n// ?

@linas
Copy link
Member

linas commented Jan 21, 2020

I can add a flag to show the source macro for each connector.

Yes, this would be nice. One could show the deepest macro only, or show the whole macro chain. So, for example,

linkparser> this is a big test
Found 8 linkages (8 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS=-0.10 LEN=9)

                   +-----Ost-----+
    +----->WV----->+  +---Ds**x--+
    +-->Wd---+-Ss*b+  +PHc+---A--+
    |        |     |  |   |      |
LEFT-WALL this.p is.v a big.a test.n

Let pretend this is wrong. Why is it wrong?

linkparser> !dis
Display of disjuncts used turned on.
linkparser> this is a big test
Found 8 linkages (8 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS=-0.10 LEN=9)

                   +-----Ost-----+
    +----->WV----->+  +---Ds**x--+
    +-->Wd---+-Ss*b+  +PHc+---A--+
    |        |     |  |   |      |
LEFT-WALL this.p is.v a big.a test.n

            LEFT-WALL     0.000  hWd+ hWV+ RW+
               this.p     0.000  Wd- Ss*b+
                 is.v     0.000  Ss- dWV- O*t+
                    a     0.000  PHc+ Ds**x+
                big.a    -0.100  PHc- A+
               test.n     0.000  @A- Ds**x- Os-
           RIGHT-WALL     0.000  RW-

Gee, I think that test.n 0.000 @A- Ds**x- Os- is wrong, or strange I'm not sure, so I want to know, where did @A- Ds**x- Os- come from? Let's try

!!test.n/@A- Ds**x- Os-/m

Well, that doesn't work because I cannot cut-n-paste from the disjunct display to the regex search ... But lets pretend that this worked ... ideally, I would see something similar to this:

<common-const-noun>: (
    <common-phonetic>: (
        <noun-modifiers>: @A-) &
     <nn-modifiers>: Ds**x  &
     <noun-main-s>: (
            <CLAUSE>: Os- ))

I used both indentation, and parenthesis above, but I think only one or the other is needed (the indentation is the same as the open-paren count). Its probably easier to read without the parens. With the above printout, I can immediately jump to the correct locations in the dictionary, and study how or why they might be right/wrong.

The biggest problem with this suggestion is that !!test.n/@A- Ds**x- Os-/m is not a regex. And I would hate to have to insert escape-backslashes to turn it into a valid regex. So maybe it should be ... !!test.n{@A- Ds**x- Os-}m or !!test.n[@A- Ds**x- Os-]m or !!test.n#@A- Ds**x- Os-#m or !!test.n;@A- Ds**x- Os-;m .. I dunno. There is also the question: how can I search for !!test.n/@A- Ds**x- (wildcard)/m ? Maybe that is not important, maybe the existing !!test.n/@A Ds .*<---/m is enough? Except it doesn't work:

linkparser> !!test.n/@A Ds .* <-/m
Token "test.n" matches:
    test.n                        8509  disjuncts <en/words/words.n.1-const>


Token "test.n" disjuncts:
    test.n 4273/4501 disjuncts

(0 disjuncts matched)

so I'm confused about that (let's pretend I'm a naive user and never really how to use a regex, except for very simple ones...)

linkparser> !!test.n/@A Ds/m
Token "test.n" matches:
    test.n                        8509  disjuncts <en/words/words.n.1-const>


Token "test.n" disjuncts:
    test.n 4273/4501 disjuncts

(0 disjuncts matched)

linkparser> !!test.n/A Ds/m
Token "test.n" matches:
    test.n                        8509  disjuncts <en/words/words.n.1-const>


Token "test.n" disjuncts:
    test.n 4273/4501 disjuncts

(0 disjuncts matched)

oh wait, is deep-shallow ordering reversed?

linkparser> !!test.n/D @A/
Token "test.n" matches:
    test.n                        8509  disjuncts <en/words/words.n.1-const>


Token "test.n" disjuncts:
    test.n 4273/4501 disjuncts
          test.n: [3500]2.000= DD @AN @A <--> GN
          test.n: [3513]0.000= DD @A <--> GN
          test.n: [3526]0.100= DD @AN <--> GN
          test.n: [3539]3.100= DD @AN @A @AN <--> GN
          test.n: [3552]1.100= DD @A @AN <--> GN

(5 disjuncts matched)

linkparser> !!test.n/Ds**x @A/
Token "test.n" matches:
    test.n                        8509  disjuncts <en/words/words.n.1-const>


Token "test.n" disjuncts:
    test.n 4273/4501 disjuncts

(0 disjuncts matched)

Argh! regex is confusing.

BTW, I think the typical use case will be to search for one disjunct only. I think that wild-card searches will be uncommon, and I cannot imagine using regex, even if it worked well... (but its late at night and I'm sleepy so my imagination is impaired).

@ampli
Copy link
Member Author

ampli commented Jan 21, 2020

So, !!test.n// shows a numbers list, but what is the numbering? There's no apparent sort order.

This is the order as produce by build_disjunct(). The numbering is just sequential.
Flags can be added to sort it according to:

  1. Cost.
  2. LHS connectors and RHS connectors (with a selection which is the main sort key).
  3. Number of connectors.

When you filter by regex, change max_cost, or apply a dialect, the numbering is redone.
Edit:
A code may be added to to preserve the numbering in the case of filtering by regex (not in the other cases. E.g. if you see disjunct number 1234 it will always be the exact same disjunct disregarding the regex filter you use (maybe this may ease debugging disjuncts).
No need - this is already done by default.

@ampli
Copy link
Member Author

ampli commented Jan 21, 2020

? Would would that marking look like? Isn't shallow/deep already more-or-less explicit with !!test.n//

In !!test.n//, the leftmost disjuncts on each jet is the shallow one.
However, on expression display it is usually unclear which ones are shallow or deep. Many of the connectors can be both shallow and deep in the generated disjuncts, but some of them are only shallow or only deep. If you know that a connector is only shallow,in case of a several connectors on a jet, it must connect to the farthest position (the deeper one must connect to the nearest position).
However:

  1. I still don't know if this is useful when authoring a dict.
  2. I hoped that using this info, I will be able to make expression_prune() faster, because it will be able drop connectors without even trying to match them, and even if they would match otherwise (like done in power_prune()), and then power_prune() will have less work to do. But for some reason I didn't get net speedup (marking takes time too).
  3. My current demo code only detects shallow connectors and not the deepest ones. Of course it can be extended. It also marks as shallow all the connectors that may be shallow on at least one disjunct. This is not useful. It is most useful to have these marks instead: surly shallow, maybe shallow, surely deepest, maybe deepest.

Example from the demo output (the ones ending with the d mark cannot be shallow, all the rest may be):
(XXXGIVEN+s) or ((((((((@A-s & {[[@AN-s]]}) or [@AN-s]0.100 or ([[@AN-d]0.100 & @A-s] & {[[@AN-s]]}) or ())) & (((({@M+d} & dSJls+s) or ({[@M+s]} & dSJrs-s))) or (GN+s & (DD-s or [()])) or Us-s or ({Ds-d} & [Wa-s]0.050 & ({Mf+s} or {NM+s})))) or ((((@A-s & {[[@AN-s]]}) or [@AN-s]0.100 or ([[@AN-d]0.100 & @A-s] & {[[@AN-s]]}))) & (({NMa+d} & AN+s)
`

@ampli
Copy link
Member Author

ampli commented Jan 21, 2020

It is also possible to show the expressions with cost cutoff (same as done with disjuncts), in which case they are more compact. This may be useful because it will show only what is actually used.

@ampli
Copy link
Member Author

ampli commented Jan 21, 2020

!!test.n/@A- Ds**x- Os-/m

Well, that doesn't work because I cannot cut-n-paste from the disjunct display to the regex search

In any case the m flag is still unsupported for disjunct list because I didn't know what the desired output is. Now I know, and will try to implement that.

And I would hate to have to insert escape-backslashes to turn it into a valid regex.

Because the default regex engine is PCRE (I guess it is installed in your system, actually PCRE2 if installed), you can preceed the regex with \Q in order to make it literal. E.g., this would work:
!!test.n/\QDs**x/
Moreover, you can use both literal mode and regex mode by ending the literal part with \E.
But see also the r flag suggestion below.

because I cannot cut-n-paste from the disjunct display to the regex search

The problem is that the disjunct display doesn't include the connector signs. Possible solutions:

  1. Always add the connector signs, e.g. instead of displaying
    test.n: [3487]2.600= dRJrc @hCOd Ds**x @A @AN <--> Ss*s Bs R NM
    display it as:
    test.n: [3487]2.600= dRJrc- @hCOd- Ds**x- @A- @AN- <--> Ss*s+ Bs+ R NM+
  2. Like 1, but when a flag for that is used.
  3. Like 1, but only when the pattern is quoted and includes connector signs (but it looks too crazy).

Using solution(), one could do (using flag s for "use connector signs"):
!!test.n/@A- Ds**x- Os-/ms

oh wait, is deep-shallow ordering reversed?

The deep-shallow ordering of expressions and disjuncts is indeed different... It was always so in the library disjunct and connector-list display code. This makes a problem for cut&paste, unless, for example, the s flag above auto-reverses the order. This is indeed a problematic complication.

BTW, I think the typical use case will be to search for one disjunct only. I think that wild-card searches will be uncommon, and I cannot imagine using regex, even if it worked well...

Here are a useful regex searchs:

  1. Show duplicate sequential connectors:
    !!test.v/( @?\w+\b)\1\b/
    (note the subtelity of the need for \b ...)
    You get something like:
...
          test.v: [66112]0.000= dIV B*m dIV I @E <--> VC O VC VC
...
          test.v: [70679]0.000= B*j @E @E VJrpi I @E <--> O
...
  1. Show very long connector lists (6 or more):
    !!test.v/^\s*([\w.-]+):? ([^>]*> )?\S+(:? @?[a-z]?[A-Z]+[a-z*]*){6,}/
    (You will find here lists of up to 9 connectors. In the dictionary there are words with lists of up to 13 connectors.)
  2. Search connector sequences with slight variations:
    !!test.n/ J[sk] D[\w*]+c/

However, if the regular searches are expected to be of literals, a flag r can be added for regex search.

@linas
Copy link
Member

linas commented Jan 21, 2020

OK, clarifications:

deep-shallow ordering

So, this is a bug: the printout should match the same order as what is in 4.0.dict. I thought I fixed that bug, once, but maybe not. Connector ordering is already confusing; to have to mentally reorder it, some of the time, for some cases .. that's just bad.

regex

I'm fairly certain that 90% of all dictionary debugging flow will work like in the "this is a big test" example. So this is the flow that must be natural, easy-to-use, and obvious. I don't mind using \Q to quote -- this is a good idea. But I do mind is having open a browser tab, to search for regex documentation, to read the regex documentation, and then go back to what I was doing. It's a complete waste of time and mental energy. So complex regex patterns are almost useless, to me, I will probably never-ever use them. (basically, my brain is already fully occupied trying to solve a linguistic problem; I don't want to also, at the same time, solve a regex problem.)

By contrast, these two I like:

!!test.n/\QDs**x/

and

!!test.n/ J[sk] D[\w*]+c/

and both should be mentioned in !help !

disjunct display doesn't include the connector signs

I think I like solution 1 the best, mostly because it takes the least amount of work/effort (least amount of reading the docs).

@ampli
Copy link
Member Author

ampli commented Jan 22, 2020

So, this is a bug: the printout should match the same order as what is in 4.0.dict. I thought I fixed that bug, once, but maybe not. Connector ordering is already confusing; to have to mentally reorder it, some of the time, for some cases .. that's just bad.

I will make the printout of disjuncts and connector lists consistent with expression order. Only debug output is involved - not anything with official API output.

I will also implement the rest according to your above post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants