Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recent 5x multiplication of verb disjuncts #1072

Open
ampli opened this issue Dec 28, 2019 · 15 comments
Open

Recent 5x multiplication of verb disjuncts #1072

ampli opened this issue Dec 28, 2019 · 15 comments

Comments

@ampli
Copy link
Member

ampli commented Dec 28, 2019

From 5.7.0:

linkparser> !!test.v
Token "test.v" matches:
    test.v                       18101  disjuncts

Now (248a613):

linkparser> !!test.v
Token "test.v" matches:
    test.v                       93621  disjuncts

This also seems to cause a noticeable speed regression.

@linas
Copy link
Member

linas commented Dec 30, 2019

Yeah. I'm guessing that this is due to the recent addition of (Wi- & Xc+ & SI*i+ & {Xc+}) to <verb-ico> which is a part of <verb-pl,i> which is used in the definition of VERB_PLI which is of the form <verb-pl,i> & ($1), and $1 is some verb difinition, e.g. for test.v it's

<vc-predict>:
  (<vc-trans> & {{Xc+} & QI+})
  or ({@MV+} & (<embed-verb> or TH+ or RSe+ or Zs- or VC+))
  or ({@MV+} & (<QI+pref> & {MV+}));

So test.v used to have the above, anded with whatever the old <verb-ico> used to be. Now, its the old stuff, and in addition, the above anded with (Wi- & Xc+ & SI*i+ & {Xc+}) .. I think I also added (Xd- & (EI- or S**i-)) which explodes another factor of two.

There's a combinatoric explosion, there's no particularly obvious way to limit it.

That's one reason why machine learning is interesting.

It might be interesting to parse a jizzilion sentences, and see which disjuncts never show up. But how to remove them is unclear: although some disjunct might never show up for one word, it might show up for another similar one, and then what? It's a morass.

@linas
Copy link
Member

linas commented Dec 30, 2019

This is not the first time it happened -- I'm pretty sure the dicts from a few years ago are a lot faster...

@ampli
Copy link
Member Author

ampli commented Dec 30, 2019

I produced the disjuncts of all the unique dict expressions of words.
There are ~170M disjuncts in this disjunct list.
Many of them don't make sense for me.
For example:

  1. Long connector lists (~3M are >6, there are up to 13).
  2. Repeating sequential connectors in a connector list. Especially strange is @E @E.
  3. Some words have a very large number of disjuncts (were.v-d maybe have the most - ~450K).

Examples:
From the parses of the fixes batch, only one connector list of >5 connectors has been observed, on sentences like:

    +------------------------Xp------------------------+
    +-------------------->WV-------------------->+     |
    +--------------Xp--------------+             |     |
    +----------->WV---------->+    |             |     |
    |         +------Bsw------+    |             |     |
    +--->Wq---+    +---Pg*b---+    +->Wd--+      |     |
    +-ZZZ-+   +-Rw-+SIpx+     |    +ZZZ+  +--Ss--+     |
    |     |   |    |    |     |    |   |  |      |     |
LEFT-WALL " what are.v you doing.v ?   " she asked.v-d .

            LEFT-WALL     2.000  ZZZ+ hWq+ hWV+ Xp+ hWV+ Xp+

There are ~700K jets with 2-3 duplicate sequential connectors, like:

somebody: [21](0.000)  <--> Ss*s @MXs @MXs @M EL
is.v: [7545](0.000) dWV dWV Qw Iq Xd <--> THb O*m SFIs VC
is.v: [11305](0.000) dCV dCV Qw Iq Xd <--> THb O*m SFIs VC
is.v: [15065](0.000) dIV dIV Qw Iq Xd <--> THb O*m SFIs VC
were.v-d: [32485](0.000) dIV dIV dIV Qe <--> VC THb @MV O*m @EBm SFIp
were.v-d: [33528](0.000) dIV dIV B**t dIV Qe <--> @MV VC @EBm SIpx VC
come.v: [130](0.000) @E dIV B*m <--> dVJlh VC VC
come.v: [839](0.000) dIV B*m dWV PP @E <--> VC VC VC
LEFT-WALL: [2629](3.000)  <--> QUc Xp Xp hWd QUd
different.a: [108](0.000) EA EA <--> dAJlc
night.n: [55](0.400)  <--> dSJl @M @M NM
do.v: [6459](0.000) dWV Wi @E @E <--> VC @MV @MV O
do.v: [1327](0.000) Rw <--> I*d N N SIp
be.v: [31228](0.050) dIV dIV Wi @E @E <--> Pa

Maybe some macros are included more than once?

The ones with duplicate sequential multi-connectors (like @MXs @MXs which is common) can be deleted before power_prune() as it will not delete them if there is more than one match.

@ampli
Copy link
Member Author

ampli commented Dec 30, 2019

There's a combinatoric explosion, there's no particularly obvious way to limit it.

Maybe there is a need to make more efforts in pruning. I thought of more complex algos, that may be enabled on long sentences only (because they are relatively time consuming - the processing shortcuts I used cannot be done in them). I also have speedup demos that I still need to convert to production code.

This is not the first time it happened -- I'm pretty sure the dicts from a few years ago are a lot faster...

Yes.

And pseudo-cross-links will add even more overhead (I know as I already mostly implemented that, and I try now to reduce the overhead). BTW, I called them in the code fxlink (from fake-cross-link), but maybe another term is more appropriate (e.g. pxlink for pseudo-cross-link).

@linas
Copy link
Member

linas commented Jan 2, 2020

The multiple @E @E is just due to sloppiness in the dictionary, the problem being that it's hard/tedious to clean up.

The complexity of "is, was, were`" is not surprising...

The duplicated dCV dCV and likewise is due to sloppiness in <vc-be-no-obj> which has <verb-wall> appearing more than once in a disjunct. I'm going to try to clean it up now ....

linas added a commit that referenced this issue Jan 2, 2020
@linas
Copy link
Member

linas commented Jan 2, 2020

I fixed the duplicated wall connecteors in were.v here: e8f862c

@linas
Copy link
Member

linas commented Jan 2, 2020

The duplicated VC is a bug ... I don't know what to say .. maintaining the English dict by hand is very tedious...

@ampli
Copy link
Member Author

ampli commented Jan 8, 2020

In a WIP I have the following mode of printing expressions (can use a flag).
I did it in a hope that it may help in finding out the duplicate connectors and other suspected connector sequences in disjuncts. If it seems useful I can submit that in a PR (after the disjunct printing PR).

linkparser> !!test.n
Token "test.n" matches:
    test.n                        8509  disjuncts <en/words/words.n.1-const>


Token "test.n" expressions:
    test.n
<marker-common-entity>: (XXXGIVEN+) or 
<common-const-noun>: ((
    <common-phonetic>: (((
        <noun-modifiers>: (((@A- & {[[@AN-]]}) or [@AN-]0.100 or ([[@AN-]0.100 & @A-] & {[[@AN-]]}) or ())) & (
        <noun-and-s>: ((({@M+} & dSJls+) or ({[@M+]} & dSJrs-))) or (GN+ & (DD- or [()])) or Us- or ({Ds-} & [Wa-]0.050 & ({Mf+} or {NM+})))) or (
        <nn-modifiers>: (((@A- & {[[@AN-]]}) or [@AN-]0.100 or ([[@AN-]0.100 & @A-] & {[[@AN-]]}))) & (({NMa+} & AN+) or ((NM+ or ({[NM+]1.500} & (Ds**x- or 
        <no-det-null>: ([(())]headline)))) & ((
        <noun-rel-s>: (({@M+} & {R+ & Bs+ & {[[@M+]]}} & {@MXs+})) & (
        <noun-main-s>: (((Ss*s+ & 
            <CLAUSE>: {({@hCOd-} & (C- or ((dRJrc- or dRJlc+)))) or ({@hCO-} & Wd-) or [Rn-]}) or SIs- or (Js- & ({Jk-} or {Mf+})) or Os- or 
            <post-nominal-s>: ([{[Bsj+]} & Xd- & (Xc+ or 
                <costly-null>: ([[[[()]]]])) & MX-]0.100) or 
            <costly-null>: ([[[[()]]]]))) or 
        <rel-clause-s>: (({Rw+} & Bsm+)))) or 
        <noun-and-s>: ((({@M+} & dSJls+) or ({[@M+]} & dSJrs-))))) or (YS+ & Ds**x-))))) or ({NMa+} & AN+) or ((NM+ or ({[NM+]1.500} & (Ds**c- or 
    <no-det-null>: ([(())]headline)))) & ((
    <noun-rel-s>: (({@M+} & {R+ & Bs+ & {[[@M+]]}} & {@MXs+})) & (
    <noun-main-s>: (((Ss*s+ & 
        <CLAUSE>: {({@hCOd-} & (C- or ((dRJrc- or dRJlc+)))) or ({@hCO-} & Wd-) or [Rn-]}) or SIs- or (Js- & ({Jk-} or {Mf+})) or Os- or 
        <post-nominal-s>: ([{[Bsj+]} & Xd- & (Xc+ or 
            <costly-null>: ([[[[()]]]])) & MX-]0.100) or 
        <costly-null>: ([[[[()]]]]))) or 
    <rel-clause-s>: (({Rw+} & Bsm+)))) or 
    <noun-and-s>: ((({@M+} & dSJls+) or ({[@M+]} & dSJrs-))))) or (YS+ & Ds**c-)))

The extra parens (also with regular expression printing) seem as an artifact that I need to find out and eliminate.
BTW, this works by using macros as a second type of expression tag and printing them in front of subexpressions (instead of as a cost, as done with dialect components) with an indentation proportional to their nesting level.

@ampli
Copy link
Member Author

ampli commented Jan 8, 2020

In the above, the parts that are before, between and after between macros need to be folded with indentation too so it will be clear what the content of each macro is.

@linas
Copy link
Member

linas commented Jan 18, 2020

Yes, that could be useful. Even more useful would be an inverse-lookup: given a disjunct, where did it come from? I'm often fumbling trying to figure this out, to see if it came from <rel-clause-s> or from somewhere else...

@ampli
Copy link
Member Author

ampli commented Jan 18, 2020

Even more useful would be an inverse-lookup: given a disjunct, where did it come from?

This can be easily implemented... I will add this.

I need good syntax for expression display flags.
For now the syntax is:
!!word for the current expression display output.
!!word/regex and !!word/regex/flags for disjunct display.
But I need a way to specify flags for expression display (e.g. by macros).
I thought of the following way:
!!word/flags for expression display.
!!word/regex/flags for disjunct display.

There is also a problem of displaying tens of thousands of disjuncts. An internal paginator (like in git and gdb) may help.

@ampli
Copy link
Member Author

ampli commented Mar 2, 2021

Most of that and more (the discussion continue at PR #1083) have got implemented by now (PR #1085).
However, an internal paginator for the disjunct list has not been implemented.

I said above regarding of duplicate multi-disjuncts:

The ones with duplicate sequential multi-connectors (like @MXs @MXs which is common) can be deleted before power_prune() as it will not delete them if there is more than one match.

I tried to eliminate the duplicate copies but the speedup was unnoticeable. So I didn't include that in any PR.
A more interesting experiment can be to eliminate them and also eliminate the multiple VCs and the very long disjuncts, and then propagate the result into the original expressions, simplify them and use `!!word/m' to see which parts of the expression got eliminated. (I have WIP that can propagate removed disjuncts into their originating expressions in order to simplify the expressions.)

@ampli
Copy link
Member Author

ampli commented Mar 3, 2021

There is a bug here regarding the link to LEFT-WALL:

          test.v: [71186]0.000= @E- Qa- dWV- B*d- <> VC+ VC+
<verb-pl,i>:
    <verb-ico>: @E- &
        <verb-why>: Qa- &
            <verb-wall>: dWV- & VC+
   &
<vc-predict>:
    <vc-trans>:
        <b-minus>: B*d-
           &
        <mv-coord>: VC+

dWV- should have been a shallow connector. Here B*d- tries to connect to words before LEFT-WALL.

@linas
Copy link
Member

linas commented Mar 22, 2021

very long disjuncts

My general impression is that there should never be any disjuncts longer than about 6 connectors long, in the English dict. I don't really know, and I have never looked at the statistics of disjunct-length versus frequency. If those stats were available, then we could probably cut anything longer than 6 (or whatever the "real world" max is.

@ampli
Copy link
Member Author

ampli commented Mar 22, 2021

and I have never looked at the statistics of disjunct-length versus frequency

I looked and I think I reported elsewhere but I cannot find it for now. In the en corpora batches I found 6 disjuncts only once. In any case power_prune() is now so efficient it removes them in a negligible time.

It may be interesting to remove them and propagate the result back to the expressions, to see which macros got affected and whether any of them are made redundant. I may try to do it when I return to this disjuncts-to-expression feedback code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants