Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flood-counting #1451

Open
linas opened this issue Feb 24, 2023 · 35 comments
Open

Flood-counting #1451

linas opened this issue Feb 24, 2023 · 35 comments

Comments

@linas
Copy link
Member

linas commented Feb 24, 2023

One of the "deep" conceptual problems with LG parsing is that there are usually either not enough parses (i.e. null words) or there are too many (count overflow) and its hard to strike a middle ground. This particularly affects linkage generation: there (currently) isn't any way to know the cost of a linkage, without actually computing it.

This issue is meant to be a place to daydream about possible solutions to this.

I have only one very vague and not really workable suggestion, for right now. During counting, instead of counting 1 for each link, instead do a weighted count of $2^{-cost}$ for each link.

Counting in this way replaces a grand-total by a weighted count, hinting at how many of these potential linkages are low-cost, vs. how many are high-cost. We could even make it smell more like simulated annealing by using $exp(-\beta \times cost)$ instead of $2^{-cost}$ where $\beta$ is a user-supplied "inverse temperature" parameter $\beta=1/kT$.

What is not yet clear to me is how to use this to replace the random sampling of linkages (when there is a count overflow) by a weighted random sampling, that would make lower-cost linkages more likely to be sampled.

@ampli
Copy link
Member

ampli commented Feb 24, 2023

Can we just ignore the count overflow?
As each linkage is a path in the parse-set/parse-choice graph, maybe we can apply a heuristic algo to find the shortest (lower metric cost) paths. It is known that it is NP-hard, but I understood there are approximate algos that are fast and give results that are close to the optimum.


From Wikipedia:

Various heuristics and approximation algorithms, which quickly yield good solutions, have been devised. These include the Multi-fragment algorithm. Modern methods can find solutions for extremely large problems (millions of cities) within a reasonable time which are with a high probability just 2–3% away from the optimal solution.


(The parse-set graph may also have special properties that enable shortcuts in these algos, but I guess it could be very time-consuming to investigate that.)

linas added a commit to linas/link-grammar that referenced this issue Feb 24, 2023
Something like this could almost work, if we had left and right
cumulative costs.  See issue opencog#1451
@linas
Copy link
Member Author

linas commented Feb 24, 2023

Idea two: issue-links in sorted order.

Currently, in extract-links.c the exctract_links() calls list_links() with an index. This walks the left and right trees using a modulo numbering scheme. This modulo scheme always explores the left parse-set exhaustively, before taking the next step on the right. I thought that perhaps it could work in a quasi-sorted order, by stepping left or right, depending on which one has lower cost. So I tried this:

--- a/link-grammar/parse/extract-links.c
+++ b/link-grammar/parse/extract-links.c
@@ -838,8 +845,16 @@ static void list_links(Linkage lkg, Parse_set * set, int index)
        }
        assert(pc != NULL, "walked off the end in list_links");
        issue_links_for_choice(lkg, pc, set);
+
+float lcost = pc->set[0]->first->md->cost;
+float rcost = pc->set[1]->first->md->cost;
+if (lcost < rcost) {
        list_links(lkg, pc->set[0], index % pc->set[0]->count);
        list_links(lkg, pc->set[1], index / pc->set[0]->count);
+} else {
+       list_links(lkg, pc->set[0], index / pc->set[1]->count);
+       list_links(lkg, pc->set[1], index % pc->set[1]->count);
+}
 }
 
 static void list_random_links(Linkage lkg, unsigned int *rand_state,

This is not quite enough to make it work. The linked lists pc->first would need to be kept in cost-sorted order. ... and Parse_set_struct does not hold a cost. I did experiment with sorting based on md disjunct cost, but this was not enough. The full code of what I tried is here: https://github.com/linas/link-grammar/tree/sorting-experiment

@linas
Copy link
Member Author

linas commented Feb 24, 2023

I have a very dumb question; excuse me if it is incredibly dumb: why are we counting, anyway? Is there some kind of info being recorded, during counting? Where? If not, why can't we just jump straight into extract? I feel like there is something obvious that I should know but have forgotten. It's embarrassing.

@ampli
Copy link
Member

ampli commented Feb 24, 2023

Here is how I see it:
In the counting stage, there are many dead-ends in which one of the multiplications terms is 0 and the other is not but its result is kept in Table_tracon to save its recomputation in case it is needed again. So the table is big.
In the extraction stage, all the pair results are already known, and only those with both non-zero terms are used, resulting in a smaller data table.

BTW, it seems possible to just point the parse-set elements to the corresponding Table_tracon elements, and save memory. However, I don't know if this will not trash the CPU cache (unless I try that).

@linas
Copy link
Member Author

linas commented Feb 24, 2023

I made some progress. The code in the sorting-experiment branch now gives an almost-but-not-quite sorted sequence of linkages.

It is not obvious to me if this can be improved on, or not (while keeping that general idea implemented there). It's not clear that I've handled totcost correctly, or whther it should be a totmincost or some other function of the left and right branches. I'm done for tonight. @ampli you seem to understand this code better than I; can you see what it is that I am trying to do here, and can you see any way of fixing it to provide a true total order?

Here are two clues as to what is going wrong:

  • It seems as if the even-numbered and odd-numbered linkages got swapped. The costs alternate up/down with each linkage. (I'm looking at the duuude i= prints, which are printed just after the linkages are completely enumerated, but before they are sorted.)
  • The order in which the linkages are enumerated seems to (almost) not depend on the inequality chosen at line 895 -- either greater-than or less-than gives almost the same results. Which seems surprising.

@linas
Copy link
Member Author

linas commented Feb 24, 2023

Hmm. My "almost works" code seems to almost not work, depending on the sentence. So for example:

this is an exceptionally long wired test that perhaps might have more linkages than you might think

generates linkages in "almost sorted order". However,

Let's try some other kind of long strange sentence that might be something or maybe it can be something else

does not do very well, with almost randomized results, i.e. almost not sorted at all. So, clearly, I've been fooling myself, with what I thought I was doing here. What am I doing wrong? Is this fixable?

@ampli
Copy link
Member

ampli commented Feb 24, 2023

So, clearly, I've been fooling myself, with what I thought I was doing here.

I think so.

What am I doing wrong?

A local sorting of disjunct costs cannot let you find linkages with the total lowest costs, as the nested parse-sets (inside a parse-choice) lead to other parse-choice lists that contain arbitrary and unknown costs, and thus has a zero prediction ability on the rest of the costs that will be found until the end of the graph.

This is similar to the traveler to the Travelling_salesman_problem, in which following the shortest link to the next node doesn't lead, in general, to the shortest local path.

Is this fixable?

I think that it is not fixable in the way you did it, but has to be done using one of the shortest path finding algos.

@ampli
Copy link
Member

ampli commented Feb 24, 2023

BTW, I forgot to add that I looked at your sorting-experiment changes.

@ampli
Copy link
Member

ampli commented Feb 24, 2023

You once added a comment in extract-links.c on getting the lower cost parse, with a partial implementation under #ifdef FINISH_THIS_IDEA_MAYBE_LATER.

/**
* Sort the match-list into ascending disjunct cost. The goal here
* is to issue the lowest-cost disjuncts first, so that the parse
* set ends up quasi-sorted. This is not enough to get us a totally
* sorted parse set, but it does guarantee that at least the very
* first parse really will be the lowest cost.
*/

I finished its implementation 4 years ago in commit c1b4d78. The code is enabled by -test=sort-match-list. It doesn't seem to do anything useful! I think this is because of the same reason. So you may want to revisit this comment.

@linas
Copy link
Member Author

linas commented Feb 25, 2023

sort-match-list

Hmm. In that case, the code for sort-match-list, and the comment, should be removed. Any objections to removing it?

@ampli
Copy link
Member

ampli commented Feb 25, 2023

It can indeed be removed.

@linas
Copy link
Member Author

linas commented Feb 25, 2023

A local sorting of disjunct costs cannot let you find linkages with the total lowest costs, as the nested parse-sets (inside a parse-choice) lead to other parse-choice lists that contain arbitrary and unknown costs, and thus has a zero prediction ability on the rest of the costs that will be found until the end of the graph.

I'm still struggling to understand the parse set. To begin doing this, I started by making some measurements, like so:

uint64_t pset_size(Parse_set * pset)
{
   if (pset->first == NULL) return 0;

   uint64_t tot = 0;
   for (Parse_choice *pc = pset->first; pc != NULL; pc = pc->next) {
      tot ++;
      tot += pset_size(pc->set[0]);
      tot += pset_size(pc->set[1]);
   }
   return tot;
}

For the Bainite test sentence I get that the tree seize is 1153392 -- about 1.1 million, while there are 10201832 linkages -- 10M (Pset_bucket_pool size= 1321 and Parse_choice_pool size= 2698)

For the Tagalog test sentence, I get a parse tree size of 1.74e9 and a parse-count overflow. I changed count_t to 64-bit, and it seems there are 2.4e13 linkages (Pset_bucket_pool size= 3483 and Parse_choice_pool size= 13150)

For the Cortes sentence, and 32-bit counts, I get an overflow and 1.2e10 tree size. For 64-bit counts, Cortes gives

49128286530384 linkages -- 49T = 4.9e13 at null count 0
3065210334826656 linkages -- 3065T = 3.0P = 3.0e15 -- null count=1
85107253873160528 linkages -- 85P = 8.5e16 -- null-count=2
1432754044150117546 linkages -- 1.4H = 1.5e18 -- null-count=3

and I don't know how big a parse-tree -- it is still counting, as I write this. At least 1e12, if not more.

Obviously, counting such humongous trees does NOT require lots of RAM -- most of the branches in the tree seem to be rearranged copies of one-another, pulled out of hash tables (the RSS of LG during the Cortes counting is 278 MBytes ; the Pset_bucket_pool size is 90571 and the Parse_choice_pool size is 874347; added together, these two pools total less than a million entries.)

Thus, even though the count is explosive, the Parse_set tree is much smaller, although it is still huge. But this tree has vast amounts of repetition in it.

Rephrased as a traveling salesman problem, Its as if the salesman has to visit certain houses in certain towns in certain counties in certain states, but most of the towns are identical to one-another, and most of the counties are identical to one-another, and most of the states are identical, just with permutations that skyrocket the number of possibilities. Somehow, the same-ness should be exploitable.

(counting code is here )

@linas
Copy link
Member Author

linas commented Feb 25, 2023

Even without the sameness, it seems as if the branches of the tree should be compactly "describable": for each branch, it should be possible (and quick & easy) to compute and store (during mk_parse_set()):

  • The cost of the least-cost choice in that tree
  • The cost of the most expensive choice in that tree
  • The average of all costs in that tree
  • The weighted average of all costs in that tree (with e.g. some Boltzmann weighting)

Lets assume the above is possible (it might not be, see further below)

If it is possible to compute & store mins and averages, then it seems to imply several statements:

  • It should always be "trivial" to find the least-cost tree. Right? Or am I imagining things again?
  • By knowing the (weighted) average of costs in a tree, we can know the likelihood of finding a low-cost path in that tree, and so searches for low-cost trees should be made mostly in trees that have a low average cost. Right? Or is this another delusion?

Perhaps it is somehow not possible to ("easily") compute and store these? If not, why not?

  • Certainly, it is easy to add a few floats to Parse_choice_struct and Parse_set_struct so storage is not a problem.
  • Every time record_choice() is called, we would have to recompute mins, maxes, averages. At first, this seems easy: thus walk over the parse-choice chain, and recompute them.

But what if a parse-set in a parse-choice was changed at some earlier time? Would that mean the averages stored in the parse_choice are no longer valid? to find out if they are valid, would we have to walk the tree all the way to the leaves? That would be infeasible. Is this where the problem lies?

I'm still very confused. I'm having trouble envisioning the "shape" of that tree.

linas added a commit to linas/link-grammar that referenced this issue Feb 25, 2023
Per comments in opencog#1451 and from direct experiments from last night,
it does not seem to make any difference.
@ampli
Copy link
Member

ampli commented Feb 26, 2023

  • By knowing the (weighted) average of costs in a tree, we can know the likelihood of finding a low-cost path in that tree, and so searches for low-cost trees should be made mostly in trees that have a low average cost. Right? Or is this another delusion?

Note that to extract the links you don't traverse a tree from its start point to a single endpoint. The listing links algo traverses it recursively into many paths, until the end of each one. So I cannot see how your algo would work.

But what if a parse-set in a parse-choice was changed at some earlier time? Would that mean the averages stored in the parse_choice are no longer valid? to find out if they are valid, would we have to walk the tree all the way to the leaves? That would be infeasible. Is this where the problem lies?

Instead of in mk_parse_set(), after it finishes you can scan the x-table and handle it element by element, skipping elements that you have already handled.

@linas
Copy link
Member Author

linas commented Feb 26, 2023

how your algo would work.

Here's how to find one low-cost parse. Starting from the top, there are L parse choices on the left, and R to the right. These can be ranked according to lowest-to-highest average cost. Pick one each, on left and right, having the lowest average cost. Then recurse. This should give you some linkage, which should have a low cost, because, at each step, you were picking from a collection having the lowest average.

To get the second-cheapest parse, you'd advance with the same modulo calculation as today, but alternate which side you advance first, based on picking which side has the lowest average cost.

This is similar to my earliest experiment, which attempted to maintain a total cost instead of an average cost, and to keep the lists of parse choices in cost-sorted order. For one sample sentence, it seemed to almost work; for another sample sentence, it was disappointing. I still don't really understand why the experiment didn't work. I can't tell if it was a bug in my code, or if the concept is flawed, or what the flaw is.

It did seem to work for a few short sentences, and fail on the longer ones. Maybe I should have tested more short sentences? Draw out, by hand, pen and paper, how the traversal actually works? Perhaps that would expose the issue ...

@ampli
Copy link
Member

ampli commented Feb 26, 2023

which should have a low cost, because, at each step, you were picking from a collection having the lowest average.

Are there paths that you can path fragments that may be included both in high cost and low cost parse due to the costs in the rest of the paths?
If so, would would be the averages stored in their elements?

@linas
Copy link
Member Author

linas commented Feb 26, 2023

Are there paths that

I'm trying to figure this out. My words below may contain errors and mis-statements:

Conceptually, the parse-set is laid out as a tree of lists, having, at each level, a list of left and a list of right entries. A linkage is a selection of pair of entries at each level, one from the left list, and one from the right list. That is, a single linkage is a single, true binary tree.

As the above parse-set size measurements indicate, the tree-of-lists is very large. The hashing means that most of the lists are shared. That is, the tree-of-lists is a DAG. I think you are asking "how do I assign a cost when there are several different inequivalent ways of getting to the same place?"

Each left-list and right-list is unique, because Parse_set::first points to Parse_choice which has a Parse_choice::next which terminates. The sharing happens at the Parse_choice::set[2] level, where there might be many different Parse_choice:set[0] that all point at the same Parse_set.

This still allows costs and averages to be rolled upwards, from the leaves. Each Parse_set remains unique, even if there are many different Parse_choice:set[0] or Parse_choice:set[1] point at it. Therefore, we can store an total cost in Parse_set::tot_cost.

To compute this total, start at Parse_set::first and walk the list Parse_choice::next, totaling Parse_choice::set[0]->tot_cost + Parse_choice::set[1]->tot_cost.

But we wanted an average, not a total, so, redo the same walk, counting 1 for Parse_choice and storing that count in Parse_set::num_choices. Then, the average is Parse_set::tot_cost / Parse_set::num_choices.

Now that we know how to walk the tree, and where to store totals, other kinds of totals, with weights, can also be stored.

If you see any mistakes in the above, let me know. I think this is now sufficiently unambiguous to me to be able to code it up and try it... I might try that ... well, not today. Maybe in a few days.

@linas
Copy link
Member Author

linas commented Feb 26, 2023

I really want to try this, ASAP. But I have a lot of errands I have to run first.

If you try it, then the correct total to keep would be

prtot += exp ( -beta * cost);

where beta=1/log(2) is a good place to start. Based on prior experience with other systems, beta=2.5 or in the range of 2 to 5 might work even better.

Then pravg = prtot / nsubtrees;

The list of parse_choices would be kept in sorted order, most to least least pravg; my test branch has code that does this. Then to walk the linkages, the modulo/divide device would be used on either the left or the right side, depending on which is higher pravg.

@linas
Copy link
Member Author

linas commented Feb 27, 2023

I added some ascii-art here:

* Each "level" in the parse-set S consists of a linked list of
* Parse_choice elements, denoted by Cₘ in the diagram below (running
* from C₀ thru Cₖ). Each Parse_choice contains pointers to two
* Parse_set elements, denoted below as S0ₘ and S1ₘ. The structure is
* recursive, so that S0ₘ and S1ₘ in turn point to link lists of
* Parse_choice. This is shown in ASCII-art below.
*
* S
* |
* C₀-------------C₁----------C₂--------C₃- ... --Cₖ
* / \ / \ / \ / \
* / \ / \ / \
* S0₀ S1₀ S0₁ S1₁ S0₂ S1₂
* | |
* | C----C-----C
* |
* C---C----C----C
*
* A single linkage is (conceptually) a tree of Parse_choice, selected
* from the Parse_set as follows. Starting from the top S, pick one Cₘ.
* This becomes the linkage root. Under it are S0ₘ and S1ₘ. Pick some
* C (any C) from the list of C's given by S0ₘ. Likewise, pick some C
* from the S1ₘ list. These two become the left and right sides under
* the linkage root. Continue recursively, until leaves are reached.
*

@linas
Copy link
Member Author

linas commented Feb 27, 2023

Well, I have no self-control, so I tried coding up what I said above. It was difficult, and seems not to work, and I don't understand why: a bug in the code? A bug in the theory? Unfortunately, this needs a sit-down with paper & pencil to debug.

Same branch as before: https://github.com/linas/link-grammar/tree/sorting-experiment

@ampli
Copy link
Member

ampli commented Feb 27, 2023

755  * \k num_linkages_found-1 (by extract_links(), see process_linkages()).
...
794  * within the range of 0 to \k set->count-1 in order to extract all the

Just note that \k should remain \c, because it is actually a Doxygen prefix for a code snippet. In vim / gvim (that I assume you use) there is highlighting of Doxygen constructs if you append doxigen to the file type. So my file type for C files is set to c.doxygen (cpp.doxygen for C++) so I see the effect of \c.

@ampli
Copy link
Member

ampli commented Feb 27, 2023

some ascii-art

The various S0 and S1 can also be shared between the C elements. This is hard to draw.
I think this may happen only at the same level.

@ampli
Copy link
Member

ampli commented Feb 27, 2023

If you see any mistakes in the above

In an old analysis, I once did, I got convinced you cannot find the lower-cost path by assigning costs to parse-set elements. But I don't remember the details and maybe I had a mistake since your description is convincing.

I will try to do this analysis again and recall.

@linas
Copy link
Member Author

linas commented Feb 28, 2023

In an old analysis,

In the old analysis, I sorted by cost of the current disjunct, which might be hiding some high-cost disjuncts behind it. So indeed, this would not be effective.

To overcome this, in my second experiment, I summed totals, and that seemed to work almost-ish, but I think I was sloppy with my summing (since I was still confused about what I was doing).

In the current form, I'm taking averages over all possibilities, and so high-cost disjuncts hiding behind low-cost ones should become "visible" in the average. However, again, this only "sort-of works". I'll try to figure out why, later.

linas added a commit that referenced this issue Mar 4, 2023
@linas
Copy link
Member Author

linas commented Mar 4, 2023

A different (un-)related problem I'm having is that during build_parse_set, I'm getting calls to make_choice with over 50GB up to 100GB RAM getting alloced. It all gets freed; there's no leak; but the usage is voracious, and ... for now, I can't figure out how to make it stop.

@ampli
Copy link
Member

ampli commented Mar 4, 2023

What is x_table_size in that case?
How many Pset_bucket and Parse_choice elements are used?
It may also be interesting to know how many Parse_set elements point only to a single Parse_choice element.
Most important: How many of them have an ANY connector?
In addition, what is the size of the counting table, and how many Tracon_table elements are used?

for now, I can't figure out how to make it stop.

What do you mean by "stop"?
It seems it is possible to reduce the data structure size by maybe 40% at the expense of CPU and much code complexity, but I don't think it would help much.

@linas
Copy link
Member Author

linas commented Mar 4, 2023

I'm collecting data now. Prior to some of the worst-case cases, I see this:

What is x_table_size in that case?

x_table-size in the range of 65K to 256K, ...

average number of parse_choice allocs is about 2 million; worst case so far is 95 million.

About 6K disjuncts on average. The one that had 95 million parse_choice allocs had 22K disjuncts in it.

I don't have the other numbers, yet.

What do you mean by "stop"?

Several choices: 1) predict in advance that there will be an explosive number of parse_choices, and skip creating the parse_set structure entirely (i.e. no linkages, just fail) or 2) trim the number of parse_choice, e.g. by discarding those with high cost, even as they are being computed. 3) something else...

Choice 2 sounds ... interesting....

@linas
Copy link
Member Author

linas commented Mar 5, 2023

How many Pset_bucket and Parse_choice elements are used?

I created all new graphs, but they are more or less the same as the earlier graphs. I won't repeat them here, they're not that interesting. Results:

  • Tracon pool size varies from 1K to 30M
  • Parse_Choice pool size varies from 10K to 400M and goes as $(num tracon)^{1.3}$

The most interesting new result is the max number of parse choices in a parse-set, for a fixed parse. That is, a fixed parse-set is a linked list of parse choices, given by Parse_set::num_pc This is .. surprisingly large, varying from 100 a little over 7K. It's never less than 100, averaging around 1.5K or so.

@linas
Copy link
Member Author

linas commented Mar 5, 2023

how many Tracon_table elements are used?

The size of the tracon pool, and the number of pset_buckets are very nearly equal, to some 5% or 10%, over three orders of magnitude. The one can be used as an excellent estimator for the size of the other.

@ampli
Copy link
Member

ampli commented Mar 6, 2023

Several choices: 1) predict in advance that there will be an explosive number of parse_choices, and skip creating the parse_set structure entirely (i.e. no linkages, just fail).

The number of pars_choice elements in each parse_set can be calculated in do_count() and kept in Table_tracon.
The parse set can then be built in two steps:

  1. Calling a simpler version of the current mk_parse_set(), that builds the Parse_set elements but only counts the Parse_chice elements. If there are too many, then it can just return a failure indication.
  2. Then calling mk_parse_set() that would use a ready table of Parse_set() and would allocate the whole array of Parse_choice elements at once. This has 3 benefits: A. It's more efficient to allocate an array of elements than one by one. B. Less 8 bytes in Parse_choice - no next pointer. C. Most important: The Parse_chice elements of a Parse_set will all be in contiguous memory, which means access by an index instead of traversing a list (with elements in random far-away memory locations, may each be in a different memory page, and their list traversing clobbers the CPU cache).

Maybe this will even save CPU even though a counting needs to be added/stored in do_count().

how many Tracon_table elements are used?

The size of the tracon pool, and the number of pset_buckets are very nearly equal, to some 5% or 10%, over three orders of magnitude. The one can be used as an excellent estimator for the size of the other.

Since only Table_tracon elements with count > 0 may create a Pset_bucket, I guess that in your MST parses, most of them have a non-zero count. This is contrary to en and other real language parses, which rarely have a non-zero Table_tracon element. It seems also that another property of your MST parses is that there are only very few, if any, "disconnected" Table_tracon elements, i.e. ones that happen not to participate in the parse.

Most important: How many of them have an ANY connector?

It is still interesting to know how many of the percentage of Parse_choice elements have one ANY connector and two of them.

@linas
Copy link
Member Author

linas commented Mar 8, 2023

The number of pars_choice elements in each parse_set can be calculated in do_count() and kept in Table_tracon.

I like this idea, I think. I did a run with 82K sentences in it. The average size of PSet_bucket_pool was 55K entries, the largest seen was 9M entries.

By comparison, the average size of parse_choice_pool was 1.5M and the largest seen was 645M. This blew up resident set size to 74 GB for this run.

Does this idea work if there is a count overflow? I don't know how often I get count overflows.

Since only Table_tracon elements with count > 0 may create a Pset_bucket, I guess that in your MST parses, most of them have a non-zero count.

Yes, exactly, by design. Let me explain the process more clearly. First (hours/days/weeks earlier) a dataset of word-pairs are created. These are pairs of words that have been seen near each other, together with how often they've been seen.

You might think that this dataset is huge, but its not. So, for a vocabulary of 50K distinct words, you might think that 50K x 50K = 2500M word pairs are possible. In fact only about 1 in a thousand are actually seen (so maybe 10M or 25M ballpark). The distribution is Zipfian -- the word "the" pairs with almost all other words. The word "bicycle" only pairs with a few -- "handlebars", "seat", etc. For each word-pair, the https://en.wikipedia.org/wiki/Pointwise_mutual_information is computed. "bicycle seat" has a high MI. "the she" has a negative MI. This is all done days in advance (although I'm working on doing it "on the fly")

During MST parsing, disjuncts are created on the fly from these pairs, as a Cartesian product. What this means is this: lets say some word w in a sentence might link to other words, with links A or B or C... the expression that is created is then four copies of (A+ or B+ or C+ ... or ()) and'ed together. So

(A+ or B+ or C+ ... or ()) and (A+ or B+ or C+ ... or ()) and (A+ or B+ or C+ ... or ()) ...

which obviously blows up and allows some huge number of disjuncts. Almost all of these are pruned, immediately. But those that are left -- they will connect almost any word to almost any other. So it is still a relatively "free" parse ("free" in the sense of "anything goes", or, mathematically, a "free object", having few constraints.) Lots of very different linkages are possible.

If you think about this situation a bit, its not hard to see why the size of the tracon table and the size of pset_buckets might be the same.

I failed to see how often count overflow happen. I think its constrained enough that they do not happen, but I dunno. Confused about this, didn't measure.

Each disjunct has a cost, which is the sum-total of the MI of the individual word-pairs. The lowest-cost parse is then a "minimum spanning tree" (MST) it consists of word pairs whose grand total sum minimizes the cost.

In principle, I could reject high-cost disjuncts from the get-go. This would reduce much of the pressure, I suppose. I haven't explored this yet. Its a parameter setting problem. Should be easy...

@linas
Copy link
Member Author

linas commented Mar 8, 2023

ANY connector

In this run, I think there are zero. This is configurable, and I think I configured it off, so as to avoid issues. The goal is to use ANY connectors, only if nothing else works, but I don't have a protocol for this yet .. I don't know if I should have it on by default, or off by default, and retry later, if it fails, or what, exactly. There's half-a-dozen parameters in this, half-a-dozen on-off switches, I don't know what the best ones are, and I don't have a plan for how to explore them. Its all very seat-of-the-pants stuff.

BTW, I'm building docker containers for this, so that you can try it with minimal fuss. First one is ready. Second one is almost ready. Third one, I haven't started. I ... .... hmm. just realized that ... I should add an LG demo to the first container. Why didn't I do that before? Dohhh.

I don't know if you know docker. Basically, you download, press a few buttons, and bingo, everything works.

part one:
https://github.com/opencog/docker/tree/master/opencog/lang-pairs

part two:
https://github.com/opencog/docker/tree/master/opencog/lang-mst

@linas
Copy link
Member Author

linas commented Mar 8, 2023

The number of parse_choice elements in each parse_set can be calculated in do_count() and kept in Table_tracon.

I am still hoping that it will be possible to assign min and max cost bounds to parse_choice elts so that these can be sorted, and trimmed if too long. I still don't understand why my naive hacks to do this didn't work.

@ampli
Copy link
Member

ampli commented Mar 8, 2023

Does this idea work if there is a count overflow?

I think so.

@linas
Copy link
Member Author

linas commented Mar 9, 2023

I added a graphviz for the parse-set. its in the same experimental tree as before:
https://github.com/linas/link-grammar/tree/sorting-experiment

I am unable to control the graphviz layout sufficiently; it goes crazy once the graph is bigger than a few nodes. But it seems to be accurate. The file is written to /tmp/parse-choice.dot

I think I finally understand why the sorting code sort-of works, and sort-of doesn't. Given any fixed parse-set, it does correctly walk from lowest to highest cost at that level. The problem is that the next-lowest-cost linkage may be in a parse-set at a higher level, and there's no obvious way to back-track to there first, before exhausting the current level.

This can be seen in the debug printout, where the costs print in a sawtooth-ascending fashion: the costs get higher, until the walk backtracks up a level, and then the cost drops. Then it resumes the climb till the next backtrack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants