-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flood-counting #1451
Comments
Can we just ignore the count overflow? From Wikipedia: Various heuristics and approximation algorithms, which quickly yield good solutions, have been devised. These include the Multi-fragment algorithm. Modern methods can find solutions for extremely large problems (millions of cities) within a reasonable time which are with a high probability just 2–3% away from the optimal solution. (The parse-set graph may also have special properties that enable shortcuts in these algos, but I guess it could be very time-consuming to investigate that.) |
Something like this could almost work, if we had left and right cumulative costs. See issue opencog#1451
Idea two: issue-links in sorted order. Currently, in
This is not quite enough to make it work. The linked lists |
I have a very dumb question; excuse me if it is incredibly dumb: why are we counting, anyway? Is there some kind of info being recorded, during counting? Where? If not, why can't we just jump straight into extract? I feel like there is something obvious that I should know but have forgotten. It's embarrassing. |
Here is how I see it: BTW, it seems possible to just point the parse-set elements to the corresponding |
I made some progress. The code in the It is not obvious to me if this can be improved on, or not (while keeping that general idea implemented there). It's not clear that I've handled Here are two clues as to what is going wrong:
|
Hmm. My "almost works" code seems to almost not work, depending on the sentence. So for example:
generates linkages in "almost sorted order". However,
does not do very well, with almost randomized results, i.e. almost not sorted at all. So, clearly, I've been fooling myself, with what I thought I was doing here. What am I doing wrong? Is this fixable? |
I think so.
A local sorting of disjunct costs cannot let you find linkages with the total lowest costs, as the nested parse-sets (inside a parse-choice) lead to other parse-choice lists that contain arbitrary and unknown costs, and thus has a zero prediction ability on the rest of the costs that will be found until the end of the graph. This is similar to the traveler to the Travelling_salesman_problem, in which following the shortest link to the next node doesn't lead, in general, to the shortest local path.
I think that it is not fixable in the way you did it, but has to be done using one of the shortest path finding algos. |
BTW, I forgot to add that I looked at your |
You once added a comment in link-grammar/link-grammar/parse/extract-links.c Lines 358 to 364 in acb424d
I finished its implementation 4 years ago in commit c1b4d78. The code is enabled by |
Hmm. In that case, the code for |
It can indeed be removed. |
I'm still struggling to understand the parse set. To begin doing this, I started by making some measurements, like so:
For the Bainite test sentence I get that the tree seize is 1153392 -- about 1.1 million, while there are 10201832 linkages -- 10M ( For the Tagalog test sentence, I get a parse tree size of 1.74e9 and a parse-count overflow. I changed For the Cortes sentence, and 32-bit counts, I get an overflow and 1.2e10 tree size. For 64-bit counts, Cortes gives
and I don't know how big a parse-tree -- it is still counting, as I write this. At least 1e12, if not more. Obviously, counting such humongous trees does NOT require lots of RAM -- most of the branches in the tree seem to be rearranged copies of one-another, pulled out of hash tables (the RSS of LG during the Cortes counting is 278 MBytes ; the Thus, even though the count is explosive, the Rephrased as a traveling salesman problem, Its as if the salesman has to visit certain houses in certain towns in certain counties in certain states, but most of the towns are identical to one-another, and most of the counties are identical to one-another, and most of the states are identical, just with permutations that skyrocket the number of possibilities. Somehow, the same-ness should be exploitable. (counting code is here ) |
Even without the sameness, it seems as if the branches of the tree should be compactly "describable": for each branch, it should be possible (and quick & easy) to compute and store (during
Lets assume the above is possible (it might not be, see further below) If it is possible to compute & store mins and averages, then it seems to imply several statements:
Perhaps it is somehow not possible to ("easily") compute and store these? If not, why not?
But what if a parse-set in a parse-choice was changed at some earlier time? Would that mean the averages stored in the parse_choice are no longer valid? to find out if they are valid, would we have to walk the tree all the way to the leaves? That would be infeasible. Is this where the problem lies? I'm still very confused. I'm having trouble envisioning the "shape" of that tree. |
Per comments in opencog#1451 and from direct experiments from last night, it does not seem to make any difference.
Note that to extract the links you don't traverse a tree from its start point to a single endpoint. The listing links algo traverses it recursively into many paths, until the end of each one. So I cannot see how your algo would work.
Instead of in |
Here's how to find one low-cost parse. Starting from the top, there are L parse choices on the left, and R to the right. These can be ranked according to lowest-to-highest average cost. Pick one each, on left and right, having the lowest average cost. Then recurse. This should give you some linkage, which should have a low cost, because, at each step, you were picking from a collection having the lowest average. To get the second-cheapest parse, you'd advance with the same modulo calculation as today, but alternate which side you advance first, based on picking which side has the lowest average cost. This is similar to my earliest experiment, which attempted to maintain a total cost instead of an average cost, and to keep the lists of parse choices in cost-sorted order. For one sample sentence, it seemed to almost work; for another sample sentence, it was disappointing. I still don't really understand why the experiment didn't work. I can't tell if it was a bug in my code, or if the concept is flawed, or what the flaw is. It did seem to work for a few short sentences, and fail on the longer ones. Maybe I should have tested more short sentences? Draw out, by hand, pen and paper, how the traversal actually works? Perhaps that would expose the issue ... |
Are there paths that you can path fragments that may be included both in high cost and low cost parse due to the costs in the rest of the paths? |
I'm trying to figure this out. My words below may contain errors and mis-statements: Conceptually, the parse-set is laid out as a tree of lists, having, at each level, a list of left and a list of right entries. A linkage is a selection of pair of entries at each level, one from the left list, and one from the right list. That is, a single linkage is a single, true binary tree. As the above parse-set size measurements indicate, the tree-of-lists is very large. The hashing means that most of the lists are shared. That is, the tree-of-lists is a DAG. I think you are asking "how do I assign a cost when there are several different inequivalent ways of getting to the same place?" Each left-list and right-list is unique, because This still allows costs and averages to be rolled upwards, from the leaves. Each To compute this total, start at But we wanted an average, not a total, so, redo the same walk, counting 1 for Now that we know how to walk the tree, and where to store totals, other kinds of totals, with weights, can also be stored. If you see any mistakes in the above, let me know. I think this is now sufficiently unambiguous to me to be able to code it up and try it... I might try that ... well, not today. Maybe in a few days. |
I really want to try this, ASAP. But I have a lot of errands I have to run first. If you try it, then the correct total to keep would be
where Then The list of parse_choices would be kept in sorted order, most to least least |
I added some ascii-art here: link-grammar/link-grammar/parse/extract-links.c Lines 764 to 788 in dcd6cc6
|
Well, I have no self-control, so I tried coding up what I said above. It was difficult, and seems not to work, and I don't understand why: a bug in the code? A bug in the theory? Unfortunately, this needs a sit-down with paper & pencil to debug. Same branch as before: https://github.com/linas/link-grammar/tree/sorting-experiment |
Just note that |
The various |
In an old analysis, I once did, I got convinced you cannot find the lower-cost path by assigning costs to parse-set elements. But I don't remember the details and maybe I had a mistake since your description is convincing. I will try to do this analysis again and recall. |
In the old analysis, I sorted by cost of the current disjunct, which might be hiding some high-cost disjuncts behind it. So indeed, this would not be effective. To overcome this, in my second experiment, I summed totals, and that seemed to work almost-ish, but I think I was sloppy with my summing (since I was still confused about what I was doing). In the current form, I'm taking averages over all possibilities, and so high-cost disjuncts hiding behind low-cost ones should become "visible" in the average. However, again, this only "sort-of works". I'll try to figure out why, later. |
A different (un-)related problem I'm having is that during |
What is
What do you mean by "stop"? |
I'm collecting data now. Prior to some of the worst-case cases, I see this:
average number of parse_choice allocs is about 2 million; worst case so far is 95 million. About 6K disjuncts on average. The one that had 95 million parse_choice allocs had 22K disjuncts in it. I don't have the other numbers, yet.
Several choices: 1) predict in advance that there will be an explosive number of parse_choices, and skip creating the parse_set structure entirely (i.e. no linkages, just fail) or 2) trim the number of parse_choice, e.g. by discarding those with high cost, even as they are being computed. 3) something else... Choice 2 sounds ... interesting.... |
I created all new graphs, but they are more or less the same as the earlier graphs. I won't repeat them here, they're not that interesting. Results:
The most interesting new result is the max number of parse choices in a parse-set, for a fixed parse. That is, a fixed parse-set is a linked list of parse choices, given by |
The size of the tracon pool, and the number of pset_buckets are very nearly equal, to some 5% or 10%, over three orders of magnitude. The one can be used as an excellent estimator for the size of the other. |
The number of pars_choice elements in each parse_set can be calculated in
Maybe this will even save CPU even though a counting needs to be added/stored in
Since only
It is still interesting to know how many of the percentage of |
I like this idea, I think. I did a run with 82K sentences in it. The average size of By comparison, the average size of Does this idea work if there is a count overflow? I don't know how often I get count overflows.
Yes, exactly, by design. Let me explain the process more clearly. First (hours/days/weeks earlier) a dataset of word-pairs are created. These are pairs of words that have been seen near each other, together with how often they've been seen. You might think that this dataset is huge, but its not. So, for a vocabulary of 50K distinct words, you might think that 50K x 50K = 2500M word pairs are possible. In fact only about 1 in a thousand are actually seen (so maybe 10M or 25M ballpark). The distribution is Zipfian -- the word "the" pairs with almost all other words. The word "bicycle" only pairs with a few -- "handlebars", "seat", etc. For each word-pair, the https://en.wikipedia.org/wiki/Pointwise_mutual_information is computed. "bicycle seat" has a high MI. "the she" has a negative MI. This is all done days in advance (although I'm working on doing it "on the fly") During MST parsing, disjuncts are created on the fly from these pairs, as a Cartesian product. What this means is this: lets say some word w in a sentence might link to other words, with links A or B or C... the expression that is created is then four copies of
which obviously blows up and allows some huge number of disjuncts. Almost all of these are pruned, immediately. But those that are left -- they will connect almost any word to almost any other. So it is still a relatively "free" parse ("free" in the sense of "anything goes", or, mathematically, a "free object", having few constraints.) Lots of very different linkages are possible. If you think about this situation a bit, its not hard to see why the size of the tracon table and the size of pset_buckets might be the same. I failed to see how often count overflow happen. I think its constrained enough that they do not happen, but I dunno. Confused about this, didn't measure. Each disjunct has a cost, which is the sum-total of the MI of the individual word-pairs. The lowest-cost parse is then a "minimum spanning tree" (MST) it consists of word pairs whose grand total sum minimizes the cost. In principle, I could reject high-cost disjuncts from the get-go. This would reduce much of the pressure, I suppose. I haven't explored this yet. Its a parameter setting problem. Should be easy... |
In this run, I think there are zero. This is configurable, and I think I configured it off, so as to avoid issues. The goal is to use ANY connectors, only if nothing else works, but I don't have a protocol for this yet .. I don't know if I should have it on by default, or off by default, and retry later, if it fails, or what, exactly. There's half-a-dozen parameters in this, half-a-dozen on-off switches, I don't know what the best ones are, and I don't have a plan for how to explore them. Its all very seat-of-the-pants stuff. BTW, I'm building docker containers for this, so that you can try it with minimal fuss. First one is ready. Second one is almost ready. Third one, I haven't started. I ... .... hmm. just realized that ... I should add an LG demo to the first container. Why didn't I do that before? Dohhh. I don't know if you know docker. Basically, you download, press a few buttons, and bingo, everything works. part one: part two: |
I am still hoping that it will be possible to assign min and max cost bounds to parse_choice elts so that these can be sorted, and trimmed if too long. I still don't understand why my naive hacks to do this didn't work. |
I think so. |
I added a graphviz for the parse-set. its in the same experimental tree as before: I am unable to control the graphviz layout sufficiently; it goes crazy once the graph is bigger than a few nodes. But it seems to be accurate. The file is written to I think I finally understand why the sorting code sort-of works, and sort-of doesn't. Given any fixed parse-set, it does correctly walk from lowest to highest cost at that level. The problem is that the next-lowest-cost linkage may be in a parse-set at a higher level, and there's no obvious way to back-track to there first, before exhausting the current level. This can be seen in the debug printout, where the costs print in a sawtooth-ascending fashion: the costs get higher, until the walk backtracks up a level, and then the cost drops. Then it resumes the climb till the next backtrack. |
One of the "deep" conceptual problems with LG parsing is that there are usually either not enough parses (i.e. null words) or there are too many (count overflow) and its hard to strike a middle ground. This particularly affects linkage generation: there (currently) isn't any way to know the cost of a linkage, without actually computing it.
This issue is meant to be a place to daydream about possible solutions to this.
I have only one very vague and not really workable suggestion, for right now. During counting, instead of counting 1 for each link, instead do a weighted count of$2^{-cost}$ for each link.
Counting in this way replaces a grand-total by a weighted count, hinting at how many of these potential linkages are low-cost, vs. how many are high-cost. We could even make it smell more like simulated annealing by using$exp(-\beta \times cost)$ instead of $2^{-cost}$ where $\beta$ is a user-supplied "inverse temperature" parameter $\beta=1/kT$ .
What is not yet clear to me is how to use this to replace the random sampling of linkages (when there is a count overflow) by a weighted random sampling, that would make lower-cost linkages more likely to be sampled.
The text was updated successfully, but these errors were encountered: