-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing the SAT solver #188
Comments
The link-grammar.def file is not needed by Linux; its used only by Windows and MacOS, where it plays a vital role. Basically, it lists the functions that are publically visibile from a shared library. I guess you are suggesting that we declare these functions as public API functions in various header files, but I really really don't want to do that, since they real are not means to be used by the public. A better solution would be to make the sat library be a STATIC libraries (not shared) and then link it directly into the LG library, if SAT compilation is enabled. That way, I believe we could remove most of the crud from the link-grammar.def file. |
On Linux I just had to add these functions to linkgrammar.def - there was no need to define them as public PAI. if so, isn't this linkgrammar.def addition enough also for Windows and MacOS? In any case, I suggested above that we solve the need to add these in linkgrammar.def in another way: Alternatively (instead of a static library), why note to just link the sat-solver files along with the rest of the link-grammar library files? |
Well, yes, since we've configured the lib to use link-grammar.def, then yes, its used by Linux too. If the functions are small then yes, maybe inline code in a header file might be OK, but is the cure worse than the disease? Yes, we could link the SAT code straight into the LG library, but that would require writing some strange, convoluted, hard-to-debug autoconf code .. which, because its weird and hard-to-debug, I would not want to do. So, suggesting that it be a static lib gives you the same effect, but with less work and less confusion... |
Here is a summary of problems that are still open after the latest fixes:
From all of these problems, I guess that fixing the cost problem is at the highest priority. |
An additional thing related to point (4) above. The reason is that by default !null=1, so if no complete linkage is found, link-parser tries to find a linkage with min_null_count=1 and max_null_count=sentence_length(). Then My suggested fix, until an encoding that allows parsing with nulls is implemented (I try to find out how to do that in an incremental way):
2. If min_null_count==0 but max_parse_count>0 (never happens with link-parser), try to parse anyway but if no complete linkage is found, also issue the said message. I will send a pull request that implements that, but if you think it should be done in another way I can of course change it. |
Benchmarking newer Minisat (and Minisat-based) code. Description and conclusion.I tried LG with the following libraries:
Here is a summary. It turns out that my previous finding that Minisat 2.2+ performs the same as the current Minisat version doesn't hold for en/4.0.fixes-long.batch. The tests were mainly with the default SAT-libraries parameters (I didn't test much with non-default ones, and the changes I tested didn't cause a speed-up). On first glance it seemed that the comparison to the standard LG parser is not completely fair since the standard parser finds and sorts up to 1000 linkages per sentence even when running in batch mode, when the SAT parser only finds up to one valid linkage, since it currently finds linkages on demand only. However, from further tests it turned out that the overhead of fetching linkages by the standard parser is relatively negligible (for the SAT parser it is the main overhead). In the tests described below, the SAT parser only finds the first valid linkage (if there is a one),
Conclusion: My recommendation is to replace the current Minisat version by Glucose-4.0 (3). Also, I would first like to check the impact of improvements in the low-level part of generating the CNF (I don't understand yet the higher-level part of the LG linkage encoding). |
would you still happen to have the raw numbers? Could you post them? One reason the SAT work stagnated is because the original author found that he could not beat LG for normal-sized sentences. One could argue this was well-known -- SAT was first deployed in the chip-design industry, and it could not beat established tools for "ordinary"-sized IC's. When IC's got larger, SAT started winning by orders of magnitude: but you have to have big problems to seethe pay-off. So, also, it seams, in LG.... I've often thought of a "hybrid" mode: using the usual parser for sentences under 30-50 words, and then switching to SAT for anything longer. Is the Glucose license compatible with LGPL? The biggest reason to NOT use sat is that we need a parser that can handle conversations, where words dribble in one-at-a-time, or where several people are talking. For that viterbi would produce useful results, although it would be a lot SLOWER than the current LG. If you are fishing for something to do that would be useful & cool, there are several things worth doing. One is partial-sentence support: implement a "virtual-left-right-wall" that captures the state (the unfulfilled disjuncts) in mid-sentence; then later, when the rest of the words come in, parsing resumes at this "virtual wall". Another would be to handle quotations correctly, and/or multiple speakers: if you have two people saying things, can you untangle them into distinct streams? it would need probabilistic hints ... |
My relevant diary is attached here:
I now try to optimize the low level clause generation by using PBlib. However, for really short sentences it is most probably not going to help since there is not much to optimize then in the low level.
It has a type of "free" license. I couldn't find the license online so I attache it here, taken from Glucose-4.0 extracted sources (renamed from LICENSE TO LICENSE.txt for the purpose of this attachment) :
For that I will need some hints and help in how exactly approach this task, because I have already looked at it and couldn't conclude how it should be done.
I also started to investigate this problem. There is an article in my reading list that will hopefully give me useful insights: Punctuation in Quoted Speech.
It seems to me that in order to do that in a good manner, the program needs to "understand" much more than what the current LG library can do. [BTW, meanwhile I didn't forget the tokenizing problem and even the Hebrew dictionary.] |
Could you create a directory called "diary" and place fix-long-benchmark.txt into that? The LICENSE seems to be a BSD-type licence, so its compatible. |
Also, I noticed that the russian tests have a very different behavior than english, with respect to timing. |
For the mid-sentence "wall" think of it this way: if you cut the final parse of a sentence mid-way, you will cut some links, i.e. generate some connectors. On the first half of the sentence, the connectors all point right; the last half, they all point left. Thus, a cut sentence looks as if it could be "just some word" with connectors on it, i.e. a disjunct. During parsing, there are more than one possibilities: so when you cut, there will be multiple possible disjuncts attached to the half-sentence ... |
I will do that. I would also want to add the "ru" results. For Russian there is a problem of mismatched alternatives ("insane linkages". Finding such "bogus" solutions is a significant overhead. However, it is possible to SAT-encode the constraint of "select only words from the same alternative". |
Well, careful about "constraints": SAT is a constraint-solver; in general, its best if you tell SAT what the constraint is, and let it solve it. Or at least, that is the principle. The practice, that's another matter, I guess. |
Regarding adding constraints to eliminate "insane" linkages, I forgot to add:
Here are my results: |
Hmm. OK. Yes, switch got glucose, I guess. Sure, add constraints. For sentences between 20-50 words, sometimes SAT is faster, sometimes the orig algo. One nutty idea is to start both, one in each thread, and first one to stop wins. Of course if the timeout is small, the orig algo will always win... the answer seems to be "use SAT if the sentence has a lot of words and the timeout is large" .... |
This is indeed something that needs exploring. Another thing I would like to explore is SAT parallelism. I still need to try the Glucose-4.0 parallelism. |
Comparison of running the data/en batches and the data/ru batch files, using the standard parser and 4 different SAT libraries (including the currently used one). See also the discussion at issue opencog#188.
Comparison of running the data/en batches and the data/ru batch files, using the standard parser and 4 different SAT libraries (including the currently used one). See also the discussion at issue opencog#188.
ampli commented on Sep 19, 2015
The main problem is expressions like that:
The current code assigns costs only to connectors, so I have no idea what to do with costs of null expressions. |
Ah, good question. The costly-null is almost always used in the form |
Thus, as a corollary, expressions such as |
I guess that if I would like to fix it, I will need to add a concept of a "null connector", because changing the SAT encoding code to have the cost on disjuncts seems to me too ambitious (and maybe unneeded). The question is what the matching rules of a "null connector" are. |
That seems like a dangerous idea. A simpler approach would be to always expand out any terms that have a null link in them, and bump the cost of the expanded form. Surely, the SAT solver already has some mechanism to expand null links, so you don't have to re-invent that. All you have to do is to propagate that cost onto some other .. any other .. connector in the expanded form. And since the SAT solver already has to deal with expressions like |
If I have an expression |
already answered that. |
See also issue opencog#188.
Update:
|
Hi Linas!
I mostly finished the full integration of the SAT solver.
In addition to the previous fixes, that are listed in the comment in #187, I have fixed several more aspects of the SAT solver:
6. The maximum number of shown linkages is now !limit instead of the arbitrary 222.
Remained to fix:
However, this will need adding more functions to link-grammar.def (I already added many).
Alternatively, and maybe better, the shared functions can be moved to the corresponding .h files, and be removed from link-grammar.def. This will solve the problem of the many non-API symbols that are exported by the LG lib.
The text was updated successfully, but these errors were encountered: