-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prune speedup #845
Prune speedup #845
Conversation
Instead of copying the list (in reverse order) while making the deletions, just delete in-place, preserving the list order. (Preserving the order is important for the next change that depends on the shallow connectors to always be first in the list.)
Connectors cannot match when both are non-shallow. Now the list of connectors contains mixed shallow/non-shallow, and the whole list needs scanning, when just before the match try False is returned as the match result. Implement the following for stopping earlier in the left/right_table_search() matching (possible_connection()) loop; 1. On power_table_new(), arrange for the shallow connectors to be first in the table, by inserting them last (they are inserted in reverse order). 2. On left/right_table_search(), if we are matching a non-shallow connector, stop when we don't have any more shallow ones to try. Note that in a previous commit cleanup_table() got changed not to change the table order, a thing that is critical for this change.
This allows the code to compile if defined. It still causes slowness so it remains undefined.
Also delete disjuncts in-place instead of copying the remaining ones to a new list.
Validate that all BAD_WORDS has been removed.
This saves some CPU by avoiding a long zero-byte padding.
(GCC would most probably inline it anyway, but I don't know about other compilers.)
At the same occasion: - Reformat to the new code style. - Localize variables.
looks good. |
So. Stupid computer tricks. Just to see how things are going, I thought to measure the speed of the ten-year-old version 4.2.5 ... had to do some hacking to get past a broken file-search-path design. I was shocked to find that it is 3x faster. Let that sink in. Basically, Lets try a different experiment: run today's parser with the old dictionary. Easy to say. Hard to do. For multiple reasons. Conclude: today's parser is almost 2x faster than the old parser. What to conclude? The "almost 2x faster" result is ... great! But that also means that the more complex dict is 6x slower (since 6 = 2x3). So what does this mean? Well, part of me wants to say that this is like the fractal "length of the coastline of England" -- the more complex the dictionary, the more fractal language coastline can it cover, more accurately. But that's just hand-wavey meaningless blather. Its still interesting, that a more accurate dictionary might even result in exponentially slower parsing? Is that possible? |
Yes. It seems it is more subtle than what it seems at first glance. I made the following tests: Version 4.2.5:
So 2772 sentences fail immediately due to at least one word with 0 disjuncts. For the current version (5.2.1):ed '/verbosity=/d' data/en/corpus-fixes.batch | link-parser -v=9 -debug=prune.c | grep -F -A1 'After pp_prune' | grep -cF '(0)'
But as you can see for yourself (if you change Another thing to consider is parse with FAT links. All the complex (and parse-costly) conjunction links didn't exist on 4.2.5. I don't know much on FAT links, but it may be that the effective sentence length was smaller. Better to compare to 4.8.6 (it is slower than 5.5.1 on the batch benchmarks when even each version uses its own dict). As I noted in #250, the current aggressive unit split, that didn't exist on 4.2.5 and even not in 4.8.6, causes much slowness. As I also pointed out there, this aggressive split is actually mostly useless because it mostly not supported by the dict. The said aggressive split (that you once added) adds unit alternatives for many words, e.g. Proposal: Turn off multi unit split (or add support for them, e.g. that they will be handled as a single unit).
Several times I noticed a large slowness due to what seemed as small dict fixes. ALSO: No regexes in 4.2.5! Last thing: 5.5.1 using the 4.2.5 dict (after adding an empty |
I fixed some typos in my previous post so you may want to read it on the web (and not email). |
Using the currrent parser with the 4.2.5 dicts will not work with an empty regex; to get approximately backwards compat behavior, you need at least this:
grepping reveals this:
and
|
Yeah, maybe turning off aggressive unit stripping is a good idea. Or maybe making it into a flag. At one point, I was trying to parse biochem papers, and at that time, it seemed like a good idea. |
More generally, if a sentence does not parse quickly and easily, what is the fallback strategy? Currently its ad-hoc:
Should we have unit stripping turned off by default, and then try aggressive unit stripping before trying to ignore some words? Maybe not a bad idea... Then an oldie-but-a-goodie: what about newspaper headlines, which are short of determiners? |
I checked again the overhead of multi-unit stripping, and at least in the fixes batch it is now only a few percents (maybe due to the automatic subscript addition that was recently added). I guess that for very long sentences it can be more. But I'm for leaving the multi-split as is, and instead fixing the dict to use it. I thought to use it as a test case of "general idiom definitions".
And maybe also something like that, if the split will be done even more aggressively:
(The
#402 is in a WIP stage. I based it on an implementation of sub-expression tags, when the tags in that case are used for per-dialect costs. (They can also be used for other expression manipulations like the dict-capitalization.) We will also need to find out how to set the dialect if a sentence is not parsed. For example, how can we know if the reason is that it is short of determiners? An easy case may be if there are no determiners at all. But:We can also always parse with a high disjunct_cost (say disjunct_cost=10) and implement a cost cutoff as a postprocessing. The cost_max for this cutoff will still be usually the normal one (2.7), and the higher-scored sentences will be considered only if there are no ones <= cost_max. Such a strategy may also automatically identify the dialects (by looking back from which sub-expressions the chosen_disjuncts got generated). |
I like all these ideas. However, my intent is to never hand-write dictionaries again, and just automate everything. The automated data will have cost attached to everything, including morphology splits. Sorry I don't have dicts for you to play with, I keep getting distracted by other things. |
Corpus batch benchmarks (10x10 runs):
en basic: 2+%
en fixes: 3+%
ru: ~0.5%
Individual runs of a small sample of very long sentences shows ~5% speed up.
Runs with
-verbosity=2
show a big power_prune() speedup. However,power_prune()
currently takes only a small fraction of the total CPU, so the total saving is not as big.The largest speedup is due to the shallow/deep connector ordering that allows an early termination of the loop in
*_table_search()
, that is implemented in the following commit:prune.c: Matching shallow connectors: Stop earlier
Some other changes didn't show a noticeable save in their benchmarks, so they are not included here:
pp_prune()
.power_table_new()
.But now I may know better and can improve their implementation, so I may include them in a future PR.
Some other changes in this PR:
power_prune()
static.lc_easy_match()
as inline (I guess it got inlined anyway).(I also have major exprune.c changes with a greater speedup impact, to be sent soon.)
BTW, here is a
power_prune()
speedup idea for a future check:Eliminate
bool shallow
instruct c_list_s
, in order that it will be exactly half CPU cache line size.*_table_search()
loop.