-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recent 5x multiplication of verb disjuncts #1072
Comments
Yeah. I'm guessing that this is due to the recent addition of
So There's a combinatoric explosion, there's no particularly obvious way to limit it. That's one reason why machine learning is interesting. It might be interesting to parse a jizzilion sentences, and see which disjuncts never show up. But how to remove them is unclear: although some disjunct might never show up for one word, it might show up for another similar one, and then what? It's a morass. |
This is not the first time it happened -- I'm pretty sure the dicts from a few years ago are a lot faster... |
I produced the disjuncts of all the unique dict expressions of words.
Examples:
There are ~700K jets with 2-3 duplicate sequential connectors, like:
Maybe some macros are included more than once? The ones with duplicate sequential multi-connectors (like |
Maybe there is a need to make more efforts in pruning. I thought of more complex algos, that may be enabled on long sentences only (because they are relatively time consuming - the processing shortcuts I used cannot be done in them). I also have speedup demos that I still need to convert to production code.
Yes. And pseudo-cross-links will add even more overhead (I know as I already mostly implemented that, and I try now to reduce the overhead). BTW, I called them in the code fxlink (from fake-cross-link), but maybe another term is more appropriate (e.g. pxlink for pseudo-cross-link). |
The multiple The complexity of "is, was, were`" is not surprising... The duplicated |
I fixed the duplicated wall connecteors in |
The duplicated |
In a WIP I have the following mode of printing expressions (can use a flag).
The extra parens (also with regular expression printing) seem as an artifact that I need to find out and eliminate. |
In the above, the parts that are before, between and after between macros need to be folded with indentation too so it will be clear what the content of each macro is. |
Yes, that could be useful. Even more useful would be an inverse-lookup: given a disjunct, where did it come from? I'm often fumbling trying to figure this out, to see if it came from |
This can be easily implemented... I will add this. I need good syntax for expression display flags. There is also a problem of displaying tens of thousands of disjuncts. An internal paginator (like in |
Most of that and more (the discussion continue at PR #1083) have got implemented by now (PR #1085). I said above regarding of duplicate multi-disjuncts:
I tried to eliminate the duplicate copies but the speedup was unnoticeable. So I didn't include that in any PR. |
There is a bug here regarding the link to
|
My general impression is that there should never be any disjuncts longer than about 6 connectors long, in the English dict. I don't really know, and I have never looked at the statistics of disjunct-length versus frequency. If those stats were available, then we could probably cut anything longer than 6 (or whatever the "real world" max is. |
I looked and I think I reported elsewhere but I cannot find it for now. In the It may be interesting to remove them and propagate the result back to the expressions, to see which macros got affected and whether any of them are made redundant. I may try to do it when I return to this disjuncts-to-expression feedback code. |
From 5.7.0:
Now (248a613):
This also seems to cause a noticeable speed regression.
The text was updated successfully, but these errors were encountered: