-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spell-guessing mis-handles capitalized words #404
Comments
It even doesn't arrive the spell-guess step. By design (per the original one), the lookup order of the tokenizer is as follows, and it stops on the first successful one (from my memory):
It means that a word that matches a regex doesn't pass a spell correction. Similar problems happen with these sentences: therfor, I declare ... elefants are big From tokenizer.c:
If the answer to the above is "yes", I can implement that. Also note that words that match a regex loss the opportunity to be handled by the unknown-word device! For handling such things we may need an additional kind of post-processing, plus maybe programmatically costs. |
Hmm. If a word is capitalized and unknown, we should spell-guess the lower-case version. In other cases, we should not do so? The point is, I don't want to run roman numerals through the spell-guesser, since I already think I know what they are. Maybe in some ideal world, we should allow the dictionary to contain something like
and even assign costs:
so that the BAr+ connector gets priority, else the spell-cheeked version. .. Something like that. This would allow common typos to be automatically handled: e.g.
which would add their, they're as possibilities, but at a large cost. |
I think we should always spell guess words that match regex, but discard the linkages with the guesses if the guessed words are unlinked. If the original word (that matched a regex) is linked then
If you write an exact specification, I can implement that.
But this is not exactly a spell guess - I don't know a spell guesser that gives guesses for correct words. |
Post-processing needs to be avoided at all costs. Post-processing is a fail. I cannot write an exact spec just right now. My train of thought, though, is that everything we know about a word should be encodable in the dictionary -- I don't want to invent some new extra-dictionary format. So, the question becomes: how can I tell the tokenizer that capitalized words should be spell-guessed? The only obvious answer, for now, is to invent a new connector: say
However, having costs is a problem: the Gword_struct does not have a cost-field in it, and that would need to be added. Then, during parsing, if this particular alternative is used, the cost of the alternative would have to be added to all of the connectors on it. There's an unrelated, nice reason for having costs on alternatives: some spelling alternatives could be given lower costs than others. That way, if we also had a list of word frequency, we could make the cost be -log(frequency) so that the alternatives for "teh" would get "the" having a low cost, "tea" having a higher cost, and "Tet" having a very high cost. |
For alternatives to correctly-spelled words ... not sure. we need to add some sort alternatives file. I don''t see how to jam alternatives into the current dictionary format. So maybe we do need some kind of new file format, something that would contain entries like:
The last one is interesting, as it suggests that "all_of" is a valid alternative to "all" for example: "I ate all (of) the cookies." which solves the zero-word/phantom-word problem from issue #224 |
one super-hacky way of indicating alternatives is to use word-subscripts. So, for example:
means that "yisser" is an acceptable Irish word, and it really means "your" . If we used a hash-mark instead of a period, it could mean "alternative"
means if you see "there" during tokenization, then add as an alternative "their" so that, after parsing "It is hot there" would be printed instead of "It is hot their." This is kind-of fun, because it can be used as a crude language-translation mechanism: "yisser#your" tells you how to translate the strange Irish word "yisser" into standard English. |
But I think doing a kind of post-processing of sentences is a natural things people are doing when they read text, especially problematic text (.e.g. with errors).
If you would like more info, e.g. for Unification Link Grammar or context sensitive parsing (which we have never discussed) it seems some different format is needed somehwre. For not introducing too much changes, I once suggested a two-step lookup, when you first look up the token in a special dictionary and consult the link-grammar dictionary only to find the disjuncts.
In "during parsing", do you mean during performing do_count()/mk_parse_set()? |
I don't know if you considered my (tested) old suggestion (from the LG group). For example, suppose we add these:
Then we get:
The sentence-checker can then mark the original sentence word "then" as having a correction "than". This can work for the "all_of" example too, if a bug of handling idioms (that I mentioned in my said group post) is fixed. Note that this works fine also if a word have several possible corrections (as demonstrated by my The above method can be used even now. The "post processing" (which in this case means displaying which word can be replaced by which words) can be done by the application that uses the library. (Maybe the library can eventually include a higher-level part - maybe even implemented as a separate library, for doing such things.)
This needs an additional mechanism than what you detail - when we put as an alternative the word "their", we need to somehow indicate it is not an original sentence word but is an alternative. We can do it in several ways. In my suggestion above, it is encoded in the subscript. It can also be encoded in a Please indicate a favorable way to do that, and I will try to write a demo. |
Ah hah! Silly me, you are right! This works today, without any changes at all to the C code:
which gives
Cool! I did not realize that! The |
Re: post-processing: Let me explain it this way: if there is some optional "post-processing" utility that maybe tags words with some additional info, or does some other light-weight processing, that is fine. What I do NOT want is some complex heavy-weight mechanism that interacts with parsing, or does complex filtering, like the current post-processor. Here's why: The abstract theory of link-grammar is very similar to, almost identical to the concept of a "pregroup grammar" (see wikipedia) or a "categorial grammar" or a non-symmetric monoidal category (of tensor products), etc. The "of tensor products" means that one can attach a cost system that is distributive the same way that tensoring is -- you can pull out the cost as an overall factor. This concept is important, because it makes the costs look like a Markov model -- or like a hidden Markov model, where the links between words are what is "hidden", and the costs are the Markov chain weights. To summarize: this way of viewing the parsing process makes it fit very naturally into a generic theory of Markov networks, as well as fitting into various other generic theories of grammar. So I want to very much preserve this naturalness (in the sense of the wikipedia article on "naturalness" as a "natural transformation" between categories). The current link-parser post-processor destroys naturalness. I don't want any other post-processors that destroy naturalness. Of course, it is possible to have a post-processor that is natural, so ... that would be OK. |
So: pull request #425 implements some basic typo support. Things that don't work (can't work without wordgraph support): replacing there by they're, replacing all by all_of |
There is a slight problem that it makes "bad sentences" parsable without a way to switch this feature off (like it is possible with spell guessing). A way to load sub-dictionaries on demand can solve that, but it seems an overkill to implement it only for that. |
well, instead of sub-dictionaries, the problem would be solved by "dialect support" - #402 : turn off the "bad speling" dialect, and then these rules no longer apply. |
I would like to fix the "idiom problem", and since this fix touches common code with the implementation of the above, I would like to implement your proposal for spell guess dictionary markers. |
Huh. So the proposal is that there are some "special" words, such as |
The Russian "idiom problem" is not so specific. The current tokenize algo (as the original one) doesn't make word lookup in more than one method - it stops after the first lookup succeeds. For example, supposing the dict doesn't already include the word '22', then "he found 22 dollars" couldn't be parsed in the presence of "catch_22" in the dict. A possible fix for the problem mentioned above is to define a new function In the case of a regex tag which has I first thought there will be a need to change I think that all of that indicates a much deeper problem. For example in |
Similar examples are when a word is a subject to a spell correction, but none of the suggested words is linked, or a word is subject to a regex when the result is no linked (there are more examples like that). In that cases it is beneficial to use the unknown-word device, but the current algo doesn't use it if an earlier lookup succeeds. |
Ah! OK, yes, very good. Right, the a-priori, use-the-first-one-found approach is not sufficient. Yes, for any given input, we should try all three -- spell-guessing, regex, and standard lookup. The cost system is that thing which is supposed to help eliminate "failed tries" -- the failed tries are the ones with the highest cost. The cost system has both a-priori costs and parse costs. The a-priori costs are that regex and spell should have higher costs than hits in the (abridged) main dictionary. The parse costs are that some word combinations are just unlikely. Both are supposed to work together to find the most-likely interpretation. The "dialect" proposal is supposed to help fine-tune this, by allowing the relative weighting of the regex, spell, and other approaches to be adjusted in a simpler and more direct fashion. |
As part of handling word location (issue #420), I wanted that the code will account the "i" capitalization of the TR and AZ utf-8 locales (lowercasing a capital dotted " (On GitHub, click to expand.) Currently, when the code doesn't succeed to split a token, it lowercases it and tries again. There is no way in the current code (without extreme hacks) to mark the word with the WS_FIRSTUPPER flag, a thing that is needed in order to later account for a possible token length change without extra overhead. So I removed this said lowecasing. Because the lowercase version of the token is now in the wordgraph, this fixes the said spelling problem. However:
This I have not implemented yet. Possible solutions:
Please tell me what to implement. (In any case, it could be nice if the spell-guess cost could relate to the edit-distance of the fix, which we can compute.) Examples for the current unrestricted guess-corrections: This adds a likely guess (the 3rd one).
This ads a really bad guess (maybe
In this case there is no spell-guess try at all:
This is because currently a lowercase token is not generated for capitalized tokens that are not in capitalized positions. Finally, a problem I found that we have not thought about: Nonsense guesses for all-capitalized tokens. For example:
Note that |
Spell-guesses have many other problems (that I encountered), some we have never discussed. |
For regex we already had such test code (the "parallel regex" feature) that I removed because it added a lot of false positives (may be mainly false positives), especially for Russian. I can bring this code back. Maybe we can add a parse option for "more aggressive guesses"?
How? Especially, how it can work for spell-guesses? |
It turns out the "parallel regex" feature is not the desired feature here, as it referred to making a regex guess on words that can also split, in preparation to cases in which this split words have no linkage. In any case, I already have a ready version that I can send as a PR. I said:
I forgot to mention : This is what I implemented for now. Anything else we discussed can be implemented as an extension. |
The concept of spell-guess on unknown capital words is still problematic, and the cost system doesn't usually help in that. Here is an example from corpus-voa.batch: Current GitHub (3a11762):
After the said change:
Because it now has a full linkage, "Shel[!]" doesn't appear any more in the results. EDIT: Fix showing the of original sentence. Fix truncated results. |
Linas, |
I just unearthed this in my email box. I'll try tomorrow; keep reminding me. |
Please see it on GitHub, because I edited it just after my posts, to fix errors. |
I read through this, and see no easy solution. I want to put this on the back burner for another few months or more.
Adding costs to alternatives will become very important, soon ... I expect that language learning will reveal that some splits are far more likely than others, and this should be indicated with costs. I have a capitalization problem for the language learning, I don't yet know how to solve that. |
I just would like to comment that the current library usage gives "hidden" priority to linkages with no null-words, by not showing the ones with null-words at all if there is a linkage without them, thus avoiding the opportunity to give null-words relatively less cost when desired. Sometimes (like here) a linkage with null-words makes more sense.
Is there a statistical+logical way to infer that certain letters are likely "the same", e.g. A and a? (In Hebrew there is a similar problem, regarding certain letters at end of words.) |
Even with spell-guessing enabled, the sentence " Therefo, I declare I love Computer Science" is not fixed. However, the lower-case version of this works fine. I suggest that spell-guessing should down-case words before trying to correct them. (for aspell, I did not attempt Hunspell)
The text was updated successfully, but these errors were encountered: