-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capitalization is a kind-of pseduo-morphology #690
Comments
I once posted a proposal to handle capitalization as a special kind of morpheme: Note that the current tokenizer already implements that (untested, as the dict part of this idea has never been implemented). This seems to me simple and natural way to do that, still using a pure link-grammar technic.
The system find these words are very similar, but with a feature that is different. I propose to encode this different feature as a morpheme. All the capitalization rules will then can inferred (in principle) automatically as link-grammar rules (and until they are inferred automatically, they can be straightforwardly hand coded).
What I essentially propose is to look on substring of words as only a special case of morphemes. BTW, in Hebrew morphemes are not just substring of words. A morphemes that are "templates" are common (and they are really called "morphemes"). So I didn't invent something too new in my proposal to refer to capitalization (and vocalness of start of words, for a/an) as morphemes. |
The letter case is a private case of a more general problem. |
I'm finally trying to implement hanfling of capitalized words by the dict. Possible solution: |
I implemented the main part of the pseudo-morphology capitalization.
I also made the needed changes, mainly in The easiest part was adding support for that in the dict.
Since the capitalized word is downcased if it is in the dict, it appears downcased even if it is a common-entity. It means that the code will need to handle restoring capitalization (like the old code). However, the real (very) hard work has still not done:
Since these 3 tasks are really hard to implement, I first thought to combine the disjuncts of the feature token ( Current linkage result examples:
|
I like the appearing in the parse tree. That is certainly an interesting idea! But is it superfluous? In almost all cases, if a word attaches with Wd, then it should have been capitalized...(or already was capitalized). Linking CAP-MARK to the left-wall seems wrong; it really should be linking to the word that is (should be) capitalized. Or better yet, link to both: to the wall, and also to the word that is being capitalized, with both linkages being mandatory, not optional, thus constraining the parses. That makes it similar to the PH link, which indicates a phonetic constraint. |
This seems wrong:
because it violates your earlier guess about French: the lower-case version of
which 'c' being a possible flag. The 'f' and the 'l' flags would not be needed. If the |
Allowing two links between words -- I'm pretty sure this would massively break the existing dicts. The issue 1,2 for null words - I'm not sure that needs to be solved. If ends up being a null word, that just means that someone capitalized a word that they should not have capitalized. |
I tried to always use the same rule: Provide a CW+ connector on a token after which there is a capitalizable position. But my current implementation is indeed problematic.
Of course you are right. I just had a problem how to do it. One of the ways to start to solve that is having something like: Why I think such disjunct rewrites are not contradicting the current LG theory: |
I don't understand how you can do it without an indication like 'f'.
I cannot see how. I wrote there:
So the algorithm is to use a TOLOWER definition if exists, such as
I am not sure I understand what would be this
(And then there is no tolower() in the code, and an alternative would be generated on white space or end of replacement part.) |
It will be interesting to see where and how... I always thought multiple links should be allowed. |
I'm trying to look at this issue as a special case or demo for issue #700 Thus,
becomes viable for all sorts of tokenization problems, and not just for capitalization. So, yes, there would be a few dozen rules of the form
and no
which does nothing except insert whitespace between the number and the run-on unit. Then with the power to insert spaces (or insert other text), the need for the |
For splitting units and any other continuous morphology I proposed a better way - defining (in the dict) token boundaries (which side of a token must have whitespace) either by marks on the tokens (similar to the "infixmarks" today) or by link types. It is not hard to to write a tokenizer that will use only this info. |
Yes. I'm not sure how. Ideally, one could say something like "if regex In essence, I'm thinking that there is some better way of doing what is done by This is a non-trivial problem: the current Russian dictionaries do seem to be fairly direct and well-structured; I don't want to loose the current simplicity, there. What we currently do is not bad. Yet somehow, the current tokenization mechanism seems inadequate for more complex morphology. In the back of my mind, I'm even thinking of speech-to-text and other discernment problems. There are other possibilities: for example, instead of a regex with programmatically-added connectors, imagine a neural net component, which took in sentences and spat out tokens and connectors. Viz, did exactly the same thing as a regex+prgram-connectors component did, except that maybe it assigns likelihood values to each possible tokenization. The only reason I'm thinking "neural net" is that it would be trainable -- I do NOT want to hand-write big, complex regex+prog-connector dictionaries. Or perhaps there is a way to auto-convert the NN to a regex+connector dictionary.... |
Where is this? one of the other issues? I'm sorry, I'm having trouble reading and responding to everything. |
It can be done in this way or in several other ways.
INFIXMARK can be replaced by link type (or can remain),
Because I got the impression you would not like to introduce changes to the way LG is currently done, including not to the format of the dictionaries, all my latest proposals try to be compatible with the way things are currently done, and are only additions in order to encode things in the dictionary instead of hard-code them like is done now (which has a benefit of better supporting English and also for supporting more languages). For handling any arbitrary morphology, I proposed a different token lookup mechanism (but still use a dictionary syntax like the current one). It is like input preprocessing, but doing it internally has an easy ability to generate alternatives.
There is indeed much to discuss on automatic language learning, and I have several questions regarding that. I will try to find the correct place to raise them - they are too technical/specific and the opencog discussion group is not the proper place for them). |
In the LG group (the "zero knowledge tokenizer"), in other issues here and even in the code itself (the "regex tokenizer" demo). I will try to summarize here. The algo starts with the first letter, and look it up as The regex-tokenizer demo uses the PCRE engine to do this backtracking (however, I also wrote C code to directly do it). It is able to tokenize Russian and is even able to tokenize this way sentences without white space between some or all of their words if it is not directed to look for white space at word boundaries. BTW, for the current dicts it is possible to infer the infix marks for all the tokens in LPUNC/MPUNC/RPUNC, so the above mentioned algo also do punctuation stripping. |
Note this fix in my previous post. |
We can make changes in the dictionary format. I'm mostly being conservative, because I want to avoid ending up with a big confusing mess of non-orthogonal functions that have an incomplete or ugly theoretical underpinning. If something is a good idea, we should do it. I just don't want a horror show at the end of it.
I didn't quite understtand the details of what you said there. I think I have the general flavor of it, just not the details.
Yes, that was also the intent of the regex rewrite; it was a mechanism flexible enough to do this.
Yes, this seems quite reasonable.
Sorry, there was a lot of confusion in my mind back then. Perhaps slightly less, now. There was no clear concept of the word-graph, back then; now that there is, the conceptual foundation is stronger; it allows us to design better pre-processors. The regex idea mentioned here is a kind-of warming up to the finite-state transducer morphology analyzers. I'm thinking of the tokenizer as some kind of "transducer" from flat strings, into word-graphs. The primary questions are "what is the best syntax for encoding what that transducer does?" and "what is the best way of describing the algorithm that the transducer uses?" Viz, currently, the tokenizer is a semi ad-hoc algorithm, driven by
So do I. |
|
I will try to construct a particular example. Regarding tokenizing, I didn't understand why an FST is any better than tokenizing by simple table lookup. |
I am using this term very loosely, as a synonym for a regex So, now, I am trying to think of the transformation of an input string into a token flow-graph (I think maybe that is what we should call it, instead of word-graph -- it shows how an input string can be transformed into several different, sequential flows of tokens). So, your word-graph tokenizer is actually a "transducer" from a character string to a flow-graph. Along the way, you insert whitespace, change spelling, downcase, and transform the input in other, perhaps more radical ways. What is the correct way to specify the operation of this transducer? Currently we use |
It will be interesting to actually test it. Indeed it seems to me that Hebrew tokenizing only by token-end marks would generate extremely many total-nonsense alternatives, since using only word-end info, the tokenizer don't know that some tokens can come only at start of word etc. and in certain order. So this is certainly non-optimal. A tokenizer that follows the links implied by the dict can solve that. The idea is to make a trie-like structure when reading the dict according to connectors, then follow it during tokenizing. This way only sequences that potentially have a dict support will get emitted. But this is good only for continuous morphology (Hebrew has a continuous morphology-like approach that can be used, but other non-continuous morphology languages may not have that).
It (like the current tokenizer - using code for that) can emit a token indicating capitalization and then the lowercased word.
Same as above, it can amit special tokens (say for phonology also).
It can, but the current tokenizer can to.
Say you have a word
Then it emits these 2 alternatives
It can be just substrings of the word. But I originally thought to tokenize this way into tokens which represent grammatical features of the looked up word, like:
and build an LG dict based on that (there are more details about such an implementation, but i is not the point here). I only had some initial sketches on paper so I don't really know if this is possible (e.g. no crossed links) and which changes are needed to make it possible/easier (I guess that an ability of connector/disjunct editing would help). I think you can find such tables for many languages. For complex languages they may be very big. |
I said:
I guess this is a direct response to the above:
The problem that Wd from the LEFT-WALL is often not to the first word, e.g.:
So I cannot see how to use it to enforce capitalization. What I miss are rules for "disjunct calculus". It will tell how to create new disjuncts and "edit" existing ones in various situations. Existing examples:
In our case here: So when a certain word has feature X, it should have a modified set of disjuncts. I will try to code something and learn more from it. |
Such an error will then mask possible single null-word linkages of the sentence (there we be no way to get them), and it seems to me this reduces the application usefulness of LG. One way to solve that, and similar existing problems, is to count null links differently: |
I intend to test again the idea of tokenizing by LG (i.e. use the dict definitions as a tokenizer by automatically converting My previous test tried to tokenize whole sentences in a single parsing operation. This has a benefit of being able to overcome missing word boundaries (or allow for spaces within words). But it turned out to be extremely slow. I also didn't know what to do with multiple tokenization possibilities per sentence, as there were no wordgraph then (the only option was to parse each tokenization result separately, which would be extras-slow by itself). But now the library is faster by a large factor, there is an ability to specify LENGTH_LIMIT_1 connectors (which is the length of all the "letter-LG" connectors), I know some more relevant things that I didn't know then, and I would want to try to tokenize each word separately and not all the words at once. |
I said.
I suspect that for languages with reach morphology like Turkish or with compound words like German, a naive table driven morphology analysis would not work (table too big). (Even for Hebrew, as I mentioned, there is a need to break words into two parts at least and make a lookup in more than one table.) But "letter-LG" may work for all. Stay tuned... |
Or is it a token-flow graph? |
"the token flow graph" or simply "the flow graph" with-or-without hyphens. So "flow-graph" - its not just some graph (cause there's tons of those in comp-sci/linguistics/machine learning) but indicates time-like ordering. A "token-flow graph" because its tokens that are flowing. |
So what should be done with that new term? |
Yeah, maybe. This risks turning into a full-featured parser, if you're not careful. And if you do it only shallowly, then it might not constrain things sufficiently. But its worth a try, I suppose. I don't understand the commoent about "continuous morphology". |
Hang on. I do not believe that some collection of scholars hand-created and hand-maintain a 50MB table. That table is being generated by some algorithm. The algorithm generating the table should then be the tokenizer. So, e.g. for French, there is some small number of irregular words, for which a table lookup is sufficient to tokenize, and then some 3 or 5 classes of very regular verbs which can be split algorithmically. |
Replace all occurrences of "word graph" by "flow graph". This is not urgent. And maybe we should sleep on it a bit longer. |
Its a plausible idea, but unlikely to work well. The French wiktionary contains this kind of data; I doubt its enough. Also, its not terribly hard to import data from other parsers for other languages; I suspect the quality will be mediocre. |
I want to draw a line here. This is an interesting idea, but just not here, not now. Some of this calculus has to happen "after the parse", as it were. Some before the parse. Its a potentially bottomless pit, and we risk over-engineering and trying to design something before the principles are clear. By contrast, the sentence-to-token-flow-graph conversion is becoming increasingly clear; we need to make it a bit more robust, a bit more general, but not go crazy on it. It seems like some amount of minor disjunct generation would be useful, but I want to keep it minor, until the dust settles a bit. Go ahead and think about disjunct calculus, if you wish, but don't create any complex designs, until we have more practical experience with it, and what would be needed from it. |
I don't care. If null-word linkages are required, then we know that either (a) its a spoken sentence, where the we know our grammar is bad, or (b) the dict is just handling the sentence incorrectly (viz, we know our grammar is bad). The correct long-term solution to null-word linkages is to have a much, much better, more accurate model of connector costs; if we had that, the null-word linkages become much clearer, and, in a sense, irrelevant. To be clear, to me, it seems like trying to solve this problem now is not only not worth the cost, but may also have a negative impact on the system: it adds significant complexity, and, worse, that extra complexity might be theoretically wrong. |
In that case you cannot extract unique substrings that represent morphemes. |
Of course it is generated by a program. It is a big and slow Java program (GPL'ed) that of course has its own big data table (all GPL'ed). I guess the big table is used when a slow program doesn't fit the task. BTW, regarding verb forms, in Hebrew there are 10+ families of verb roots that each has a potential 7 forms called "binyanim". And each of them (quoted from Wikipedia "... conjugated to reflect their tense and mood, as well as to agree with their subjects in gender, number, and person.". But not only verbs have inflections. Also regular words can have it. For example (the OR are different readings of the same written letters, with a different pronouncing): Each word may have up to about 1000 prefix combinations (most of them are of course rare in actual use). Of course in order to analyze such words you need to have the full list of base words in Hebrew (and for each noun have its gender and its plural type - in general these cannot be derived from the words), the full list of rules (a big one by itself), and the full list of exceptions. The program |
The quality of the such data that is available for Hebrew is considered to be rather excellent (but as I said above, under GPL or AGPL, depending on its source). |
Do I understand correctly that bad linkages will be then emitted, but marked with a higher cost? Additional question, how per-sentence cost may solve problems like as in issue #404? |
The idea is indeed to start with something minor enough and get a practical experience with it. |
Well, I guess it is a complex program, and not something that would be easily re-written in C/C++. Presumably, it algorithmically encodes many special cases. I guess having a 50MB table is OK, but then we have a packaging/distribution problem.
Yes. Chinese has a fairly simple grammar, but is written without any whitespace at all. Each word is either 1, 2 or 3 hanzi long, and humans can just automatically see the word boundaries. There are word-boundary splitters, but they are not terribly accurate. I was thinking of just treating each separate hanzi glyph as a distinct pseudo-morpheme, and letting the parser figure out what the word segmentation should be, during parsing. Anyway, this is an example of a language where one might want to have not just one but maybe two tokenizers. Its also an example where two different lanuages e.g. mandarin and cantonese might want to use the same tokenizer. |
(Shouldn't this discussion be in issue #700 instead of here?)
It is text file so its size in a
Even now the tokenizer include special code for issuing alternatives for the Hebrew word prefix.
Do you know the approximate number of words in which each glyph can participate (order of magnitude)? |
Probably, but its too late for that.
We can't. In principle, there is no difference between "ungrammatical" and "less probable". In a sense, there is no such thing as "ungrammatical", there is only "someone who does not speak a language very well" or "who doesn't understand a language very well". What we call "language" is actually a social construct; what LG tries to do is to capture some "typical" way in which some "typical" person might speak; however, much like a human, its got quirks, failings, and this will continue to be the case "forever" (as LG will never become a social construct; it will always remain an "individual", just another user of language.). |
Maybe; but currently, having costs on connectors seems to be sufficient to solve all ranking problems. |
No -- it is almost surely a Zipfian, with a slope of 1.01 -- viz the most common glyph will appear in 10K words, the 10th-most-common will be in 1K words, the 100th-most-common will be in 100 words, sliding downhill from there. |
At this time, we could distribute a file containing lists of words, which LG would read. However, I would rather not distribute a file containing word features; instead, we should create scripts to convert that into an ordinary LG dict. This is somewhat analogous to what is currently done for Russian: the |
Currently, instead of using a ranking system, the code just a-priori ignores possibilities (in a hard-coded way) that tend to give bad results, but sometimes might give the only good result. Issue #404 refers to such a case when spell-guess is never done to words that match the capitalization regexes. I added there several other similar situations. I don't have any idea how a sentence-global ranking can solve such problems. |
This is a meta-issue, design-change request, to treat capitalization (and possibly other things) as a kind-of pseudo-morphology. See issue #42 for context. The general issue is about refomulating tokenization (and related issues) into a collection of rules (that could be encoded in a file). The example is capitalization.
For example: capitalization: we have a rule (coded in C) that if a word is at the beginning of a sentence, then we should search for a lower-case version of it. ... or if the word is after a semicolon, then we should search for a lower-case version of it. ... or if the word is after a quote, then we should search for a lower-case version of it. (
Lincoln said, "Four-score and seven..."
) There is an obvious solution: design a new file, and place semi-colons, quotes and LEFT-WORD as markers for capitalization. Maybe this is kind-of-like a "new kind of affix rule" ?? All of the other affix rules state "if there is a certain sequence, then insert whitespace" while this new rule is "if there is a certain sequence, then look for downcase".In the language learning code, I don't downcase any data in advance. Instead, the system eventually learns that certain words behave the same way, grammatically, whether they are uppercased or not. The system is blind to uppercasing: it just sees two different UTF8 strings that happen to fall into the same grammatical class.
To "solve" this problem, one can imagine three steps. First, a "morphological" analysis: given a certain grammatical class, compare pairs to strings to see if they have a common substring - for example, if the whole string matches, except for the first letter. This would imply that some words have a "morphology", where the first letter can be either one of two, while the rest of the word is the same.
The second step is to realize that there is a meta-morphology-rule, which states that there are many words, all of which have the property that they can begin with either one of two different initial letters. The correct choice of the initial letter depends on whether the preceding token was a semicolon, a quote, or the left-wall.
The third step is to realize that the meta-morphology-rule can be factored into approximately 26 different classes. That is, in principle, there are 52-squared/2=1352 possible sets containing two (initial) letters. Of these, only 26 are seen: {A, a}, {B, b}, {C, c} ....and one never ever sees {P, Q} or {Z, a}.
As long was we write C code, and know in advance that we are dealing with capital letters, then we can use pre-defined POSIX locales for capitalization. I'm trying to take two or three steps backwards, here. One is to treat capitalization as a kind of morphology, just like any other kind of morphology. The second is to create morphology classes - the pseudo-morpheme
A
is only substitutable by the pseudo-morphemea
. The third is that all of this should be somehow rule-driven and "generic" in some way.The meta-meta-meta issue is that I want to expand the framework beyond just language written as UTF8 strings, but more generally, language with associated intonation, affect, facial expressions, or "language" from other domains (biology, spatial relations, etc.)
The text was updated successfully, but these errors were encountered: