-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
anysplit issues. #482
Comments
In addition to fixing But note that 4 parts are getting split to: |
sorry, I meant "amy". |
I took a glance at |
I just tried Side question: for Hebrew, if I had to split a word into all of its morphological components, how many pieces might it have (in the common cases)? I get the impression that almost every letter could be a morpheme by itself; is 6 enough, or would more be needed? |
The problem is that the current amy/4.0.affix in the repository is not the version that I included in PR #481. EDIT: My error. The file at master now also works fine with 3 parts. The bug is now with > 3 parts... |
In the common case it is up to 4 pieces a start of word. For possibility demo, people constructed also a 5 pieces prefix. So 5 is the answer for prefixes. Only certain letters can be included in such a prefix. Each such piece consists of 1-3 characters. There are about 12 such strings (depending on how you count them). Of course it is very common that what can be looked as prefix is actually an integral part of a word, and also an isolated word may commonly several meaning, depending on how many pieces you consider as a prefix andhow many as integral part of the word (creating a vast ambiguity). These start peices have concatenated morphology (with a slight twist that I have not mentioned). The end of a regular word can also include some (totally another) morphemes (usually 1-2 letters). I think up to 2. Verb inflections have their own different prefixes/suffixes. There morphology is not concatenative but, interestingly, there is a concatenative approximation for them (if you use a kind of an artificial "base"). (Note also that you have hard time to conclude anything from derivational morphology - no definite rules in t.) But how you are going to handle language which have different codepoints to same letters, depending on their position in the word? On first glance, this seems to ruin morphology conclusions unless you note that fact (e.g. by preprocessing, the equivalent of lowercasing the first letter in English). |
I have just said:
This is true for 3 parts, as appear in PR #481. When using the correct amy/4.0.affix, then indeed all seems to be fine. But when I change it to 4, I also get a problem. EDIT: The file at master now also works fine with 3 parts. The bug is now with > 3 parts... |
The actual problem is because 4 parts and more are currently translated to "multi suffix". It can be fixed in several ways, all need a dict modification:
I think option (2) is reasonable. |
On Wed, Jan 25, 2017 at 4:12 PM, Amir Plivatsky ***@***.***> wrote:
Side question: for Hebrew, if I had to split a word into all of its
morphological components, how many pieces might it have (in the common
cases)? I get the impression that almost every letter could be a morpheme
by itself; is 6 enough, or would more be needed?
In the common case it is up to 4 pieces a start of word. For possibility
demo, people constructed also a 5 pieces prefix. So 5 is the answer for
prefixes. Only *certain letters* can be included in such a prefix. Each
such piece consists of 1-3 characters. There are about 12 such strings
(depending on how you count them). Of course it is very common that what
can be looked as prefix is actually an integral part of a word, and also an
isolated word may commonly several meaning, depending on how many pieces
you consider as a prefix andhow many as integral part of the word (creating
a vast ambiguity). These start peices have concatenated morphology (with a
slight twist that I have not mentioned).
The end of a regular word can also include some (totally another)
morphemes (usually 1-2 letters). I think up to 2.
Grand total sounds like maybe 7-8, plus maybe 3 more for verbs. Whew.
Verb inflections have their own different prefixes/suffixes. There
morphology is not concatenative but, interestingly, there is a
concatenative approximation for them (if you use a kind of an artificial
"base").
I understand its not concatenative; I'm hoping that the more complex
syntactic structures are enough to get this done right.
But how you are going to handle language which have different codepoints
to same letters, depending on their position in the word? On first glance,
this seems to ruin morphology conclusions unless you note that fact (e.g.
by preprocessing, the equivalent of lowercasing the first letter in
English).
Don't know. I'm also planning on not downcasing, and just seeing what
happens. Ask again in a few months. I'm still in very early stages, and
just trying to get a map for what the software needs to support.
…--linas
|
I can be fixed in several ways, *all need a dict modification*:
1. Provide me with another scheme to mark more than 3 parts.
2. You can use a marking for middle morphemes, *done solely in the
dict*: =SUF.=
These middle morphemes can be linked to stems (if they have ***@***.***
<https://github.com/LL>+}) or to previous middle morphemes, or to
both, as you like (the same is said for "real" suffixes, i.e. the last
token).
I think option (2) is reasonable.
Ah! Yes. Clearly, I did not try very hard. My current plan is to use only
two link types at the initial stages: one type between morphemes in the
same word, and another between words. Later stages will create more link
types, including appropriate ones for suffixes, prefixes, etc.
|
OK, I just fixed something, and now a new issues arises: So, at the moment, splitting words into three kind-of-ish works, on shorter sentences, but clearly, splitting into even more parts will not work, even on the shortest sentences. |
and there is another, new strange issue: a single word, of 8 letters, now gets split into 1 or 2 r 3 parts, more-or-less as it should. A single word, of 11 letters, is never split: 54 linakges are reported, and all but 3 of them are the same, and completely unpslit! This did work after pull req #481, but something later broke it. Bisecting now |
It is possible to add an alternatives-position-hierarchy comparison to expression_prune(), power_prune(), and even to the fast-matcher (in which it can even be cached), so matching mixed alternatives will be pruned in ahead. Maybe even adding it only to expressio_puning() will drastically increase the density of good results. Also note that there is a memory leak in partial_init_linkage(). |
Ah. sort-of-found it. |
With more than 3 parts you need the said dict change... With a >20 word it looks fine for me. To see the sampling, use the following: When I tried it with the word |
The whole word is issued only once, as you can see by: The fact that many linkages include only it seems as an artifact of the classic parser. |
It is possible to add an alternatives-position-hierarchy comparison to
expression_prune(), power_prune(), and even to the fast-matcher (in which
it can even be cached), so matching mixed alternatives will be pruned in
ahead. Maybe even adding it only to expressio_puning() will drastically
increase the density of good results.
Should I ask you to do this? Its kind of low-priority, but its a blocker
to more complex morphology work.
Also note that there is a memory leak in partial_init_linkage().
OK, thanks, I think I fixed it in #487
|
When I tried it with the word abcdefghijklmnopqrstuvwxyz the results
looked "reasonable".
I get this:
```
linkparser> abcdefghijklmnopqrstuvwxyz
Found 1766 linkages (162 of 162 random linkages had no P.P. violations)
Linkage 1, cost vector = (CORP=0.0000 UNUSED=0 DIS= 0.00 LEN=0)
+------------------ANY------------------+
| +------------LL-----------+
| | |
LEFT-WALL abc[!MOR-STEM].= =defghijklmnopqrstuvwxyz[!MOR-SUFF]
Press RETURN for the next linkage.
linkparser>
Linkage 2, cost vector = (CORP=0.0000 UNUSED=0 DIS= 0.00 LEN=0)
+----------ANY----------+
| |
LEFT-WALL abcdefghijklmnopqrstuvwxyz[!ANY-WORD]
```
and then linkage 3,7,12 is same as 1
linkage 4,5,6, 8,9,10,11 is same as 2
and so on, the foist one that's different is linkage 28
```
linkparser>
Linkage 28, cost vector = (CORP=0.0000 UNUSED=0 DIS= 0.00 LEN=0)
+---------------------------ANY--------------------------+
| +--------PL--------+----------LL---------+
| | | |
LEFT-WALL abcdefgh=[!MOR-PREF] ijk[!MOR-STEM].= =lmnopqrstuvwxyz[!MOR-SUFF]
```
and then its back to case 1 or 2 until linkage 53 ...
|
ah, indeed, with SAT, that repeated-issue goes away. Maybe with the classic algo, the random selector keeps hitting the same combination, over and over. I think I can kind-of guess why, its a side-effect of the sparsity. |
Flip side: I tried the SAT parser on "Le taux de liaison du ciclésonide aux protéines plasmatiques humaines est en moyenne de 99 %." and after 8+minutes CPU, its still thinking about it. Clearly, there's a combinatoric explosion, so even here, expression_prune(), power_prune(), will be needed. Although I'm confused ... if I think about it naively, adding a sane-morphism check to expression-prune won't fix anything, will it? What would fix things would be to have a different, unique link type for each different splitting, although that then makes the counting algorithm a bit trickier. I'd have to think about that some more. |
If you test this sentence using: In the sat-parser this can be solved by using sane-morphism constraints (and thus make unnecessary the use of the sane_linkage_morphism() function there). Theoretically this is said to make it faster for linkages with potential mixing.
I tend to think so - this needs checking.
Is there a sparsity any more after the bad sane-morphism deletion fix?
You are right. The related fix should be applied to power_prune().
Maybe a kind of "checksum" can be done for each linkage and get hashed, enabling rejecting of identical linkages. |
Here is my plan for mixed alternatives pruning:
In order not to increase the size of the Connector struct, I thought of sharing struct Connector_struct
{
...
union
{
Connector * tableNext;
const Gword **word;
};
}; This way no changes are needed in the usage of tableNext. I have no idea how much overhead this may add to sentences with a few alternatives. For sentences with no alternatives at all this can be skipped, and for sentences with many alternatives I guess it may significantly reduce the linkage time. |
The sparsity is still there, in the classic algo. For the abcdefghijklmnopqrstuvwxyz test, it counts 1766 linkages, but then, with random sampling, finds only 162 out of 1000 random samples. If I increase the limit to 2000, then it counts as before, but then later revises that count to 17, because it can exhaustively enumerate all of these. Its sort of a surprising behavior, that the exhaustive attempt revises the count; Its kind-of a feature-bug I guess. Hashing the linkage sounds like a good idea. But fixing the sparsity at an earlier stage seems more important. Playing with unions is kind-of like playing with fire. I know that performance is sensative to the connector size, but I don't recall any specifics. At one point, I recall measuring that 2 or 3 or 5 connectors would fit onto one cache line, which at the time seemed like a good thing. Now I'm less clear on this. There is a paper on link-grammar, describing "multi-colored" LG: connectors would be sorted into different "colored" categories, that would work independently of each other. This allowed the authors to solve some not-uncommon linguistic problem ,although I don't recall quite what. Because they're independent, there's no link-crossing constraints between different colors -- there's no constraints at all between different colors. Given how Bruce Can was describing Turkish, it seems like it might be a language that would need multi-colored connectors. Of course, I'm talking about this because perhaps, instead of thinking "this gword/morpheme can only connect to this other gword/morpheme", and enforcing it in possible_connection() -- perhaps a better "mindset" would be to think: "this gword/morphme has a blue-colored connector GM67485+ that can only connect to this other blue-colored connector GM67485-" The end-result is the same, but the change of viewpoint might make it be more natural and naturally extensible... (clearly, it's islands_ok for these blue connectors) |
In that particular case there is no problem in that, as tableNext is not in use after expression_prune(). "this gword/morphme has a blue-colored connector GM67485+ that can only connect to this other blue-colored connector GM67485-" The problem is that these color labels are scalars, while a token hierarchy position is a vector. |
Hmm. Well, but couldn't the vector be turned into a hash? Comparing hashes would in any case be faster than comparing vectors. You don't even need to use hashes -- just some unique way of turning that vector into an integer, e.g. just by enumerating all possibilities for that given word. |
Note that Say you have 3 tokens, A, B and C. A can connect to B and to C, but B cannot connect to C. To see how complex connectivity rules between tokens can arise, consider that every token can split again, and the result can split again, creating a hierarchy of splits (the word-graph). But even a sentence with one level of alternatives (like Russian or Hebrew without spell-correction) has these kind of relations - tokens of an alternative can connect to sentence words, but tokens of one alternative cannot connect to tokens of another alternative if both alternatives are of the same word, but can connect to the alternatives of another word. (If these is not clear, try to look at it deeply, and especially think of spell correction that separate words and also gives alternatives, and then all of these tokens get broken to morphemes, each in more than one way.) To find if tokens a and b are from the same alternative, the algo looks at their hierarchy position vectors Va and Vb. It compares their components one by one, until they don't equal. If even number of components are equal, then the tokens can connect, else they are not. Usually the position hierarchy vectors have 0 to 4 elements (corresponding to hierarchy depth 0 to 2), so there is no much overhead in their "comparisons". For sentence without alternatives, all the position hierarchy vectors are of length 0. |
its really late at night and I'm about to go to bed so my reply might be off-the-wall, cause I''m not reading or thinking about your code .. .but .. hey: the disjunct is a vector. each connector is a single item in the vector. I mean, that's kind-of the whole point about all that blather about category theory-- the category of hilbert spaces allows you to define tensors, where you can contract upper and lower (co- and contra-variant) indexes on vectors and tensors.The LG grammar, and most/all categorial grammars are quite similar, just enriching the possibilities of how to contract (match) indexes (connectors): you can force contraction to left or right (hilbert spaces make no left-right distinction) and you can mark some connectors as optional. Now, the classic LG connectors and connection rules were fairly specific, but I've enriched them with more stuff, and we can enrich or elaborate further. So, step back, and think in abstract terms: instead of calling them vectors, call them a new-weird-disjunct-thing; each vector-component can be called a new-weird-connector-thing, and then some ground rules: can we have more than one link between two morphemes? what else might be allowed or prohibited, generically, for these new things? My gut intuition is that developing this abstract understanding clarifies the problem, and the solution, even if the resulting C code ends up being almost the same... and the new abstract understanding might give an idea of a better C implementation. I'll try a more down-to-earth reply sometime later. |
I implemented not-same-alternative prune in prune.c. In possible_connection() before easy_match(): bool same_alternative = false;
for (Gword **lg = (Gword **)lc->word; NULL != (*lg); lg++) {
for (Gword **rg = (Gword **)rc->word; NULL != (*rg); rg++) {
if (in_same_alternative(*lg, *rg)) {
same_alternative = true;
}
}
}
if (!same_alternative) return false; To support that, the for (w = 0; w < sent->length; w++) {
for (d = sent->word[w].d; d != NULL; d = d->next) {
for (c = d->right; NULL != c; c = c->next)
c->word = d->word;
for (c = d->left; NULL != c; c = c->next)
c->word = d->word;
}
} The same linkages are produced, so I guess this indeed doesn't remove anything that is needed.
After:
If this improvement seems worthwhile, I can add efficiency changes to it and send a PR. |
Next thing to try is maybe add such checks to the fast-matcher. |
In the previous post I said:
Ok, I tested this too. It doesn't do anything more than has already been done in prune.c. There is another constraint on alternatives possible connection that is maybe the reason of most of the insane-morphism connections: A word cannot connect to tokens from different alternatives of another token. Consider this situation:
A cannot connect to C and E at the same time. However, I don't know how to implement an apriori check for that (in |
I wrote above:
This is also a private case... It turned out the complete rule covers the two cases and more. Here is the complete rule: The private case of a connection between tokens in different alternatives is easy to check (and it is what I forbidden in |
I found a way to prune the following too (quoting from my post):
However, there appear to be no such cases in the current ady/amy dicts. The 3 private case that complement the 2 private cases mentioned above is when two tokens from different alternatives have a connection each to a different token. This is the hardest case to test, and I think it can only be tested (in the classic parser) during the counting stage. So for now I only implemented the alternatives compatibility test between two connectors (the quoted code). Here is its results for the sentence:
Before:
After:
EDIT: |
I was too pessimistic. More later. |
On Sat, Jan 28, 2017 at 5:18 PM, Amir Plivatsky ***@***.***> wrote:
I implemented not-same-alternative prune in prune.c.
I didn't think about efficiency when doing it, so the change is small.
The same linkages are produced, so I guess this indeed doesn't remove
anything that is needed.
However, *surprisingly*, batch run times are not reduced .
Should not be a surprise -- the English batch files generate very few
alternatives. They Russian ones more, but not overwhelmingly more.
However, debugging shows the added check returns false on mismatched
alternatives (only).
Also, the amy test of ``abcdefghijklmnopqrstuvwxyz` got improved:
Before:
Found 1766 linkages (162 of 162 random linkages had no P.P. violations)
After:
Found 1496 linkages (289 of 289 random linkages had no P.P. violations)
If this improvement seems worthwhile, I can add efficiency changes to it
and send a PR.
Sounds good; I do expect that an 'amy' batch (run against, say the english
batch) would run a lot faster.
|
On Sat, Jan 28, 2017 at 7:17 PM, Amir Plivatsky ***@***.***> wrote:
*A word cannot connect to tokens from different alternatives of another
token.*
Consider this situation:
We have 2 words - A and B. Word B has two alternatives - C and "D E".
A B
alt1: C
alt2: D E
A cannot connect to C and E at the same time.
I suspect that a common case is to have an 'ANY' link from A to C and an LL
from C to E .. (and then the rest of the sentence connecting to E...)
|
I just tried a quick performance test: with patches mentioned above: 1m35.844s 1m36.712s So the same alternative check make it run more slowly. -- that is surprising. Sort-of. Maybe. It suggests that the check isn't actually pruning anything at all. -- and perhaps that is not a surprise, because the pruning stage is too early. However, doing the |
I said:
I had a slight bug in doing it... After fixing it, the The first-45-sentence test then runs more than 4 times faster (with the fast-matcher patch w/o the power-prune patch).
As we see, you are indeed right. I will implement efficiency changes and will send a PR soon. |
This could be helpful in any case. The current limitation seems to me like enforcing writing numbers always like "1+1+..." because this is enough to express any number.
I'm still thinking on that.... |
I noted that many words are UNKNWON_WORD, and when an affix is marked so, this causes to null words. One class of such words are those with a punctuation after them. |
From the point of view of the splitter, it would be more convenient if punctuation behaved like morphemes. But from linguistics, we know that punctuation really does not behave like that. I believe we could prove this statistically: the mutual information of observing a word-punctuation pair would be quite low: in other words, there is little correlation between punctuation, and the word immediately before (or after) it. The correlation is between the punctuation, and phrases as a whole. The problem is that ady and amy both insist on placing an LL link between morphemes: by definition, morphemes must have a link between each other, although they can also have links to other words. Treating punctuation like morphemes would make a link between the punctuation and the word mandatory, which is something that statistics will not support. So, for now, treating punctuation as distinct seems like the best ad-hoc solution. Later, as the learning system becomes more capable, we might be able do do something different: an extreme example would be to ignore all spaces, completely... and discover words in some other way. Note that most anceint languages were written without spaces between words, without punctuation, and without upper-lower distinctions: these were typographical, meant to make reading easier. Commas help denote pauses for breathing, or for bracketing idea-expressions; ? and ! are used to indicate rising/falling tones, surprise. We continue to innovate typographically :-) For example, emoticons are a markup used to convey emotional state, "out of band": like punctuation, the emoticons aren't really a part of the text: they are a side channel, telling you how the author feels, or how the author wants you to feel. Think musical notation: the musical notes run out-of-band, in parallel to the words that are sung. |
Currently words with punctuation after them are "lost" (mar. Dashes and apostrophes will need to be ignored as punctuation (i.e. considered as part of he word). BTW, there a problem in the current definitions of affix regexes, when words are getting split to pref=, stem.= and =suf, but one (or more) ot the parts is not recognized by the regexes as an affix and thus classified as UNKNOWN_WORD, leading to null words. This is very common, and it increases the processing time by much due to repeated need to parse with nulls. |
Currently words with punctuation after them are "lost"
?
The question is what to do with "words" with inter-punctuation, like *http://example.com
<http://example.com>*.
I think it may be better not to split them.
a) there aren't supposed to be any urls in the text I'm parsing.
Unfortunately, the scrubber scripts are imperfect.
b) If you let them split, then basic statistics should very quickly
discover that http:// is a "morpheme", i.e. these 7 bytes always co-occur
as a unit. Always. If you allow a 3-way split, them the discovery of .com
and .org should be straight-forward, as well. So, yes, URL's have a
morphological structure, and actually it is very regular, far more regular
than almost any natural language. The structure should be easy to find by
statistical analysis. So, split them. ...
For the next few months, I mostly don't care, because performing a
morphological analysis of URL's seems a bit pointless, right now.
This may include initials that use dots.
Beats me, random splitting should auto-discover these boundaries. We'll
find out in a few months.
Dashes and apostrophes will need to be ignored as punctuation (i.e.
considered as part of he word).
That depends on what the morpheme analysis discovers. Random
splits+statistics will tell us how.
BTW, there a problem in the current definitions of affix regexes, when
words are getting split to pref=, stem.= and =suf, but one (or more) ot the
parts is not recognized by the regexes as an affix and thus classified as
UNKNOWN_WORD, leading to null words.
That sounds like a bug. Do you have an example? I'm not clear on what is
happening here.
A fix can be implemented to handle punctuation as proposed above,
No -- for now, it would be best to ignore most punctuation, and handle only
a small set of special cases: words that end with a period, comma,
semi-colon, colon, question-mark, exclamation point. I think that's it.
Everything else should be treated as if it was an ordinary letter, an
ordinary part of the word.
The only reason I want to treat these five terminal characters as special
is to simplify the discovery process; I know that, a-priori, these puncts
behave like words, not like morphemes. I'd rather not waste compute power
right now on discovering this.
By contrast, the $ in $50 really does behave like a morpheme: it is a
modifier of 50, and there would be a link that would be discovered that
connects $ to 50 and the automatics random spliting of the string $50
should allow this discovery to happen automatically.
classify morphemes in the regex file *only* by their marks (infixmark
and stem mark) disregarding their other letters.
I guess that sounds reasonable. (except for the fact that we talked about
getting rid of these marks?)
|
See in this example what happens to time; and however,
In the example above, =wever,[?] is not classified as a suffix, creating a null word ho.
But you need to accept the punctution that you allow as part of words (such as: $ - ' etc.)
My proposal:
I can implement that, but you haven't comment on my proposal for that. |
See in this example what happens
.. ahh! Yeah that's a bug. My apologies -- 'amy' was quickly thrown
together 3 years ago, and just barely revisited. So its buggy/suboptimal.
FWIW its taken me the enitre last month just to get the counting pipeline
running in a stable fashion: it's now running for 24 hours+ without
crashing, hanging, blowing up RAM, etc. It might still be generating poor
data, but at last it seems stable so that's forward progress.
My proposal:
1. Add RPUNC with comma, period, etc.
2. Since RPUNC will get separated and will not exist any more in
marked morphemes, accept every character in MOR-, as in (ady/4.0.regex):
SIMPLE-STEM: /=$/;
yes, sounds good!
(except for the fact that we talked about getting rid of these marks?)
I can implement that, but you haven't comment on my proposal for that.
Ah I lost track. which issue #? I guess its in my email box somewhere..
|
Care to create an "aqy" which does 4-way splits (or 3 or 2...)? |
It was very recently, but I myself cannot locate it now. Tokenize words to just strings, and look them up without adding marks to them. Inter-word connections will use connectors with a special character to denote that. Also, more info about tokens can be added outside of their strings, and be used to denote the affix type. This will allow to implement Unification Link Grammar or context sensitive parsing (which we have never discussed), that in any case need some different dictionary somehwre. For not introducing too much changes, I once poposed a two-step lookup, when you first look up the token in a special dictionary to find up some info on it, and consult the link-grammar dictionary (possibly using another string or even several strings) only to find the disjuncts. |
OK. Here is a summary of my proposal for that (assuming the current morpheme marking): I this fine for links?
|
Tokenize words to just strings, and look them up without adding marks to
them.
In the dictionary there are more than one option. One of them is to still
make a marking (for dictionary readability), when these marking, as said
above, will be ignored for lookup (but not for token rendering with
!morphology=1).
Sure, at this level, sounds reasonable.
Inter-word connections will use connectors with a special character to
denote that.
I think you mean "intra" intra-word would be "within the same word"
inter-word would mean "between two different words."
So right now, in russian, LLXXX is an intra-word connector.
There may be a need to have a way to specify a null affix, which will get
used if an inter-word connector may connect to a null-affix.
Yes, probably needed, based on experience with Russian. I have no clue yet
how that will be auto-learned.
Also, more info about tokens can be added outside of their strings, and be
used to denote the affix type. This will allow to implement Unification
Link Grammar <#280>
At the naive level, this seems reasonable, but a moment of reflection
suggests that there's a long and winding road in that direction.
or context sensitive parsing (which we have never discussed),
Heh.
that in any case need some different dictionary somehwre. For not
introducing too much changes, I once poposed a two-step lookup, when you
first look up the token in a special dictionary to find up some info on it,
and consult the link-grammar dictionary (possibly using another string or
even several strings) only to find the disjuncts.
what sort of additional info is needed for a token? What do you envision?
|
On Fri, Feb 3, 2017 at 10:27 PM, Amir Plivatsky ***@***.***> wrote:
Care to create an "aqy" which does 4-way splits (or 3 or 2...)?
OK.
Here is a summary of my proposal for that (assuming the current morpheme
marking):
For 2 o 3: Same marking as nw.
For 4 and up: pref= stem.= =midd1.= =midd2.= ... =suff
I this fine for links?
+-----------------------------ANY------------------------------+
| |
| +-------PL------+-------LL------+-------LL-------+
| | | | |
LEFT-WALL abc=[!MOR-PREF] de[!MOR-STEM].= =ef[!MOR-MIDD].= =gh[MOR-SUFF]
Yes, I think so. It might be better to pick more neutral, less suggestive
named, like "MOR-1st" "MOR-2nd" "MOR-3rd" or maybe MORA MORB MORC or
something like that. The problem is that "STEM" has a distinct connotation,
and we don't yet know if the second component will actually be a stem, or
perhaps just a second prefix.
…--linas
|
Yes, I meant "intra".
First, I admit that everything (even whole dictionaries...) can be encoded into the subscript. In addition, everything needed for readability can be done using % comments. In my proposal not to use infix marks for dict lookups, I mentioned the option to leave them in the dict tokens for readability but ignore them in the dict read. Even this is not needed, as it can be mentioned in a comment such as:
But this, for example, prevents consistency checking at dict read time (e.g such a token must have a right going connector, or even (if implemented) an "intar-word" right going connector). So this is a bad idea too. Examples:
A total other alternative is not to use infix marking, but encode it in the subscript (by anyfix.c): |
Sorry for the delay. I was busy fixing the incorrect null_count bug caused by my recent empty-word elimination. I hope to have a complete fix for that very soon (without using the empty-word again). |
Yeah, no hurry. I have yet to begin data analysis on the simpler cases. There's a long road ahead. |
consider this:
this attempts the case wit one part (no splits) and with 3 parts (two splits) it never outputs any attempts to split into two parts.
Editing
data/amy/4.0.affix
and changing3: REGPARTS+;
to4: REGPARTS+;
never generates splits into 4 parts.I tried setting
"w|ts|pts|pms|pmms|ptts|ps|ts": SANEMORPHISM+;
but this seemed to have no effect.The text was updated successfully, but these errors were encountered: