-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
splitting hyphenated, underlined words #560
Comments
Note that the current I propose to restore it to the previous one (before this last change) so
The current splitter cannot split at desired characters, since it doesn't have the concept of splitting on middle characters. Adding that ability has been proposed in #42:
(BTW, it seems blank will still need special handling to collapse it to one character and not being considered as a token, unless additional syntax is added.) I can implement this if desired. For
(The same for " For
However, the result is even stranger than for To sum up: I can implement |
I tried this, and it almost works, but not quite: in 4.0.affix:
in 4.0.regex:
in 4.0-dict:
The hope was that SIMPLE-STEM aka REGPRE always ends with a dash. It does sometimes, but not always. Changing the dash to [_-] so that I can match dash or underbar, results in regexes that fail to compile, and I don't understand why. I was hoping that having the stem end in a dash would then allow regular affix processing to strip off the dash. |
Regarding WORDSEP: with regular whitespace, its OK to collapse multiple whitespace down to one. Also, whitespace is never a token, in itself. But for dashes, that would not be the case--multiple dashes cannot be collapsed, and dashes are tokens. So I don't understand the concept. |
The idea for WORDSEP is to be like LPUNC or RPUNC (maybe it can be called MPUNC instead).
Hence blanks cannot be handled by it. |
Yes, could you implement MPUNC, if it's easy? To work like LPUNC, it would be a space-separated list, with some of them being multi-character, e.g. the double-dash. |
The problem is that Maybe there is a way to right-strip
I will try to do that. |
The reason is the idiom checking in dictionary.c:78. I didn't find a trivial fix for this the "underbar in regex" problem. Possible solutions for the affix file "underbar in regex" problem:
I will implement number (3) for now, unless you have a better idea for that. (BTW, the MPUNC change is almost ready.) |
Which of |
Not sure. I am using any very heavily, and need it there, mostly because my input texts love to use crazy punctuation. For example
Beats me where the open-square-bracket went. The thing after X. is some unicode long-dash. The underscores are supposed to be some kind of quoting-like device, because the quotes are already in use for a different purpose. So I think I was seeing Its all very confusing. In the long-term, maybe ady/amy could discover punctuation on thier own, but this is still far off in the future. |
I essentially finished adding MPUNC, but there are still small details that need handling, especially repeated tokenizations due to punctuation stripping (an old problem that becomes more severe). |
I continue to work on MPUNC, and encountered a need to change a current behaviour. For example only, here is may happen with
(Of course, we could minimize such cases, if needed, by defining more tokens as LPUNC and RPUNC so they will get split too). [BTW, a general tokenizer can just split on any punctuation, and have configurable punctuation list on which to emit a split alternative (if they can be also treated as letters). In regular languages the dict will decide which combination is valid. Currently the English dict cannot cope with that, as it expects some nonsense punctuation combination as correct.] |
Accepting Here's another example, for Chinese. Chinese is normally written without any spaces at all between words. There are end-of-sentence markers. Words may be one, two or three (or rarely, more) hanzi (kanji) in length. A single sentence can be anywhere from half-a-dozen to several dozen hanzi's in length. There are three strategies for dealing with this:
I don't like 1. I'm not clear on whether 2 or 3 is preferable. Option 2 pushes all the complexity into the dictionary, and depends on the parser for performance. Option 3 pushes some of the complexity into the splitter. It makes the splitter more complex, more cpu-hungry, while making the parser run faster. Option 2 has a dictionary with some 20K hanzi, but with lots and lots of morpheme-style disjuncts, so requires the parser to work through many combinations. Option 3 has a dictionary with some 100K or 200K words, but each has far fewer disjuncts on them, making parsing faster (but splitting slower) I don't know which approach is better, either performance-wise, or philosophically-theoretically-speaking. Back to English: So, the options for handling What's your opinion? I think you are starting to realize how complex splitting can be; is that a good thing, or a bad thing? how can we balance between 2 & 3? You can ponder what might happen if we did Hebrew in 2 vs 3, (i.e. splitting on every letter in Hebrew) and what might happen if we did English/Frechh/Russian using option 2. |
A clarification regarding the For the rest I will start at the end.
The second kind of dictionary can get a sentence without whitespace and infer the words when creating a linkage. BTW, the regex tokenizer (still in the code but with no test hook any more) can do the same using regular dictionaries, i.e. infer the word boundaries of sentences without white space. Of course, every ordinary dictionary (including the current Hebrew, Russian etc.) can be translated to a single-letter dictionary. In addition, with slight extensions, even 4.0.regex can be translated to a single-letter dictionary!
I think an efficient tokenizer can be written easily enough for option 3. Depending on the internal (in-memory) representation of the dict, it can even be extremely efficient. |
Here are the results of my current any/ady/amy code:
I also modified the definitions for ady/amy in the same way, and added MPUNC to |
Handled in #575. |
I'm having trouble configuring the
any
language to ... do what it does today, but also split up hyphenated words: e.g. the splitthis aaa-bbb ccc
into five words:this aaa - bbb ccc
. I set REGPRE, REGALT, and so on in various ways, but nothing would quite match correctly ...The text was updated successfully, but these errors were encountered: