-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latin letters #3
Comments
Full support in mvzr is unlikely, unfortunately. It's something I can add if and when allocation is possible again at comptime. But the scheme mvzr uses to recognize a character set is only suitable to ASCII characters. It uses two When you generalize this technique, you get runeset. RuneSet is specifically designed for arbitrary sets of UTF-8 codepoints, which would cover composed Latin characters, see Unicode equivalence for more about what that means. But creating a Runeset requires allocation, and being a stack-based no-allocation library is an important part of mvzr's design space. I have some preliminary work done on a pattern-matching library which incorporates the RuneSet, but I don't expect it to be done any time soon. Maybe this year, but not next month. Meanwhile, I'm afraid that the workaround of saying I'll leave this issue open to track the feature, because I certainly agree that it would be a good thing to support at least codepoint-sets. Sets of arbitrary grapheme clusters are an extraordinarily difficult feature, and the regex libraries I know of which do support that, do so by translating the input stream into 32 bits per codepoint, which will never be something One more observation: modifiers will work correctly on a multibyte character, or indeed a grapheme cluster, if wrapped in parentheses: |
Perhaps this limitation and its workaround should be added to the documentation/README under "Quirks", because else one may have bad surprises. |
The first line of Limitations and Quirks in the README is "No Unicode support to speak of", so I feel like that part, at least, is covered. I suppose I could suggest the use of alternates as a poor man's character set. As for |
I disagree. There are plenty of encoding tables other than Unicode that support those characters, like ISO-8859-1 or Windows-1252. These are still very widely used (think dozens of millions of people) in Europe and elsewhere by non-English speaking users. |
So the interesting thing about that is that mvzr will correctly handle any and all one-byte legacy encodings. Your only problem at that point is that Zig will encode it as UTF-8 and nothing else.
|
hello:
MVZR pè1CéàaÇ test true UTF8 |
Will there ever be support for Latin letters such as é, à, ç, ô, etc.? Thank you.
The text was updated successfully, but these errors were encountered: