Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latin letters #3

Open
AS400JPLPC opened this issue Aug 25, 2024 · 6 comments
Open

Latin letters #3

AS400JPLPC opened this issue Aug 25, 2024 · 6 comments

Comments

@AS400JPLPC
Copy link

Will there ever be support for Latin letters such as é, à, ç, ô, etc.? Thank you.

@mnemnion
Copy link
Owner

mnemnion commented Aug 25, 2024

Full support in mvzr is unlikely, unfortunately.

It's something I can add if and when allocation is possible again at comptime. But the scheme mvzr uses to recognize a character set is only suitable to ASCII characters. It uses two u64 bitmasks to cover the low and high ranges of ASCII.

When you generalize this technique, you get runeset. RuneSet is specifically designed for arbitrary sets of UTF-8 codepoints, which would cover composed Latin characters, see Unicode equivalence for more about what that means. But creating a Runeset requires allocation, and being a stack-based no-allocation library is an important part of mvzr's design space.

I have some preliminary work done on a pattern-matching library which incorporates the RuneSet, but I don't expect it to be done any time soon. Maybe this year, but not next month.

Meanwhile, I'm afraid that the workaround of saying "(é|î)+", or what have you, is all that mvzr will support.

I'll leave this issue open to track the feature, because I certainly agree that it would be a good thing to support at least codepoint-sets. Sets of arbitrary grapheme clusters are an extraordinarily difficult feature, and the regex libraries I know of which do support that, do so by translating the input stream into 32 bits per codepoint, which will never be something mvzr requires or does.

One more observation: modifiers will work correctly on a multibyte character, or indeed a grapheme cluster, if wrapped in parentheses: (ç)? will match all or nothing of a cedilla, not just the last byte. That is something which I might be able to do automatically in the compiler, but not, alas, full character sets.

@NJdevPro
Copy link

NJdevPro commented Oct 5, 2024

Perhaps this limitation and its workaround should be added to the documentation/README under "Quirks", because else one may have bad surprises.
(otherwise great work !)

@mnemnion
Copy link
Owner

mnemnion commented Oct 5, 2024

The first line of Limitations and Quirks in the README is "No Unicode support to speak of", so I feel like that part, at least, is covered.

I suppose I could suggest the use of alternates as a poor man's character set. As for é? not working correctly, the better solution is to fix it, which I hope to find some time to do soon.

@NJdevPro
Copy link

NJdevPro commented Oct 10, 2024

The first line of Limitations and Quirks in the README is "No Unicode support to speak of", so I feel like that part, at least, is covered.

I disagree. There are plenty of encoding tables other than Unicode that support those characters, like ISO-8859-1 or Windows-1252. These are still very widely used (think dozens of millions of people) in Europe and elsewhere by non-English speaking users.

@mnemnion
Copy link
Owner

So the interesting thing about that is that mvzr will correctly handle any and all one-byte legacy encodings.

Your only problem at that point is that Zig will encode it as UTF-8 and nothing else. mvzr does not consider other formats as valid input, but let's say you know you'll be parsing Latin-1, and you want á in a character set, you can add it as [\xe1] and this will work fine. If mvzr sees the literal byte 0xe1 in a character set, it will assume that it's seeing multi-byte UTF-8, and won't compile if it finds that in a character set (in any other part of a regex it's fine).

mvzr will never support other encodings besides UTF-8, to be clear. Non goal. But with a bit of care you can certainly write regexen which will parse one-byte character sets properly.

@AS400JPLPC
Copy link
Author

AS400JPLPC commented Oct 10, 2024

hello:

pub fn isMatch(testval : [] const u8, pattern : [] const u8 ) bool {

    
     const maybe_regex = freg.compile(pattern) ;
      if (maybe_regex) |regex|{     return regex.isMatch(testval);} else return false;
    
       
 }

    std.debug.print("MVZR pè1éàaÇ test  {} \r\n",.{
        isMatch("pè1CéàaÇ","^[a-zA-Z]{1}(\\é|\\à|\\è|\\Ç|[a-zA-Z0-9]){1,7}$")}) ;

MVZR pè1CéàaÇ test true

UTF8
the result is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants