Latin letters #3

AS400JPLPC · 2024-08-25T06:17:33Z

Will there ever be support for Latin letters such as é, à, ç, ô, etc.? Thank you.

mnemnion · 2024-08-25T13:47:45Z

Full support in mvzr is unlikely, unfortunately.

It's something I can add if and when allocation is possible again at comptime. But the scheme mvzr uses to recognize a character set is only suitable to ASCII characters. It uses two u64 bitmasks to cover the low and high ranges of ASCII.

When you generalize this technique, you get runeset. RuneSet is specifically designed for arbitrary sets of UTF-8 codepoints, which would cover composed Latin characters, see Unicode equivalence for more about what that means. But creating a Runeset requires allocation, and being a stack-based no-allocation library is an important part of mvzr's design space.

I have some preliminary work done on a pattern-matching library which incorporates the RuneSet, but I don't expect it to be done any time soon. Maybe this year, but not next month.

Meanwhile, I'm afraid that the workaround of saying "(é|î)+", or what have you, is all that mvzr will support.

I'll leave this issue open to track the feature, because I certainly agree that it would be a good thing to support at least codepoint-sets. Sets of arbitrary grapheme clusters are an extraordinarily difficult feature, and the regex libraries I know of which do support that, do so by translating the input stream into 32 bits per codepoint, which will never be something mvzr requires or does.

One more observation: modifiers will work correctly on a multibyte character, or indeed a grapheme cluster, if wrapped in parentheses: (ç)? will match all or nothing of a cedilla, not just the last byte. That is something which I might be able to do automatically in the compiler, but not, alas, full character sets.

NJdevPro · 2024-10-05T06:40:06Z

Perhaps this limitation and its workaround should be added to the documentation/README under "Quirks", because else one may have bad surprises.
(otherwise great work !)

mnemnion · 2024-10-05T18:14:26Z

The first line of Limitations and Quirks in the README is "No Unicode support to speak of", so I feel like that part, at least, is covered.

I suppose I could suggest the use of alternates as a poor man's character set. As for é? not working correctly, the better solution is to fix it, which I hope to find some time to do soon.

NJdevPro · 2024-10-10T11:08:06Z

The first line of Limitations and Quirks in the README is "No Unicode support to speak of", so I feel like that part, at least, is covered.

I disagree. There are plenty of encoding tables other than Unicode that support those characters, like ISO-8859-1 or Windows-1252. These are still very widely used (think dozens of millions of people) in Europe and elsewhere by non-English speaking users.

mnemnion · 2024-10-10T16:40:07Z

So the interesting thing about that is that mvzr will correctly handle any and all one-byte legacy encodings.

Your only problem at that point is that Zig will encode it as UTF-8 and nothing else. mvzr does not consider other formats as valid input, but let's say you know you'll be parsing Latin-1, and you want á in a character set, you can add it as [\xe1] and this will work fine. If mvzr sees the literal byte 0xe1 in a character set, it will assume that it's seeing multi-byte UTF-8, and won't compile if it finds that in a character set (in any other part of a regex it's fine).

mvzr will never support other encodings besides UTF-8, to be clear. Non goal. But with a bit of care you can certainly write regexen which will parse one-byte character sets properly.

AS400JPLPC · 2024-10-10T19:10:18Z

hello:

pub fn isMatch(testval : [] const u8, pattern : [] const u8 ) bool {

    
     const maybe_regex = freg.compile(pattern) ;
      if (maybe_regex) |regex|{     return regex.isMatch(testval);} else return false;
    
       
 }

    std.debug.print("MVZR pè1éàaÇ test  {} \r\n",.{
        isMatch("pè1CéàaÇ","^[a-zA-Z]{1}(\\é|\\à|\\è|\\Ç|[a-zA-Z0-9]){1,7}$")}) ;

MVZR pè1CéàaÇ test true

UTF8
the result is correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latin letters #3

Latin letters #3

AS400JPLPC commented Aug 25, 2024

mnemnion commented Aug 25, 2024 •

edited

Loading

NJdevPro commented Oct 5, 2024 •

edited

Loading

mnemnion commented Oct 5, 2024

NJdevPro commented Oct 10, 2024 •

edited

Loading

mnemnion commented Oct 10, 2024

AS400JPLPC commented Oct 10, 2024 •

edited

Loading

Latin letters #3

Latin letters #3

Comments

AS400JPLPC commented Aug 25, 2024

mnemnion commented Aug 25, 2024 • edited Loading

NJdevPro commented Oct 5, 2024 • edited Loading

mnemnion commented Oct 5, 2024

NJdevPro commented Oct 10, 2024 • edited Loading

mnemnion commented Oct 10, 2024

AS400JPLPC commented Oct 10, 2024 • edited Loading

mnemnion commented Aug 25, 2024 •

edited

Loading

NJdevPro commented Oct 5, 2024 •

edited

Loading

NJdevPro commented Oct 10, 2024 •

edited

Loading

AS400JPLPC commented Oct 10, 2024 •

edited

Loading