Unicode lexing #14

LeventErkok · 2025-01-17T23:00:14Z

lists a few issues where Unicode Characters cause lexing issues. Would it be possible to add proper support for this?

amesgen · 2025-01-18T11:18:47Z

Concrete examples: lexerPass0 fails on the following snippets (which GHC can
parse just fine), with the respective output:

-- 猫

[
    ( Commentstart, ( Pos { char = 0, line = 1, column = 1 }, "--" ) ),
    ( ErrorToken, ( Pos { char = 2, line = 1, column = 3 }, " " ) ),
    ( TheRest, ( Pos { char = 3, line = 1, column = 4 }, "猫 " ) )
]

ﬧ :: Int

[
    ( ErrorToken, ( Pos { char = 0, line = 1, column = 1 }, "" ) ),
    ( TheRest, ( Pos { char = 0, line = 1, column = 1 }, "ﬧ :: Int " ) )
]

yav · 2025-01-22T00:19:38Z

The lexer does support unicode, but these letters are not categorized as upper or lower case according to Unicode.

I think it'd be fairly easy to copy what GHC does for the various unusual character classes. For reference here's the full mapping:

                  UppercaseLetter       -> upper
                  LowercaseLetter       -> lower
                  TitlecaseLetter       -> upper
                  ModifierLetter        -> uniidchar -- see #10196
                  OtherLetter           -> lower -- see #1103
                  NonSpacingMark        -> uniidchar -- see #7650
                  SpacingCombiningMark  -> other_graphic
                  EnclosingMark         -> other_graphic
                  DecimalNumber         -> digit
                  LetterNumber          -> digit
                  OtherNumber           -> digit -- see #4373
                  ConnectorPunctuation  -> symbol
                  DashPunctuation       -> symbol
                  OpenPunctuation       -> other_graphic
                  ClosePunctuation      -> other_graphic
                  InitialQuote          -> other_graphic
                  FinalQuote            -> other_graphic
                  OtherPunctuation      -> symbol
                  MathSymbol            -> symbol
                  CurrencySymbol        -> symbol
                  ModifierSymbol        -> symbol
                  OtherSymbol           -> symbol
                  Space                 -> space
                  _other                -> non_graphic

yav · 2025-01-22T19:28:59Z

I have the basics of this working locally, but I want to do a bit of cleanup in how the code is emitted, I'll make a PR later today or in the next few days.

LeventErkok · 2025-01-22T19:30:37Z

Thanks Iavor!

amesgen · 2025-01-22T19:40:13Z

Ah I just started working on this an hour ago and didn't see your comment here in between, a classic race condition 😅 I opened #15 with my approach, feel free to close of course 👍

I tested that it fixes phadej/cabal-extras#131 as expected.

yav · 2025-01-22T19:56:49Z

Ah, no worries. I think your changes look good, I'll make separate tickets for the other changes I was thinking of.

amesgen linked a pull request Jan 22, 2025 that will close this issue

Implement GHC's relaxed rules for unicode symbols #15

Open

amesgen mentioned this issue Jan 22, 2025

cabal-docspec: Workaround for lacking unicode support in haskell-lexer phadej/cabal-extras#132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode lexing #14

Unicode lexing #14

LeventErkok commented Jan 17, 2025 •

edited

Loading

amesgen commented Jan 18, 2025

yav commented Jan 22, 2025

yav commented Jan 22, 2025

LeventErkok commented Jan 22, 2025

amesgen commented Jan 22, 2025

yav commented Jan 22, 2025 •

edited

Loading

Unicode lexing #14

Unicode lexing #14

Comments

LeventErkok commented Jan 17, 2025 • edited Loading

amesgen commented Jan 18, 2025

yav commented Jan 22, 2025

yav commented Jan 22, 2025

LeventErkok commented Jan 22, 2025

amesgen commented Jan 22, 2025

yav commented Jan 22, 2025 • edited Loading

LeventErkok commented Jan 17, 2025 •

edited

Loading

yav commented Jan 22, 2025 •

edited

Loading