Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode lexing #14

Open
LeventErkok opened this issue Jan 17, 2025 · 6 comments · May be fixed by #15
Open

Unicode lexing #14

LeventErkok opened this issue Jan 17, 2025 · 6 comments · May be fixed by #15

Comments

@LeventErkok
Copy link

LeventErkok commented Jan 17, 2025

@amesgen

phadej/cabal-extras#131

lists a few issues where Unicode Characters cause lexing issues. Would it be possible to add proper support for this?

@amesgen
Copy link

amesgen commented Jan 18, 2025

Concrete examples: lexerPass0 fails on the following snippets (which GHC can
parse just fine), with the respective output:

  1. --
    [
        ( Commentstart, ( Pos { char = 0, line = 1, column = 1 }, "--" ) ),
        ( ErrorToken, ( Pos { char = 2, line = 1, column = 3 }, " " ) ),
        ( TheRest, ( Pos { char = 3, line = 1, column = 4 }, "" ) )
    ]
  2. :: Int
    [
        ( ErrorToken, ( Pos { char = 0, line = 1, column = 1 }, "" ) ),
        ( TheRest, ( Pos { char = 0, line = 1, column = 1 }, "ﬧ :: Int " ) )
    ]

@yav
Copy link
Owner

yav commented Jan 22, 2025

The lexer does support unicode, but these letters are not categorized as upper or lower case according to Unicode.

I think it'd be fairly easy to copy what GHC does for the various unusual character classes. For reference here's the full mapping:

                  UppercaseLetter       -> upper
                  LowercaseLetter       -> lower
                  TitlecaseLetter       -> upper
                  ModifierLetter        -> uniidchar -- see #10196
                  OtherLetter           -> lower -- see #1103
                  NonSpacingMark        -> uniidchar -- see #7650
                  SpacingCombiningMark  -> other_graphic
                  EnclosingMark         -> other_graphic
                  DecimalNumber         -> digit
                  LetterNumber          -> digit
                  OtherNumber           -> digit -- see #4373
                  ConnectorPunctuation  -> symbol
                  DashPunctuation       -> symbol
                  OpenPunctuation       -> other_graphic
                  ClosePunctuation      -> other_graphic
                  InitialQuote          -> other_graphic
                  FinalQuote            -> other_graphic
                  OtherPunctuation      -> symbol
                  MathSymbol            -> symbol
                  CurrencySymbol        -> symbol
                  ModifierSymbol        -> symbol
                  OtherSymbol           -> symbol
                  Space                 -> space
                  _other                -> non_graphic

@yav
Copy link
Owner

yav commented Jan 22, 2025

I have the basics of this working locally, but I want to do a bit of cleanup in how the code is emitted, I'll make a PR later today or in the next few days.

@LeventErkok
Copy link
Author

Thanks Iavor!

@amesgen amesgen linked a pull request Jan 22, 2025 that will close this issue
@amesgen
Copy link

amesgen commented Jan 22, 2025

Ah I just started working on this an hour ago and didn't see your comment here in between, a classic race condition 😅 I opened #15 with my approach, feel free to close of course 👍

I tested that it fixes phadej/cabal-extras#131 as expected.

@yav
Copy link
Owner

yav commented Jan 22, 2025

Ah, no worries. I think your changes look good, I'll make separate tickets for the other changes I was thinking of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants