-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide macros corresponding to the Unicode general categories #126
Comments
What have you tried? I thought this was as simple as:
which means one upper case and then zero or more mixed case. |
The Unicode uppercase letter category is quite a bit larger than It might make sense to have macros both for the predicates Haskell programmers are used to from |
Ah ok. I think my response shows that your initial question was not specific enough. If you specify exactly what it is you want to do and why the current functionality is insufficient, you will get much more useful responses than my one above. |
Right. :) I should have done that first. :) I want to detect Haskell identifiers. For that I need the following character sets:
|
Forgive me for being pedantic here, but that is not what you are asking for. You rejected my suggesting above saying that Furthermore "symbols and punctuation" can mean different things in different programming languages and even in different human languages so there is not one single solution. Maybe looking at the lexer for CHG itself will provide you some inspiration. |
Thanks. Yeah, in the end I want a lexer that detects the same identifiers that GHC itself will lex. But I don't want to replicate GHC's strange Unicode workaround. If Alex could provide macros corresponding to the Unicode general categories, building the lexer would be quite easy. |
I think my concern with lexing UTF-8 directly in Alex for Haskell source code was that the generated state machine might be huge. I didn't actually do that experiment though, I'd be interested in the results. |
I'm not sure what to look at in the generated Haskell files to see if that's a problem. |
Per #165, I would like to unfuse the UTF-8 and user-written automata to decluttter the implementation, which we speculate is a bit confused because it might predate (Even better would be to then go implement proper automaton composition to allow the the user to choose whether to fuse or not fuse the automata (when the underlying string is byte- rather than character-oriented), and start exploring the proper categorical semantics of the language specs themselves! But I am getting star-eyed and off-topic.) Back to the point, once things can work |
This will be very useful indeed! I just did something similar to #126 (comment) and wish such a support exist. |
Just want to add few notes on this issue: (A bit of background: I'm following Java SE 16 Spec to write a parser for fun, so my knowledge below is based on my experience following that spec) One workaround I tried is to let Alex accept a wider language, say Java forbids Unicode outside identifier and literals. So I can take advantage of that fact to be specific only on
and then I can deal with them in
My key takeaways:
|
Right now it seems very difficult to write a rule e.g. for words starting with an uppercase letter.
The text was updated successfully, but these errors were encountered: