add custom JULIA normalization? #11

stevengj · 2014-07-18T15:41:04Z

For JuliaLang/julia#5903. If utf8proc can have LUMP, then libmojibake can have JULIA. Unless we want to keep this separate from libmojibake.

tonyhffong · 2014-11-05T08:07:57Z

+1

StefanKarpinski · 2014-11-05T11:35:55Z

Yeah, it seems like we really need this. Unfortunately, the standardized normalizations just don't cut it.

tonyhffong · 2014-11-05T12:08:43Z

A JULIA mode in base/utf8proc.jl or in libmojibake? It seems utf8proc is better though we probably need to apply it before feeding code to julia-parser.scm

StefanKarpinski · 2014-11-05T12:13:09Z

How realistic is it to actually upstream all the changes we've made to utf8proc? I would guess that a new normalization mode would be fairly easy to keep separate from other changes.

tonyhffong · 2014-11-05T14:21:47Z

What about this example (referring to the proposed new brackets in Julia):

a1 = ⟨c,d⟩ # canonical \langle and \rangle
a2 = ⟪c,d⟫ # using \lAngle and \rAngle (legibility preference)
a3 = ⟪c,d⟩ # unmatched brackets should throw parse error
b1 = "⟪c,d⟫"
b2 = "❰c,d❱" # dingbat angular brackets
b3 = "〈c,d〉" # full-width angular brackets U3008, U3009

I'd prefer normalizing the angular brackets for a1 and a2 so they parse, and leave the chars in the literal strings untouched.

This means the lexer/parser needs to control the normalization, at least for syntactically important symbols. Or is there a hook for that already?

StefanKarpinski · 2014-11-05T14:29:32Z

This could be handled without the parser needing to know about it by having a mapping from brackets to their pair and just raising an error if the parser finds a pair that aren't really a pair. If the Unicode code points are always near each other, the check could just be for that. Of course this still implies that normalization has to happen after that check, and thus after lexing at least. So the sequence would be: lex, check, normalize, parse. Seems like a lot of trouble to prevent people from using unpaired Unicode brackets that happen to look similar. Maybe not worth it.

tonyhffong · 2014-11-05T14:50:22Z

It isn't so bad actually. The (lex, check) part of that is already in place in my PR, albeit manually and most likely non-exhaustive. I was brought here wondering if some of that work can be off-loaded to utf8proc, but it probably requires way too much finessing. So perhaps just an incremental change like so would work:

lex,check (with hand-rolled utf8 normalization for brackets and perhaps some critical symbols, like '=' and ':' )
normalize (using utf8proc) any identifiers token to impose our view of confusable symbols.
parse, as usual

stevengj · 2014-11-05T14:52:51Z

@StefanKarpinski, the changes so far aren't too radical. The first obstacle is that we need to get copyright assignments from all of the contributors in order for upstream to consider a patch. After that, I don't know what their patch-review process will be like, but I'm guessing it will be a bit on the slow side based on past interactions.

nalimilan · 2014-11-05T14:55:08Z

@stevengj Have you been able to get a reply from them? I didn't get any. I can help asking for copyright assignment if that can help, I'd rather not have to package libmojibake in addition to utf8proc in Fedora. :-)

StefanKarpinski · 2014-11-05T15:15:28Z

I think that copyright assignment is not a good idea, hopefully a contributor license agreement is all they actually require. Copyright assignment isn't even legally valid in many countries, e.g. Germany.

stevengj · 2014-11-05T15:16:50Z

You're right, Stefan, it actually seems to be just a contributor license.

stevengj · 2014-11-25T18:57:12Z

(Update: the current changes in libmojibake, mainly Unicode-7 support, have been submitted upstream with CLAs.)

jiahao · 2015-07-01T16:23:50Z

A quick note that Unicode provides a list of confusable characters as part of UAX 39, which also provides a list of recommendations for characters in identifier names given security concerns.

stevengj · 2015-07-01T18:11:03Z

@jiahao, I think we explicitly decided to reject these recommendations, along with NFKC normalization, in JuliaLang/julia#5434, in order to distinguish a wider array of mathematical symbols (e.g. 𝐇 vs. H) and to allow things like x⁽²⁾ as identifiers. So, we are on our own in deciding whether to normalize e.g. fullwidth Latin letters.

stevengj · 2016-11-29T18:01:33Z

Upon reflection, I think the best thing would be to make this pluggable, by allowing the caller to supply a custom mapping function that is applied to the codepoints after normalization.

StefanKarpinski · 2016-11-29T18:04:13Z

Providing a "reasonable" set of confusable mathematical characters won't be too crazy though.

stevengj · 2016-11-30T15:40:52Z

Closed by #89.

stevengj added the enhancement label Jul 18, 2014

This was referenced Nov 2, 2014

infix notation for more functions JuliaLang/julia#4498

Closed

[WIP] new brackets: angle, Brack, Brace JuliaLang/julia#8892

Closed

tonyhffong mentioned this issue Nov 24, 2014

Better identification of invalid Unicode characters JuliaLang/julia#9127

Closed

jiahao mentioned this issue Jul 1, 2015

Add "unusual Julia features" section in the manual noteworthy diffs. JuliaLang/julia#11966

Closed

stevengj mentioned this issue Nov 29, 2016

Wrong LaTeX-Unicode mapping of \varepsilon JuliaLang/julia#14751

Closed

stevengj mentioned this issue Nov 29, 2016

new utf8proc_map_custom for hooking in user-defined custom mappings #89

Merged

stevengj closed this as completed in #89 Nov 30, 2016

stevengj mentioned this issue Nov 30, 2016

WIP: custom Unicode normalization for Julia identifiers JuliaLang/julia#19464

Merged

stevengj mentioned this issue Jan 29, 2021

Should U+22A5 and U+27C2 be equivalent? #218

Closed

mrluc mentioned this issue Apr 10, 2022

Compiler regression on 1.14 when using unicode as variables names elixir-lang/elixir#11750

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add custom JULIA normalization? #11

add custom JULIA normalization? #11

stevengj commented Jul 18, 2014

tonyhffong commented Nov 5, 2014

StefanKarpinski commented Nov 5, 2014

tonyhffong commented Nov 5, 2014

StefanKarpinski commented Nov 5, 2014

tonyhffong commented Nov 5, 2014

StefanKarpinski commented Nov 5, 2014

tonyhffong commented Nov 5, 2014

stevengj commented Nov 5, 2014

nalimilan commented Nov 5, 2014

StefanKarpinski commented Nov 5, 2014

stevengj commented Nov 5, 2014

stevengj commented Nov 25, 2014

jiahao commented Jul 1, 2015

stevengj commented Jul 1, 2015

stevengj commented Nov 29, 2016

StefanKarpinski commented Nov 29, 2016

stevengj commented Nov 30, 2016

add custom JULIA normalization? #11

add custom JULIA normalization? #11

Comments

stevengj commented Jul 18, 2014

tonyhffong commented Nov 5, 2014

StefanKarpinski commented Nov 5, 2014

tonyhffong commented Nov 5, 2014

StefanKarpinski commented Nov 5, 2014

tonyhffong commented Nov 5, 2014

StefanKarpinski commented Nov 5, 2014

tonyhffong commented Nov 5, 2014

stevengj commented Nov 5, 2014

nalimilan commented Nov 5, 2014

StefanKarpinski commented Nov 5, 2014

stevengj commented Nov 5, 2014

stevengj commented Nov 25, 2014

jiahao commented Jul 1, 2015

stevengj commented Jul 1, 2015

stevengj commented Nov 29, 2016

StefanKarpinski commented Nov 29, 2016

stevengj commented Nov 30, 2016