Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode Properties #90

Open
srl295 opened this issue May 19, 2016 · 61 comments
Open

Unicode Properties #90

srl295 opened this issue May 19, 2016 · 61 comments
Assignees
Labels
c: text Component: case mapping, collation, properties Proposal Larger change requiring a proposal s: comment Status: more info is needed to move forward

Comments

@srl295
Copy link
Member

srl295 commented May 19, 2016

https://github.com/srl295/es-unicode-properties

@srl295
Copy link
Member Author

srl295 commented May 19, 2016

cc @mathiasbynens

@littledan
Copy link
Member

Something that's been discussed is exposing these to RegExps. V8 does this currently behind a special flag, thanks to @hashseed's work. I don't know if a spec is written but I heard @goyakin may work on exposing properties through RegExps.

@hashseed
Copy link

hashseed commented May 20, 2016

As @littledan mentioned, this is an experimental feature in V8. The comment in the regexp parser describes the current syntax we are using:

  // Parse the property class as follows:
  // - \pN with a single-character N is equivalent to \p{N}
  // - In \p{name}, 'name' is interpreted
  //   - either as a general category property value name.
  //   - or as a binary property name.
  // - In \p{name=value}, 'name' is interpreted as an enumerated property name,
  //   and 'value' is interpreted as one of the available property value names.
  // - Aliases in PropertyAlias.txt and PropertyValueAlias.txt can be used.
  // - Loose matching is not applied.

For example
/\p{East_Asian_Width=H}/u.test("\u20a9") // true

\P is the inverse of \p, so binary properties with "False" as property value can be expressed via \P.
For example
/\p{ASCII_Hex_Digit}/u.test("A") // true
/\P{ASCII_Hex_Digit}/u.test("A") // false

@mathiasbynens
Copy link
Member

mathiasbynens commented May 20, 2016

For the record, the V8 flag @littledan mentioned is --harmony_regexp_property. Tests that show how the current implementation works: https://chromium.googlesource.com/v8/v8/+/master/test/mjsunit/harmony/regexp-property-exact-match.js


Is full compatibility with existing \p implementations a hard requirement? If I were implementing \p{…} in ES I explicitly wouldn’t support Is/In prefixes, shorthands, loose matching, property aliases, property value aliases, or whitespace around = / :. E.g. throw on /\\p{In_Cyrillic_Sup}/u, /\\p{Block=Cyrillic_Sup}/u and /\\p{Block=Cyrillic_Supplementary}/u and only accept /\\p{Block=Cyrillic_Supplement}/u which is the canonical block name. We have the opportunity to be strict here and encourage readable code; let’s do it.


Some related info: mathiasbynens/es-regexp-unicode-character-class-escapes#2

@goyakin Can we track your spec work somewhere (GitHub)?

@hashseed
Copy link

hashseed commented May 20, 2016

We considered doing loose matching and having a "In"-prefix for blocks. But having thought about it, we decided against either. Looking at Perl, it seems to be a good idea to be strict rather than overly ambiguous.
Your example would be /\p{Block=Cyrillic_Supplement}/u or /\p{blk=Cyrillic_Sup}/u. Reason to have the property name be explicit is because there is ambiguity between Script and Block property value names. And honestly stating it explicitly really should not hurt anyone.

@mathiasbynens
Copy link
Member

mathiasbynens commented May 20, 2016

@hashseed Agreed; that is what the discussion I referenced concluded as well. I’ve now updated my example (originally intended to explain how aliases should throw only) to avoid confusion.

Note that in your example you’re still doing a form of loose matching, i.e. ignoring _. (The canonical block name is Cyrillic Supplement and not Cyrillic_Supplement.)

@hashseed
Copy link

I thought the underscore is actually part of the name. That's what PropertyAlias.txt and PropertyNameAlias.txt as well as ICU suggest.

@mathiasbynens
Copy link
Member

mathiasbynens commented May 21, 2016

As far as I can see, only PropertyValueAliases.txt suggests it. Blocks.txt has the block name with spaces instead of underscores. I’ve asked for clarification here: http://www.unicode.org/mail-arch/unicode-ml/y2016-m05/thread.html#79

@mathiasbynens
Copy link
Member

I hope to present this as a stage 0 strawman at a future TC39 meeting.

After implementing support for \p{…} and \P{…} in my regular expression transpiler https://github.com/mathiasbynens/regexpu-core (online demo), I’ve started to work on a concrete spec proposal. Here’s an early draft: mathiasbynens/ecma262#1 Feedback welcome.

@hashseed
Copy link

hashseed commented Jun 7, 2016

Thanks for following up on this, Mathias!

Having followed the unicode mail thread, I think I can get behind the idea of considering whitespace, hyphens and underscores as equivalent, when looking up property names and property value names including their aliases.

E.g. \p{Lowercase Letter} would be allowed just as well as \p{Lowercase-Letter} and p{Ll}, but not \p{Lower case Letter}.

This would solve the conflict between Blocks.txt and PropertyValuaAliases.txt.

@mathiasbynens
Copy link
Member

mathiasbynens commented Jun 7, 2016

@hashseed There is another issue though: e.g. Blocks.txt has Superscripts and Subscripts, whereas PropertyValueAliases.txt has Superscripts_And_Subscripts, which is the canonical property value. Note the difference in casing of the letter a. To support \p{Block=Superscripts and Subscripts} in addition to \p{Superscripts_And_Subscripts} we need case-insensitivity as well.

Would you be open to that, or would you rather stick to strict matching in that case?

@srl295
Copy link
Member Author

srl295 commented Jun 7, 2016

@mathiasbynens — thanks for your work on this. What's puzzling to me is why Blocks.txt is even being looked at here. It's for display names, not programmatic use. PropertyValueAliases.txt is the right place to find property value aliases — just as the response on the mailing list said.

I'm a definite -1 on leniency to match Blocks.txt — that's not what it's for. We should just match PropertyValueAliases.txt

@mathiasbynens
Copy link
Member

mathiasbynens commented Jun 7, 2016

@srl295 It may not be what it’s for, but it would be a direct consequence of following http://unicode.org/reports/tr18/#RL1.2 which specifies that “matching of […] values must follow the Matching Rules from UAX44”, specifically http://unicode.org/reports/tr44/#Matching_Symbolic. (As stated, I’d be fine with not following that, and implementing strict matching instead — just explaining the reasoning here.)

What's puzzling to me is why Blocks.txt is even being looked at here. It's for display names, not programmatic use.

Yeah, that’s what I didn’t know when I started the thread. I’d be willing to bet that there are other developers wishing to use \p{…} in regexps that don’t know about this. Blocks.txt doesn’t seem like an illogical place to go looking for the proper block names, IMHO. Those devs would be surprised to find that \p{Block=Superscripts and Subscripts} doesn’t work. It doesn’t help that Blocks.txt also includes this:

# Note:   When comparing block names, casing, whitespace, hyphens,
#         and underbars are ignored.
#         For example, "Latin Extended-A" and "latin extended a" are equivalent.
#         For more information on the comparison of property values, 
#            see UAX #44: http://www.unicode.org/reports/tr44/

This is a problem that can be solved through proper developer documentation, of course. But taking all of it into consideration, I’m leaning towards supporting @hashseed’s suggestion + case-insensitivity.

@srl295
Copy link
Member Author

srl295 commented Jun 7, 2016

@mathiasbynens UAX44-LM2 is of course a great reason to, quote, Ignore case, whitespace, underscore (‘’), and all medial hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E_, unquote. So I'm +1 on that.

\p{Block=Superscripts and Subscripts} doesn't work

But, it should work (and does in ICU )— because of UAX44-LM2. Are there any names in Blocks.txt that wouldn't match PropertyValueAliases.txt given the leniency?

It does seem that both the Blocks.txt comment and UAX44 could be improved for some more clarity — discussing PropertyValueAliases.txt

@mathiasbynens
Copy link
Member

mathiasbynens commented Jun 7, 2016

@srl295 Have you seen mathiasbynens/ecma262#1 (comment)? It was the context for the above discussion.

But, it should work — because of UAX44-LM2.

Sure — if we decide to follow that. My initial spec draft included a variant of loose matching per UAX44-LM2 (minus non-ASCII hyphens and the is prefix) but we later decided to use strict matching instead.

Are there any names in Blocks.txt that wouldn't match PropertyValueAliases.txt given the leniency?

No. But note that this is also true for @hashseed’s suggestion combined with case-insensitivity (which is what I was proposing here), which would be a more strict solution than UAX44-LM2. I’d strongly prefer that over UAX44-LM2, at least for the initial spec text + implementations. We can always loosen up the matching algorithm later, but if we do it right from the start, there’s no going back.

@hashseed
Copy link

hashseed commented Jun 8, 2016

If matching Blocks.txt is not really that important, I'm actually hesitant to follow UAX44-LM2 at all, including whitespace and underscore. If we simply follow UAX44-LM2, we end up with loose matching, which I thought we agreed on being a bad idea. The reason I mentioned for this is that we do not want to end up with regexps that read /\p{___lower C-A-S-E___}/ui. I don't see why we should carve out a subset of UAX44-LM2 instead of ignoring it altogether.

If we consider following UAX44-LM2 a bad idea, and there is no reason to care about matching Blocks.txt, then I'm in favor being super strict and only match what's listed in PropertyValueAliases.txt. We can explicitly state that in the spec text, and add a note about Blocks.txt. I think standardizing on underscore as separator is nicer than having this exception for Blocks.txt. Scripts.txt for example use names with underscore. You could argue that either way could surprise users.

With that in place, we can still gather feedback from developers. If not having loose matching is an actual developer pain point, we can still address that in a future PR.

@mathiasbynens
Copy link
Member

Updated https://github.com/mathiasbynens/ecma262/pull/1/files to explicitly mention PropertyAliases.txt & PropertyValueAliases.txt.

@mathiasbynens
Copy link
Member

There is now a standalone repo for this proposal: https://github.com/mathiasbynens/es-regex-unicode-property-escapes Let’s move the discussion over there.

@srl295
Copy link
Member Author

srl295 commented Jun 10, 2016

@mathiasbynens OK. I think there's a lot more needed than just support within regex (as important as that is), especially getting the general property and other properties given a codepoint.

in ICU there's uchar_getIntProperty so

 … =  uchar_getIntProperty( 'A', UCHAR_GENERAL_CATEGORY); // == U_UPPERCASE_LETTER (Lu)
 … =  uchar_getIntProperty( 'A', UCHAR_SCRIPT); // USCRIPT_LATIN (Latn)

etc. Not proposing this specific API, just trying to get the concept rolling.

@littledan
Copy link
Member

@srl295 Are there use cases that you have in mind where it is important to use the property value, rather than test whether the character has a particular property value? That would help motivate adding such an API.

@srl295
Copy link
Member Author

srl295 commented Jun 10, 2016

@littledan sure, anything that's not just a single boolean:

  • Getting the decimal value of a character: to be able to implement parsing/analysis
  • Getting the script of a character, combining class, bidi properties to be able to do advanced layout
  • getting the general category of a character to determine how it should be processed ( equivalent of isprint etc )

sure, you could do

if ( /\p{Gc=Lo}/.test('A') ) { 
  …
} else if ( /\p{Gc=Lm}/.test('A') ) { 
  …
} else if ( /\p{Gc=Mc}/.test('A') ) { 
  …
}

… but why?

Actually I would prefer making the property available over extending regex. Because if you have the properties, you can implement regex in JS. But without the properties enumerated, it's a lot harder to do the reverse.

@hashseed
Copy link

@srl295 I don't think exposing a way to test for property value for a particular character should affect this proposal.

@jungshik
Copy link

@srl295 Are there use cases that you have in mind where it is important to use the property value, rather than test whether the character has a particular property value? That would help motivate adding such an API.

Let's suppose that there's such a use case. Even then, wouldn't it better to make that API a part of Ecma 262 instead of Ecma 402 (given what has been added to Regex) ?

@littledan
Copy link
Member

It's a somewhat esoteric question which place this lands in; the 262/402 split doesn't correspond to the split in some implementations. For example, V8 does not support normalization or Unicode RegExp properties when "i18n" is compiled out. I suspect it's not the only one.

A rough argument for putting it in 402 is, this is where the library functions for things that aren't methods on existing objects go. And it seems reasonable to make this a property of the Intl object.

@jungshik
Copy link

For example, V8 does not support normalization or Unicode RegExp properties when "i18n" is compiled out. I suspect it's not the only one.

Well, the current 'V8_INTL_SUPPORT' needs to be split into two eventually or its 'boundary' has to be changed, IMHO once https://bugs.chromium.org/p/v8/issues/detail?id=5500#c9 (replace unibrow with ICU) is resolved. One should be about Ecma402 support (Intl.* API support) and the other should be about whether ICU is used or not (ICU vs unibrow). Depending on the above V8 bug is resolved, the latter would not be necessary at all (i.e. ICU is always used) in which case Unicode RegExp properties (a part of Ecma262) would always be supported regardless of Intl.* API (Ecma 402) support.

@littledan
Copy link
Member

For anyone who's looking to contribute to ECMA 402, this is a "shovel ready" project, just in need of a writeup for a concrete API, and presentation to the committee.

@srl295
Copy link
Member Author

srl295 commented Oct 15, 2019

https://github.com/tc39/template-for-proposals (that’s for 262, not sure if 402 has something different)

this may end up being 262…

@mathiasbynens
Copy link
Member

in practice, does [..."e𞤫𞤫"].length get optimized somehow, to where it doesn't need actually iterate and such? just curious.

https://v8.dev/blog/spread-elements

@leobalter
Copy link
Member

I believe this is much more convenient in 262 even w/ the fact it involves a good amount of work from the delegates working w/ ECMA-402.

As long as we have someone to champion it in a TC39 meeting, this should be well clarified.

@sffc sffc added s: in progress Status: the issue has an active proposal and removed s: help wanted Status: help wanted; needs proposal champion labels Oct 15, 2019
@sffc
Copy link
Contributor

sffc commented Oct 15, 2019

Please move further discussion to the proposal repo.

https://github.com/srl295/es-unicode-properties/issues

EDIT (May 2021): This proposal is currently stalled, pending more concrete use cases.

@tc39 tc39 locked and limited conversation to collaborators Oct 15, 2019
@sffc sffc added s: comment Status: more info is needed to move forward Proposal Larger change requiring a proposal and removed s: in progress Status: the issue has an active proposal labels Jun 5, 2020
@tc39 tc39 unlocked this conversation May 8, 2021
@my2iu
Copy link

my2iu commented May 9, 2021

Can you make the Unicode properties needed for internationalization and low-level text rendering available? It’s becoming increasingly common to do low-level text rendering in JavaScript because certain APIs like WebGL require it or because people are making more complex web apps like word processors, paint programs, or graphic design tools. Implementing internationalization support for this low-level rendering like the bidi algorithm, vertical orientation, and text shaping requires a lot of these Unicode properties, so it would be great if there were an API making them available. Right now, libraries like Harfbuzzjs simply include a compressed version of the Unicode database in their code, and it’s not too big, but since web browsers already know this information, it would be great if web browsers made it available to JavaScript. Preferably these properties would be fast to access too.

@ryzokuken
Copy link
Member

@my2iu thanks for your comment. Looks like the proposal has not seen a lot of activity lately, but hopefully that would change soon...

@srl295
Copy link
Member Author

srl295 commented May 10, 2021

@my2iu @ryzokuken that's why I proposed this, but there was a lot of pushback that there weren't real use cases for anything that couldn't be covered by regex. See https://github.com/srl295/es-unicode-properties

@my2iu
Copy link

my2iu commented May 10, 2021

Your proposed specification uses a lot of strings in its API, so I’m concerned that it would be slow. Regexes might be fine if they were extended to handle all the weird Unicode properties needed for low level text rendering and if the regexes could be optimized to be fast.

@srl295
Copy link
Member Author

srl295 commented May 10, 2021

Your proposed specification uses a lot of strings in its API, so I’m concerned that it would be slow. Regexes might be fine if they were extended to handle all the weird Unicode properties needed for low level text rendering and if the regexes could be optimized to be fast.

Not sure why that would be slow, it's mostly the same strings.

Can you make a concrete list of regex properties that are currently missing? I.e. specific properties from the Unicode spec you need?

Also see https://github.com/srl295/es-unicode-properties#why-not-just-use-regex

@my2iu
Copy link

my2iu commented May 10, 2021

Sure. Now that I think about it, something that operates on code points might be the fastest, plus with an API that either has a lot of methods or with a “matcher” object that can be optimized like with regexes. Actually, does JS even have proper support for code points and surrogates yet? The last time I checked, people were still arguing whether JS strings were UCS2, UTF16, or UTF32.

@srl295
Copy link
Member Author

srl295 commented May 10, 2021

operates on code points might be the fastest

It could be an overload. The getter could take either a string or an integer codepoint. This is discussed in srl295/es-unicode-properties#5

does JS even have proper support for code points and surrogates yet?

yes.

@sffc
Copy link
Contributor

sffc commented Jun 11, 2021

Strawman from @reed-at-google about what would be necessary for Skia's needs:

https://github.com/google/skia/blob/main/site/docs/dev/design/uni_characterize.md

@my2iu
Copy link

my2iu commented Jun 11, 2021

That strawman API seems a little wonky. I’m not a huge WASM expert, but I don’t think strings are directly transferable from WASM to JS. You call from WASM to JS, and then from JS, you can reach into the WASM memory space to copy the raw bytes into JS and convert things into JS strings. Since WASM code is normally C++, things would normally be UTF-8, but UTF-16 might be possible as well, though WASM may prefer UTF-8. As such, I’m not sure whether minimizing the number of JS to WASM transitions needs to be necessarily reflected in the API design, and having an API that operates on JS strings (as opposed to typed arrays) isn’t necessarily the fastest thing for WASM either. It does bring up some good points about how batching might improve performance, depending on the overhead of JS to browser calls on VMs.

@sffc
Copy link
Contributor

sffc commented Jun 12, 2021

Request for the "decimal" property: #579

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: text Component: case mapping, collation, properties Proposal Larger change requiring a proposal s: comment Status: more info is needed to move forward
Projects
Archived in project
Development

No branches or pull requests