Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: Improve case sensitivity and conversion #625

Merged
merged 8 commits into from
Mar 28, 2022
74 changes: 50 additions & 24 deletions spec/locale-sensitive-functions.html
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ <h1>String.prototype.localeCompare ( _that_ [ , _locales_ [ , _options_ ] ] )</h
</p>

<emu-alg>
1. Let _O_ be RequireObjectCoercible(*this* value).
1. Let _O_ be ? RequireObjectCoercible(*this* value).
1. Let _S_ be ? ToString(_O_).
1. Let _thatValue_ be ? ToString(_that_).
1. Let _collator_ be ? Construct(%Collator%, &laquo; _locales_, _options_ &raquo;).
Expand Down Expand Up @@ -56,34 +56,54 @@ <h1>String.prototype.toLocaleLowerCase ( [ _locales_ ] )</h1>
</p>

<emu-alg>
1. Let _O_ be RequireObjectCoercible(*this* value).
1. Let _O_ be ? RequireObjectCoercible(*this* value).
1. Let _S_ be ? ToString(_O_).
1. Let _requestedLocales_ be ? CanonicalizeLocaleList(_locales_).
1. If _requestedLocales_ is not an empty List, then
1. Let _requestedLocale_ be _requestedLocales_[0].
1. Else,
1. Let _requestedLocale_ be DefaultLocale().
1. Let _noExtensionsLocale_ be the String value that is _requestedLocale_ with any Unicode locale extension sequences (<emu-xref href="#sec-unicode-locale-extension-sequences"></emu-xref>) removed.
1. Let _availableLocales_ be a List with language tags that includes the languages for which the Unicode Character Database contains language sensitive case mappings. Implementations may add additional language tags if they support case mapping for additional locales.
1. Let _locale_ be BestAvailableLocale(_availableLocales_, _noExtensionsLocale_).
1. If _locale_ is *undefined*, let _locale_ be *"und"*.
1. Let _cpList_ be a List containing in order the code points of _S_ as defined in es2022, <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, starting at the first element of _S_.
1. Let _cuList_ be a List where the elements are the result of a lower case transformation of the ordered code points in _cpList_ according to the Unicode Default Case Conversion algorithm or an implementation-defined conversion algorithm. A conforming implementation's lower case transformation algorithm must always yield the same _cpList_ given the same _cuList_ and locale.
1. Let _L_ be a String whose elements are the UTF-16 Encoding (defined in es2022, <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>) of the code points of _cuList_.
1. Return _L_.
1. Return ? TransformCase(_S_, _locales_, ~lower~).
</emu-alg>

<p>
Lower case code point mappings may be derived according to a tailored version of the Default Case Conversion Algorithms of the Unicode Standard. Implementations may use locale specific tailoring defined in SpecialCasings.txt and/or CLDR and/or any other custom tailoring.
</p>

<emu-note>
The case mapping of some code points may produce multiple code points. In this case the result String may not be the same length as the source String. Because both `toLocaleUpperCase` and `toLocaleLowerCase` have context-sensitive behaviour, the functions are not symmetrical. In other words, `s.toLocaleUpperCase().toLocaleLowerCase()` is not necessarily equal to `s.toLocaleLowerCase()`.
</emu-note>

<emu-note>
The `toLocaleLowerCase` function is intentionally generic; it does not require that its *this* value be a String object. Therefore, it can be transferred to other kinds of objects for use as a method.
</emu-note>

<emu-clause id="sec-transform-case" type="abstract operation">
<h1>
TransformCase (
_S_: a String,
_locales_: an ECMAScript language value,
_targetCase_: ~lower~ or ~upper~,
)
</h1>
<dl class="header">
<dt>description</dt>
<dd>It interprets _S_ as a sequence of UTF-16 encoded code points, as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, and returns the result of implementation- and locale-dependent (ILD) transformation into _targetCase_ as a new String value.</dd>
</dl>
<emu-alg>
1. Let _requestedLocales_ be ? CanonicalizeLocaleList(_locales_).
1. If _requestedLocales_ is not an empty List, then
1. Let _requestedLocale_ be _requestedLocales_[0].
1. Else,
1. Let _requestedLocale_ be ! DefaultLocale().
1. Let _noExtensionsLocale_ be the String value that is _requestedLocale_ with any Unicode locale extension sequences (<emu-xref href="#sec-unicode-locale-extension-sequences"></emu-xref>) removed.
1. Let _availableLocales_ be a List with language tags that includes the languages for which the Unicode Character Database contains language sensitive case mappings. Implementations may add additional language tags if they support case mapping for additional locales.
1. Let _locale_ be ! BestAvailableLocale(_availableLocales_, _noExtensionsLocale_).
1. If _locale_ is *undefined*, set _locale_ to *"und"*.
1. Let _codePoints_ be ! StringToCodePoints(_S_).
1. If _targetCase_ is ~lower~, then
1. Let _newCodePoints_ be a List whose elements are the result of a lower case transformation of _codePoints_ according to an implementation-derived algorithm using _locale_ or the Unicode Default Case Conversion algorithm.
1. Else,
1. Assert: _targetCase_ is ~upper~.
1. Let _newCodePoints_ be a List whose elements are the result of an upper case transformation of _codePoints_ according to an implementation-derived algorithm using _locale_ or the Unicode Default Case Conversion algorithm.
1. Return ! CodePointsToString(_newCodePoints_).
Copy link
Contributor Author

@gibson042 gibson042 Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm is technically omitting a possible error, but I'm not fully decided about how to remedy that. It occurs because ECMAScript String values are bounded to a maximum of 253 - 1 code units and some case transformations can increase length, meaning there are inputs for which the output is well-defined but not representable (e.g., "ß".repeat(2**53 - 1).toLocaleUpperCase("en")). I think the right approach is updating CodePointsToString (which is defined in ECMA-262) to throw an error when its input corresponds to out-of-bounds output and then updating this invocation to be explicitly fallible with ?, although I'm willing to wait on that because AFAICT every current implementation is incapable of dealing with strings near the limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

</emu-alg>

<p>
Code point mappings may be derived according to a tailored version of the Default Case Conversion Algorithms of the Unicode Standard. Implementations may use locale-sensitive tailoring defined in the file <a href="https://unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt"><code>SpecialCasing.txt</code></a> of the Unicode Character Database and/or CLDR and/or any other custom tailoring. Regardless of tailoring, a conforming implementation's case transformation algorithm must always yield the same result given the same input code points, locale, and target case.
</p>

<emu-note>
The case mapping of some code points may produce multiple code points, and therefore the result may not be the same length as the input. Because both `toLocaleUpperCase` and `toLocaleLowerCase` have context-sensitive behaviour, the functions are not symmetrical. In other words, `s.toLocaleUpperCase().toLocaleLowerCase()` is not necessarily equal to `s.toLocaleLowerCase()` and `s.toLocaleLowerCase().toLocaleUpperCase()` is not necessarily equal to `s.toLocaleUpperCase()`.
</emu-note>
</emu-clause>
</emu-clause>

<emu-clause id="sup-string.prototype.tolocaleuppercase">
Expand All @@ -94,9 +114,15 @@ <h1>String.prototype.toLocaleUpperCase ( [ _locales_ ] )</h1>
</p>

<p>
This function interprets a String value as a sequence of code points, as described in es2022, <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>. This function behaves in exactly the same way as `String.prototype.toLocaleLowerCase`, except that characters are mapped to their _uppercase_ equivalents. A conforming implementation's upper case transformation algorithm must always yield the same result given the same sequence of code points and locale.
This function interprets a String value as a sequence of code points, as described in es2022, <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>. The following steps are taken:
</p>

<emu-alg>
1. Let _O_ be ? RequireObjectCoercible(*this* value).
1. Let _S_ be ? ToString(_O_).
1. Return ? TransformCase(_S_, _locales_, ~upper~).
</emu-alg>

<emu-note>
The `toLocaleUpperCase` function is intentionally generic; it does not require that its *this* value be a String object. Therefore, it can be transferred to other kinds of objects for use as a method.
</emu-note>
Expand Down
6 changes: 3 additions & 3 deletions spec/locales-currencies-tz.html
Original file line number Diff line number Diff line change
Expand Up @@ -153,8 +153,8 @@ <h1>IsWellFormedCurrencyCode ( _currency_ )</h1>

<emu-alg>
1. Let _normalized_ be the result of mapping _currency_ to upper case as described in <emu-xref href="#sec-case-sensitivity-and-case-mapping"></emu-xref>.
1. If the number of elements in _normalized_ is not 3, return *false*.
1. If _normalized_ contains any character that is not in the range *"A"* to *"Z"* (U+0041 to U+005A), return *false*.
1. If the length of _normalized_ is not 3, return *false*.
1. If _normalized_ contains any code unit outside of 0x0041 through 0x005A (corresponding to Unicode characters LATIN CAPITAL LETTER A through LATIN CAPITAL LETTER Z), return *false*.
1. Return *true*.
</emu-alg>
</emu-clause>
Expand Down Expand Up @@ -220,7 +220,7 @@ <h1>DefaultTimeZone ( )</h1>
<h1>Measurement Unit Identifiers</h1>

<p>
The ECMAScript 2022 Internationalization API Specification identifies measurement units using a <em>core unit identifier</em> as defined by <a href="https://unicode.org/reports/tr35/tr35-general.html#Unit_Elements">Unicode Technical Standard #35, Part 2, Section 6</a>. Their canonical form is a string containing all lowercase letters with zero or more hyphens.
The ECMAScript 2022 Internationalization API Specification identifies measurement units using a <em>core unit identifier</em> as defined by <a href="https://unicode.org/reports/tr35/tr35-general.html#Unit_Elements">Unicode Technical Standard #35, Part 2, Section 6</a>. Their canonical form is a string containing only Unicode Basic Latin lower case letters (U+0061 LATIN SMALL LETTER A through U+007A LATIN SMALL LETTER Z) with zero or more medial hyphens (U+002D HYPHEN-MINUS).
</p>

<p>
Expand Down
2 changes: 1 addition & 1 deletion spec/numberformat.html
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ <h1>SetNumberFormatUnitOptions ( _intlObj_, _options_ )</h1>
1. If the result of IsWellFormedUnitIdentifier(_unit_) is *false*, throw a *RangeError* exception.
1. Let _unitDisplay_ be ? GetOption(_options_, *"unitDisplay"*, *"string"*, &laquo; *"short"*, *"narrow"*, *"long"* &raquo;, *"short"*).
1. If _style_ is *"currency"*, then
1. Let _currency_ be the result of converting _currency_ to upper case as specified in <emu-xref href="#sec-case-sensitivity-and-case-mapping"></emu-xref>.
1. Let _currency_ be the result of mapping _currency_ to upper case as specified in <emu-xref href="#sec-case-sensitivity-and-case-mapping"></emu-xref>.
1. Set _intlObj_.[[Currency]] to _currency_.
1. Set _intlObj_.[[CurrencyDisplay]] to _currencyDisplay_.
1. Set _intlObj_.[[CurrencySign]] to _currencySign_.
Expand Down