From 97c1679bde2bf7a55170c29300386b011c19df4e Mon Sep 17 00:00:00 2001
From: Ron Buckton <ron.buckton@microsoft.com>
Date: Tue, 1 Oct 2019 09:28:17 -0700
Subject: [PATCH] Normative: add RegExp match indices, including RegExp d flag

---
 spec.html | 200 ++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 172 insertions(+), 28 deletions(-)
diff --git a/spec.html b/spec.html
index c90c3b8ff5b..b896d8f9c1f 100644
--- a/spec.html
+++ b/spec.html
@@ -35030,7 +35030,10 @@ <h1>Notation</h1>
             A <em>CharSet</em> is a mathematical set of characters. When the _Unicode_ flag is *true*, &ldquo;all characters&rdquo; means the CharSet containing all code point values; otherwise &ldquo;all characters&rdquo; means the CharSet containing all code unit values.
           </li>
           <li>
-            A <em>State</em> is an ordered pair (_endIndex_, _captures_) where _endIndex_ is an integer and _captures_ is a List of _NcapturingParens_ values. States are used to represent partial match states in the regular expression matching algorithms. The _endIndex_ is one plus the index of the last input character matched so far by the pattern, while _captures_ holds the results of capturing parentheses. The _n_<sup>th</sup> element of _captures_ is either a List of characters that represents the value obtained by the _n_<sup>th</sup> set of capturing parentheses or *undefined* if the _n_<sup>th</sup> set of capturing parentheses hasn't been reached yet. Due to backtracking, many States may be in use at any time during the matching process.
+            A <em>Range</em> is an ordered pair (_startIndex_, _endIndex_) that represents the range of characters included in a capture, where _startIndex_ is an integer representing the start index (inclusive) of the range within _Input_, and _endIndex_ is an integer representing the end index (exclusive) of the range within _Input_. For any <em>Range</em>, these indices must satisfy the invariant that _startIndex_ &le; _endIndex_.
+          </li>
+          <li>
+            A <em>State</em> is an ordered pair (_endIndex_, _captures_) where _endIndex_ is an integer and _captures_ is a List of _NcapturingParens_ values. States are used to represent partial match states in the regular expression matching algorithms. The _endIndex_ is one plus the index of the last input character matched so far by the pattern, while _captures_ holds the results of capturing parentheses. The _n_<sup>th</sup> element of _captures_ is either a <em>Range</em> representing the range of characters captured by the _n_<sup>th</sup> set of capturing parentheses, or *undefined* if the _n_<sup>th</sup> set of capturing parentheses hasn't been reached yet. Due to backtracking, many States may be in use at any time during the matching process.
           </li>
           <li>
             A <em>MatchResult</em> is either a State or the special token ~failure~ that indicates that the match failed.
@@ -35051,21 +35054,20 @@ <h1>Runtime Semantics: CompilePattern ( ): an Abstract Closure that takes a Stri
         <emu-grammar>Pattern :: Disjunction</emu-grammar>
         <emu-alg>
           1. Let _m_ be CompileSubpattern of |Disjunction| with argument ~forward~.
-          1. Return a new Abstract Closure with parameters (_str_, _index_) that captures _m_ and performs the following steps when called:
-            1. Assert: Type(_str_) is String.
-            1. Assert: _index_ is a non-negative integer which is &le; the length of _str_.
-            1. If _Unicode_ is *true*, let _Input_ be StringToCodePoints(_str_). Otherwise, let _Input_ be a List whose elements are the code units that are the elements of _str_. _Input_ will be used throughout the algorithms in <emu-xref href="#sec-pattern-semantics"></emu-xref>. Each element of _Input_ is considered to be a character.
+          1. Return a new Abstract Closure with parameters (_input_, _index_) that captures _m_ and performs the following steps when called:
+            1. Assert: _input_ is a List of characters.
+            1. Assert: _index_ is a non-negative integer which is &le; the number of characters in _input_.
+            1. Let _Input_ be _input_. This alias will be used throughout the algorithms in <emu-xref href="#sec-pattern-semantics"></emu-xref>.
             1. Let _InputLength_ be the number of characters contained in _Input_. This alias will be used throughout the algorithms in <emu-xref href="#sec-pattern-semantics"></emu-xref>.
-            1. Let _listIndex_ be the index into _Input_ of the character that was obtained from element _index_ of _str_.
             1. Let _c_ be a new Continuation with parameters (_y_) that captures nothing and performs the following steps when called:
               1. Assert: _y_ is a State.
               1. Return _y_.
             1. Let _cap_ be a List of _NcapturingParens_ *undefined* values, indexed 1 through _NcapturingParens_.
-            1. Let _x_ be the State (_listIndex_, _cap_).
+            1. Let _x_ be the State (_index_, _cap_).
             1. Return _m_(_x_, _c_).
         </emu-alg>
         <emu-note>
-          <p>A Pattern compiles to an Abstract Closure value. RegExpBuiltinExec can then apply this procedure to a String and an offset within the String to determine whether the pattern would match starting at exactly that offset within the String, and, if it does match, what the values of the capturing parentheses would be. The algorithms in <emu-xref href="#sec-pattern-semantics"></emu-xref> are designed so that compiling a pattern may throw a *SyntaxError* exception; on the other hand, once the pattern is successfully compiled, applying the resulting Abstract Closure to find a match in a String cannot throw an exception (except for any implementation-defined exceptions that can occur anywhere such as out-of-memory).</p>
+          <p>A Pattern compiles to an Abstract Closure value. RegExpBuiltinExec can then apply this procedure to a List of characters and an offset within that List to determine whether the pattern would match starting at exactly that offset within the List, and, if it does match, what the values of the capturing parentheses would be. The algorithms in <emu-xref href="#sec-pattern-semantics"></emu-xref> are designed so that compiling a pattern may throw a *SyntaxError* exception; on the other hand, once the pattern is successfully compiled, applying the resulting Abstract Closure to find a match in a List of characters cannot throw an exception (except for any implementation-defined exceptions that can occur anywhere such as out-of-memory).</p>
         </emu-note>
       </emu-clause>
 
@@ -35468,12 +35470,12 @@ <h1>
               1. Let _ye_ be _y_'s _endIndex_.
               1. If _direction_ is ~forward~, then
                 1. Assert: _xe_ &le; _ye_.
-                1. Let _s_ be a List whose elements are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
+                1. Let _r_ be the Range (_xe_, _ye_).
               1. Else,
                 1. Assert: _direction_ is ~backward~.
                 1. Assert: _ye_ &le; _xe_.
-                1. Let _s_ be a List whose elements are the characters of _Input_ at indices _ye_ (inclusive) through _xe_ (exclusive).
-              1. Set _cap_[_parenIndex_ + 1] to _s_.
+                1. Let _r_ be the Range (_ye_, _xe_).
+              1. Set _cap_[_parenIndex_ + 1] to _r_.
               1. Let _z_ be the State (_ye_, _cap_).
               1. Return _c_(_z_).
             1. Return _m_(_x_, _d_).
@@ -35558,15 +35560,17 @@ <h1>
               1. Assert: _x_ is a State.
               1. Assert: _c_ is a Continuation.
               1. Let _cap_ be _x_'s _captures_ List.
-              1. Let _s_ be _cap_[_n_].
-              1. If _s_ is *undefined*, return _c_(_x_).
+              1. Let _r_ be _cap_[_n_].
+              1. If _r_ is *undefined*, return _c_(_x_).
               1. Let _e_ be _x_'s _endIndex_.
-              1. Let _len_ be the number of elements in _s_.
+              1. Let _rs_ be _r_'s _startIndex_.
+              1. Let _re_ be _r_'s _endIndex_.
+              1. Let _len_ be _re_ - _rs_.
               1. If _direction_ is ~forward~, let _f_ be _e_ + _len_.
               1. Else, let _f_ be _e_ - _len_.
               1. If _f_ &lt; 0 or _f_ &gt; _InputLength_, return ~failure~.
               1. Let _g_ be min(_e_, _f_).
-              1. If there exists an integer _i_ between 0 (inclusive) and _len_ (exclusive) such that Canonicalize(_s_[_i_]) is not the same character value as Canonicalize(_Input_[_g_ + _i_]), return ~failure~.
+              1. If there exists an integer _i_ between 0 (inclusive) and _len_ (exclusive) such that Canonicalize(_Input_[_rs_ + _i_]) is not the same character value as Canonicalize(_Input_[_g_ + _i_]), return ~failure~.
               1. Let _y_ be the State (_f_, _cap_).
               1. Return _c_(_y_).
           </emu-alg>
@@ -35922,7 +35926,7 @@ <h1>
             1. Else, let _P_ be ? ToString(_pattern_).
             1. If _flags_ is *undefined*, let _F_ be the empty String.
             1. Else, let _F_ be ? ToString(_flags_).
-            1. If _F_ contains any code unit other than *"g"*, *"i"*, *"m"*, *"s"*, *"u"*, or *"y"* or if it contains the same code unit more than once, throw a *SyntaxError* exception.
+            1. If _F_ contains any code unit other than *"d"*, *"g"*, *"i"*, *"m"*, *"s"*, *"u"*, or *"y"* or if it contains the same code unit more than once, throw a *SyntaxError* exception.
             1. If _F_ contains *"u"*, let _u_ be *true*; else let _u_ be *false*.
             1. If _u_ is *true*, then
               1. Let _patternText_ be StringToCodePoints(_P_).
@@ -36087,16 +36091,20 @@ <h1>
             1. Let _flags_ be _R_.[[OriginalFlags]].
             1. If _flags_ contains *"g"*, let _global_ be *true*; else let _global_ be *false*.
             1. If _flags_ contains *"y"*, let _sticky_ be *true*; else let _sticky_ be *false*.
+            1. If _flags_ contains *"d"*, let _hasIndices_ be *true*; else let _hasIndices_ be *false*.
             1. If _global_ is *false* and _sticky_ is *false*, set _lastIndex_ to 0.
             1. Let _matcher_ be _R_.[[RegExpMatcher]].
             1. If _flags_ contains *"u"*, let _fullUnicode_ be *true*; else let _fullUnicode_ be *false*.
             1. Let _matchSucceeded_ be *false*.
+            1. If _fullUnicode_ is *true*, let _input_ be ! StringToCodePoints(_S_). Otherwise, let _input_ be a List whose elements are the code units that are the elements of _S_.
+            1. NOTE: Each element of _input_ is considered to be a character.
             1. Repeat, while _matchSucceeded_ is *false*,
               1. If _lastIndex_ &gt; _length_, then
                 1. If _global_ is *true* or _sticky_ is *true*, then
                   1. Perform ? Set(_R_, *"lastIndex"*, *+0*<sub>𝔽</sub>, *true*).
                 1. Return *null*.
-              1. Let _r_ be _matcher_(_S_, _lastIndex_).
+              1. Let _inputIndex_ be the index into _input_ of the character that was obtained from element _lastIndex_ of _S_.
+              1. Let _r_ be _matcher_(_input_, _inputIndex_).
               1. If _r_ is ~failure~, then
                 1. If _sticky_ is *true*, then
                   1. Perform ? Set(_R_, *"lastIndex"*, *+0*<sub>𝔽</sub>, *true*).
@@ -36106,9 +36114,7 @@ <h1>
                 1. Assert: _r_ is a State.
                 1. Set _matchSucceeded_ to *true*.
             1. Let _e_ be _r_'s _endIndex_ value.
-            1. If _fullUnicode_ is *true*, then
-              1. _e_ is an index into the _Input_ character list, derived from _S_, matched by _matcher_. Let _eUTF_ be the smallest index into _S_ that corresponds to the character at element _e_ of _Input_. If _e_ is greater than or equal to the number of elements in _Input_, then _eUTF_ is the number of code units in _S_.
-              1. Set _e_ to _eUTF_.
+            1. If _fullUnicode_ is *true*, set _e_ to ! GetStringIndex(_S_, _e_).
             1. If _global_ is *true* or _sticky_ is *true*, then
               1. Perform ? Set(_R_, *"lastIndex"*, 𝔽(_e_), *true*).
             1. Let _n_ be the number of elements in _r_'s _captures_ List. (This is the same value as <emu-xref href="#sec-notation"></emu-xref>'s _NcapturingParens_.)
@@ -36117,27 +36123,43 @@ <h1>
             1. Assert: The mathematical value of _A_'s *"length"* property is _n_ + 1.
             1. Perform ! CreateDataPropertyOrThrow(_A_, *"index"*, 𝔽(_lastIndex_)).
             1. Perform ! CreateDataPropertyOrThrow(_A_, *"input"*, _S_).
-            1. Let _matchedSubstr_ be the substring of _S_ from _lastIndex_ to _e_.
+            1. Let _match_ be the Match Record { [[StartIndex]]: _lastIndex_, [[EndIndex]]: _e_ }.
+            1. Let _indices_ be a new empty List.
+            1. Let _groupNames_ be a new empty List.
+            1. Append _match_ to _indices_.
+            1. Let _matchedSubstr_ be ! GetMatchString(_S_, _match_).
             1. Perform ! CreateDataPropertyOrThrow(_A_, *"0"*, _matchedSubstr_).
             1. If _R_ contains any |GroupName|, then
               1. Let _groups_ be OrdinaryObjectCreate(*null*).
+              1. Let _hasGroups_ be *true*.
             1. Else,
               1. Let _groups_ be *undefined*.
+              1. Let _hasGroups_ be *false*.
             1. Perform ! CreateDataPropertyOrThrow(_A_, *"groups"*, _groups_).
             1. For each integer _i_ such that _i_ &ge; 1 and _i_ &le; _n_, in ascending order, do
               1. Let _captureI_ be _i_<sup>th</sup> element of _r_'s _captures_ List.
-              1. If _captureI_ is *undefined*, let _capturedValue_ be *undefined*.
-              1. Else if _fullUnicode_ is *true*, then
-                1. Assert: _captureI_ is a List of code points.
-                1. Let _capturedValue_ be CodePointsToString(_captureI_).
+              1. If _captureI_ is *undefined*, then
+                1. Let _capturedValue_ be *undefined*.
+                1. Append *undefined* to _indices_.
               1. Else,
-                1. Assert: _fullUnicode_ is *false*.
-                1. Assert: _captureI_ is a List of code units.
-                1. Let _capturedValue_ be the String value consisting of the code units of _captureI_.
+                1. Let _captureStart_ be _captureI_'s _startIndex_.
+                1. Let _captureEnd_ be _captureI_'s _endIndex_.
+                1. If _fullUnicode_ is *true*, then
+                  1. Set _captureStart_ to ! GetStringIndex(_S_, _captureStart_).
+                  1. Set _captureEnd_ to ! GetStringIndex(_S_, _captureEnd_).
+                1. Let _capture_ be the Match Record { [[StartIndex]]: _captureStart_, [[EndIndex]]: _captureEnd_ }.
+                1. Let _capturedValue_ be ! GetMatchString(_S_, _capture_).
+                1. Append _capture_ to _indices_.
               1. Perform ! CreateDataPropertyOrThrow(_A_, ! ToString(𝔽(_i_)), _capturedValue_).
               1. If the _i_<sup>th</sup> capture of _R_ was defined with a |GroupName|, then
                 1. Let _s_ be the CapturingGroupName of the corresponding |RegExpIdentifierName|.
                 1. Perform ! CreateDataPropertyOrThrow(_groups_, _s_, _capturedValue_).
+                1. Append _s_ to _groupNames_.
+              1. Else,
+                1. Append *undefined* to _groupNames_.
+            1. If _hasIndices_ is *true*, then
+              1. Let _indicesArray_ be ! MakeMatchIndicesIndexPairArray(_S_, _indices_, _groupNames_, _hasGroups_).
+              1. Perform ! CreateDataPropertyOrThrow(_A_, *"indices"*, _indicesArray_).
             1. Return _A_.
           </emu-alg>
         </emu-clause>
@@ -36161,6 +36183,116 @@ <h1>
             1. Return _index_ + _cp_.[[CodeUnitCount]].
           </emu-alg>
         </emu-clause>
+
+        <emu-clause id="sec-getstringindex" type="abstract operation">
+          <h1>
+            GetStringIndex (
+              _S_: a String,
+              _e_: a non-negative integer,
+            )
+          </h1>
+          <dl class="header">
+          </dl>
+          <emu-alg>
+            1. If _S_ is the empty String, return 0.
+            1. Let _codepoints_ be StringToCodePoints(_S_).
+            1. Let _eUTF_ be the smallest index into _S_ that corresponds to the character at element _e_ of _codepoints_. If _e_ is greater than or equal to the number of elements in _codepoints_, then _eUTF_ is the number of code units in _S_.
+            1. Return _eUTF_.
+          </emu-alg>
+        </emu-clause>
+
+        <emu-clause id="sec-match-records">
+          <h1>Match Records</h1>
+          <p>A <dfn variants="Match Records">Match Record</dfn> is a Record value used to encapsulate the start and end indices of a regular expression match or capture.</p>
+          <p>Match Records have the fields listed in <emu-xref href="#table-match-record"></emu-xref>.</p>
+          <emu-table id="table-match-record" caption="Match Record Fields">
+            <table>
+              <tr>
+                <th>Field Name</th>
+                <th>Value</th>
+                <th>Meaning</th>
+              </tr>
+              <tr>
+                <td>[[StartIndex]]</td>
+                <td>a non-negative integer</td>
+                <td>The number of code units from the start of a string at which the match begins (inclusive).</td>
+              </tr>
+              <tr>
+                <td>[[EndIndex]]</td>
+                <td>an integer &ge; [[StartIndex]]</td>
+                <td>The number of code units from the start of a string at which the match ends (exclusive).</td>
+              </tr>
+            </table>
+          </emu-table>
+        </emu-clause>
+
+        <emu-clause id="sec-getmatchstring" type="abstract operation">
+          <h1>
+            GetMatchString (
+              _S_: a String,
+              _match_: a Match Record,
+            )
+          </h1>
+          <dl class="header">
+          </dl>
+          <emu-alg>
+            1. Assert: _match_.[[StartIndex]] is an integer value &ge; 0 and &le; the length of _S_.
+            1. Assert: _match_.[[EndIndex]] is an integer value &ge; _match_.[[StartIndex]] and &le; the length of _S_.
+            1. Return the substring of _S_ from _match_.[[StartIndex]] to _match_.[[EndIndex]].
+          </emu-alg>
+        </emu-clause>
+
+        <emu-clause id="sec-getmatchindexpair" type="abstract operation">
+          <h1>
+            GetMatchIndexPair (
+              _S_: a String,
+              _match_: a Match Record,
+            )
+          </h1>
+          <dl class="header">
+          </dl>
+          <emu-alg>
+            1. Assert: _match_.[[StartIndex]] is an integer value &ge; 0 and &lt; the length of _S_.
+            1. Assert: _match_.[[EndIndex]] is an integer value &ge; _match_.[[StartIndex]] and &le; the length of _S_.
+            1. Return ! CreateArrayFromList(&laquo; 𝔽(_match_.[[StartIndex]]), 𝔽(_match_.[[EndIndex]]) &raquo;).
+          </emu-alg>
+        </emu-clause>
+
+        <emu-clause id="sec-makematchindicesindexpairarray" type="abstract operation">
+          <h1>
+            MakeMatchIndicesIndexPairArray (
+              _S_: a String,
+              _indices_: a List, each of whose elements is a Match Record or *undefined*,
+              _groupNames_: a List, each of whose elements is a String or *undefined*,
+              _hasGroups_: a Boolean,
+            )
+          </h1>
+          <dl class="header">
+          </dl>
+          <emu-alg>
+            1. Let _n_ be the number of elements in _indices_.
+            1. Assert: _n_ &lt; 2<sup>32</sup> - 1.
+            1. Assert: _groupNames_ has _n_ - 1 elements.
+            1. NOTE: The _groupNames_ List contains elements aligned with the _indices_ List starting at _indices_[1].
+            1. Let _A_ be ! ArrayCreate(_n_).
+            1. Assert: The value of _A_'s *"length"* property is 𝔽(_n_).
+            1. If _hasGroups_ is *true*, then
+              1. Let _groups_ be ! OrdinaryObjectCreate(*null*).
+            1. Else,
+              1. Let _groups_ be *undefined*.
+            1. Perform ! CreateDataPropertyOrThrow(_A_, *"groups"*, _groups_).
+            1. For each integer _i_ starting with 0 such that _i_ &lt; _n_, in ascending order, do
+              1. Let _matchIndices_ be _indices_[_i_].
+              1. If _matchIndices_ is not *undefined*, then
+                1. Let _matchIndexPair_ be ! GetMatchIndexPair(_S_, _matchIndices_).
+              1. Else,
+                1. Let _matchIndexPair_ be *undefined*.
+              1. Perform ! CreateDataPropertyOrThrow(_A_, ! ToString(𝔽(_i_)), _matchIndexPair_).
+              1. If _i_ &gt; 0 and _groupNames_[_i_ - 1] is not *undefined*, then
+                1. Perform ! CreateDataPropertyOrThrow(_groups_, _groupNames_[_i_ - 1], _matchIndexPair_).
+            1. Return _A_.
+          </emu-alg>
+        </emu-clause>
       </emu-clause>
 
       <emu-clause id="sec-get-regexp.prototype.dotAll">
@@ -36200,6 +36332,8 @@ <h1>get RegExp.prototype.flags</h1>
           1. Let _R_ be the *this* value.
           1. If Type(_R_) is not Object, throw a *TypeError* exception.
           1. Let _result_ be the empty String.
+          1. Let _hasIndices_ be ! ToBoolean(? Get(_R_, *"hasIndices"*)).
+          1. If _hasIndices_ is *true*, append the code unit 0x0064 (LATIN SMALL LETTER D) as the last code unit of _result_.
           1. Let _global_ be ToBoolean(? Get(_R_, *"global"*)).
           1. If _global_ is *true*, append the code unit 0x0067 (LATIN SMALL LETTER G) as the last code unit of _result_.
           1. Let _ignoreCase_ be ToBoolean(? Get(_R_, *"ignoreCase"*)).
@@ -36226,6 +36360,16 @@ <h1>get RegExp.prototype.global</h1>
         </emu-alg>
       </emu-clause>
 
+      <emu-clause id="sec-get-regexp.prototype.hasIndices">
+        <h1>get RegExp.prototype.hasIndices</h1>
+        <p>`RegExp.prototype.hasIndices` is an accessor property whose set accessor function is *undefined*. Its get accessor function performs the following steps:</p>
+        <emu-alg>
+          1. Let _R_ be the *this* value.
+          1. Let _cu_ be the code unit 0x0064 (LATIN SMALL LETTER D).
+          1. Return ? RegExpHasFlag(_R_, _cu_).
+        </emu-alg>
+      </emu-clause>
+
       <emu-clause id="sec-get-regexp.prototype.ignorecase">
         <h1>get RegExp.prototype.ignoreCase</h1>
         <p>`RegExp.prototype.ignoreCase` is an accessor property whose set accessor function is *undefined*. Its get accessor function performs the following steps:</p>

Field Name	Value	Meaning
[[StartIndex]]	a non-negative integer	The number of code units from the start of a string at which the match begins (inclusive).
[[EndIndex]]	an integer ≥ [[StartIndex]]	The number of code units from the start of a string at which the match ends (exclusive).