revamp sub/3 to resolve most issues with gsub (and sub with "g"); add…

… uniq(stream) The primary purpose of this commit (which supercedes PR jqlang#2624) is to rectify most problems with `gsub` (and also `sub` with the "g" option), in particular jqlang#1425 ('\b'), jqlang#2354 (lookahead), and jqlang#2532 (regex == "^(?!cd ).*$|^cd ";"")). This commit also partly resolves jqlang#2148 and jqlang#1206 in that `gsub` no longer loops infinitely; however, because the new `gsub` depends critically on match(_;"g"), the behavior when regex == "" is sometimes non-standard. [*1] Since the new sub/3 relies on uniq/1, that has been added as well [*2]. The documentation has been updated to reflect the fact that `sub` and `gsub` are intended to be regular in the second argument. [*3] Also, _nwise/1 has been tweaked to take advantage of TCO. Footnotes: [*1] Using the new gsub, '"a" | gsub( ""; "a")' emits "aa" rather than "aaa" as would be standard. This is nevertheless better than the infinite loop behavior of jq 1.6 in this case. With one exception (as explained in [*2]), the new gsub is implemented as though match/2 behavior is correct. That is, bugs in `gsub` behavior will most likely have their origin in `match/2`. [*2] `uniq/1` adopts the Unix/Linux name and semantics; it is needed for the following test case: gsub("(?=u)"; "u") "qux" "quux" Without this functionality: Test jqlang#23: 'gsub("(?=u)"; "u")' at line number 100 *** Expected "quux", but got "quuux" for test at line number 102: gsub("(?=u)"; "u") The root of the problem here is `match`: if `match` is fixed, then gsub would not need `untie`. The addition of `uniq` as a top-level function should be a non-issue relative to general concern about builtins.jq bloat: the line count of the new builtin.jq is significantly reduced overall, and the number of defs is actually reduced by 1 (from 111 (ignoring a redundant def) to 110). [*3] See e.g. jqlang#513 (comment)
pkoppstein · Jun 29, 2023 · 0d6a400 · 0d6a400
1 parent 03db550
commit 0d6a400
Show file tree

Hide file tree

Showing 4 changed files with 139 additions and 48 deletions.
diff --git a/docs/content/manual/manual.yml b/docs/content/manual/manual.yml
@@ -1428,6 +1428,30 @@ sections:
             input: '[{"foo":1, "bar":14}, {"foo":2, "bar":3}]'
             output: ['{"foo":2, "bar":3}']
 
+      - title: "`uniq(stream)`"
+        body: |
+
+          The `uniq` function produces a substream of the given stream
+          by emitting in turn the first item from each run within it.
+          No sorting takes place.
+          
+        examples:
+          - program: '[uniq(1,1,2,null,null,1)]'
+            input: 'null'
+            output: ['[1,2,null,1]']
+
+          - program: '[uniq(.[])]'
+            input: '[1,1,2,null,null,1]'
+            output: ['[1,2,null,1]']
+
+          - program: '[uniq(empty)]'
+            input: 'null'
+            output: ['[]']
+
+          - program: '[true, false | [uniq(1,1,2)]]'
+            input: null
+            output: ['[[1,2],[1,2]]']
+
       - title: "`unique`, `unique_by(path_exp)`"
         body: |
 
@@ -2471,27 +2495,33 @@ sections:
             input: '("ab,cd", "ef, gh")'
             output: ['"ab"', '"cd"', '"ef"', '"gh"']
 
-      - title: "`sub(regex; tostring)`, `sub(regex; string; flags)`"
+      - title: "`sub(regex; tostring)`, `sub(regex; tostring; flags)`"
         body: |
 
-          Emit the string obtained by replacing the first match of regex in the
-          input string with `tostring`, after interpolation.  `tostring` should
-          be a jq string, and may contain references to named captures. The
-          named captures are, in effect, presented as a JSON object (as
-          constructed by `capture`) to `tostring`, so a reference to a captured
-          variable named "x" would take the form: `"\(.x)"`.
+          Emit the string obtained by replacing the first match of
+          regex in the input string with `tostring`, after
+          interpolation.  `tostring` should be a jq string or a stream
+          of such strings, each of which may contain references to
+          named captures. The named captures are, in effect, presented
+          as a JSON object (as constructed by `capture`) to
+          `tostring`, so a reference to a captured variable named "x"
+          would take the form: `"\(.x)"`.
 
         example:
           - program: 'sub("^[^a-z]*(?<x>[a-z]*).*")'
             input: '"123abc456"'
             output: '"ZabcZabc"'
 
+          - program: '[sub("(?<a>.)"; "\(.a|ascii_upcase)", "\(.a|ascii_downcase)")]'
+            input: '"aB"'
+            output: ['["AB","aB"]']
 
-      - title: "`gsub(regex; string)`, `gsub(regex; string; flags)`"
+      - title: "`gsub(regex; tostring)`, `gsub(regex; tostring; flags)`"
         body: |
 
           `gsub` is like `sub` but all the non-overlapping occurrences of the regex are
-          replaced by the string, after interpolation.
+          replaced by `tostring`, after interpolation. If the second argument is a stream
+          of jq strings, then `gsub` will produce a corresponding stream of JSON strings.
 
         example:
           - program: 'gsub("(?<x>.)[^a]*"; "+\(.x)-")'

diff --git a/src/builtin.jq b/src/builtin.jq
@@ -99,8 +99,10 @@ def scan(re):
 #
 # If input is an array, then emit a stream of successive subarrays of length n (or less),
 # and similarly for strings.
-def _nwise(a; $n): if a|length <= $n then a else a[0:$n] , _nwise(a[$n:]; $n) end;
-def _nwise($n): _nwise(.; $n);
+def _nwise($n):
+  def n: if length <= $n then . else .[0:$n] , (.[$n:] | n) end;
+  n;
+def _nwise(a; $n): a | _nwise($n);
 #
 # splits/1 produces a stream; split/1 is retained for backward compatibility.
 def splits($re; flags): . as $s
@@ -114,47 +116,34 @@ def splits($re): splits($re; null);
 # split emits an array for backward compatibility
 def split($re; flags): [ splits($re; flags) ];
 #
-# If s contains capture variables, then create a capture object and pipe it to s
-def sub($re; s):
-  . as $in
-  | [match($re)]
-  | if length == 0 then $in
-    else .[0]
-    | . as $r
-#  # create the "capture" object:
-    | reduce ( $r | .captures | .[] | select(.name != null) | { (.name) : .string } ) as $pair
-        ({}; . + $pair)
-    | $in[0:$r.offset] + s + $in[$r.offset+$r.length:]
-    end ;
+# stream-oriented
+def uniq(s):
+  foreach s as $x (null;
+    if . and $x == .[0] then .[1] = false
+    else [$x, true]
+    end;
+    if .[1] then .[0] else empty end);
 #
-# If s contains capture variables, then create a capture object and pipe it to s
-def sub($re; s; flags):
-  def subg: [explode[] | select(. != 103)] | implode;
-  # "fla" should be flags with all occurrences of g removed; gs should be non-nil if flags has a g
-  def sub1(fla; gs):
-    def mysub:
-      . as $in
-      | [match($re; fla)]
-      | if length == 0 then $in
-        else .[0] as $edit
-        | ($edit | .offset + .length) as $len
-        # create the "capture" object:
-        | reduce ( $edit | .captures | .[] | select(.name != null) | { (.name) : .string } ) as $pair
-            ({}; . + $pair)
-        | $in[0:$edit.offset]
-          + s
-          + ($in[$len:] | if length > 0 and gs then mysub else . end)
-        end ;
-    mysub ;
-    (flags | index("g")) as $gs
-    | (flags | if $gs then subg else . end) as $fla
-    | sub1($fla; $gs);
+# If s contains capture variables, then create a capture object and pipe it to s, bearing
+# in mind that s could be a stream
+def sub($re; s; $flags):
+   . as $in
+   | (reduce uniq(match($re; $flags)) as $edit
+        ({result: [], previous: 0};
+            $in[ .previous: ($edit | .offset) ] as $gap
+            # create the "capture" objects (one per item in s)
+            | [reduce ( $edit | .captures | .[] | select(.name != null) | { (.name) : .string } ) as $pair
+                 ({}; . + $pair) | s ] as $inserts
+            | reduce range(0; $inserts|length) as $ix (.; .result[$ix] += $gap + $inserts[$ix])
+	    | .previous = ($edit | .offset + .length ) )
+          | .result[] + $in[.previous:] )
+      // $in;
 #
 def sub($re; s): sub($re; s; "");
-# repeated substitution of re (which may contain named captures)
+#
 def gsub($re; s; flags): sub($re; s; flags + "g");
 def gsub($re; s): sub($re; s; "g");
-
+#
 ########################################################################
 # generic iterator/generator
 def while(cond; update):
@@ -237,7 +226,6 @@ def tostream:
   getpath($p) |
   reduce path(.[]?) as $q ([$p, .]; [$p+$q]);
 
-
 # Assuming the input array is sorted, bsearch/1 returns
 # the index of the target if the target is in the input array; and otherwise
 #  (-1 - ix), where ix is the insertion point that would leave the array sorted.

diff --git a/tests/jq.test b/tests/jq.test
@@ -1731,3 +1731,11 @@ false
 . |= try . catch .
 1
 1
+
+[uniq(1,1,2,3,3,4)]
+null
+[1,2,3,4]
+
+[uniq(empty)]
+null
+[]
diff --git a/tests/onig.test b/tests/onig.test
@@ -75,6 +75,45 @@ gsub( "(.*)"; "";  "x")
 ""
 ""
 
+gsub( ""; "a";  "g")
+""
+"a"
+
+gsub( "^"; "";  "g")
+"a"
+"a"
+
+
+# The following is a regression test and should not be construed as a requirement other than that execution should terminate:
+gsub( ""; "a";  "g")
+"a"
+"aa"
+
+gsub( "$"; "a";  "g")
+"a"
+"aa"
+
+gsub( "^"; "a")
+""
+"a"
+
+gsub("(?=u)"; "u")
+"qux"
+"quux"
+
+gsub("^.*a"; "b")
+"aaa"
+"b"
+
+gsub("^.*?a"; "b")
+"aaa"
+"baa"
+
+# The following is for regression testing and should not be construed as a requirement:
+[gsub("a"; "b", "c")]
+"a"
+["b","c"]
+
 [.[] | scan(", ")]
 ["a,b, c, d, e,f",", a,b, c, d, e,f, "]
 [", ",", ",", ",", ",", ",", ",", ",", "]
@@ -92,7 +131,33 @@ gsub("(?<x>.)[^a]*"; "+\(.x)-")
 "Abcabc"
 "+A-+a-"
 
+gsub("(?<x>.)(?<y>[0-9])"; "\(.x|ascii_downcase)\(.y)")
+"A1 B2 CD"
+"a1 b2 CD"
+
+gsub("\\b(?<x>.)"; "\(.x|ascii_downcase)")
+"ABC DEF"
+"aBC dEF"
+
 # utf-8
 sub("(?<x>.)"; "\(.x)!")
 "’"
 "’!"
+
+[sub("a"; "b", "c")]
+"a"
+["b","c"]
+
+[sub("(?<a>.)"; "\(.a|ascii_upcase)", "\(.a|ascii_downcase)", "c")]
+"aB"
+["AB","aB","cB"]
+
+[gsub("(?<a>.)"; "\(.a|ascii_upcase)", "\(.a|ascii_downcase)", "c")]
+"aB"
+["AB","ab","cc"]
+
+# splits and _nwise
+[splits("")]
+"ab"
+["","a","b"]
+