Skip to content

Commit

Permalink
revamp sub/3 to resolve most issues with gsub (and sub with "g"); add…
Browse files Browse the repository at this point in the history
… uniq(stream)

The primary purpose of this commit (which supercedes PR
jqlang#2624) is to rectify most problems
with `gsub` (and also `sub` with the "g" option), in particular jqlang#1425
('\b'), jqlang#2354 (lookahead), and jqlang#2532 (regex == "^(?!cd ).*$|^cd ";"")).

This commit also partly resolves jqlang#2148 and jqlang#1206 in that `gsub` no
longer loops infinitely; however, because the new `gsub` depends
critically on match(_;"g"), the behavior when regex == "" is sometimes
non-standard. [*1]

Since the new sub/3 relies on uniq/1, that has been added as well [*2].

The documentation has been updated to reflect the fact that `sub` and
`gsub` are intended to be regular in the second argument. [*3]

Also, _nwise/1 has been tweaked to take advantage of TCO.

Footnotes:

[*1] Using the new gsub, '"a" | gsub( ""; "a")' emits "aa" rather than
"aaa" as would be standard.  This is nevertheless better than the
infinite loop behavior of jq 1.6 in this case.

With one exception (as explained in [*2]), the new gsub is implemented
as though match/2 behavior is correct.  That is, bugs in `gsub`
behavior will most likely have their origin in `match/2`.

[*2] `uniq/1` adopts the Unix/Linux name and semantics; it is needed for the following test case:

gsub("(?=u)"; "u")
"qux"
"quux"

Without this functionality:

Test jqlang#23: 'gsub("(?=u)"; "u")' at line number 100
*** Expected "quux", but got "quuux" for test at line number 102: gsub("(?=u)"; "u")

The root of the problem here is `match`: if `match` is fixed, then gsub would not need `untie`.

The addition of `uniq` as a top-level function should be a non-issue
relative to general concern about builtins.jq bloat: the line count of
the new builtin.jq is significantly reduced overall, and the number of
defs is actually reduced by 1 (from 111 (ignoring a redundant def) to 110).

[*3] See e.g. jqlang#513 (comment)
  • Loading branch information
pkoppstein committed Jun 29, 2023
1 parent 03db550 commit 0d6a400
Show file tree
Hide file tree
Showing 4 changed files with 139 additions and 48 deletions.
48 changes: 39 additions & 9 deletions docs/content/manual/manual.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1428,6 +1428,30 @@ sections:
input: '[{"foo":1, "bar":14}, {"foo":2, "bar":3}]'
output: ['{"foo":2, "bar":3}']

- title: "`uniq(stream)`"
body: |
The `uniq` function produces a substream of the given stream
by emitting in turn the first item from each run within it.
No sorting takes place.
examples:
- program: '[uniq(1,1,2,null,null,1)]'
input: 'null'
output: ['[1,2,null,1]']

- program: '[uniq(.[])]'
input: '[1,1,2,null,null,1]'
output: ['[1,2,null,1]']

- program: '[uniq(empty)]'
input: 'null'
output: ['[]']

- program: '[true, false | [uniq(1,1,2)]]'
input: null
output: ['[[1,2],[1,2]]']

- title: "`unique`, `unique_by(path_exp)`"
body: |
Expand Down Expand Up @@ -2471,27 +2495,33 @@ sections:
input: '("ab,cd", "ef, gh")'
output: ['"ab"', '"cd"', '"ef"', '"gh"']

- title: "`sub(regex; tostring)`, `sub(regex; string; flags)`"
- title: "`sub(regex; tostring)`, `sub(regex; tostring; flags)`"
body: |
Emit the string obtained by replacing the first match of regex in the
input string with `tostring`, after interpolation. `tostring` should
be a jq string, and may contain references to named captures. The
named captures are, in effect, presented as a JSON object (as
constructed by `capture`) to `tostring`, so a reference to a captured
variable named "x" would take the form: `"\(.x)"`.
Emit the string obtained by replacing the first match of
regex in the input string with `tostring`, after
interpolation. `tostring` should be a jq string or a stream
of such strings, each of which may contain references to
named captures. The named captures are, in effect, presented
as a JSON object (as constructed by `capture`) to
`tostring`, so a reference to a captured variable named "x"
would take the form: `"\(.x)"`.
example:
- program: 'sub("^[^a-z]*(?<x>[a-z]*).*")'
input: '"123abc456"'
output: '"ZabcZabc"'

- program: '[sub("(?<a>.)"; "\(.a|ascii_upcase)", "\(.a|ascii_downcase)")]'
input: '"aB"'
output: ['["AB","aB"]']

- title: "`gsub(regex; string)`, `gsub(regex; string; flags)`"
- title: "`gsub(regex; tostring)`, `gsub(regex; tostring; flags)`"
body: |
`gsub` is like `sub` but all the non-overlapping occurrences of the regex are
replaced by the string, after interpolation.
replaced by `tostring`, after interpolation. If the second argument is a stream
of jq strings, then `gsub` will produce a corresponding stream of JSON strings.
example:
- program: 'gsub("(?<x>.)[^a]*"; "+\(.x)-")'
Expand Down
66 changes: 27 additions & 39 deletions src/builtin.jq
Original file line number Diff line number Diff line change
Expand Up @@ -99,8 +99,10 @@ def scan(re):
#
# If input is an array, then emit a stream of successive subarrays of length n (or less),
# and similarly for strings.
def _nwise(a; $n): if a|length <= $n then a else a[0:$n] , _nwise(a[$n:]; $n) end;
def _nwise($n): _nwise(.; $n);
def _nwise($n):
def n: if length <= $n then . else .[0:$n] , (.[$n:] | n) end;
n;
def _nwise(a; $n): a | _nwise($n);
#
# splits/1 produces a stream; split/1 is retained for backward compatibility.
def splits($re; flags): . as $s
Expand All @@ -114,47 +116,34 @@ def splits($re): splits($re; null);
# split emits an array for backward compatibility
def split($re; flags): [ splits($re; flags) ];
#
# If s contains capture variables, then create a capture object and pipe it to s
def sub($re; s):
. as $in
| [match($re)]
| if length == 0 then $in
else .[0]
| . as $r
# # create the "capture" object:
| reduce ( $r | .captures | .[] | select(.name != null) | { (.name) : .string } ) as $pair
({}; . + $pair)
| $in[0:$r.offset] + s + $in[$r.offset+$r.length:]
end ;
# stream-oriented
def uniq(s):
foreach s as $x (null;
if . and $x == .[0] then .[1] = false
else [$x, true]
end;
if .[1] then .[0] else empty end);
#
# If s contains capture variables, then create a capture object and pipe it to s
def sub($re; s; flags):
def subg: [explode[] | select(. != 103)] | implode;
# "fla" should be flags with all occurrences of g removed; gs should be non-nil if flags has a g
def sub1(fla; gs):
def mysub:
. as $in
| [match($re; fla)]
| if length == 0 then $in
else .[0] as $edit
| ($edit | .offset + .length) as $len
# create the "capture" object:
| reduce ( $edit | .captures | .[] | select(.name != null) | { (.name) : .string } ) as $pair
({}; . + $pair)
| $in[0:$edit.offset]
+ s
+ ($in[$len:] | if length > 0 and gs then mysub else . end)
end ;
mysub ;
(flags | index("g")) as $gs
| (flags | if $gs then subg else . end) as $fla
| sub1($fla; $gs);
# If s contains capture variables, then create a capture object and pipe it to s, bearing
# in mind that s could be a stream
def sub($re; s; $flags):
. as $in
| (reduce uniq(match($re; $flags)) as $edit
({result: [], previous: 0};
$in[ .previous: ($edit | .offset) ] as $gap
# create the "capture" objects (one per item in s)
| [reduce ( $edit | .captures | .[] | select(.name != null) | { (.name) : .string } ) as $pair
({}; . + $pair) | s ] as $inserts
| reduce range(0; $inserts|length) as $ix (.; .result[$ix] += $gap + $inserts[$ix])
| .previous = ($edit | .offset + .length ) )
| .result[] + $in[.previous:] )
// $in;
#
def sub($re; s): sub($re; s; "");
# repeated substitution of re (which may contain named captures)
#
def gsub($re; s; flags): sub($re; s; flags + "g");
def gsub($re; s): sub($re; s; "g");

#
########################################################################
# generic iterator/generator
def while(cond; update):
Expand Down Expand Up @@ -237,7 +226,6 @@ def tostream:
getpath($p) |
reduce path(.[]?) as $q ([$p, .]; [$p+$q]);


# Assuming the input array is sorted, bsearch/1 returns
# the index of the target if the target is in the input array; and otherwise
# (-1 - ix), where ix is the insertion point that would leave the array sorted.
Expand Down
8 changes: 8 additions & 0 deletions tests/jq.test
Original file line number Diff line number Diff line change
Expand Up @@ -1731,3 +1731,11 @@ false
. |= try . catch .
1
1

[uniq(1,1,2,3,3,4)]
null
[1,2,3,4]

[uniq(empty)]
null
[]
65 changes: 65 additions & 0 deletions tests/onig.test
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,45 @@ gsub( "(.*)"; ""; "x")
""
""

gsub( ""; "a"; "g")
""
"a"

gsub( "^"; ""; "g")
"a"
"a"


# The following is a regression test and should not be construed as a requirement other than that execution should terminate:
gsub( ""; "a"; "g")
"a"
"aa"

gsub( "$"; "a"; "g")
"a"
"aa"

gsub( "^"; "a")
""
"a"

gsub("(?=u)"; "u")
"qux"
"quux"

gsub("^.*a"; "b")
"aaa"
"b"

gsub("^.*?a"; "b")
"aaa"
"baa"

# The following is for regression testing and should not be construed as a requirement:
[gsub("a"; "b", "c")]
"a"
["b","c"]

[.[] | scan(", ")]
["a,b, c, d, e,f",", a,b, c, d, e,f, "]
[", ",", ",", ",", ",", ",", ",", ",", "]
Expand All @@ -92,7 +131,33 @@ gsub("(?<x>.)[^a]*"; "+\(.x)-")
"Abcabc"
"+A-+a-"

gsub("(?<x>.)(?<y>[0-9])"; "\(.x|ascii_downcase)\(.y)")
"A1 B2 CD"
"a1 b2 CD"

gsub("\\b(?<x>.)"; "\(.x|ascii_downcase)")
"ABC DEF"
"aBC dEF"

# utf-8
sub("(?<x>.)"; "\(.x)!")
"’"
"’!"

[sub("a"; "b", "c")]
"a"
["b","c"]

[sub("(?<a>.)"; "\(.a|ascii_upcase)", "\(.a|ascii_downcase)", "c")]
"aB"
["AB","aB","cB"]

[gsub("(?<a>.)"; "\(.a|ascii_upcase)", "\(.a|ascii_downcase)", "c")]
"aB"
["AB","ab","cc"]

# splits and _nwise
[splits("")]
"ab"
["","a","b"]

0 comments on commit 0d6a400

Please sign in to comment.