Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gsub breaks \b for checking word boundaries #1425

Closed
thedward opened this issue Jun 6, 2017 · 8 comments · Fixed by #2641
Closed

gsub breaks \b for checking word boundaries #1425

thedward opened this issue Jun 6, 2017 · 8 comments · Fixed by #2641
Labels
Milestone

Comments

@thedward
Copy link

thedward commented Jun 6, 2017

When I run the following program with jq -n:

"hello world"|"\\b(?<i>[[:lower:]])" as $rx|gsub($rx;.i|ascii_upcase)

Expected Output:

"Hello World"

Actual Output:

"HELLO WORLD"

I've got a workaround for my particular use case:

"hello world"|"\\b(?<fst>[[:lower:]])(?<rst>[[:lower:]]*)\\b" as $rx|gsub($rx;"\(.fst|ascii_upcase)\(.rst)")
@pkoppstein
Copy link
Contributor

@thedward - As you've discovered, jq's interpretation of the "g" in gsub is currently different from the one you have in mind. Here's one way to achieve the effect you want:

def capitalize:
  def c: sub("\\b(?<i>[[:lower:]])";.i|ascii_upcase);
  c as $c | if . == $c then $c else ($c|c) end;

Example:

"hello world" | capitalize

@thedward
Copy link
Author

thedward commented Jun 7, 2017

It's also currently different than documented. :-)

I'm trying to write a correct implementation — if I succeed, I'll make a pull request.

@itchyny
Copy link
Contributor

itchyny commented Jun 3, 2023

I'm uncertain but this looks the same instance of gsub bugs, like #1206, #2148.

@pkoppstein
Copy link
Contributor

pkoppstein commented Jun 13, 2023

@itchyny wrote:

I'm uncertain but this looks the same instance of gsub bugs, like #1206, #2148.

With respect to the example given at the top of this thread, please note that gsub is working as intended, that is, in accordance with the idea that gsub(re; _) should repeatedly apply the transformation specified by sub(re;_) to non-overlapping substrings of the original input from left to right, until quiescence. That's not to say that this "specification" is desirable, but it is defensible.

@thedward
Copy link
Author

@pkoppstein where is this behavior specified?

@pkoppstein
Copy link
Contributor

@thedward - I wrote this "specification" to make it clear that I was referring to the previous sentence.
If you think that the existing documentation of gsub precludes the interpretation as iterative application, please just say so.

In any case, my hope is that whatever issues there may regarding the "iterative application" interpretation of gsub should not get in the way of fixing the infinite looping behavior described in #1206 and #2148.

@thedward
Copy link
Author

thedward commented Jun 14, 2023

@pkoppstein, I was just hoping there was a source other than the manual to shed more light on this.

Regardless, I don't agree that the existing behavior is consistent with the manual or with your informal specification because the regular expression \b(?<i>[[:lower:]] does not match the 'e' in "hello world". That it matches the 'e' in "ello world" should be irrelevant based on either the manual's description —"gsub is like sub but all the non-overlapping occurrences of the regex are replaced by the string, after interpolation."— or your informal specification.

I also have a hard time accepting the idea that someone intended the existing behavior both because it violates the principal of least surprise and because it makes the function less useful without gaining anything in return.

I really don't understand why you'd want to reframe this as anything other than a bug.

Since changing the behavior could conceivably break existing programs, it's certainly possible to argue it's a documentation bug at this point, but it's definitely a bug.

p.s. I'm really excited to see jq development picking up some speed!

@pkoppstein
Copy link
Contributor

pkoppstein commented Jun 19, 2023

@thedward @itchyny - The proposed new version of gsub (#2624) addresses this and many other issues.
This would be a good time to try to ensure that known gsub bugs are resolved if possible.
With that end in mind, here is an extract from my current tests/onig.test showing the new test cases.
Please feel free to suggest others.

The new version uses match(_; "g"), which unfortunately seems to have a bug in the case of match("";"g").
For example, "a" | match("";"g") only returns one match rather than two. That's why
the last test case (involving gsub(""; "a")) has the comment about its being a regression test case.
I propose retaining it, as at least it verifies termination (whereas with e.g. jq 1.6 it does not).

(The tests are presented as triples, each of which can be read as: asserteq(LINE1 | LINE2; LINE3).)

gsub( ""; "a";  "g")
""
"a"

gsub( "^"; "";  "g")
"a"
"a"

gsub( "$"; "a";  "g")
"a"
"aa"

gsub( "^"; "a")
""
"a"

gsub("(?=u)"; "u")
"qux"
"quux"

gsub("^.*a"; "b")
"aaa"
"b"

gsub("^.*?a"; "b")
"aaa"
"baa"

gsub("(?<x>.)(?<y>[0-9])"; "\(.x|ascii_downcase)\(.y)")
"A1 B2 CD"
"a1 b2 CD"

gsub("\\b(?<x>.)"; "\(.x|ascii_downcase)")
"ABC DEF"
"aBC dEF"

# Note: the following is just a regression test case.  The result should be "aaa".
gsub( ""; "a";  "g")
"a"
"aa"

pkoppstein added a commit to pkoppstein/jq that referenced this issue Jun 29, 2023
… uniq(stream)

The primary purpose of this commit (which supercedes PR
jqlang#2624) is to rectify most problems
with `gsub` (and also `sub` with the "g" option), in particular jqlang#1425
('\b'), jqlang#2354 (lookahead), and jqlang#2532 (regex == "^(?!cd ).*$|^cd ";"")).

This commit also partly resolves jqlang#2148 and jqlang#1206 in that `gsub` no
longer loops infinitely; however, because the new `gsub` depends
critically on match(_;"g"), the behavior when regex == "" is sometimes
non-standard. [*1]

Since the new sub/3 relies on uniq/1, that has been added as well [*2].

The documentation has been updated to reflect the fact that `sub` and
`gsub` are intended to be regular in the second argument. [*3]

Also, _nwise/1 has been tweaked to take advantage of TCO.

Footnotes:

[*1] Using the new gsub, '"a" | gsub( ""; "a")' emits "aa" rather than
"aaa" as would be standard.  This is nevertheless better than the
infinite loop behavior of jq 1.6 in this case.

With one exception (as explained in [*2]), the new gsub is implemented
as though match/2 behavior is correct.  That is, bugs in `gsub`
behavior will most likely have their origin in `match/2`.

[*2] `uniq/1` adopts the Unix/Linux name and semantics; it is needed for the following test case:

gsub("(?=u)"; "u")
"qux"
"quux"

Without this functionality:

Test jqlang#23: 'gsub("(?=u)"; "u")' at line number 100
*** Expected "quux", but got "quuux" for test at line number 102: gsub("(?=u)"; "u")

The root of the problem here is `match`: if `match` is fixed, then gsub would not need `untie`.

The addition of `uniq` as a top-level function should be a non-issue
relative to general concern about builtins.jq bloat: the line count of
the new builtin.jq is significantly reduced overall, and the number of
defs is actually reduced by 1 (from 111 (ignoring a redundant def) to 110).

[*3] See e.g. jqlang#513 (comment)
@itchyny itchyny added this to the 1.7 release milestone Jul 2, 2023
itchyny pushed a commit that referenced this issue Jul 3, 2023
The primary purpose of this commit is to rectify most problems with
`gsub` (and also `sub` with the `g` option), in particular fix #1425 ('\b'),
fix #2354 (lookahead), and fix #2532 (regex == `"^(?!cd ).*$|^cd "`).

This commit also partly resolves #2148 and resolves #1206 in that
`gsub` no longer loops infinitely; however, because the new `gsub`
depends critically on `match/2`, the behavior when regex == `""` is
sometimes non-standard.

The documentation has been updated to reflect the fact that `sub`
and `gsub` are intended to be regular in the second argument.

Also, `_nwise/1` has been tweaked to take advantage of TCO.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants