Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected back reference scope #1057

Closed
timhodson opened this issue Jul 12, 2022 · 2 comments
Closed

Unexpected back reference scope #1057

timhodson opened this issue Jul 12, 2022 · 2 comments
Assignees
Labels

Comments

@timhodson
Copy link

Unexpected back reference scope

Given a CSV file like this.

original,expected
abc123,abc-456

This miller command:

mlr --csv --from test.csv put '
    if ($original =~ "([a-z]{3})([0-9]{3})") {
        $br1="\1"; 
        $br2="\2"; 
        $new_value=$br1."456";
        $substitution = sub($new_value, "([a-z]{3})([0-9]{3})", "\1-\2")
        }
'

Gives:

original,expected,br1,br2,new_value,substitution
abc123,abc-456,abc,123,abc456,abc-123

The back reference of the original regex comparison operator remains set and is very likely to confuse anyone who is doing a number of transformations of a string.

This is a noddy example for demo purposes. My use case was splitting a URL apart, doing various regex pattern match substitutions on the parts of the URL before putting it back together again.

My solution was to split out the second substitution into a separate put statement, but this is less intuitive than having all transformations within a block as it meant repeating some tests to see if I needed to do the transformation and then pick up where I left off.

mlr --csv --from test.csv put '
    if ($original =~ "([a-z]{3})([0-9]{3})") {
        $br1="\1"; 
        $br2="\2"; 
        $new_value=$br1."456";
        }
' \
then put '
    $substitution = sub($new_value, "([a-z]{3})([0-9]{3})", "\1-\2")
'

The above gives the expected:

original,expected,br1,br2,new_value,substitution
abc123,abc-456,abc,123,abc456,abc-456

I don't know if this is a bug or intentional, but I wasn't expecting it and it took me several days of hair pulling to spot what was happening!

So if nothing else this may help someone else resolve unexpected side effects of using back references with the =~ operator.

@johnkerl
Copy link
Owner

johnkerl commented Dec 19, 2023

@timhodson sorry for the very long delay.

After some head-scratching I realized the issue was "obvious in hindsight" 😬

Some things:

  • When you use =~ with (...) in it then \1,\2,... are supposed to be set for the rest of the function scope
    • I.e. after the =~, any \1 within a string literal should be replaced with an abc, and likewise any \2 should be replaced with a 123. For example, when you assign $br1 = "\1" and $br2 = "\2".
    • This is working as intended.
    • (Note that as of the recent PR Preserve regex captures across stack frames #1447 each user-defined top-level block and user-defined function has its own scope for these -- just FYI, not that that helps here)
  • When you use sub or gsub with (...) in it then \1,\2,... are supposed to be set only within the sub or gsub -- the line below, for example, shouldn't know about what captures the sub/gsub found -- after the sub function has returned
    • This too is working as intended

The problem is the any part when I said "after the =~, any \1 within a string literal should be replaced with an abc". The issue is that the second argument to sub is itself such a string. So the string "\1-\2" which you're intending to pass directly to sub is getting changed to be "abc123" before the sub function ever receives it :( -- the \1 and \2 that appear in the second argument to sub are interpolated for the same reason they're interpolated in $br1 = "\1" and $br2 = "\2".

So, alas. I don't know a way to automatedly change this without breaking the semantics of =~ and the capture strings \1,\2,... which are supposed to generate interpolations through the rest of the scope.

One thing I can offer is issue #1401 and PR #1451. The good news is you can do

"anything" =~ null

in between the =~ and the sub -- with null on the right-hand side of =~ -- and that'll reset the \1,\2,... so that your second argument to sub arrives intact all the way to sub. The bad news is you have to remember to do that. And it's counter-intuitive. The reason for my head-scratching above was I kept seeing the second argument to sub as the exact string "\1-\2" and failing to realize it was subject to interpolation like anything else in the scope. So, this is under user control, but error-prone. (I do now regret having ever implemented =~ with captures \1,\2,... and I wish I'd had strmatchx from the get-go. Alas, time has gone by and Miller will need to support captures.)

The other thing I have to offer is the new strmatchx function from the recent PR #1448. The fundamental problem with =~ setting \1,\2,... is the one you encountered -- they persist where you don't want or expect them. Using strmatchx you get to apply captures only when you want, intentionally and not by accident.

@johnkerl
Copy link
Owner

@timhodson I will close this issue but please let me know if we need any more action taken & I'll happily re-open it.

@johnkerl johnkerl removed the active label Jan 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants