Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Registers in repetitions are not returned #25

Open
phmarek opened this issue Sep 25, 2015 · 4 comments
Open

Registers in repetitions are not returned #25

phmarek opened this issue Sep 25, 2015 · 4 comments

Comments

@phmarek
Copy link
Contributor

phmarek commented Sep 25, 2015

Using 35c5266, I can see that registers within repetitions are not correctly returned:

(cl-ppcre:scan-to-strings "(?:\\s+(\\w+)=(\\w+))*" " a1=A13 a2=A2 ")
" a1=A13 a2=A2"
#("a2" "A2")

I would have expected either an array with 4 strings, or an array with two arrays (or lists) in it.
SCAN, REGEX-REPLACE, REGEX-REPLACE-ALL are all messed up by that, too, of course.

BTW, how is REGEX-REPLACE-ALL supposed to be used with repetitions? When having a pattern ^A(B)*$, the replacement lambda only gets called once (and has to return the complete new string), so it doesn't really provide any advantage over using SCAN.
So, in order to be able to return new strings for each of the B matches (individually) I have to match the A first, and then call again with :START for the Bs; there's no easier way, right?

@ghollisjr
Copy link

ghollisjr commented Aug 3, 2022

It looks like Python has the same behavior:

import re
re.findall(r"(?:\s+(\w+)=(\w+))*" " a1=A13 a2=A2 ")[0]

==> ('a2', 'A2')

Is there an example where you get different & unexpected behavior using Python or some other regex as a co-witness?

@ghollisjr
Copy link

ghollisjr commented Aug 3, 2022

Also using ppcre:all-scans-to-strings from a recent PR to match re.findall:

(defun all-scans-as-strings (regex string &key start end)
  "Returns two values: 1. A list of all matches, 2. A list of lists for
all of the group matches."
  (let ((start (or start 0))
        (end (or end (length string)))
        (mresult NIL)
        (mlast NIL)
        (sresult NIL)
        (slast NIL))
    (labels ((macc (v)
               (if (null mresult)
                   (setf mresult (cons v nil)
                         mlast mresult)
                   (setf (cdr mlast) (cons v nil)
                         mlast (cdr mlast))))
             (sacc (v)
               (if (null sresult)
                   (setf sresult (cons v nil)
                         slast sresult)
                   (setf (cdr slast) (cons v nil)
                         slast (cdr slast)))))
      (apply #'values
             (ppcre:do-scans (ms me
                              rs re
                              regex string
                              (list mresult sresult)
                              :start start
                              :end end)
               (macc (subseq string ms me))
               (sacc (loop for s across rs
                           for e across re
                           collecting (if (and s e)
                                          (subseq string s e)
                                          ""))))))))

(nth-value 1 (ppcre:all-scans-as-strings "(?:\\s+(\\w+)=(\\w+))*" " a1=A13 a2=A2 "))
==> (("a2" "A2") ("" "") ("" ""))

which matches

>>> import re
>>> re.findall(r"(?:\s+(\w+)=(\w+))*" " a1=A13 a2=A2 ")

==> [('a2', 'A2'), ('', ''), ('', '')]

@ghollisjr
Copy link

ghollisjr commented Aug 3, 2022

As an aside, if you're trying to match the variable bindings, what's hurting your regex is greediness. If you use non-greedy +:

;; assuming the all-scans-as-strings definition from previous comment
(nth-value 1
  (ppcre:all-scans-as-strings "(?:\\s+(\\w+)=(\\w+))+?" " a1=A13 a2=A2 ")

==> (("a1" "A13") ("a2" "A2"))

@phmarek
Copy link
Contributor Author

phmarek commented Aug 3, 2022

Well, my reference point was perl.

The canonical way in perl would be this here:

$ perl -e '$_ = " a1=A13 a2=A2 "; print join(" ", /(\w+)=(\w+)/g),"\n";'
a1 A13 a2 A2

But there's no /g modifier in cl-ppcre, so I tried with a repetition around the RE.

Anyway -- perl has strangely exactly the same behaviour:

$ perl -e '$_ = " a1=A13 a2=A2 "; print join(" ", /(?:\s+(\w+)=(\w+))+/g),"\n";'
a2 A2
$ perl -e '$_ = " a1=A13 a2=A2 "; print join(" ", /(?:\s+(\w+)=(\w+))*/g),"\n";'
a2 A2    
$ perl -e '$_ = " a1=A13 a2=A2 "; print join(" ", /(?:\s+(\w+)=(\w+))+?/g),"\n";'
a1 A13 a2 A2

Without the /g flag, similar results are returned:

$ perl -e '$_ = " a1=A13 a2=A2 "; print join(" ", /(?:\s+(\w+)=(\w+))*/),"\n";'
a2 A2
$ perl -e '$_ = " a1=A13 a2=A2 "; print join(" ", /(?:\s+(\w+)=(\w+))+/),"\n";'
a2 A2
$ perl -e '$_ = " a1=A13 a2=A2 "; print join(" ", /(?:\s+(\w+)=(\w+))+?/),"\n";'
a1 A13

I think perl's behaviour was different back in 2015 - I would've remembered if I had encountered this unexpected greedy vs. non-greedy difference.

A looping construct works:

(cl-ppcre:do-register-groups (k v) ("(\\w+)=(\\w+)" " a1=A13 a2=A2 ")
  (format t "~s = ~s~%" k v))
; "a1" = "A13"
; "a2" = "A2"
; NIL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants