add *(::Union{Regex, AbstractString, AbstractChar}...) #23422

rfourquet · 2017-08-24T08:34:11Z

I needed it recently and got frustrated that it doesn't work:

julia> const width = 7
julia> replace(line, r"^" * ' '^width, "") # ok, doesn't work, have to look up how to do repetition in PCRE...
julia> replace(line, r"^ {$width}", "") # ah crap, cannot do string interpolation...
julia> replace(line, r"^ {7}, "") # NOTE:  this 7 corresponds to `width`, don't forget to update it when needed

So yes there is a way to achieve what is needed with the Regex constructor, but allowing * is more regex-noobie friendly.

I don't know if it's a good idea, but don't foresee a problem with it; note that I didn't go so far as making Regex <: AbstractString, as I didn't expect to get support for that ;-)

waldyrious · 2017-08-24T09:36:21Z

allowing * is more noobie friendly.

I assume you mean "regex noobie", because r"^" * ' '^7 looks almost like obfuscated code to me. IMHO the second option (r"^ {$width}") is the most readable, if that's a possibility.

rfourquet · 2017-08-24T09:40:28Z

Yes you are right, I edited to "regex-noobie friendly" ! For the option you find more readble, it would require more work, and as $ has a special meaning in regexes it may be subtle to implement.

StefanKarpinski · 2017-08-24T14:13:42Z

Related: regular languages with * as concatenation of patterns and + as alternation of patterns (i.e. | in regex syntax) form a semiring. In fact this is the original Kleene algebra – i.e. a semiring with the additional requirement that + is idempotent (a + a = a for all a). This naturally extends to classes of equivalent regular expressions and strings are embedded in this space as wildcard-free patterns (the string as a literal pattern).

In other words, extending * to work on regular expressions, mixtures of regexes and strings is a completely natural extension of how * works on strings already. It would also make sense to define + on strings and regexes to compute the alternation of its arguments. That said, I'm a bit reluctant to start adding this kind of functionality at this point since once we start doing this, I feel like we should do it thoroughly – i.e. implement the entire Kleene algebra. There would definitely be some people who would not be thrilled about "foo" + "bar" producing re"(foo|bar)" 😬.

rfourquet · 2017-09-03T12:58:17Z

Someone else had a similar observation:

In linear algebra, there is a unary operator, the adjoint operator, often denoted by *. In Julia as well as Matlab, the adjoint operator is give by a single quote ('), since * is generally used for multplication. So I would propose string operators in Julia to be (*,+,') for concatenation, alternation, and Kleene star respectively.

In this PR I clearly don't intend to go beyond multiplication: is it still too early to decide to do that, or shall I add tests and documentation?

StefanKarpinski · 2017-09-05T17:31:12Z

Go for it!

rfourquet · 2019-04-02T13:44:15Z

Freshly rebased and documented, ready for reviews :)

StefanKarpinski

My immediate instinct is to worry about composing expressions via string operations. And indeed, I think there are some quite broken cases here. E.g. r"this|that" * r"cat|hat" would result in r"this|thatcat|hat" which is not the correct juxtaposition at all. Similarly, for repetition r"this|that"^2 results in r"this|thatthis|that" which is also wrong. I think that if you unconditionally wrap regex patterns in non-capturing subpatterns that this might be correct. So you'd produce r"(?:this|that)(?:cat|hat)" for the first example. For repetition, I think it's wasteful to actually textually repeat the regex pattern since regular expressions already support repetition as a language feature. So for the second case, I would instead generate r"(?:this|that)^{2}".

Other features that we may want to consider while thinking about such an API expansion are:

alternation of regexes and even strings, producing regexes
repetition by a range of indices to get between min and max number of repetitions
an operator/function for Kleene star?

For example, we could use + for alternation since that makes * and + a semiring on regexes with strings embedded in them as "simple" patterns. We could also use something like p^(a:b) to generate r"(?:$p){$a,$b}" if you follow my pseudo-syntax here.

StefanKarpinski · 2019-04-02T15:15:21Z

base/regex.jl

+function *(r1::Union{Regex,AbstractString,AbstractChar}, rs::Union{Regex,AbstractString,AbstractChar}...)
+    opts = unique((r.compile_options, r.match_options) for r in (r1, rs...) if r isa Regex)
+    length(opts) == 1 ||
+        throw(ArgumentError("cannot multiply regexes with incompatible options"))


This could be supported with internal option setting, probably best combined with a non-capturing subpattern. So something like this:

"(?sm:$(r1.pattern))(?i:$(r2.pattern))"

This could work uniformly but it would be nice to avoid the extra line noise in the presumably common case where the options match. You could get even fancier and keep options that are shared by all of the regexes in the result regex's global options while only putting the options that are specific to each subpattern in it's option set.

StefanKarpinski · 2019-04-02T15:31:06Z

So overall todos here are:

wrap every pattern in non-capturing groups (?$opts:$pattern) before doing operations
for concatenation, collect shared options in the resulting regex's global options
put non-shared options in the opts part of each components' non-capturing group
use (?:$pat){$rep} notation for repetition of regexes

rfourquet · 2019-04-03T12:27:08Z

Thanks so much for taking the time to set out the logic! and for linking to documentation. I hadn't even started to think about this sophistication, and didn't know the PCRE features you indicated to implement the logic.
(I may have a invented a new way to learn PCRE: make a PR touching regexes and wait for feedback 😉)

So I implemented this, but with the limitation that only the four options corresponding to "imsx" can be different between the multiplied regexes (IIUC, these are the only ones which can be indicated "internally").

Concerning the extension of the API, I unfortunately have nothing to contribute to your ideas, but it make sense to me and I would be happy to implement them. For a Kleene star operator, it is unfortunate that * is already taken (and similarly for the "one or more" operation and +).

rfourquet · 2019-04-03T14:56:42Z

base/regex.jl

+# Examples
+```jldoctest
+julia> r"Hello " * "world"
+r"Hello world"


Oups, forgot to update the docstrings, will do next time I push. Does it need more explanations? ("concatenate regexes" is quite minimal).

StefanKarpinski

The regex composition stuff is looking good now but I think the string embedding into regular expressions is still wrong: strings and characters need to be embedded as patterns which match only that literal string or character.

StefanKarpinski · 2019-04-03T15:22:10Z

test/regex.jl

+        end
+
+        @test r"a"i * r"b"i == r"(?:a)(?:b)"i
+        @test r"a"i * "b"   == r"(?:a)(?:b)"i


Should this really work like this? I would expect the string "b" to only match itself?

StefanKarpinski · 2019-04-03T15:30:42Z

base/regex.jl

+end
+
+unwrap_string(r::Regex, unshared::UInt32) = string("(?", regex_opts_str(r.compile_options & unshared), ':', r.pattern, ')')
+unwrap_string(s::Union{AbstractString,AbstractChar}, ::UInt32) = string("(?:", s, ')')


So I think this is not how we want this to work: I think that a string or character should be converted to a regex that only matches that exact string or character, not to a regex with that string or character as its content. Also, I think there's a case to be made for making this a method of convert. For a clearer example where it matters, should r"a|b" * "c|d" produce r"(?:a|b)(?:c|d)" which would match ac, ad, bc and bd or should it produce something like r"(?:a|b)(?:c\|d)" which would match ac|d and bc|d? I think it has to be the latter.

Ok, actually while you were reviewing I pushed a new version unwrap_string(s::Union{AbstractString,AbstractChar}, ::UInt32) = s, but it has the same problem. I didn't think about this problem, but I don't see how to address generally. It's easy enough to quote |, but I don't know if this would be enough to quote a list a special characters? ('{', '+', etc). Plus, it would be time consuming.

OK, there seems to be a solution after all: enclose the string between "\Q" and "\E". Remains to decide which behavior is the best, I don't have a strong opinion. But it's true that if one wants the regex interpretation of characters, it's easy enough to wrap the string in a call to Regex.

StefanKarpinski

Meant to submit that review as "request changes" but picked the wrong radio button.

StefanKarpinski · 2019-04-03T15:36:00Z

In fully generality, this requires being able to correctly regex-quote strings, which is definitely doable, but not entirely straightforward, I suspect. It may be easier to just skip that part of this change and keep it focused on the pure regex bit.

rfourquet · 2019-04-03T15:50:47Z

skip that part of this change and keep it focused on the pure regex bit.

So allow only multiplication of regexes, not regexes + strings + chars?
EDIT: I'm fine with that, although my use case involved precisely this mixed multiplication.

StefanKarpinski · 2019-04-03T20:09:11Z

You can certainly include that too, but does it makes sense to you that it should treat strings as a regex that only matches that string? Of you’re cool with implementing that, I have no objection.

rfourquet · 2019-04-04T14:43:51Z

I don't know if you saw my comment above, but there seems to be a trivial solution, using quoting between "\Q" and "\E". So I updated the PR with this.

does it makes sense to you that it should treat strings as a regex that only matches that string?

It sure makes sense, but as I said I don't have a strong opinion and anyway it (my opinion) shouldn't matter as I didn't do much regexing in my life. Your call!

StefanKarpinski

This is getting there. I think if you just handle \E in strings we should be good.

StefanKarpinski · 2019-04-04T15:36:32Z

base/regex.jl

@@ -559,7 +571,7 @@ end
 *(r::Regex) = r # avoids wrapping r in a useless subpattern

 unwrap_string(r::Regex, unshared::UInt32) = string("(?", regex_opts_str(r.compile_options & unshared), ':', r.pattern, ')')
-unwrap_string(s::Union{AbstractString,AbstractChar}, ::UInt32) = s # no need to wrap in subpattern
+unwrap_string(s::Union{AbstractString,AbstractChar}, ::UInt32) = string("\\Q", s, "\\E")


Ok, you're going to want to kill me but what if s contains \E?

Hahaha! actually, I should have known better with my background in cybersecurity (cf. injection attacks) ;-)
EDIT: I'm very curious whether you'll find a flaw in my solution!

StefanKarpinski · 2019-04-06T02:55:47Z

This looks good to me now. We’re halfway to having a fully algebra on regular expressions!

rfourquet · 2019-04-27T10:04:44Z

So the intersection of the failing tests between this CI run and the previous one is empty (previous failures: travis, tester_macos64, tester_linux32), so I will assume they are not related.

I will merge soon if no objection.

StefanKarpinski · 2019-04-29T20:43:11Z

Go for it! Squash or not as appropriate. Edit: I decided to just go for it 😁

fredrikekre · 2019-04-29T20:44:32Z

NEWS.md

@@ -24,7 +24,7 @@ New library functions
 Standard library changes
 ------------------------

-* Cmd interpolation (``` `$(x::Cmd) a b c` ``` where) now propagates `x`'s process flags (environment, flags, working directory, etc) if `x` is the first interpolant and errors otherwise ([#24353]).


Bad rebase?

Good catch. Fixed in #31874.

Ooops, sorry about that!

rfourquet added needs docs Documentation for this change is required needs tests Unit tests are required for this change strings "Strings!" labels Aug 24, 2017

rfourquet force-pushed the rf/regexp-concat branch from b3a8a81 to 87e9ca2 Compare April 2, 2019 13:42

rfourquet changed the title ~~RFC: add *(::Union{Regex, AbstractString, Char}...)~~ add *(::Union{Regex, AbstractString, Char}...) Apr 2, 2019

rfourquet removed needs docs Documentation for this change is required needs tests Unit tests are required for this change labels Apr 2, 2019

rfourquet force-pushed the rf/regexp-concat branch from 87e9ca2 to 4306a4b Compare April 2, 2019 14:13

StefanKarpinski requested changes Apr 2, 2019

View reviewed changes

rfourquet commented Apr 3, 2019

View reviewed changes

StefanKarpinski approved these changes Apr 3, 2019

View reviewed changes

StefanKarpinski requested changes Apr 3, 2019

View reviewed changes

rfourquet force-pushed the rf/regexp-concat branch from 01b8253 to 04340e4 Compare April 4, 2019 14:52

StefanKarpinski requested changes Apr 4, 2019

View reviewed changes

StefanKarpinski approved these changes Apr 6, 2019

View reviewed changes

rfourquet force-pushed the rf/regexp-concat branch from dba1b10 to 8c5af35 Compare April 25, 2019 14:03

rfourquet changed the title ~~add *(::Union{Regex, AbstractString, Char}...)~~ add *(::Union{Regex, AbstractString, AbstractChar}...) Apr 25, 2019

rfourquet closed this Apr 26, 2019

rfourquet reopened this Apr 26, 2019

rfourquet added 10 commits April 26, 2019 17:52

add *(::Union{Regex, AbstractString, Char}...)

a9b756e

use internal option setting and non-capturing subpatterns

8883776

do nothing for one argument

1d90dbc

don't wrap strings and chars

c8182c9

use quotation for strings and chars

e6eda32

remove useless operation

abbc367

update wrong example in docstring

dfb6b2e

add example and rename unwrap_string -> wrap_string

a17e926

fix case where a string contains "\E"

03b11ca

update to 1.3

8382116

rfourquet force-pushed the rf/regexp-concat branch from 8c5af35 to 8382116 Compare April 26, 2019 15:52

StefanKarpinski merged commit 0140ce8 into master Apr 29, 2019

StefanKarpinski deleted the rf/regexp-concat branch April 29, 2019 20:43

fredrikekre reviewed Apr 29, 2019

View reviewed changes

StefanKarpinski added a commit that referenced this pull request Apr 29, 2019

fix bad rebase in #23422

9eb3f27

rfourquet pushed a commit that referenced this pull request Apr 30, 2019

fix bad rebase in #23422 (#31874)

3d7e0d9

rfourquet mentioned this pull request May 10, 2019

create a Regex from a string matching this string literally (with escaping) #31989

Open

StefanKarpinski mentioned this pull request Aug 12, 2019

Add a function to escape strings for use in regular expressions #29643

Open

singularitti mentioned this pull request Oct 29, 2019

Can we have *(::Union{Regex, AbstractString, AbstractChar}...) into Compat.jl? JuliaLang/Compat.jl#672

Closed

jyjemily mentioned this pull request May 29, 2023

NEWS 수정 juliakorea/translate-doc#32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add *(::Union{Regex, AbstractString, AbstractChar}...) #23422

add *(::Union{Regex, AbstractString, AbstractChar}...) #23422

rfourquet commented Aug 24, 2017 •

edited

Loading

waldyrious commented Aug 24, 2017

rfourquet commented Aug 24, 2017

StefanKarpinski commented Aug 24, 2017 •

edited

Loading

rfourquet commented Sep 3, 2017

StefanKarpinski commented Sep 5, 2017

rfourquet commented Apr 2, 2019 •

edited

Loading

StefanKarpinski left a comment

StefanKarpinski Apr 2, 2019

StefanKarpinski commented Apr 2, 2019 •

edited

Loading

rfourquet commented Apr 3, 2019

rfourquet Apr 3, 2019

StefanKarpinski left a comment

StefanKarpinski Apr 3, 2019

StefanKarpinski Apr 3, 2019

rfourquet Apr 3, 2019

rfourquet Apr 3, 2019

StefanKarpinski left a comment

StefanKarpinski commented Apr 3, 2019

rfourquet commented Apr 3, 2019 •

edited

Loading

StefanKarpinski commented Apr 3, 2019

rfourquet commented Apr 4, 2019

StefanKarpinski left a comment

StefanKarpinski Apr 4, 2019

rfourquet Apr 5, 2019 •

edited

Loading

StefanKarpinski commented Apr 6, 2019

rfourquet commented Apr 27, 2019

StefanKarpinski commented Apr 29, 2019 •

edited

Loading

fredrikekre Apr 29, 2019

StefanKarpinski Apr 29, 2019 •

edited

Loading

rfourquet Apr 30, 2019

add *(::Union{Regex, AbstractString, AbstractChar}...) #23422

add *(::Union{Regex, AbstractString, AbstractChar}...) #23422

Conversation

rfourquet commented Aug 24, 2017 • edited Loading

waldyrious commented Aug 24, 2017

rfourquet commented Aug 24, 2017

StefanKarpinski commented Aug 24, 2017 • edited Loading

rfourquet commented Sep 3, 2017

StefanKarpinski commented Sep 5, 2017

rfourquet commented Apr 2, 2019 • edited Loading

StefanKarpinski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanKarpinski commented Apr 2, 2019 • edited Loading

rfourquet commented Apr 3, 2019

Choose a reason for hiding this comment

StefanKarpinski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanKarpinski left a comment

Choose a reason for hiding this comment

StefanKarpinski commented Apr 3, 2019

rfourquet commented Apr 3, 2019 • edited Loading

StefanKarpinski commented Apr 3, 2019

rfourquet commented Apr 4, 2019

StefanKarpinski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rfourquet Apr 5, 2019 • edited Loading

Choose a reason for hiding this comment

StefanKarpinski commented Apr 6, 2019

rfourquet commented Apr 27, 2019

StefanKarpinski commented Apr 29, 2019 • edited Loading

Choose a reason for hiding this comment

StefanKarpinski Apr 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rfourquet commented Aug 24, 2017 •

edited

Loading

StefanKarpinski commented Aug 24, 2017 •

edited

Loading

rfourquet commented Apr 2, 2019 •

edited

Loading

StefanKarpinski commented Apr 2, 2019 •

edited

Loading

rfourquet commented Apr 3, 2019 •

edited

Loading

rfourquet Apr 5, 2019 •

edited

Loading

StefanKarpinski commented Apr 29, 2019 •

edited

Loading

StefanKarpinski Apr 29, 2019 •

edited

Loading