Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add *(::Union{Regex, AbstractString, AbstractChar}...) #23422

Merged
merged 10 commits into from
Apr 29, 2019

Conversation

rfourquet
Copy link
Member

@rfourquet rfourquet commented Aug 24, 2017

I needed it recently and got frustrated that it doesn't work:

julia> const width = 7
julia> replace(line, r"^" * ' '^width, "") # ok, doesn't work, have to look up how to do repetition in PCRE...
julia> replace(line, r"^ {$width}", "") # ah crap, cannot do string interpolation...
julia> replace(line, r"^ {7}, "") # NOTE:  this 7 corresponds to `width`, don't forget to update it when needed

So yes there is a way to achieve what is needed with the Regex constructor, but allowing * is more regex-noobie friendly.

I don't know if it's a good idea, but don't foresee a problem with it; note that I didn't go so far as making Regex <: AbstractString, as I didn't expect to get support for that ;-)

@rfourquet rfourquet added needs docs Documentation for this change is required needs tests Unit tests are required for this change strings "Strings!" labels Aug 24, 2017
@waldyrious
Copy link
Contributor

allowing * is more noobie friendly.

I assume you mean "regex noobie", because r"^" * ' '^7 looks almost like obfuscated code to me. IMHO the second option (r"^ {$width}") is the most readable, if that's a possibility.

@rfourquet
Copy link
Member Author

Yes you are right, I edited to "regex-noobie friendly" ! For the option you find more readble, it would require more work, and as $ has a special meaning in regexes it may be subtle to implement.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Aug 24, 2017

Related: regular languages with * as concatenation of patterns and + as alternation of patterns (i.e. | in regex syntax) form a semiring. In fact this is the original Kleene algebra – i.e. a semiring with the additional requirement that + is idempotent (a + a = a for all a). This naturally extends to classes of equivalent regular expressions and strings are embedded in this space as wildcard-free patterns (the string as a literal pattern).

In other words, extending * to work on regular expressions, mixtures of regexes and strings is a completely natural extension of how * works on strings already. It would also make sense to define + on strings and regexes to compute the alternation of its arguments. That said, I'm a bit reluctant to start adding this kind of functionality at this point since once we start doing this, I feel like we should do it thoroughly – i.e. implement the entire Kleene algebra. There would definitely be some people who would not be thrilled about "foo" + "bar" producing re"(foo|bar)" 😬.

@rfourquet
Copy link
Member Author

Someone else had a similar observation:

In linear algebra, there is a unary operator, the adjoint operator, often denoted by *. In Julia as well as Matlab, the adjoint operator is give by a single quote ('), since * is generally used for multplication. So I would propose string operators in Julia to be (*,+,') for concatenation, alternation, and Kleene star respectively.

In this PR I clearly don't intend to go beyond multiplication: is it still too early to decide to do that, or shall I add tests and documentation?

@StefanKarpinski
Copy link
Member

Go for it!

@rfourquet rfourquet changed the title RFC: add *(::Union{Regex, AbstractString, Char}...) add *(::Union{Regex, AbstractString, Char}...) Apr 2, 2019
@rfourquet
Copy link
Member Author

rfourquet commented Apr 2, 2019

Freshly rebased and documented, ready for reviews :)

@rfourquet rfourquet removed needs docs Documentation for this change is required needs tests Unit tests are required for this change labels Apr 2, 2019
Copy link
Member

@StefanKarpinski StefanKarpinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My immediate instinct is to worry about composing expressions via string operations. And indeed, I think there are some quite broken cases here. E.g. r"this|that" * r"cat|hat" would result in r"this|thatcat|hat" which is not the correct juxtaposition at all. Similarly, for repetition r"this|that"^2 results in r"this|thatthis|that" which is also wrong. I think that if you unconditionally wrap regex patterns in non-capturing subpatterns that this might be correct. So you'd produce r"(?:this|that)(?:cat|hat)" for the first example. For repetition, I think it's wasteful to actually textually repeat the regex pattern since regular expressions already support repetition as a language feature. So for the second case, I would instead generate r"(?:this|that)^{2}".

Other features that we may want to consider while thinking about such an API expansion are:

  • alternation of regexes and even strings, producing regexes
  • repetition by a range of indices to get between min and max number of repetitions
  • an operator/function for Kleene star?

For example, we could use + for alternation since that makes * and + a semiring on regexes with strings embedded in them as "simple" patterns. We could also use something like p^(a:b) to generate r"(?:$p){$a,$b}" if you follow my pseudo-syntax here.

base/regex.jl Outdated
function *(r1::Union{Regex,AbstractString,AbstractChar}, rs::Union{Regex,AbstractString,AbstractChar}...)
opts = unique((r.compile_options, r.match_options) for r in (r1, rs...) if r isa Regex)
length(opts) == 1 ||
throw(ArgumentError("cannot multiply regexes with incompatible options"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be supported with internal option setting, probably best combined with a non-capturing subpattern. So something like this:

"(?sm:$(r1.pattern))(?i:$(r2.pattern))"

This could work uniformly but it would be nice to avoid the extra line noise in the presumably common case where the options match. You could get even fancier and keep options that are shared by all of the regexes in the result regex's global options while only putting the options that are specific to each subpattern in it's option set.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Apr 2, 2019

So overall todos here are:

  • wrap every pattern in non-capturing groups (?$opts:$pattern) before doing operations
  • for concatenation, collect shared options in the resulting regex's global options
  • put non-shared options in the opts part of each components' non-capturing group
  • use (?:$pat){$rep} notation for repetition of regexes

@rfourquet
Copy link
Member Author

Thanks so much for taking the time to set out the logic! and for linking to documentation. I hadn't even started to think about this sophistication, and didn't know the PCRE features you indicated to implement the logic.
(I may have a invented a new way to learn PCRE: make a PR touching regexes and wait for feedback 😉)

So I implemented this, but with the limitation that only the four options corresponding to "imsx" can be different between the multiplied regexes (IIUC, these are the only ones which can be indicated "internally").

Concerning the extension of the API, I unfortunately have nothing to contribute to your ideas, but it make sense to me and I would be happy to implement them. For a Kleene star operator, it is unfortunate that * is already taken (and similarly for the "one or more" operation and +).

base/regex.jl Outdated
# Examples
```jldoctest
julia> r"Hello " * "world"
r"Hello world"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oups, forgot to update the docstrings, will do next time I push. Does it need more explanations? ("concatenate regexes" is quite minimal).

Copy link
Member

@StefanKarpinski StefanKarpinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex composition stuff is looking good now but I think the string embedding into regular expressions is still wrong: strings and characters need to be embedded as patterns which match only that literal string or character.

test/regex.jl Outdated
end

@test r"a"i * r"b"i == r"(?:a)(?:b)"i
@test r"a"i * "b" == r"(?:a)(?:b)"i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this really work like this? I would expect the string "b" to only match itself?

base/regex.jl Outdated
end

unwrap_string(r::Regex, unshared::UInt32) = string("(?", regex_opts_str(r.compile_options & unshared), ':', r.pattern, ')')
unwrap_string(s::Union{AbstractString,AbstractChar}, ::UInt32) = string("(?:", s, ')')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this is not how we want this to work: I think that a string or character should be converted to a regex that only matches that exact string or character, not to a regex with that string or character as its content. Also, I think there's a case to be made for making this a method of convert. For a clearer example where it matters, should r"a|b" * "c|d" produce r"(?:a|b)(?:c|d)" which would match ac, ad, bc and bd or should it produce something like r"(?:a|b)(?:c\|d)" which would match ac|d and bc|d? I think it has to be the latter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, actually while you were reviewing I pushed a new version unwrap_string(s::Union{AbstractString,AbstractChar}, ::UInt32) = s, but it has the same problem. I didn't think about this problem, but I don't see how to address generally. It's easy enough to quote |, but I don't know if this would be enough to quote a list a special characters? ('{', '+', etc). Plus, it would be time consuming.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, there seems to be a solution after all: enclose the string between "\Q" and "\E". Remains to decide which behavior is the best, I don't have a strong opinion. But it's true that if one wants the regex interpretation of characters, it's easy enough to wrap the string in a call to Regex.

Copy link
Member

@StefanKarpinski StefanKarpinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to submit that review as "request changes" but picked the wrong radio button.

@StefanKarpinski
Copy link
Member

In fully generality, this requires being able to correctly regex-quote strings, which is definitely doable, but not entirely straightforward, I suspect. It may be easier to just skip that part of this change and keep it focused on the pure regex bit.

@rfourquet
Copy link
Member Author

rfourquet commented Apr 3, 2019

skip that part of this change and keep it focused on the pure regex bit.

So allow only multiplication of regexes, not regexes + strings + chars?
EDIT: I'm fine with that, although my use case involved precisely this mixed multiplication.

@StefanKarpinski
Copy link
Member

You can certainly include that too, but does it makes sense to you that it should treat strings as a regex that only matches that string? Of you’re cool with implementing that, I have no objection.

@rfourquet
Copy link
Member Author

I don't know if you saw my comment above, but there seems to be a trivial solution, using quoting between "\Q" and "\E". So I updated the PR with this.

does it makes sense to you that it should treat strings as a regex that only matches that string?

It sure makes sense, but as I said I don't have a strong opinion and anyway it (my opinion) shouldn't matter as I didn't do much regexing in my life. Your call!

Copy link
Member

@StefanKarpinski StefanKarpinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting there. I think if you just handle \E in strings we should be good.

base/regex.jl Outdated
@@ -559,7 +571,7 @@ end
*(r::Regex) = r # avoids wrapping r in a useless subpattern

unwrap_string(r::Regex, unshared::UInt32) = string("(?", regex_opts_str(r.compile_options & unshared), ':', r.pattern, ')')
unwrap_string(s::Union{AbstractString,AbstractChar}, ::UInt32) = s # no need to wrap in subpattern
unwrap_string(s::Union{AbstractString,AbstractChar}, ::UInt32) = string("\\Q", s, "\\E")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, you're going to want to kill me but what if s contains \E?

Copy link
Member Author

@rfourquet rfourquet Apr 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hahaha! actually, I should have known better with my background in cybersecurity (cf. injection attacks) ;-)
EDIT: I'm very curious whether you'll find a flaw in my solution!

@StefanKarpinski
Copy link
Member

This looks good to me now. We’re halfway to having a fully algebra on regular expressions!

@rfourquet rfourquet changed the title add *(::Union{Regex, AbstractString, Char}...) add *(::Union{Regex, AbstractString, AbstractChar}...) Apr 25, 2019
@rfourquet rfourquet closed this Apr 26, 2019
@rfourquet rfourquet reopened this Apr 26, 2019
@rfourquet
Copy link
Member Author

So the intersection of the failing tests between this CI run and the previous one is empty (previous failures: travis, tester_macos64, tester_linux32), so I will assume they are not related.

I will merge soon if no objection.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Apr 29, 2019

Go for it! Squash or not as appropriate. Edit: I decided to just go for it 😁

@StefanKarpinski StefanKarpinski merged commit 0140ce8 into master Apr 29, 2019
@StefanKarpinski StefanKarpinski deleted the rf/regexp-concat branch April 29, 2019 20:43
@@ -24,7 +24,7 @@ New library functions
Standard library changes
------------------------

* Cmd interpolation (``` `$(x::Cmd) a b c` ``` where) now propagates `x`'s process flags (environment, flags, working directory, etc) if `x` is the first interpolant and errors otherwise ([#24353]).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad rebase?

Copy link
Member

@StefanKarpinski StefanKarpinski Apr 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in #31874.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, sorry about that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants