Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make RegexMatch iterable #34355

Merged
merged 1 commit into from
Jan 22, 2021
Merged

Make RegexMatch iterable #34355

merged 1 commit into from
Jan 22, 2021

Conversation

oxinabox
Copy link
Contributor

Look at the tests.
Isn't this a nice feature?

In general its weird to define getindex without matching iterate.

@oxinabox
Copy link
Contributor Author

Needs News.md.

May warrent mentioning in Docs?

@DilumAluthge
Copy link
Member

I think you'll also need to define:

length(m::RegexMatch) = length(m.captures)

@tkf
Copy link
Member

tkf commented Jan 13, 2020

Also eltype(::Type{RegexMatch})?

Reading the test code, it looks like the plan is to treat RegexMatch like a Dict. If so, maybe defining keys, values, keytype, valtype and pairs makes sense?

@DilumAluthge
Copy link
Member

I thought the idea was to treat RegexMatch like an AbstractVector. After all, we are basically just forwarding the relevant methods (iterate, length, etc.) onto the captures field of RegexMatch, and captures is a Vector.

@tkf
Copy link
Member

tkf commented Jan 13, 2020

I was just looking at this new test

julia/test/regex.jl

Lines 139 to 147 in e6794e1

@testset "iterate" begin
m = match(r"(.) test (.+)", "a test 123")
@test first(m) == "a"
@test collect(m) == ["a", "123"]
@test for (i, capture) in m
i == 1 && @test capture == "a"
i == 2 && @test capture == "123"
end
end

I think Dict like interface makes sense since we already have getindex(::RegexMatch, ::Symbol) for named capture.

@oxinabox
Copy link
Contributor Author

This PR makes it iterable.
Making it Dict-like feels like a possible follow up PR.
Seems more contentious potentially.
And getting a smaller PR reviewed and in is much easier.
Otherwise get bogged down in arguing about one feature.

The bit this adds is much less debatable.

@fredrikekre fredrikekre added needs compat annotation Add !!! compat "Julia x.y" to the docstring strings "Strings!" needs news A NEWS entry is required for this change and removed needs compat annotation Add !!! compat "Julia x.y" to the docstring labels Jan 13, 2020
base/regex.jl Outdated Show resolved Hide resolved
@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jan 13, 2020

This makes sense to me although I can't say I've encountered the motivation to iterate over the captures of a single regex match. Can you give a sense of what kind of code needs this? Allowing indexing by index also seems like a fine addition to me. Why not, after all?

@oxinabox
Copy link
Contributor Author

I can't say I've encountered the motivation to iterate over the captures of a single regex match. Can you give a sense of what kind of code needs this?

I often want to write code that looks like this test:
https://github.com/JuliaLang/julia/pull/34355/files#diff-f172897babd48052d6f286028f613cb2R149-R155
where I dispatch to f(::Nothing) or f((capture1, capture2)::RegexMatch)
It basically just means I can get rid of the RegexMatch object and get down to the capture using destructuring.

Allowing indexing by index also seems like a fine addition to me.

Indexing by index? We've always allowed that.

test/regex.jl Outdated Show resolved Hide resolved
@StefanKarpinski
Copy link
Member

Indexing by index? We've always allowed that.

Then what is the proposed indexing by?

@oxinabox
Copy link
Contributor Author

Indexing by index? We've always allowed that.
Then what is the proposed indexing by?

We are not adding indexing at all?
Only iterate

@StefanKarpinski
Copy link
Member

I think Dict like interface makes sense

I'm asking about that.

@StefanKarpinski
Copy link
Member

There's a bit of thinking to be done here: if a RegexMatch object is map from indices or symbols to captures, then shouldn't it iterate like that dictionary? If you want a vector of captures, you can already get that by accessing m.captures which is just that.

@oxinabox
Copy link
Contributor Author

I think Dict like interface makes sense
I'm asking about that.

Ok, so you are talking to @tkf fair enough.

From my point of view:
I don't think iterating as (capturename, capturevalue) makes sense.
It makes my motivating example of destructuring in a function call ugly.

I would have do to
f((_, capture1), (_, capture2))::RegexMatch)
Rather than f((capture1, capture2)::RegexMatch)

While I very often want the capture values to be passed to a function, I have never wanted the capture names. I always know those since i named it.

@StefanKarpinski
Copy link
Member

Why not just pass the .captures field to the function?

@oxinabox
Copy link
Contributor Author

Why not just pass the .captures field to the function?

Issue with that is can't access (nothing).captures on the nonmatch.

Dispatching into f(::Nothing), and f((capture1, capture2)::RegexMatch) gives an elegant way to solve that, while also giving them into variables.

I have a bunch of code that does;

function f(m::RegexMatch)
    capture1, capture2 = m.captures
    ...
end
f(::Nothing)=...

and I think the destructuring is much cleaner

@StefanKarpinski
Copy link
Member

Issue with that is can't access (nothing).captures on the nonmatch.

I see. That makes sense. I guess I always just write out my m !== nothing check explicitly. I think we need to figure out what the data model of a RegexMatch is, lest we end up in the situation that DataFrames of yore did, where it was fundamentally confused about whether a DataFrame was: (a) a two-dimensional collection, (b) an integer-indexed collection of rows, or (c) a name-indexed collection of columns. So what is RegexMatch? (a) a struct that contains a collection of captures, (b) a collection of captures itself, or (b) a map from names to captures?

@tkf
Copy link
Member

tkf commented Jan 13, 2020

I think a challenge for the Dict-like route is that it's not super clear what keys part should be for mixed named and unnamed captures like m = match(r"(?P<first>.) (.) (.) (?P<last>.)", "1 2 3 4"). Should keys(m) be [:first, 2, 3, :last]? [1, 2, 3, 4]? [1, 2, 3, 4, :first, :last]? [1, 2, 3, 4, :first, :last, "first", "last"]? There is also type-stability problem for output like [:first, 2, 3, :last]. So maybe Array-like route (a collection of captures itself) is the way to go?

@oxinabox
Copy link
Contributor Author

I think it is a collection of captures that is indexed by position (int) and optionally by name.
It's a bit like NamedTuple in that way.
NamedTuple is more Tuple-like than Dict-like.

@StefanKarpinski
Copy link
Member

Ok, that makes sense. I just wanted to clarify the data model. Does this complete the implementation of that data model?

@oxinabox
Copy link
Contributor Author

Does this complete the implementation of that data model?

Tuple, and NamedTuple, and Vector all overload keys.
But keys is contentious.

@oxinabox oxinabox force-pushed the ox/iterreg branch 2 times, most recently from 5cb04bd to f39528f Compare August 31, 2020 14:12
@oxinabox
Copy link
Contributor Author

To return to this PR.
I think we should do this.
I am going to make a follow up PR to add keys because that is a different question.

@StefanKarpinski StefanKarpinski added the triage This should be discussed on a triage call label Sep 11, 2020
NEWS.md Outdated Show resolved Hide resolved
@omus
Copy link
Member

omus commented Oct 2, 2020

Something that isn't great about
foo((cap1, cap2)::RegexMatch) is that it's quite fragile. You need to only call foo with the results of a specific regular expression match or encounter an exception when the number of captures differs.
For example having conditional captures will break this pattern.

@oxinabox
Copy link
Contributor Author

oxinabox commented Oct 2, 2020

I don't see that this needs to solve every problem to be a good idea.
I think 90% or regexes i have ever written have exactly 1 captures and probably another 5% exactly 2.
and probably 99% of functions processing regex results have been intended to processing exactly one regex (not some collection of regex i check)
(often i use at little local closure for this).

I don't know that i have ever used conditional captures.
I think this pattern captures (pun not intended) a pretty sunbstantial number of use-cases.
And its not like we are talking away the current way of doing things.

Further for conditional capture won't you want to write multiple different methods?
I might not be understanding though,

foo((cap1,)::RegexMatch) =...
foo((cap1, cap2)::RegexMatch) = ...
foo((cap1, cap2, cap3)::RegexMatch) = ...

This will also play nice with: #2626 (comment) then we will be able to have:

foo((cap1, caps...)::RegexMatch)) = ...

@KristofferC
Copy link
Member

KristofferC commented Oct 2, 2020

I must say the use-case shown in the NEWS looks pretty forced / unnatural to me. The code in #34355 (comment) looks perfectly fine and is easy to understand while the NEWS one is not.

@omus
Copy link
Member

omus commented Oct 2, 2020

For example having conditional captures will break this pattern.

I'll need to make a correction on my part. I thought that alternative capture groups (?|...) (also known as branch reset groups) could result in a variable amount of capture groups being returned. This is not the case and I have not found an example where a single regular expression can return a variable amount of captured groups.

I think it (a RegexMatch) is a collection of captures that is indexed by position (int) and optionally by name.

This statement does seem to be correct to me. When mixing named and positional capture groups the groups are always in the order in which they were defined:

julia> match(r"((?<b>b)|(?<a>a)(\n)?)", "a")
RegexMatch("a", 1="a", b=nothing, a="a", 4=nothing)

foo((cap1,)::RegexMatch) =...
foo((cap1, cap2)::RegexMatch) = ...
foo((cap1, cap2, cap3)::RegexMatch) = ...

These all currently define the same method so foo is currently only compatible with a specific regular expression. That said, if this is a pattern we want to embrace we could parameterize RegexMatch by the capture groups (e.g. RegexMatch{(1,:b,:a,4)}) so that you could write separate methods to handle results from several regular expressions.

@oxinabox
Copy link
Contributor Author

oxinabox commented Oct 2, 2020

These all currently define the same method so foo is currently only compatible with a specific regular expression

Oh, good point. That does limit the utility of this.
I still think this is catching the majority of cases where it considering 1 regex that is either matched or not matched.


As a bit of an aside: AFAIK RegexMatch is the only object in Base that as its API recommends accessing a field of the result (Excluding NamedTuples).
and it is still weird that we have getindex defined on RegexMatch but not iterate.

@tkf
Copy link
Member

tkf commented Nov 24, 2020

As a bit of an aside: AFAIK RegexMatch is the only object in Base that as its API recommends accessing a field of the result

FYI I think another example is

Process: Wait for a process or process chain to exit. The exitcode field of a process can be used to determine success or failure.

--- https://docs.julialang.org/en/v1.6-dev/base/parallel/#Base.wait

I don't think we need to "fix" these cases, though.

@tkf tkf removed the needs news A NEWS entry is required for this change label Nov 24, 2020
@oxinabox
Copy link
Contributor Author

oxinabox commented Nov 27, 2020

I continue to think we should do this.
What is blocking making a decision on this?
Reiterating the three reasons we should do this:

  1. Matching other things that have getindex defined, this should have iterate defined.
  2. Access to the collection as an iterable without needing to access a field (which we rarely do)
  3. Destructuing dispatch on if matched or not.

@StefanKarpinski asked:

Does this complete the implementation of that data model?

Yes, combined with #37300 (merged) this implements the complete data model to act like a (partially) named-tuple.

@KristofferC said:

I must say the use-case shown in the NEWS looks pretty forced / unnatural to me.

Should I remove that example from the NEWS.md, and just simply say that regex matches now iterate to give the captures?
It is the thing I perstonally would do and what initially motivated this, but it is just one of 3 motivating reasons for this change.
It's also a bit long for the NEWS.md.

@oxinabox
Copy link
Contributor Author

oxinabox commented Jan 2, 2021

Should I remove that example from the NEWS.md, and just simply say that regex matches now iterate to give the captures?

I have now done this

test/regex.jl Outdated Show resolved Hide resolved
@oxinabox
Copy link
Contributor Author

bump

don't forget to pass the state
add length and eltype to RegexMatch
Fix missing enumerate
=update news for RegexMatch iterating

fix typo in news

Co-authored-by: Curtis Vogt <[email protected]>

remove example of possible use from NEWS.md

Remove stray at-test
@JeffBezanson JeffBezanson added merge me PR is reviewed. Merge when all tests are passing and removed triage This should be discussed on a triage call labels Jan 21, 2021
@vtjnash vtjnash merged commit 770d0d5 into JuliaLang:master Jan 22, 2021
@vtjnash vtjnash removed the merge me PR is reviewed. Merge when all tests are passing label Jan 22, 2021
@oxinabox oxinabox deleted the ox/iterreg branch January 22, 2021 09:15
ElOceanografo pushed a commit to ElOceanografo/julia that referenced this pull request May 4, 2021
add iterate, length, and eltype to RegexMatch

Co-authored-by: Curtis Vogt <[email protected]>
antoine-levitt pushed a commit to antoine-levitt/julia that referenced this pull request May 9, 2021
add iterate, length, and eltype to RegexMatch

Co-authored-by: Curtis Vogt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants