Make RegexMatch iterable #34355

oxinabox · 2020-01-12T22:09:30Z

Look at the tests.
Isn't this a nice feature?

In general its weird to define getindex without matching iterate.

oxinabox · 2020-01-13T02:08:47Z

Needs News.md.

May warrent mentioning in Docs?

DilumAluthge · 2020-01-13T02:15:24Z

I think you'll also need to define:

length(m::RegexMatch) = length(m.captures)

tkf · 2020-01-13T04:56:31Z

Also eltype(::Type{RegexMatch})?

Reading the test code, it looks like the plan is to treat RegexMatch like a Dict. If so, maybe defining keys, values, keytype, valtype and pairs makes sense?

DilumAluthge · 2020-01-13T05:34:56Z

I thought the idea was to treat RegexMatch like an AbstractVector. After all, we are basically just forwarding the relevant methods (iterate, length, etc.) onto the captures field of RegexMatch, and captures is a Vector.

tkf · 2020-01-13T05:45:29Z

I was just looking at this new test

julia/test/regex.jl

Lines 139 to 147 in e6794e1

    
           @testset "iterate" begin 
        
               m = match(r"(.) test (.+)", "a test 123") 
        
               @test first(m) == "a" 
        
               @test collect(m) == ["a", "123"] 
        
               @test for (i, capture) in m 
        
                   i == 1 && @test capture == "a" 
        
                   i == 2 && @test capture == "123" 
        
               end 
        
           end

I think Dict like interface makes sense since we already have getindex(::RegexMatch, ::Symbol) for named capture.

oxinabox · 2020-01-13T09:40:42Z

This PR makes it iterable.
Making it Dict-like feels like a possible follow up PR.
Seems more contentious potentially.
And getting a smaller PR reviewed and in is much easier.
Otherwise get bogged down in arguing about one feature.

The bit this adds is much less debatable.

base/regex.jl

StefanKarpinski · 2020-01-13T17:23:33Z

This makes sense to me although I can't say I've encountered the motivation to iterate over the captures of a single regex match. Can you give a sense of what kind of code needs this? Allowing indexing by index also seems like a fine addition to me. Why not, after all?

oxinabox · 2020-01-13T17:44:27Z

I can't say I've encountered the motivation to iterate over the captures of a single regex match. Can you give a sense of what kind of code needs this?

I often want to write code that looks like this test:
https://github.com/JuliaLang/julia/pull/34355/files#diff-f172897babd48052d6f286028f613cb2R149-R155
where I dispatch to f(::Nothing) or f((capture1, capture2)::RegexMatch)
It basically just means I can get rid of the RegexMatch object and get down to the capture using destructuring.

Allowing indexing by index also seems like a fine addition to me.

Indexing by index? We've always allowed that.

test/regex.jl

StefanKarpinski · 2020-01-13T18:10:39Z

Indexing by index? We've always allowed that.

Then what is the proposed indexing by?

oxinabox · 2020-01-13T18:40:20Z

Indexing by index? We've always allowed that.
Then what is the proposed indexing by?

We are not adding indexing at all?
Only iterate

StefanKarpinski · 2020-01-13T18:48:16Z

I think Dict like interface makes sense

I'm asking about that.

StefanKarpinski · 2020-01-13T18:51:59Z

There's a bit of thinking to be done here: if a RegexMatch object is map from indices or symbols to captures, then shouldn't it iterate like that dictionary? If you want a vector of captures, you can already get that by accessing m.captures which is just that.

oxinabox · 2020-01-13T18:54:23Z

I think Dict like interface makes sense
I'm asking about that.

Ok, so you are talking to @tkf fair enough.

From my point of view:
I don't think iterating as (capturename, capturevalue) makes sense.
It makes my motivating example of destructuring in a function call ugly.

I would have do to
f((_, capture1), (_, capture2))::RegexMatch)
Rather than f((capture1, capture2)::RegexMatch)

While I very often want the capture values to be passed to a function, I have never wanted the capture names. I always know those since i named it.

StefanKarpinski · 2020-01-13T18:55:20Z

Why not just pass the .captures field to the function?

oxinabox · 2020-01-13T18:58:38Z

Why not just pass the .captures field to the function?

Issue with that is can't access (nothing).captures on the nonmatch.

Dispatching into f(::Nothing), and f((capture1, capture2)::RegexMatch) gives an elegant way to solve that, while also giving them into variables.

I have a bunch of code that does;

function f(m::RegexMatch)
    capture1, capture2 = m.captures
    ...
end
f(::Nothing)=...

and I think the destructuring is much cleaner

StefanKarpinski · 2020-01-13T20:09:29Z

Issue with that is can't access (nothing).captures on the nonmatch.

I see. That makes sense. I guess I always just write out my m !== nothing check explicitly. I think we need to figure out what the data model of a RegexMatch is, lest we end up in the situation that DataFrames of yore did, where it was fundamentally confused about whether a DataFrame was: (a) a two-dimensional collection, (b) an integer-indexed collection of rows, or (c) a name-indexed collection of columns. So what is RegexMatch? (a) a struct that contains a collection of captures, (b) a collection of captures itself, or (b) a map from names to captures?

tkf · 2020-01-13T22:41:50Z

I think a challenge for the Dict-like route is that it's not super clear what keys part should be for mixed named and unnamed captures like m = match(r"(?P<first>.) (.) (.) (?P<last>.)", "1 2 3 4"). Should keys(m) be [:first, 2, 3, :last]? [1, 2, 3, 4]? [1, 2, 3, 4, :first, :last]? [1, 2, 3, 4, :first, :last, "first", "last"]? There is also type-stability problem for output like [:first, 2, 3, :last]. So maybe Array-like route (a collection of captures itself) is the way to go?

oxinabox · 2020-01-13T22:53:06Z

I think it is a collection of captures that is indexed by position (int) and optionally by name.
It's a bit like NamedTuple in that way.
NamedTuple is more Tuple-like than Dict-like.

StefanKarpinski · 2020-01-14T18:28:17Z

Ok, that makes sense. I just wanted to clarify the data model. Does this complete the implementation of that data model?

oxinabox · 2020-01-14T18:37:58Z

Does this complete the implementation of that data model?

Tuple, and NamedTuple, and Vector all overload keys.
But keys is contentious.

oxinabox · 2020-08-31T14:13:28Z

To return to this PR.
I think we should do this.
I am going to make a follow up PR to add keys because that is a different question.

NEWS.md

omus · 2020-10-02T03:20:48Z

Something that isn't great about
foo((cap1, cap2)::RegexMatch) is that it's quite fragile. You need to only call foo with the results of a specific regular expression match or encounter an exception when the number of captures differs.
For example having conditional captures will break this pattern.

oxinabox · 2020-10-02T10:41:26Z

I don't see that this needs to solve every problem to be a good idea.
I think 90% or regexes i have ever written have exactly 1 captures and probably another 5% exactly 2.
and probably 99% of functions processing regex results have been intended to processing exactly one regex (not some collection of regex i check)
(often i use at little local closure for this).

I don't know that i have ever used conditional captures.
I think this pattern captures (pun not intended) a pretty sunbstantial number of use-cases.
And its not like we are talking away the current way of doing things.

Further for conditional capture won't you want to write multiple different methods?
I might not be understanding though,

foo((cap1,)::RegexMatch) =...
foo((cap1, cap2)::RegexMatch) = ...
foo((cap1, cap2, cap3)::RegexMatch) = ...

This will also play nice with: #2626 (comment) then we will be able to have:

foo((cap1, caps...)::RegexMatch)) = ...

KristofferC · 2020-10-02T10:55:58Z

I must say the use-case shown in the NEWS looks pretty forced / unnatural to me. The code in #34355 (comment) looks perfectly fine and is easy to understand while the NEWS one is not.

omus · 2020-10-02T13:57:16Z

For example having conditional captures will break this pattern.

I'll need to make a correction on my part. I thought that alternative capture groups (?|...) (also known as branch reset groups) could result in a variable amount of capture groups being returned. This is not the case and I have not found an example where a single regular expression can return a variable amount of captured groups.

I think it (a RegexMatch) is a collection of captures that is indexed by position (int) and optionally by name.

This statement does seem to be correct to me. When mixing named and positional capture groups the groups are always in the order in which they were defined:

julia> match(r"((?<b>b)|(?<a>a)(\n)?)", "a")
RegexMatch("a", 1="a", b=nothing, a="a", 4=nothing)

foo((cap1,)::RegexMatch) =...
foo((cap1, cap2)::RegexMatch) = ...
foo((cap1, cap2, cap3)::RegexMatch) = ...

These all currently define the same method so foo is currently only compatible with a specific regular expression. That said, if this is a pattern we want to embrace we could parameterize RegexMatch by the capture groups (e.g. RegexMatch{(1,:b,:a,4)}) so that you could write separate methods to handle results from several regular expressions.

oxinabox · 2020-10-02T14:36:41Z

These all currently define the same method so foo is currently only compatible with a specific regular expression

Oh, good point. That does limit the utility of this.
I still think this is catching the majority of cases where it considering 1 regex that is either matched or not matched.

As a bit of an aside: AFAIK RegexMatch is the only object in Base that as its API recommends accessing a field of the result (Excluding NamedTuples).
and it is still weird that we have getindex defined on RegexMatch but not iterate.

tkf · 2020-11-24T22:38:26Z

As a bit of an aside: AFAIK RegexMatch is the only object in Base that as its API recommends accessing a field of the result

FYI I think another example is

Process: Wait for a process or process chain to exit. The exitcode field of a process can be used to determine success or failure.

--- https://docs.julialang.org/en/v1.6-dev/base/parallel/#Base.wait

I don't think we need to "fix" these cases, though.

oxinabox · 2020-11-27T17:03:29Z

I continue to think we should do this.
What is blocking making a decision on this?
Reiterating the three reasons we should do this:

Matching other things that have getindex defined, this should have iterate defined.
Access to the collection as an iterable without needing to access a field (which we rarely do)
Destructuing dispatch on if matched or not.

@StefanKarpinski asked:

Does this complete the implementation of that data model?

Yes, combined with #37300 (merged) this implements the complete data model to act like a (partially) named-tuple.

@KristofferC said:

I must say the use-case shown in the NEWS looks pretty forced / unnatural to me.

Should I remove that example from the NEWS.md, and just simply say that regex matches now iterate to give the captures?
It is the thing I perstonally would do and what initially motivated this, but it is just one of 3 motivating reasons for this change.
It's also a bit long for the NEWS.md.

oxinabox · 2021-01-02T21:12:27Z

Should I remove that example from the NEWS.md, and just simply say that regex matches now iterate to give the captures?

I have now done this

test/regex.jl

oxinabox · 2021-01-19T11:52:14Z

bump

don't forget to pass the state add length and eltype to RegexMatch Fix missing enumerate =update news for RegexMatch iterating fix typo in news Co-authored-by: Curtis Vogt <[email protected]> remove example of possible use from NEWS.md Remove stray at-test

add iterate, length, and eltype to RegexMatch Co-authored-by: Curtis Vogt <[email protected]>

fredrikekre added needs compat annotation Add !!! compat "Julia x.y" to the docstring strings "Strings!" needs news A NEWS entry is required for this change and removed needs compat annotation Add !!! compat "Julia x.y" to the docstring labels Jan 13, 2020

oxinabox commented Jan 13, 2020

View reviewed changes

base/regex.jl Outdated Show resolved Hide resolved

StefanKarpinski reviewed Jan 13, 2020

View reviewed changes

test/regex.jl Outdated Show resolved Hide resolved

oxinabox force-pushed the ox/iterreg branch 2 times, most recently from 5cb04bd to f39528f Compare August 31, 2020 14:12

oxinabox mentioned this pull request Aug 31, 2020

define keys on RegexMatch #37299

Merged

oxinabox mentioned this pull request Aug 31, 2020

make haskey with Integers work with RegexMatch #37300

Merged

StefanKarpinski added the triage This should be discussed on a triage call label Sep 11, 2020

omus reviewed Oct 1, 2020

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

oxinabox force-pushed the ox/iterreg branch from 6ef067e to 9f2e060 Compare October 2, 2020 10:44

oxinabox force-pushed the ox/iterreg branch from 060bf3e to 5862f93 Compare November 13, 2020 12:33

tkf removed the needs news A NEWS entry is required for this change label Nov 24, 2020

oxinabox force-pushed the ox/iterreg branch from 5862f93 to ebf2d1a Compare November 27, 2020 17:13

oxinabox force-pushed the ox/iterreg branch from ebf2d1a to 955c9ba Compare January 2, 2021 21:08

oxinabox commented Jan 3, 2021

View reviewed changes

test/regex.jl Outdated Show resolved Hide resolved

oxinabox force-pushed the ox/iterreg branch from d8770c5 to c8f18a6 Compare January 19, 2021 11:51

JeffBezanson approved these changes Jan 21, 2021

View reviewed changes

Make RegexMatch iterable

951a389

don't forget to pass the state add length and eltype to RegexMatch Fix missing enumerate =update news for RegexMatch iterating fix typo in news Co-authored-by: Curtis Vogt <[email protected]> remove example of possible use from NEWS.md Remove stray at-test

oxinabox force-pushed the ox/iterreg branch from c8f18a6 to 951a389 Compare January 21, 2021 17:51

JeffBezanson added merge me PR is reviewed. Merge when all tests are passing and removed triage This should be discussed on a triage call labels Jan 21, 2021

vtjnash merged commit 770d0d5 into JuliaLang:master Jan 22, 2021

vtjnash removed the merge me PR is reviewed. Merge when all tests are passing label Jan 22, 2021

oxinabox deleted the ox/iterreg branch January 22, 2021 09:15

ElOceanografo pushed a commit to ElOceanografo/julia that referenced this pull request May 4, 2021

Make RegexMatch iterable (JuliaLang#34355)

ddcf03b

add iterate, length, and eltype to RegexMatch Co-authored-by: Curtis Vogt <[email protected]>

antoine-levitt pushed a commit to antoine-levitt/julia that referenced this pull request May 9, 2021

Make RegexMatch iterable (JuliaLang#34355)

82dbb10

add iterate, length, and eltype to RegexMatch Co-authored-by: Curtis Vogt <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make RegexMatch iterable #34355

Make RegexMatch iterable #34355

oxinabox commented Jan 12, 2020

oxinabox commented Jan 13, 2020

DilumAluthge commented Jan 13, 2020

tkf commented Jan 13, 2020

DilumAluthge commented Jan 13, 2020

tkf commented Jan 13, 2020

oxinabox commented Jan 13, 2020

StefanKarpinski commented Jan 13, 2020 •

edited

Loading

oxinabox commented Jan 13, 2020

StefanKarpinski commented Jan 13, 2020

oxinabox commented Jan 13, 2020

StefanKarpinski commented Jan 13, 2020

StefanKarpinski commented Jan 13, 2020

oxinabox commented Jan 13, 2020

StefanKarpinski commented Jan 13, 2020

oxinabox commented Jan 13, 2020

StefanKarpinski commented Jan 13, 2020

tkf commented Jan 13, 2020

oxinabox commented Jan 13, 2020

StefanKarpinski commented Jan 14, 2020

oxinabox commented Jan 14, 2020

oxinabox commented Aug 31, 2020

omus commented Oct 2, 2020

oxinabox commented Oct 2, 2020 •

edited

Loading

KristofferC commented Oct 2, 2020 •

edited

Loading

omus commented Oct 2, 2020

oxinabox commented Oct 2, 2020 •

edited

Loading

tkf commented Nov 24, 2020

oxinabox commented Nov 27, 2020 •

edited

Loading

oxinabox commented Jan 2, 2021

oxinabox commented Jan 19, 2021

Make RegexMatch iterable #34355

Make RegexMatch iterable #34355

Conversation

oxinabox commented Jan 12, 2020

oxinabox commented Jan 13, 2020

DilumAluthge commented Jan 13, 2020

tkf commented Jan 13, 2020

DilumAluthge commented Jan 13, 2020

tkf commented Jan 13, 2020

oxinabox commented Jan 13, 2020

StefanKarpinski commented Jan 13, 2020 • edited Loading

oxinabox commented Jan 13, 2020

StefanKarpinski commented Jan 13, 2020

oxinabox commented Jan 13, 2020

StefanKarpinski commented Jan 13, 2020

StefanKarpinski commented Jan 13, 2020

oxinabox commented Jan 13, 2020

StefanKarpinski commented Jan 13, 2020

oxinabox commented Jan 13, 2020

StefanKarpinski commented Jan 13, 2020

tkf commented Jan 13, 2020

oxinabox commented Jan 13, 2020

StefanKarpinski commented Jan 14, 2020

oxinabox commented Jan 14, 2020

oxinabox commented Aug 31, 2020

omus commented Oct 2, 2020

oxinabox commented Oct 2, 2020 • edited Loading

KristofferC commented Oct 2, 2020 • edited Loading

omus commented Oct 2, 2020

oxinabox commented Oct 2, 2020 • edited Loading

tkf commented Nov 24, 2020

oxinabox commented Nov 27, 2020 • edited Loading

oxinabox commented Jan 2, 2021

oxinabox commented Jan 19, 2021

StefanKarpinski commented Jan 13, 2020 •

edited

Loading

oxinabox commented Oct 2, 2020 •

edited

Loading

KristofferC commented Oct 2, 2020 •

edited

Loading

oxinabox commented Oct 2, 2020 •

edited

Loading

oxinabox commented Nov 27, 2020 •

edited

Loading