Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: allow the user to capture groups on regex #148

Open
german1608 opened this issue Oct 9, 2019 · 4 comments
Open

Feature: allow the user to capture groups on regex #148

german1608 opened this issue Oct 9, 2019 · 4 comments

Comments

@german1608
Copy link
Contributor

german1608 commented Oct 9, 2019

Currently, when alex executes a token action, the whole match is captured. For example:

%wrapper "basic" -- the behaviour is similar on the other wrappers
tokens :-
  \"a\"  { \s -> Token a }


{
data Token = Token String
}
-- After reading, alex would return [Token "\"a\""] for input = "\"a\""

Would be amazing that alex match syntax allows to capture groups, like other regexes:

%wrapper "basic" -- the behaviour is similar on the other wrappers
tokens :-
  \"(a)\"  { \s -> Token s }
-- After reading, alex would return [Token "a"] for the same input

Alex could also handle several groups in the same token action, so it could extract the groups as a list of strings or something alike.

@simonmar
Copy link
Member

Yes, I would really like to have this functionality. As I recall it was non-trivial to implement it (especially if we want to do it without a performance overhead if you don't use the functionality), but it would be very useful to have it.

@german1608
Copy link
Contributor Author

We could try to have separate wrappers for that, something like:

%wrapper "basic-group-captor"
%wrapper "posn-group-captor"
%wrapper "monad-group-captor"
%wrapper "monadUserState-group-captor"

Or adding another directive, like:

%capturegroups

I think the last approach is better than the first one.

@andreasabel
Copy link
Member

@german1608: is the automata-theoretic implementation of groups worked out some where?
I found only the question: https://stackoverflow.com/questions/28941425/capture-groups-using-dfa-based-linear-time-regular-expressions-possible , and the answer pointed out that groups introduce nondeterminism, e.g. matching

a(b)|(a)b

against ab would give you non-deterministically a or b. This is maybe a can of worms we do not want to open.

Without drilling up the automata generation of Alex, one could imagine the following light-weight approach:

%wrapper "basic-regex-groups" 
tokens :-
  \"(a)\"  { \ r s -> Token (extractGroup r s) }
-- After reading, alex would return [Token "a"] for the same input

What Alex would do here, is:

  • Ignore the groups for the sake of DFA generation and token recognition.
  • Provide you with the regex r that accepted the token s, such that you can extract the group content from s yourself. (A user-defined extractGroup could utilize a third-party library after converting r into the format required by that library.)
  • The regex r would be provided as element of a type RExp defined in the basic-regex-groups wrapper, basically as an abstract syntax tree, see https://github.com/simonmar/alex/blob/6c4db72c3bb3419c84740e2e9ae112ed0abd572e/src/AbsSyn.hs#L198-L205.

This might not be as comfortable as it could be, but a rather lightweight and generic extension of Alex. In terms of efficiency, there is of course some duplication of work, but just extracting groups in a regexp that is known to match should be cheaper that figuring out which token to extract from the input in the first place.

@andreasabel
Copy link
Member

@german1608: is the automata-theoretic implementation of groups worked out some where?

There is TDFAs (Tagged Deterministic Finite Automata), e.g. https://github.com/haskell-hvr/regex-tdfa .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants