TDFA-based submatch extraction #1024

skvadrik · 2023-07-05T21:03:10Z

skvadrik
Jul 5, 2023

Hi @BurntSushi ! I read your interesting blog post https://blog.burntsushi.net/regex-internals, and I thought I'd mention another possible future optimization here: you can have fast submatch extraction on DFA not only on NFA. The algorithm is called TDFA (Tagged DFA, originally invented by Laurikari) and it is described on wikipedia and more rigorously in the paper. The good thing about TDFA is that, as the number of submatch groups approaches zero, TDFA degenerates into a simple DFA without any added overhead (so you can use it in place of the DFA algoritm rather than a new one). Apologies if you are already familiar with this.

BurntSushi · 2023-07-05T21:17:30Z

BurntSushi
Jul 5, 2023
Maintainer

Hiya! Big fan of your re2c work. :-) I've been familiar with it for a while and I read your original paper on TDFAs a few years ago. I actually mentioned tagged DFAs in the blog, but you can be forgiven for missing it given the length of the blog haha.

I don't have the TDFA inner workings paged into cache, but the main issue is that I can't really use fully compiled DFAs (nevermind TDFAs) in this crate. The blog outlines some very limited cases where they're used, but for example, they aren't even used by default. You have to opt into it. So it's possible TDFAs could slot into there somehow, but it would be very limited in its applicability given that I try to keep a reasonable limit on regex compile times and heap usage.

Now, if a TDFAs can be lazily built just like the lazy DFA in this crate, then that could be a different story. It almost looks like your 2022 paper describes exactly this (under the "JIT" vernacular), but I haven't read that paper yet.

Popping up a level, I just spent 3 years working on regex-automata, so I'm not likely to dig into TDFAs any time soon. I have other projects I want to work on, and as far as regex engines go, I'm most interested in exploring Glushkov automata and bit-parallel NFAs. Specifically targeting the multi-pattern use case. TDFAs probably aren't going to help, because the multi-pattern problem is one of scale and DFAs don't scale.

0 replies

skvadrik · 2023-07-05T21:43:02Z

skvadrik
Jul 5, 2023
Author

I actually mentioned tagged DFAs in the blog

Oh sorry, I should learn to read. :)

So it's possible TDFAs could slot into there somehow, but it would be very limited in its applicability given that I try to keep a reasonable limit on regex compile times and heap usage.

Sure, using TDFA would only make sense if DFAs were used much.

Now, if a TDFAs can be lazily built just like the lazy DFA in this crate, then that could be a different story. It almost looks like your 2022 paper describes exactly this (under the "JIT" vernacular), but I haven't read that paper yet.

It's not really lazy TDFA construction (so don't waste your time decrypting the paper). I think it would be possible to adapt multi-pass TDFA to lazy construction, but what the paper means by "JIT" is doing determinization at run-time (which is always the case in a regexp library, as opposed to a lexer generator like RE2C that does determinization at compile time).

Popping up a level, I just spent 3 years working on regex-automata, so I'm not likely to dig into TDFAs any time soon.

Totally understand! Exploring Glushkov automata sounds interesting. My personal favorite read on this subject is https://cs.nyu.edu/~mohri/postscript/glush.pdf. Thanks for your reply!

1 reply

BurntSushi Jul 5, 2023
Maintainer

but what the paper means by "JIT" is doing determinization at run-time

So I guess this is what I think the lazy DFA is doing. Is there some difference here that is eluding me? I think your reference to multi-pass TDFA to lazy construction is zipping over my head.

Totally understand! Exploring Glushkov automata sounds interesting. My personal favorite read on this subject is https://cs.nyu.edu/~mohri/postscript/glush.pdf. Thanks for your reply!

Ooo thank you!

skvadrik · 2023-07-05T22:38:40Z

skvadrik
Jul 5, 2023
Author

So I guess this is what I think the lazy DFA is doing. Is there some difference here that is eluding me? I think your reference to multi-pass TDFA to lazy construction is zipping over my head.

The algorithms I tested in the paper for TDFA and multipass-TDFA work as your "full DFA", in that they construct the full TDFA at regcomp time and then execute it on a string at regexec time. Now thinking about it, it seems trivial to change them both to construct only the necessary part of TDFA in the presence of the input string (at regexec time). I just don't have a ready implementation or benchmarks for it.

0 replies

skvadrik · 2023-07-06T14:56:46Z

skvadrik
Jul 6, 2023
Author

To clarify, when I wrote:

but what the paper means by "JIT" is doing determinization at run-time

I meant run-time as in during regcomp, not run-time as in during regexec (and of course these are POSIX names and your library has some other names for them). I tend to think of both regcomp and regexec as run-time because they both happen after the program starts executing, not when the program is compiled.

1 reply

BurntSushi Jul 6, 2023
Maintainer

I meant run-time as in during regcomp

Ooooooohhhh, gotya. Yup, this was the missing piece of the puzzle for me. And yeah, your conception of what runtime is makes sense given re2c's domain. :-)

Indeed, the lazy DFA builds itself during regexec. Only the NFA (and a couple trivial sentinel DFA states) are built during regcomp. If one can do that sort of lazy construction with a TDFA, then that would be a potentially promising future approach for this crate.

skvadrik · 2023-07-06T16:48:37Z

skvadrik
Jul 6, 2023
Author

If one can do that sort of lazy construction with a TDFA, then that would be a potentially promising future approach for this crate.

For the record, yes I think it's easy to adapt either TDFA or multipass-TDFA to lazy determinization (multipass-TDFA being much more suited to this setting in my opinion). Should anyone experiment with this, feel free to ping me for details.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TDFA-based submatch extraction #1024

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

TDFA-based submatch extraction #1024

skvadrik Jul 5, 2023

Replies: 5 comments · 2 replies

BurntSushi Jul 5, 2023 Maintainer

skvadrik Jul 5, 2023 Author

BurntSushi Jul 5, 2023 Maintainer

skvadrik Jul 5, 2023 Author

skvadrik Jul 6, 2023 Author

BurntSushi Jul 6, 2023 Maintainer

skvadrik Jul 6, 2023 Author

skvadrik
Jul 5, 2023

Replies: 5 comments 2 replies

BurntSushi
Jul 5, 2023
Maintainer

skvadrik
Jul 5, 2023
Author

BurntSushi Jul 5, 2023
Maintainer

skvadrik
Jul 5, 2023
Author

skvadrik
Jul 6, 2023
Author

BurntSushi Jul 6, 2023
Maintainer

skvadrik
Jul 6, 2023
Author