Replies: 5 comments 2 replies
-
Hiya! Big fan of your re2c work. :-) I've been familiar with it for a while and I read your original paper on TDFAs a few years ago. I actually mentioned tagged DFAs in the blog, but you can be forgiven for missing it given the length of the blog haha. I don't have the TDFA inner workings paged into cache, but the main issue is that I can't really use fully compiled DFAs (nevermind TDFAs) in this crate. The blog outlines some very limited cases where they're used, but for example, they aren't even used by default. You have to opt into it. So it's possible TDFAs could slot into there somehow, but it would be very limited in its applicability given that I try to keep a reasonable limit on regex compile times and heap usage. Now, if a TDFAs can be lazily built just like the lazy DFA in this crate, then that could be a different story. It almost looks like your 2022 paper describes exactly this (under the "JIT" vernacular), but I haven't read that paper yet. Popping up a level, I just spent 3 years working on regex-automata, so I'm not likely to dig into TDFAs any time soon. I have other projects I want to work on, and as far as regex engines go, I'm most interested in exploring Glushkov automata and bit-parallel NFAs. Specifically targeting the multi-pattern use case. TDFAs probably aren't going to help, because the multi-pattern problem is one of scale and DFAs don't scale. |
Beta Was this translation helpful? Give feedback.
-
Oh sorry, I should learn to read. :)
Sure, using TDFA would only make sense if DFAs were used much.
It's not really lazy TDFA construction (so don't waste your time decrypting the paper). I think it would be possible to adapt multi-pass TDFA to lazy construction, but what the paper means by "JIT" is doing determinization at run-time (which is always the case in a regexp library, as opposed to a lexer generator like RE2C that does determinization at compile time).
Totally understand! Exploring Glushkov automata sounds interesting. My personal favorite read on this subject is https://cs.nyu.edu/~mohri/postscript/glush.pdf. Thanks for your reply! |
Beta Was this translation helpful? Give feedback.
-
The algorithms I tested in the paper for TDFA and multipass-TDFA work as your "full DFA", in that they construct the full TDFA at |
Beta Was this translation helpful? Give feedback.
-
To clarify, when I wrote:
I meant run-time as in during |
Beta Was this translation helpful? Give feedback.
-
For the record, yes I think it's easy to adapt either TDFA or multipass-TDFA to lazy determinization (multipass-TDFA being much more suited to this setting in my opinion). Should anyone experiment with this, feel free to ping me for details. |
Beta Was this translation helpful? Give feedback.
-
Hi @BurntSushi ! I read your interesting blog post https://blog.burntsushi.net/regex-internals, and I thought I'd mention another possible future optimization here: you can have fast submatch extraction on DFA not only on NFA. The algorithm is called TDFA (Tagged DFA, originally invented by Laurikari) and it is described on wikipedia and more rigorously in the paper. The good thing about TDFA is that, as the number of submatch groups approaches zero, TDFA degenerates into a simple DFA without any added overhead (so you can use it in place of the DFA algoritm rather than a new one). Apologies if you are already familiar with this.
Beta Was this translation helpful? Give feedback.
All reactions