Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Reduce lexer to a regular language #2755

Closed
graydon opened this issue Jun 29, 2012 · 5 comments
Closed

RFC: Reduce lexer to a regular language #2755

graydon opened this issue Jun 29, 2012 · 5 comments
Labels
A-frontend Area: Compiler frontend (errors, parsing and HIR) A-grammar Area: The grammar of Rust A-syntaxext Area: Syntax extensions C-cleanup Category: PRs that clean code up or issues documenting cleanup. E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue.
Milestone

Comments

@graydon
Copy link
Contributor

graydon commented Jun 29, 2012

The lexer presently affords one real and one planned form of recursive token. These mean that our "tokens" are not actually describable by a regular language. We discussed this at some length on IRC today and came up with solutions for both cases, so I would like to reduce the lexer back down to "just regular".

The cases are:

  • Balanced block-quote delimiters. That is, the lexer switches mode when seeing an opening /* and then consumes a balanced set of possibly-nested /* and */ pairs. These exist for only one reason, which is to be able to comment-out a region of a file that already contains a comment. The solution we arrived at is to differentiate the problems of "commenting for the sake of writing some non-rust text like docs or such" and "commenting in order to disable code". For the former case, we'll maintain non-balanced block comments (described by a shortest-match regexp) and for the latter case we'll introduce a syntax extension called #ignore(...) that just discards its token-tree (including any block-comments, which are just single tokens). The corner case is that you won't be able to comment-out blocks that contain mixtures of both other-block-comments and random non-token lexemes, but that's far less common and (imo) worth sacrificing.
  • Lexeme-balanced syntax extensions. This is a touchier topic as I've long since maintained that I want Rust to support custom (marked) lexemes via automatic balancing of bracket-shaped delimiters, much the way Perl's q{...} brackets do. Thinking about this in the cold light of the question "is it enough of a feature to require the lexer to be non-regular?", though, I have to say no. Python-like raw strings are probably adequate -- or possibly q{...} quotes without automatic balancing -- and there's nothing really stopping a syntax extension from picking apart a string-literal token provided this way. I no longer think it's worth the complexity cost.

So given that, it should be only a couple patches to the lexer to get it back under the "regular" threshold, and possibly at that point we could drop in actual regexp definition of our tokens (binding to an existing re engine, or writing our own, I don't care. It should be a linear one in any case, something like http://code.google.com/p/re2/ or a clone if you feel like doing the exercise in rust).

@ghost ghost assigned paulstansifer Jun 29, 2012
@marijnh
Copy link
Contributor

marijnh commented Jun 29, 2012

Could someone point to an actual problem with non-regular lexers? Reducing syntax user-friendliness in order to fit (1970-era) formalization approaches is a bit of a pet peeve of mine.

@graydon
Copy link
Contributor Author

graydon commented Jun 29, 2012

Syntax modes in editors. Often written in the 70s.

Also any time someone wants to write a one-off tool that processes it as a token stream.

@catamorphism
Copy link
Contributor

I know block comments have been removed -- can this be closed?

@paulstansifer
Copy link
Contributor

@catamorphism I don't think there are any other obstacles; lexeme-balanced syntax extensions are not implemented and I believe the consensus was that there wasn't enough of a justification to desire them as a feature.

@catamorphism
Copy link
Contributor

@paulstansifer Great, sounds like we can close this.

@graydon graydon removed their assignment Jun 16, 2014
RalfJung pushed a commit to RalfJung/rust that referenced this issue Jan 9, 2023
…lfJung

add dtors_in_dtors_in_dtors

That's a pretty neat test from the standard library. Sadly not enough to check for rust-lang/miri#2754, but still worth having here.
celinval pushed a commit to celinval/rust-dev that referenced this issue Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-frontend Area: Compiler frontend (errors, parsing and HIR) A-grammar Area: The grammar of Rust A-syntaxext Area: Syntax extensions C-cleanup Category: PRs that clean code up or issues documenting cleanup. E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue.
Projects
None yet
Development

No branches or pull requests

4 participants