Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of (negative) lookahead (somehow) #14

Closed
eddyb opened this issue Aug 21, 2018 · 2 comments · Fixed by #113
Closed

Get rid of (negative) lookahead (somehow) #14

eddyb opened this issue Aug 21, 2018 · 2 comments · Fixed by #113

Comments

@eddyb
Copy link
Member

eddyb commented Aug 21, 2018

  • used right now for lexing, e.g. ensuring that an identifier ([a-zA-Z][a-zA-Z0-9]* in regex) is not followed by more identifier characters ((?![a-zA-Z0-9]) in some regex dialects), forcing the rule to match the longest valid identifier (just like most regex semantics out there)
  • it makes the grammar context-sensitive (when it maybe doesn't need to)
  • incremental reparsing (Incremental reparsing #12) would need to track larger input ranges than the SPPF uses
  • bidirectional parsing (Bidirectional parsing #13) would prefer symmetrical lookbehind, which worsens the problem
    • either both direction checks both lookbehind and lookahead, or we do an analysis to determine that one or both is unnecessary given its contexts
  • we could transform what the user writes as look{ahead,behind} into "static context-sensitivity", propagating it up and specializing callers, generating a (much?) larger but context-free grammar, which only accesses input within the success range of each rule invocation
@eddyb eddyb added the help wanted Extra attention is needed label Aug 21, 2018
@eddyb
Copy link
Member Author

eddyb commented Aug 25, 2018

Found a solution while talking to @eternaleye, and it involves #27:

// Existing Rust grammar:
ForLoop = "for" Pat "in" Expr Block;
// Scannerless CFG:
ForLoop = "for" {
    Pat(/*allow_ident_left=*/"0", /*allow_ident_right=*/"0") |
    // NB: WS is *mandatory* whitespace here:
    WS Pat(/*allow_ident_left=*/"1", /*allow_ident_right=*/"0") |
    Pat(/*allow_ident_left=*/"0", /*allow_ident_right=*/"1") WS |
    WS Pat(/*allow_ident_left=*/"1", /*allow_ident_right=*/"1") WS
} "in" {
    Expr(/*allow_ident_left=*/"0", /*allow_ident_right=*/"1") |
    WS Pat(/*allow_ident_left=*/"1", /*allow_ident_right=*/"1")
} WS? Block;

Note that this is a first approximation and we could find some sugar for it, perhaps, especially to avoid having to write WS? everywhere else in the grammar.

One possibility could be "for" Pat(WS?, WS?) "in" Expr(WS?, "1") Block. That is, writing WS? explicitly removes the implicit WS? and expands the parametric/macro rule invocation to pass "0" or "1 depending on whether there is whitespace to the left (or right, for the second WS?).

EDIT: We probably don't need that WS? trick (it doesn't work for "as" anyway), and instead we could use e.g. 4 rules, named ExprI{L,}{R,} (or, alternatively, Expr{{After,Before,Between}KW,}). Not sure how to control whitespace insertion though.
Maybe any rule that can start/end with explicit whitespace doesn't get whitespace also inserted before/after it at its use sites?

@eddyb eddyb removed the help wanted Extra attention is needed label Aug 25, 2018
@eddyb
Copy link
Member Author

eddyb commented Sep 11, 2018

While we might be able to implement the machinery to allow writing the explicit version sooner, it's becoming increasingly clear how unergonomic it would be to write such a grammar.

A good middle-ground could be taking lookaround and propagating it up the grammar, and require that it cleanly "intersects" with existing terminals and "dissolves" away, leaving a CFG behind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant