Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow lexers to provide more user-friendly errors #104

Closed
brendanzab opened this issue Apr 4, 2020 · 19 comments · Fixed by #273
Closed

Allow lexers to provide more user-friendly errors #104

brendanzab opened this issue Apr 4, 2020 · 19 comments · Fixed by #273
Labels
enhancement New feature or request
Milestone

Comments

@brendanzab
Copy link

At the moment errors are a single, opaque token. Is there a way to have more user-friendly, structured error values, with error recovery, instead of just returning Token::Error? It would be really nice to have an example of this too!

@maciejhirsz
Copy link
Owner

What sort of interface would you like to see?

You can use range (maybe we should rename it to span?) to get the position of the error, and slice to get the exact subslice that produced the error, as with every other token. Errors are guaranteed to be at least 1 byte long so that the lexer never produces an infinite loop (advance always, well, advances).

Lexers, or parsers, especially for PLs, tend to be exempt from the "fail fast" philosophy since you usually want to accumulate errors and report them in batches to the user. Having error be just another token has the nice benefit in that you can treat errors just as any other unexpected token when writing a parser to AST, which makes it much easier to write, and it tends to have a nice performance profile since you don't have to do special case branching for error handling.

@brendanzab
Copy link
Author

brendanzab commented Apr 7, 2020

Lexers, or parsers, especially for PLs, tend to be exempt from the "fail fast" philosophy since you usually want to accumulate errors and report them in batches to the user.

Yeah, this is why I mentioned 'with error recovery' - I didn't see this mentioned in the docs. If Logos supports this then that's great!

I more want better information as to why an error occurred, so that I can produce nice, user-friendly errors using my library, codespan-reporting.

At the moment Pikelet has a bunch of lexer errors on the master branch: https://github.com/pikelet-lang/pikelet/blob/782a8853cbaf0c50ec668ef55df798e30cea0ef6/crates/pikelet-concrete/src/parse/lexer.rs#L43-L69

In contrast, at the moment this is what I am forced to output on the next branch (which uses Logos): https://github.com/pikelet-lang/pikelet/blob/0568c6ada6cc1937774cac15372f3acde40793c9/pikelet/src/surface/lexer.rs#L174

@maciejhirsz
Copy link
Owner

maciejhirsz commented Apr 7, 2020

Ah, gotcha, so what you are missing is what was the token Logos was expecting to produce but failed.

I'll have to think how to do that best, especially now that stateful enums have landed.

Speaking of, with 0.11 you can implement Logos directly on your Token<'a>:

#[derive(Logos, Debug, PartialEq)]
enum Token<'a> {
#[error]
Error,
#[regex(r"[a-zA-Z]+", |lex| lex.slice())]
Text(&'a str),

And there is also a spanned iterator now, which should also play nicer with what you are doing there:

logos/tests/tests/simple.rs

Lines 228 to 235 in 4005d70

let tokens: Vec<_> = Token::lexer("pri priv priva private").spanned().collect();
assert_eq!(tokens, &[
(Token::Identifier, 0..3),
(Token::Priv, 4..8),
(Token::Identifier, 9..14),
(Token::Private, 15..22),
]);

@brendanzab
Copy link
Author

Oh super nice, thanks! 😍

@bee-bee-eeb
Copy link

I'd also love to see this. Just letting the error variant be able to store something like an enum would be great.

We could use the Err variant of a Result from a parsing function and just store that in the error.

Example:

enum Parts<'a> {
    #[regex(".+", handle_text)]
    Text(&'a str),

    #[error]
    Error(ParsingError<'a>),
}

enum ParsingError<'a> {
    InvalidCharacter(&'a str)
}

fn handle_text<'a>(lex: &mut Lexer<Parts<'a>>) -> Result<&'a str, ParsingError<'a>> {
    let text = lex.slice();
    if text.contains("§") {
        return Err(ParsingError::InvalidCharacter("The character § can't be in the text!"));
    }

    Ok(text)
}

The lifetimes may be messed up a bit but i think you can understand what it is supposed to do.

@maciejhirsz
Copy link
Owner

maciejhirsz commented Apr 20, 2020

@nnt0 yeah, I think I can make that work. There needs to be some way to create ParsingError in this example for lexing errors (as opposed to callback errors).

Two options just out of the top of my head:

  1. Require trait bound From<&'source str>, so the default error can be built from slice of invalid token.
  2. Require trait bound FromLexer<'source, str> (might need a different signature) that takes &mut Lexer<'source, Parts<'source>> so you can do whatever.

Option 1. would be easier to handle, while 2. would be more flexible.

@maciejhirsz maciejhirsz added this to the 0.11.1 milestone Apr 20, 2020
@maciejhirsz maciejhirsz added the enhancement New feature or request label Apr 20, 2020
@bee-bee-eeb
Copy link

bee-bee-eeb commented Apr 20, 2020

@maciejhirsz I think i'd prefer option 2 here because you'd have access to Span. That way it's possible to report exactly where the error occured. With this you could just jump to the index instead of searching through the text.

Of course there are more benefits but thats just what I thought about first.

I also thought about making this opt-in for the folks that just want Error and not Error(ParsingError). That would also prevent people from having to change their code when updating. But I'm not sure how to implement that.

@maciejhirsz
Copy link
Owner

Oh yeah, this will only do anything if you actually specify a field in the error variant.

I reckon having #[error] Error(&'a str) might be a common enough case, but that can be done by providing a impl<'souce> From<'source, str> for &'source str { ... } implementation (+ String, Box<str>, Cow<'source, str>, etc).

Callbacks that return a Result<_, E> would have to satisfy a bound E: Into<T> where T is the field of the error variant (again, only if one was provided).

@ratmice
Copy link
Contributor

ratmice commented Apr 21, 2020

One question which this brings up is whether there is a thought to expand #[error] to accept a callback like #[error(|lex|...)] and if that callback could be accessed from lex within handle_text above?

That would at least seem somewhat consistent with how regex Tokens manage construction with parameters.

@maciejhirsz
Copy link
Owner

@ratmice ye, that would actually be a good way to supply the default constructor for the error type, might be better than using a trait.

@maciejhirsz
Copy link
Owner

maciejhirsz commented Apr 26, 2020

Following up the discussion from #135, and expanding on the suggestion above here, it might be useful to have an error callback for regexes that returns error type:

#[derive(Logos)]
enum Token {
    #[regex("some_complex_regex", error = |_| MyErrorType::InvalidComplexMatch)]
    ComplexMatch,

    #[error(|_| MyErrorType::default())]
    Error(MyErrorType),
}

@Alxandr
Copy link

Alxandr commented Apr 26, 2020

I suggest being able to add some special token to the regexes that triggers an error branch. Something like #[regex(r"foobar(bat|\!)"]. This is entirely off the top of my head, but the point here is that there are cases where you can say that "if you've reached this point, I know you were trying to write a number, but you did it wrong".

@maciejhirsz
Copy link
Owner

maciejhirsz commented Apr 26, 2020

I suggest being able to add some special token to the regexes that triggers an error branch.

You only need that in places where lexer can backtrack, which can only happen inside a (...)? or (...)* group. I just did some more reading on regex, and there is actually already syntax that we can use for this - atomic groups:

#[regex(r"[0-9]+(?>\.[0-9])?")]

This would make it so that for input 123.foo once 123. is read, not matching a digit on the next byte (f) would produce an error instead of backtracking and returning a match on 123.

I believe regex-syntax already supports the atomic group flag, I just ignore it due to, well, ignorance :)

@Alxandr
Copy link

Alxandr commented Apr 26, 2020

An important thing to note though, is that if you just turn this into the #[error] variant, I would have to parse it anew myself to figure out which (of all my known error cases) actually happened. Ie', you're throwing away information I need.

@maciejhirsz
Copy link
Owner

This is assuming stateful error variant, so you can use callbacks to put the required state inside it.

That said, it also might make sense to get rid of #[error] variant altogether, similarly to how #[end] got removed once we changed things into an iterator.

With a C-style enum Token, and 0-size LexerError default, Option<Result<Token, LexerError>> is still one byte only. Potential syntax for custom errors could then look as follows:

#[derive(Logos)]
#[logos(error = MyError)]
enum Token {
    #[regex("[0-9]+(?>\.[0-9]+)?", error = |_| MyError::JunkAfterPeriod)]
    Number,
}

This would be way more idiomatic, and would allow to collect iterators into like so:

let tokens: Vec<Token> = Token::lexer("3.14").collect()?;

@Alxandr
Copy link

Alxandr commented Apr 26, 2020

How would you deal with 2 errors in the same case? For instance, in the number case I was converting, it dealt with junk after period, and after e (in for instance 1e! for instance, the ! is an error here. Just split them up into multiple refex-es on the same case, but with different error statements?

@maciejhirsz
Copy link
Owner

Just split them up into multiple refex-es on the same case, but with different error statements?

That, or read the lex.slice() in the error callback and and emit different error accordingly. Might have to do some manual parsing there, but that should be fine in error path.

@Alxandr
Copy link

Alxandr commented Apr 26, 2020

I don't really agree on the "but that should be fine in error path", because logos has already determined that we arrived at this error path. You're essentially asking users of the library to redo the work that the library just did. You should be able to get some information out from the Lexer of why you arrived in this error callback. Maybe some index of which atomic group failed for instance (and at what index in the source)?

@maciejhirsz
Copy link
Owner

maciejhirsz commented Apr 26, 2020

I don't really agree on the "but that should be fine in error path", because logos has already determined that we arrived at this error path.

Let me rephrase that: it's fine to sacrifice error path performance for happy path performance, because error path is by definition rare.

You should be able to get some information out from the Lexer of why you arrived in this error callback.

Except Lexer has, by design, absolutely no context of why it's doing anything it's doing. All it knows is at what index it started lexing a token, and at what index it currently is. Everything else is just munching through CPU instructions compiled from a state machine graph compiled from token definitions.

The options are:

  • Track extra information at runtime, which means sacrificing happy path performance for error path performance, which is almost certainly the wrong approach.
  • Inject some extra context into the callback statically at compile time (index of the atomic group would be possible), which is better, but it means that every callback has to accept two parameters now (LLVM will optimize it away if it's not used, but it might be awkward on the API).
  • One match expression:
    match lex.slice().bytes().next_back() {
        Some(b".") => Error::JunkAfterPeriod,
        Some(b"e" | b"E") => Error::JunkAfterExponent,
        _ => Error::default(),
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants