Allow lexers to provide more user-friendly errors #104

brendanzab · 2020-04-04T05:01:22Z

At the moment errors are a single, opaque token. Is there a way to have more user-friendly, structured error values, with error recovery, instead of just returning Token::Error? It would be really nice to have an example of this too!

The text was updated successfully, but these errors were encountered:

maciejhirsz · 2020-04-04T06:34:59Z

What sort of interface would you like to see?

You can use range (maybe we should rename it to span?) to get the position of the error, and slice to get the exact subslice that produced the error, as with every other token. Errors are guaranteed to be at least 1 byte long so that the lexer never produces an infinite loop (advance always, well, advances).

Lexers, or parsers, especially for PLs, tend to be exempt from the "fail fast" philosophy since you usually want to accumulate errors and report them in batches to the user. Having error be just another token has the nice benefit in that you can treat errors just as any other unexpected token when writing a parser to AST, which makes it much easier to write, and it tends to have a nice performance profile since you don't have to do special case branching for error handling.

brendanzab · 2020-04-07T00:55:04Z

Lexers, or parsers, especially for PLs, tend to be exempt from the "fail fast" philosophy since you usually want to accumulate errors and report them in batches to the user.

Yeah, this is why I mentioned 'with error recovery' - I didn't see this mentioned in the docs. If Logos supports this then that's great!

I more want better information as to why an error occurred, so that I can produce nice, user-friendly errors using my library, codespan-reporting.

At the moment Pikelet has a bunch of lexer errors on the master branch: https://github.com/pikelet-lang/pikelet/blob/782a8853cbaf0c50ec668ef55df798e30cea0ef6/crates/pikelet-concrete/src/parse/lexer.rs#L43-L69

In contrast, at the moment this is what I am forced to output on the next branch (which uses Logos): https://github.com/pikelet-lang/pikelet/blob/0568c6ada6cc1937774cac15372f3acde40793c9/pikelet/src/surface/lexer.rs#L174

maciejhirsz · 2020-04-07T06:45:29Z

Ah, gotcha, so what you are missing is what was the token Logos was expecting to produce but failed.

I'll have to think how to do that best, especially now that stateful enums have landed.

Speaking of, with 0.11 you can implement Logos directly on your Token<'a>:

logos/tests/tests/callbacks.rs

Lines 8 to 14 in 4005d70

    
           #[derive(Logos, Debug, PartialEq)] 
        
           enum Token<'a> { 
        
               #[error] 
        
               Error, 
        
               #[regex(r"[a-zA-Z]+", |lex| lex.slice())] 
        
               Text(&'a str),

And there is also a spanned iterator now, which should also play nicer with what you are doing there:

logos/tests/tests/simple.rs

Lines 228 to 235 in 4005d70

    
           let tokens: Vec<_> = Token::lexer("pri priv priva private").spanned().collect(); 
        
           assert_eq!(tokens, &[ 
        
               (Token::Identifier, 0..3), 
        
               (Token::Priv, 4..8), 
        
               (Token::Identifier, 9..14), 
        
               (Token::Private, 15..22), 
        
           ]);

brendanzab · 2020-04-07T08:05:54Z

Oh super nice, thanks! 😍

bee-bee-eeb · 2020-04-20T10:02:00Z

I'd also love to see this. Just letting the error variant be able to store something like an enum would be great.

We could use the Err variant of a Result from a parsing function and just store that in the error.

Example:

enum Parts<'a> {
    #[regex(".+", handle_text)]
    Text(&'a str),

    #[error]
    Error(ParsingError<'a>),
}

enum ParsingError<'a> {
    InvalidCharacter(&'a str)
}

fn handle_text<'a>(lex: &mut Lexer<Parts<'a>>) -> Result<&'a str, ParsingError<'a>> {
    let text = lex.slice();
    if text.contains("§") {
        return Err(ParsingError::InvalidCharacter("The character § can't be in the text!"));
    }

    Ok(text)
}

The lifetimes may be messed up a bit but i think you can understand what it is supposed to do.

maciejhirsz · 2020-04-20T10:31:56Z

@nnt0 yeah, I think I can make that work. There needs to be some way to create ParsingError in this example for lexing errors (as opposed to callback errors).

Two options just out of the top of my head:

Require trait bound From<&'source str>, so the default error can be built from slice of invalid token.
Require trait bound FromLexer<'source, str> (might need a different signature) that takes &mut Lexer<'source, Parts<'source>> so you can do whatever.

Option 1. would be easier to handle, while 2. would be more flexible.

bee-bee-eeb · 2020-04-20T11:48:33Z

@maciejhirsz I think i'd prefer option 2 here because you'd have access to Span. That way it's possible to report exactly where the error occured. With this you could just jump to the index instead of searching through the text.

Of course there are more benefits but thats just what I thought about first.

I also thought about making this opt-in for the folks that just want Error and not Error(ParsingError). That would also prevent people from having to change their code when updating. But I'm not sure how to implement that.

maciejhirsz · 2020-04-20T12:31:52Z

Oh yeah, this will only do anything if you actually specify a field in the error variant.

I reckon having #[error] Error(&'a str) might be a common enough case, but that can be done by providing a impl<'souce> From<'source, str> for &'source str { ... } implementation (+ String, Box<str>, Cow<'source, str>, etc).

Callbacks that return a Result<_, E> would have to satisfy a bound E: Into<T> where T is the field of the error variant (again, only if one was provided).

ratmice · 2020-04-21T12:50:05Z

One question which this brings up is whether there is a thought to expand #[error] to accept a callback like #[error(|lex|...)] and if that callback could be accessed from lex within handle_text above?

That would at least seem somewhat consistent with how regex Tokens manage construction with parameters.

maciejhirsz · 2020-04-21T14:34:12Z

@ratmice ye, that would actually be a good way to supply the default constructor for the error type, might be better than using a trait.

maciejhirsz · 2020-04-26T10:54:55Z

Following up the discussion from #135, and expanding on the suggestion above here, it might be useful to have an error callback for regexes that returns error type:

#[derive(Logos)]
enum Token {
    #[regex("some_complex_regex", error = |_| MyErrorType::InvalidComplexMatch)]
    ComplexMatch,

    #[error(|_| MyErrorType::default())]
    Error(MyErrorType),
}

Alxandr · 2020-04-26T11:18:44Z

I suggest being able to add some special token to the regexes that triggers an error branch. Something like #[regex(r"foobar(bat|\!)"]. This is entirely off the top of my head, but the point here is that there are cases where you can say that "if you've reached this point, I know you were trying to write a number, but you did it wrong".

maciejhirsz · 2020-04-26T12:24:46Z

I suggest being able to add some special token to the regexes that triggers an error branch.

You only need that in places where lexer can backtrack, which can only happen inside a (...)? or (...)* group. I just did some more reading on regex, and there is actually already syntax that we can use for this - atomic groups:

#[regex(r"[0-9]+(?>\.[0-9])?")]

This would make it so that for input 123.foo once 123. is read, not matching a digit on the next byte (f) would produce an error instead of backtracking and returning a match on 123.

I believe regex-syntax already supports the atomic group flag, I just ignore it due to, well, ignorance :)

Alxandr · 2020-04-26T12:38:37Z

An important thing to note though, is that if you just turn this into the #[error] variant, I would have to parse it anew myself to figure out which (of all my known error cases) actually happened. Ie', you're throwing away information I need.

maciejhirsz · 2020-04-26T13:50:29Z

This is assuming stateful error variant, so you can use callbacks to put the required state inside it.

That said, it also might make sense to get rid of #[error] variant altogether, similarly to how #[end] got removed once we changed things into an iterator.

With a C-style enum Token, and 0-size LexerError default, Option<Result<Token, LexerError>> is still one byte only. Potential syntax for custom errors could then look as follows:

#[derive(Logos)]
#[logos(error = MyError)]
enum Token {
    #[regex("[0-9]+(?>\.[0-9]+)?", error = |_| MyError::JunkAfterPeriod)]
    Number,
}

This would be way more idiomatic, and would allow to collect iterators into like so:

let tokens: Vec<Token> = Token::lexer("3.14").collect()?;

Alxandr · 2020-04-26T14:30:07Z

How would you deal with 2 errors in the same case? For instance, in the number case I was converting, it dealt with junk after period, and after e (in for instance 1e! for instance, the ! is an error here. Just split them up into multiple refex-es on the same case, but with different error statements?

maciejhirsz · 2020-04-26T15:52:41Z

Just split them up into multiple refex-es on the same case, but with different error statements?

That, or read the lex.slice() in the error callback and and emit different error accordingly. Might have to do some manual parsing there, but that should be fine in error path.

Alxandr · 2020-04-26T15:59:10Z

I don't really agree on the "but that should be fine in error path", because logos has already determined that we arrived at this error path. You're essentially asking users of the library to redo the work that the library just did. You should be able to get some information out from the Lexer of why you arrived in this error callback. Maybe some index of which atomic group failed for instance (and at what index in the source)?

maciejhirsz · 2020-04-26T16:49:13Z

I don't really agree on the "but that should be fine in error path", because logos has already determined that we arrived at this error path.

Let me rephrase that: it's fine to sacrifice error path performance for happy path performance, because error path is by definition rare.

You should be able to get some information out from the Lexer of why you arrived in this error callback.

Except Lexer has, by design, absolutely no context of why it's doing anything it's doing. All it knows is at what index it started lexing a token, and at what index it currently is. Everything else is just munching through CPU instructions compiled from a state machine graph compiled from token definitions.

The options are:

Track extra information at runtime, which means sacrificing happy path performance for error path performance, which is almost certainly the wrong approach.
Inject some extra context into the callback statically at compile time (index of the atomic group would be possible), which is better, but it means that every callback has to accept two parameters now (LLVM will optimize it away if it's not used, but it might be awkward on the API).

One match expression:

match lex.slice().bytes().next_back() {
    Some(b".") => Error::JunkAfterPeriod,
    Some(b"e" | b"E") => Error::JunkAfterExponent,
    _ => Error::default(),
}

maciejhirsz added this to the 0.11.1 milestone Apr 20, 2020

maciejhirsz added the enhancement New feature or request label Apr 20, 2020

maciejhirsz modified the milestones: 0.11.1, 0.11.2 Apr 23, 2020

maciejhirsz mentioned this issue Apr 26, 2020

Help dealing with some error cases #135

Closed

maciejhirsz modified the milestones: Next, 0.12.0 Apr 28, 2020

This was referenced Oct 1, 2020

Error kinds #158

Closed

[WIP] Using Custom Error Type #176

Closed

agluszak mentioned this issue Dec 20, 2022

Make lexer produce Result<Token, Error> instead of Token. #273

Merged

maciejhirsz closed this as completed in #273 Feb 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow lexers to provide more user-friendly errors #104

Allow lexers to provide more user-friendly errors #104

brendanzab commented Apr 4, 2020

maciejhirsz commented Apr 4, 2020

brendanzab commented Apr 7, 2020 •

edited

Loading

maciejhirsz commented Apr 7, 2020 •

edited

Loading

brendanzab commented Apr 7, 2020

bee-bee-eeb commented Apr 20, 2020

maciejhirsz commented Apr 20, 2020 •

edited

Loading

bee-bee-eeb commented Apr 20, 2020 •

edited

Loading

maciejhirsz commented Apr 20, 2020

ratmice commented Apr 21, 2020

maciejhirsz commented Apr 21, 2020

maciejhirsz commented Apr 26, 2020 •

edited

Loading

Alxandr commented Apr 26, 2020

maciejhirsz commented Apr 26, 2020 •

edited

Loading

Alxandr commented Apr 26, 2020

maciejhirsz commented Apr 26, 2020

Alxandr commented Apr 26, 2020

maciejhirsz commented Apr 26, 2020

Alxandr commented Apr 26, 2020 •

edited

Loading

maciejhirsz commented Apr 26, 2020 •

edited

Loading

Allow lexers to provide more user-friendly errors #104

Allow lexers to provide more user-friendly errors #104

Comments

brendanzab commented Apr 4, 2020

maciejhirsz commented Apr 4, 2020

brendanzab commented Apr 7, 2020 • edited Loading

maciejhirsz commented Apr 7, 2020 • edited Loading

brendanzab commented Apr 7, 2020

bee-bee-eeb commented Apr 20, 2020

maciejhirsz commented Apr 20, 2020 • edited Loading

bee-bee-eeb commented Apr 20, 2020 • edited Loading

maciejhirsz commented Apr 20, 2020

ratmice commented Apr 21, 2020

maciejhirsz commented Apr 21, 2020

maciejhirsz commented Apr 26, 2020 • edited Loading

Alxandr commented Apr 26, 2020

maciejhirsz commented Apr 26, 2020 • edited Loading

Alxandr commented Apr 26, 2020

maciejhirsz commented Apr 26, 2020

Alxandr commented Apr 26, 2020

maciejhirsz commented Apr 26, 2020

Alxandr commented Apr 26, 2020 • edited Loading

maciejhirsz commented Apr 26, 2020 • edited Loading

brendanzab commented Apr 7, 2020 •

edited

Loading

maciejhirsz commented Apr 7, 2020 •

edited

Loading

maciejhirsz commented Apr 20, 2020 •

edited

Loading

bee-bee-eeb commented Apr 20, 2020 •

edited

Loading

maciejhirsz commented Apr 26, 2020 •

edited

Loading

maciejhirsz commented Apr 26, 2020 •

edited

Loading

Alxandr commented Apr 26, 2020 •

edited

Loading

maciejhirsz commented Apr 26, 2020 •

edited

Loading