-
-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow lexers to provide more user-friendly errors #104
Comments
What sort of interface would you like to see? You can use Lexers, or parsers, especially for PLs, tend to be exempt from the "fail fast" philosophy since you usually want to accumulate errors and report them in batches to the user. Having error be just another token has the nice benefit in that you can treat errors just as any other unexpected token when writing a parser to AST, which makes it much easier to write, and it tends to have a nice performance profile since you don't have to do special case branching for error handling. |
Yeah, this is why I mentioned 'with error recovery' - I didn't see this mentioned in the docs. If Logos supports this then that's great! I more want better information as to why an error occurred, so that I can produce nice, user-friendly errors using my library, At the moment Pikelet has a bunch of lexer errors on the master branch: https://github.com/pikelet-lang/pikelet/blob/782a8853cbaf0c50ec668ef55df798e30cea0ef6/crates/pikelet-concrete/src/parse/lexer.rs#L43-L69 In contrast, at the moment this is what I am forced to output on the next branch (which uses Logos): https://github.com/pikelet-lang/pikelet/blob/0568c6ada6cc1937774cac15372f3acde40793c9/pikelet/src/surface/lexer.rs#L174 |
Ah, gotcha, so what you are missing is what was the token Logos was expecting to produce but failed. I'll have to think how to do that best, especially now that stateful enums have landed. Speaking of, with 0.11 you can implement logos/tests/tests/callbacks.rs Lines 8 to 14 in 4005d70
And there is also a spanned iterator now, which should also play nicer with what you are doing there: Lines 228 to 235 in 4005d70
|
Oh super nice, thanks! 😍 |
I'd also love to see this. Just letting the error variant be able to store something like an enum would be great. We could use the Example: enum Parts<'a> {
#[regex(".+", handle_text)]
Text(&'a str),
#[error]
Error(ParsingError<'a>),
}
enum ParsingError<'a> {
InvalidCharacter(&'a str)
}
fn handle_text<'a>(lex: &mut Lexer<Parts<'a>>) -> Result<&'a str, ParsingError<'a>> {
let text = lex.slice();
if text.contains("§") {
return Err(ParsingError::InvalidCharacter("The character § can't be in the text!"));
}
Ok(text)
} The lifetimes may be messed up a bit but i think you can understand what it is supposed to do. |
@nnt0 yeah, I think I can make that work. There needs to be some way to create Two options just out of the top of my head:
Option 1. would be easier to handle, while 2. would be more flexible. |
@maciejhirsz I think i'd prefer option 2 here because you'd have access to Of course there are more benefits but thats just what I thought about first. I also thought about making this opt-in for the folks that just want |
Oh yeah, this will only do anything if you actually specify a field in the error variant. I reckon having Callbacks that return a |
One question which this brings up is whether there is a thought to expand That would at least seem somewhat consistent with how |
@ratmice ye, that would actually be a good way to supply the default constructor for the error type, might be better than using a trait. |
Following up the discussion from #135, and expanding on the suggestion above here, it might be useful to have an error callback for regexes that returns error type: #[derive(Logos)]
enum Token {
#[regex("some_complex_regex", error = |_| MyErrorType::InvalidComplexMatch)]
ComplexMatch,
#[error(|_| MyErrorType::default())]
Error(MyErrorType),
} |
I suggest being able to add some special token to the regexes that triggers an error branch. Something like |
You only need that in places where lexer can backtrack, which can only happen inside a #[regex(r"[0-9]+(?>\.[0-9])?")] This would make it so that for input I believe |
An important thing to note though, is that if you just turn this into the |
This is assuming stateful error variant, so you can use callbacks to put the required state inside it. That said, it also might make sense to get rid of With a C-style enum #[derive(Logos)]
#[logos(error = MyError)]
enum Token {
#[regex("[0-9]+(?>\.[0-9]+)?", error = |_| MyError::JunkAfterPeriod)]
Number,
} This would be way more idiomatic, and would allow to collect iterators into like so: let tokens: Vec<Token> = Token::lexer("3.14").collect()?; |
How would you deal with 2 errors in the same case? For instance, in the number case I was converting, it dealt with junk after period, and after |
That, or read the |
I don't really agree on the "but that should be fine in error path", because logos has already determined that we arrived at this error path. You're essentially asking users of the library to redo the work that the library just did. You should be able to get some information out from the |
Let me rephrase that: it's fine to sacrifice error path performance for happy path performance, because error path is by definition rare.
Except Lexer has, by design, absolutely no context of why it's doing anything it's doing. All it knows is at what index it started lexing a token, and at what index it currently is. Everything else is just munching through CPU instructions compiled from a state machine graph compiled from token definitions. The options are:
|
At the moment errors are a single, opaque token. Is there a way to have more user-friendly, structured error values, with error recovery, instead of just returning
Token::Error
? It would be really nice to have an example of this too!The text was updated successfully, but these errors were encountered: