How to use peg for streams of data - or how to implemented streaming in peg #326

torfmaster · 2022-11-12T23:20:37Z

torfmaster
Nov 12, 2022

Hi,

first of all let me thank you for your wonderful library. It saved me days of implementation (and lots of pain) for implementing a parser for a proprietary binary protocol.

I was wondering what it would take to use peg for streams of data. I am receiving a stream of potentially incomplete or corrupt data and I am interested in only the valid parts of that stream. Currently, I have implemented a very rudimentary parser for my protocol to detect beginning and end of a message (though quite erroneous). My questions are

How can I use peg to filter out only the interesting parts of the stream? I can only speculate, but probably it would make sense to look at the location where errors occur to decide what to do (e.g. move on because error is in the end, skip these bytes because error is at the beginning, read more data because error is in the middle).
What would it take to extend the api of peg to support streaming?

I am very interested about your thoughts. Btw the project I am working on is located here: https://github.com/torfmaster/hackdose-sml-parser.

Thank you!

kevinmehall · 2022-11-13T00:49:52Z

kevinmehall
Nov 13, 2022
Maintainer

Parsing Expression Grammars aren't particularly compatible with streaming. Backtracking allows returning all the way to the beginning of the input (preventing discarding early input), while the parse function as a whole returns a single result (preventing returning an early result before reaching the end).

It looks like you are parsing a sequence of messages. What you can do is wrap the PEG in a loop that parses a single message at a time. If your messages have a delimiter you can split on with BufRead::read_until or other simple framing, this is straightforward to split and then parse. If you can't find the end of a message without parsing it, take a look at the #[no_eof] rule attribute. This skips the check that the parser matches the full input, allowing you to parse a message from the beginning of the input, and return a parse result for the first message only. Use the position!() at the end of the rule to find where that message ends and the next one starts, then have the loop use that position to advance the input buffer and parse the next message.

Assuming the protocol marks the end of messages, you can detect an incomplete message when the error position returned is equal to the length of the input because matches will fail at that position. Or you could even maybe use a custom ParseElem implementation that blocks waiting for further input if the position is higher than the buffer length -- not sure how well that would work.

0 replies

kevinboulain · 2023-03-08T17:30:15Z

kevinboulain
Mar 8, 2023

Same here, thanks for the library.

If your messages have a delimiter you can split on with BufRead::read_until or other simple framing.

So, in my case, this particular thing I'm parsing doesn't really have any framing (though I did work around that in the end) and I was hoping I'd be able to cheat by checking if the location reported by peg::error::ParseError increased, i.e.: the parser actually consumed more data so maybe I can feed it some more and get a successful parse. Sadly, it looks like this isn't the case:

peg::parser! {                                                                                                                                                                                 
  grammar parser() for [u8] {                                                                                                                                                                  
    #[no_eof]                                                                                                                                                                                  
    pub rule test1() = "a" "b" "c"                                                                                                                                                             
    #[no_eof]                                                                                                                                                                                  
    pub rule test2() = "a" "bc"                                                                                                                                                                
  }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test1() {
        assert!(matches!(
            parser::test1(b"ab"),
            Err(peg::error::ParseError { location: 2, .. })
        ))
    }

    #[test]
    fn test2() {
        assert!(matches!(
            parser::test2(b"ab"),
            Err(peg::error::ParseError { location: 1, .. }) // I hoped this would be 2.
        ))
    }
}

I guess the difference is to be expected given how the rules are declared but has the parser really no way to report it could have partially matched the token/rule?

4 replies

kevinmehall Mar 9, 2023
Maintainer

The "" syntax is handled via the ParseLiteral trait, and its implementation explains the behavior you're seeing:

rust-peg/peg-runtime/str.rs

Lines 58 to 67 in df9b0ca

    
           impl ParseLiteral for str { 
        
               fn parse_string_literal(&self, pos: usize, literal: &str) -> RuleResult<()> { 
        
                   let l = literal.len(); 
        
                   if self.len() >= pos + l && &self.as_bytes()[pos..pos + l] == literal.as_bytes() { 
        
                       RuleResult::Matched(pos + l, ()) 
        
                   } else { 
        
                       RuleResult::Failed 
        
                   } 
        
               } 
        
           }

The problem is that there's no way to distinguish which side of the && failed, and no way to even represent that in the return type.

I've seen this technique used in a REPL to accept multiline input and respond with a parse error if the error position is the middle, or prompt for another line if the error is at the end. This issue doesn't come up there because to have a partial match at the end of the line that could succeed with the addition of a continuation line would only be possible if the literal has a \n in it, and that would be very uncommon in a programming language grammar. But in your case I assume the input is coming from reading a socket or something where the chunk boundary could happen anywhere.

You could perhaps use interior mutability on a custom input type to track whether the parser tried to read past the end of input.
Instead of using "" literals, you could define your own helper rule that uses the *<{len}> repeat on a [_] match for any single byte to test each position for EOF first, before comparing with the literal. Note that if the input is the right length but the compare fails, the error position is still at the beginning of the string because that's how {? } works. You'd also want to avoid the quiet!{} construct as it prevents advancing the error position within it.

peg::parser! {                                                                                                                                                                                 
  grammar parser() for [u8] {     
    rule literal(s: &'static [u8])
      = t:$([_]*<{s.len()}>) {?
          if t == s { Ok(()) } else {
            Err(std::str::from_utf8(s).unwrap_or("<literal>"))
          }
        }
    #[no_eof]                                                       
    pub rule test2() = literal(b"a") literal(b"bc")
  }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test2() {
        assert!(matches!(
            parser::test2(b"ab"),
            Err(peg::error::ParseError { location: 2, .. })
        ));

        assert!(matches!(
            parser::test2(b"abc"),
            Ok(..)
        ));
    }
}

kevinboulain Mar 9, 2023

But in your case I assume the input is coming from reading a socket or something where the chunk boundary could happen anywhere.

Yes, sadly that's the root of the issue here (and I don't really expect rust-peg to solve that for me, I was merely raising the issue to see if this had already been thought through).

You could perhaps use interior mutability on a custom input type to track whether the parser tried to read past the end of input.

Hm, I might try that, I've seen it's quite simple to implement a parser over any type (as opposed to a few other crates that are restricted to str for example).

you could define your own helper rule that uses the *<{len}> repeat on a [_]

Wouldn't that be terrible performance-wise? I believe the parser would allocate unnecessarily: I encountered #292 when using repeats and I had to rewrite something like t:$(CHAR()*<{n}>) as suggested in #283 (comment). That's when I realized peg::RuleResult::Failed doesn't have any indication of where the parse failed so I'm not sure if the parser would ever be able to surface this information.

kevinmehall Mar 12, 2023
Maintainer

Wouldn't that be terrible performance-wise? I believe the parser would allocate unnecessarily:

It won't allocate since $ marks its inner expression's return value as not used, so all the return types get replaced with (). The repeat then uses Vec<()> which just keeps a length counter but doesn't allocate. (#292 was closed as unnecessary since the existing code in rust-peg already did that)

I haven't looked at the asm or benchmarked this one, but Rust/LLVM are surprisingly good at optimizing the code that rust-peg produces sometimes. It's possible it will inline it enough to figure out it's just a length check.

Not saying that hack is a great idea to use though...you're quite possibly better off with a parser generator that supports streaming natively for what you're trying to do.

kevinboulain Mar 14, 2023

It won't allocate since $ marks its inner expression's return value as not used, so all the return types get replaced with ().

Apologies, I might have profiled under a non-optimized test build. I experimented a bit again and the Vec is clearly visible in the profile only if I don't set opt-level. Oops.

Not saying that hack is a great idea to use though...

Worst case there's always #283 (comment).

you're quite possibly better off with a parser generator that supports streaming natively

Is there one, besides nom? From a quick test, I think it sometimes just ask for one byte more and the resulting parsers are much more verbose so it's not like it's perfect out of the box. I quite like this crate (one of the very few that support parsing over any type) and I have an okayish workaround for now.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use peg for streams of data - or how to implemented streaming in peg #326

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to use peg for streams of data - or how to implemented streaming in peg #326

torfmaster Nov 12, 2022

Replies: 2 comments · 4 replies

kevinmehall Nov 13, 2022 Maintainer

kevinboulain Mar 8, 2023

kevinmehall Mar 9, 2023 Maintainer

kevinboulain Mar 9, 2023

kevinmehall Mar 12, 2023 Maintainer

kevinboulain Mar 14, 2023

torfmaster
Nov 12, 2022

Replies: 2 comments 4 replies

kevinmehall
Nov 13, 2022
Maintainer

kevinboulain
Mar 8, 2023

kevinmehall Mar 9, 2023
Maintainer

kevinmehall Mar 12, 2023
Maintainer