Merge pull request #15 from wirelyre/syntax

Add basic syntax highlighting for PEGs
pest-parser · Oct 25, 2018 · 2a5c7a2 · 2a5c7a2
2 parents 26bcc38 + 0395a5f
commit 2a5c7a2
Show file tree

Hide file tree

Showing 9 changed files with 91 additions and 39 deletions.
diff --git a/book.toml b/book.toml
@@ -2,3 +2,6 @@
 title = "A thoughtful introduction to the pest parser"
 description = "An introduction to the pest parser by implementing a Rust grammar subset"
 author = "Dragoș Tiselice"
+
+[output.html]
+additional-js = ["highlight-pest.js"]
diff --git a/highlight-pest.js b/highlight-pest.js
@@ -0,0 +1,41 @@
+// Syntax highlighting for pest PEGs.
+
+// mdBook exposes a minified version of highlight.js, so the language
+// definition objects below have abbreviated property names:
+//     "b"  => begin
+//     "c"  => contains
+//     "cN" => className
+//     "e"  => end
+
+hljs.registerLanguage("pest", function(hljs) {
+
+    // Basic syntax.
+    var comment = {cN: "comment", b: "//", e: /$/};
+    var ident = {cN: "title", b: /[_a-zA-Z][_a-z0-9A-Z]*/};
+    var special = {b: /COMMENT|WHITESPACE/, cN: "keyword"};
+
+    // Escape sequences within a string or character literal.
+    var escape = {b: /\\./};
+
+    // Per highlight.js style, only built-in rules should be highlighted inside
+    // a definition.
+    var rule = {
+        b: /[@_$!]?\{/, e: "}",
+        k: {built_in: "ANY SOI EOI PUSH POP PEEK " +
+                      "ASCII_ALPHANUMERIC ASCII_DIGIT ASCII_HEX_DIGIT " +
+                      "ASCII_NONZERO_DIGIT NEWLINE"},
+        c: [comment,
+            {cN: "string", b: '"', e: '"', c: [escape]},
+            {cN: "string", b: "'", e: "'", c: [escape]}]
+    };
+
+    return {
+        c: [special, rule, ident, comment]
+    };
+
+});
+
+// This file is inserted after the default highlight.js invocation, which tags
+// unknown-language blocks with CSS classes but doesn't highlight them.
+Array.from(document.querySelectorAll("code.language-pest"))
+    .forEach(hljs.highlightBlock);
diff --git a/src/examples/csv.md b/src/examples/csv.md
@@ -52,7 +52,7 @@ code! This is a very important attribute.
 into Rust code. Let's write a grammar for a CSV file that contains numbers.
 Create a new file named `src/csv.pest` with a single line:
 
-```
+```pest
 field = { (ASCII_DIGIT | "." | "-")+ }
 ```
 
@@ -98,7 +98,7 @@ Yikes! That's a complicated type! But you can see that the successful parse was
 
 For now, let's complete the grammar:
 
-```
+```pest
 field = { (ASCII_DIGIT | "." | "-")+ }
 record = { field ~ ("," ~ field)* }
 file = { SOI ~ (record ~ ("\r\n" | "\n"))* ~ EOI }

diff --git a/src/examples/ini.md b/src/examples/ini.md
@@ -42,7 +42,7 @@ recognize a single character in that set. The built-in rule
 `ASCII_ALPHANUMERIC` is a shortcut to represent any uppercase or lowercase
 ASCII letter, or any digit.
 
-```
+```pest
 char = { ASCII_ALPHANUMERIC | "." | "_" | "/" }
 ```
 
@@ -51,14 +51,14 @@ be empty (as in the line `ip=` above). That is, the former consist of one or
 more characters, `char+`; and the latter consist of zero or more characters,
 `char*`. We separate the meaning into two rules:
 
-```
+```pest
 name = { char+ }
 value = { char* }
 ```
 
 Now it's easy to express the two kinds of input lines.
 
-```
+```pest
 section = { "[" ~ name ~ "]" }
 property = { name ~ "=" ~ value }
 ```
@@ -67,7 +67,7 @@ Finally, we need a rule to represent an entire input file. The expression
 `(section | property)?` matches `section`, `property`, or else nothing. Using
 the built-in rule `NEWLINE` to match line endings:
 
-```
+```pest
 file = {
     SOI ~
     ((section | property)? ~ NEWLINE)* ~
@@ -193,7 +193,7 @@ If defined, it will be implicitly run, as many times as possible, at every
 tilde `~` and between every repetition (for example, `*` and `+`). For our INI
 parser, only spaces are legal whitespace.
 
-```
+```pest
 WHITESPACE = _{ " " }
 ```
 
@@ -209,7 +209,7 @@ char+ }`. Rules that *are* whitespace-sensitive need to be marked [*atomic*]
 with a leading at sign `@{ ... }`. In atomic rules, automatic whitespace
 handling is disabled, and interior rules are silent.
 
-```
+```pest
 name = @{ char+ }
 value = @{ char* }
 ```

diff --git a/src/examples/json.md b/src/examples/json.md
@@ -81,15 +81,15 @@ strings (where it must be parsed separately) and between digits in numbers
 (where it is not allowed). This makes it a good fit for `pest`'s [implicit
 whitespace]. In `src/json.pest`:
 
-```
+```pest
 WHITESPACE = _{ " " | "\t" | "\r" | "\n" }
 ```
 
 [The JSON specification] includes diagrams for parsing JSON strings. We can
 write the grammar directly from that page. Let's write `object` as a sequence
 of `pair`s separated by commas `,`.
 
-```
+```pest
 object = {
     "{" ~ "}" |
     "{" ~ pair ~ ("," ~ pair)* ~ "}"
@@ -110,7 +110,7 @@ such as in `[0, 1,]`, is illegal in JSON.
 Now we can write `value`, which represents any single data type. We'll mimic
 our AST by writing `boolean` and `null` as separate rules.
 
-```
+```pest
 value = _{ object | array | string | number | boolean | null }
 
 boolean = { "true" | "false" }
@@ -129,7 +129,7 @@ except the ones given in parentheses. In this case, any character is legal
 inside a string, except for double quote `"` and backslash <code>\\</code>,
 which require separate parsing logic.
 
-```
+```pest
 string = ${ "\"" ~ inner ~ "\"" }
 inner = @{ char* }
 char = {
@@ -148,7 +148,7 @@ Numbers have four logical parts: an optional sign, an integer part, an optional
 fractional part, and an optional exponent. We'll mark `number` atomic so that
 whitespace cannot appear between its parts.
 
-```
+```pest
 number = @{
     "-"?
     ~ ("0" | ASCII_NONZERO_DIGIT ~ ASCII_DIGIT*)
@@ -162,7 +162,7 @@ of a JSON file is a single object or array. We'll mark this rule [silent], so
 that a parsed JSON file contains only two token pairs: the parsed value itself,
 and [the `EOI` rule].
 
-```
+```pest
 json = _{ SOI ~ (object | array) ~ EOI }
 ```
 

diff --git a/src/grammars/peg.md b/src/grammars/peg.md
@@ -4,7 +4,7 @@ Parsing expression grammars (PEGs) are simply a strict representation of the
 simple imperative code that you would write if you were writing a parser by
 hand.
 
-```
+```pest
 number = {            // To recognize a number...
     ASCII_DIGIT+      //   take as many ASCII digits as possible (at least one).
 }
@@ -21,7 +21,7 @@ comments above.
 
 When a [repetition] PEG expression is run on an input string,
 
-```
+```pest
 ASCII_DIGIT+      // one or more characters from '0' to '9'
 ```
 
@@ -87,7 +87,7 @@ The engine will not back up and try again.
 
 Consider this grammar, matching on the string `"frumious"`:
 
-```
+```pest
 word = {     // to recognize a word...
     ANY*     //   take any character, zero or more times...
     ~ ANY    //   followed by any character

diff --git a/src/grammars/syntax.md b/src/grammars/syntax.md
@@ -2,7 +2,7 @@
 
 `pest` grammars are lists of rules. Rules are defined like this:
 
-```
+```pest
 my_rule = { ... }
 
 another_rule = {        // comments are preceded by two slashes
@@ -16,7 +16,7 @@ to be Rust keywords.
 The left curly bracket `{` defining a rule can be preceded by [symbols that
 affect its operation]:
 
-```
+```pest
 silent_rule = _{ ... }
 atomic_rule = @{ ... }
 ```
@@ -64,7 +64,7 @@ ANY
 Finally, you can **refer to other rules** by writing their names directly, and
 even **use rules recursively**:
 
-```
+```pest
 my_rule = { "slithy " ~ other_rule }
 other_rule = { "toves" }
 recursive_rule = { "mimsy " ~ recursive_rule }
@@ -106,13 +106,15 @@ if `first` matched some input before it failed. When encountering a parse
 failure, the engine will try the next ordered choice as though no input had
 been matched. Failed parses never consume any input.
 
-```
+```pest
 start = { "Beware " ~ creature }
 creature = {
     ("the " ~ "Jabberwock")
     | ("the " ~ "Jubjub bird")
 }
+```
 
+```
 "Beware the Jubjub bird"
  ^ (start) Parses via the second choice of `creature`,
            even though the first choice matched "the " successfully.
@@ -178,7 +180,7 @@ a kind of "NOT" statement: "the input string must match `bar` but NOT `foo`".
 
 This leads to the common idiom meaning "any character but":
 
-```
+```pest
 not_space_or_tab = {
     !(                // if the following text is not
         " "           //     a space
@@ -221,12 +223,14 @@ Larger expressions can be repeated by surrounding them with parentheses.
 Repetition operators have the highest precedence, followed by predicate
 operators, the sequence operator, and finally ordered choice.
 
-```
+```pest
 my_rule = {
     "a"* ~ "b"?
     | &"b"+ ~ "a"
 }
-    equivalent to
+
+// equivalent to
+
 my_rule = {
       ( ("a"*) ~ ("b"?) )
     | ( (&("b"+)) ~ "a" )
@@ -243,7 +247,7 @@ For example, to ensure that a rule matches the entire input, where any syntax
 error results in a failed parse (rather than a successful but incomplete
 parse):
 
-```
+```pest
 main = {
     SOI
     ~ (...)
@@ -261,11 +265,13 @@ The **optional rules `WHITESPACE` and `COMMENT`** implement this behaviour. If
 either (or both) are defined, they will be implicitly inserted at every
 [sequence] and between every [repetition] (except in [atomic rules]).
 
-```
+```pest
 expression = { "4" ~ "+" ~ "5" }
 WHITESPACE = _{ " " }
 COMMENT = _{ "/*" ~ (!"*/" ~ ANY)* ~ "*/" }
-    matches
+```
+
+```
 "4+5"
 "4 + 5"
 "4  +     5"
@@ -276,7 +282,7 @@ As you can see, `WHITESPACE` and `COMMENT` are run repeatedly, so they need
 only match a single whitespace character or a single comment. The grammar above
 is equivalent to:
 
-```
+```pest
 expression = {
     "4"   ~ (ws | com)*
     ~ "+" ~ (ws | com)*
@@ -291,11 +297,13 @@ Note that implicit whitespace is *not* inserted at the beginning or end of rules
 include implicit whitespace at the beginning and end of a rule, you will need to
 sandwich it between two empty rules (often `SOI` and `EOI` [as above]):
 
-```
+```pest
 WHITESPACE = _{ " " }
 expression = { "4" ~ "+" ~ "5" }
 main = { SOI ~ expression ~ EOI }
-    matches
+```
+
+```
 "4+5"
 "  4 + 5   "
 ```
@@ -318,7 +326,7 @@ silent, it will never appear in a parse result.
 To make a silent rule, precede the left curly bracket `{` with a low line
 (underscore) `_`.
 
-```
+```pest
 silent = _{ ... }
 ```
 
@@ -330,7 +338,7 @@ silent = _{ ... }
 `pest` has two kinds of atomic rules: **atomic** and **compound atomic**. To
 make one, write the sigil before the left curly bracket `{`.
 
-```
+```pest
 atomic = @{ ... }
 compound_atomic = ${ ... }
 ```
@@ -367,7 +375,7 @@ This is where you use a **non-atomic** rule. Write an exclamation mark `!` in
 front of the defining curly bracket. The rule will run as non-atomic, whether
 it is called from an atomic rule or not.
 
-```
+```pest
 fstring = @{ "\"" ~ ... }
 expr = !{ ... }
 ```
@@ -383,7 +391,7 @@ rather than *the same pattern*.
 
 For example,
 
-```
+```pest
 same_text = {
     PUSH( "a" | "b" | "c" )
     ~ POP
@@ -411,7 +419,7 @@ const raw_str: &str = r###"
 When parsing a raw string, we have to keep track of how many number signs `#`
 occurred before the quotation mark. We can do this using the stack:
 
-```
+```pest
 raw_string = {
     "r" ~ PUSH("#"*) ~ "\""    // push the number signs onto the stack
     ~ raw_string_interior

diff --git a/src/intro.md b/src/intro.md
@@ -13,7 +13,7 @@ abstractions, `pest` parsers can be **very fast**.
 Here is the complete grammar for a simple calculator [developed in a (currently
 unwritten) later chapter](examples/calculator.html):
 
-```
+```pest
 num = @{ int ~ ("." ~ ASCII_DIGIT*)? ~ (^"e" ~ int)? }
     int = { ("+" | "-")? ~ ASCII_DIGIT+ }