Skip to content

Commit

Permalink
Merge pull request #15 from wirelyre/syntax
Browse files Browse the repository at this point in the history
Add basic syntax highlighting for PEGs
  • Loading branch information
wirelyre authored Oct 25, 2018
2 parents 26bcc38 + 0395a5f commit 2a5c7a2
Show file tree
Hide file tree
Showing 9 changed files with 91 additions and 39 deletions.
3 changes: 3 additions & 0 deletions book.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,6 @@
title = "A thoughtful introduction to the pest parser"
description = "An introduction to the pest parser by implementing a Rust grammar subset"
author = "Dragoș Tiselice"

[output.html]
additional-js = ["highlight-pest.js"]
41 changes: 41 additions & 0 deletions highlight-pest.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
// Syntax highlighting for pest PEGs.

// mdBook exposes a minified version of highlight.js, so the language
// definition objects below have abbreviated property names:
// "b" => begin
// "c" => contains
// "cN" => className
// "e" => end

hljs.registerLanguage("pest", function(hljs) {

// Basic syntax.
var comment = {cN: "comment", b: "//", e: /$/};
var ident = {cN: "title", b: /[_a-zA-Z][_a-z0-9A-Z]*/};
var special = {b: /COMMENT|WHITESPACE/, cN: "keyword"};

// Escape sequences within a string or character literal.
var escape = {b: /\\./};

// Per highlight.js style, only built-in rules should be highlighted inside
// a definition.
var rule = {
b: /[@_$!]?\{/, e: "}",
k: {built_in: "ANY SOI EOI PUSH POP PEEK " +
"ASCII_ALPHANUMERIC ASCII_DIGIT ASCII_HEX_DIGIT " +
"ASCII_NONZERO_DIGIT NEWLINE"},
c: [comment,
{cN: "string", b: '"', e: '"', c: [escape]},
{cN: "string", b: "'", e: "'", c: [escape]}]
};

return {
c: [special, rule, ident, comment]
};

});

// This file is inserted after the default highlight.js invocation, which tags
// unknown-language blocks with CSS classes but doesn't highlight them.
Array.from(document.querySelectorAll("code.language-pest"))
.forEach(hljs.highlightBlock);
4 changes: 2 additions & 2 deletions src/examples/csv.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ code! This is a very important attribute.
into Rust code. Let's write a grammar for a CSV file that contains numbers.
Create a new file named `src/csv.pest` with a single line:

```
```pest
field = { (ASCII_DIGIT | "." | "-")+ }
```

Expand Down Expand Up @@ -98,7 +98,7 @@ Yikes! That's a complicated type! But you can see that the successful parse was
For now, let's complete the grammar:
```
```pest
field = { (ASCII_DIGIT | "." | "-")+ }
record = { field ~ ("," ~ field)* }
file = { SOI ~ (record ~ ("\r\n" | "\n"))* ~ EOI }
Expand Down
12 changes: 6 additions & 6 deletions src/examples/ini.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ recognize a single character in that set. The built-in rule
`ASCII_ALPHANUMERIC` is a shortcut to represent any uppercase or lowercase
ASCII letter, or any digit.

```
```pest
char = { ASCII_ALPHANUMERIC | "." | "_" | "/" }
```

Expand All @@ -51,14 +51,14 @@ be empty (as in the line `ip=` above). That is, the former consist of one or
more characters, `char+`; and the latter consist of zero or more characters,
`char*`. We separate the meaning into two rules:

```
```pest
name = { char+ }
value = { char* }
```

Now it's easy to express the two kinds of input lines.

```
```pest
section = { "[" ~ name ~ "]" }
property = { name ~ "=" ~ value }
```
Expand All @@ -67,7 +67,7 @@ Finally, we need a rule to represent an entire input file. The expression
`(section | property)?` matches `section`, `property`, or else nothing. Using
the built-in rule `NEWLINE` to match line endings:

```
```pest
file = {
SOI ~
((section | property)? ~ NEWLINE)* ~
Expand Down Expand Up @@ -193,7 +193,7 @@ If defined, it will be implicitly run, as many times as possible, at every
tilde `~` and between every repetition (for example, `*` and `+`). For our INI
parser, only spaces are legal whitespace.

```
```pest
WHITESPACE = _{ " " }
```

Expand All @@ -209,7 +209,7 @@ char+ }`. Rules that *are* whitespace-sensitive need to be marked [*atomic*]
with a leading at sign `@{ ... }`. In atomic rules, automatic whitespace
handling is disabled, and interior rules are silent.

```
```pest
name = @{ char+ }
value = @{ char* }
```
Expand Down
12 changes: 6 additions & 6 deletions src/examples/json.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,15 +81,15 @@ strings (where it must be parsed separately) and between digits in numbers
(where it is not allowed). This makes it a good fit for `pest`'s [implicit
whitespace]. In `src/json.pest`:

```
```pest
WHITESPACE = _{ " " | "\t" | "\r" | "\n" }
```

[The JSON specification] includes diagrams for parsing JSON strings. We can
write the grammar directly from that page. Let's write `object` as a sequence
of `pair`s separated by commas `,`.

```
```pest
object = {
"{" ~ "}" |
"{" ~ pair ~ ("," ~ pair)* ~ "}"
Expand All @@ -110,7 +110,7 @@ such as in `[0, 1,]`, is illegal in JSON.
Now we can write `value`, which represents any single data type. We'll mimic
our AST by writing `boolean` and `null` as separate rules.

```
```pest
value = _{ object | array | string | number | boolean | null }
boolean = { "true" | "false" }
Expand All @@ -129,7 +129,7 @@ except the ones given in parentheses. In this case, any character is legal
inside a string, except for double quote `"` and backslash <code>\\</code>,
which require separate parsing logic.

```
```pest
string = ${ "\"" ~ inner ~ "\"" }
inner = @{ char* }
char = {
Expand All @@ -148,7 +148,7 @@ Numbers have four logical parts: an optional sign, an integer part, an optional
fractional part, and an optional exponent. We'll mark `number` atomic so that
whitespace cannot appear between its parts.

```
```pest
number = @{
"-"?
~ ("0" | ASCII_NONZERO_DIGIT ~ ASCII_DIGIT*)
Expand All @@ -162,7 +162,7 @@ of a JSON file is a single object or array. We'll mark this rule [silent], so
that a parsed JSON file contains only two token pairs: the parsed value itself,
and [the `EOI` rule].

```
```pest
json = _{ SOI ~ (object | array) ~ EOI }
```

Expand Down
6 changes: 3 additions & 3 deletions src/grammars/peg.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Parsing expression grammars (PEGs) are simply a strict representation of the
simple imperative code that you would write if you were writing a parser by
hand.

```
```pest
number = { // To recognize a number...
ASCII_DIGIT+ // take as many ASCII digits as possible (at least one).
}
Expand All @@ -21,7 +21,7 @@ comments above.

When a [repetition] PEG expression is run on an input string,

```
```pest
ASCII_DIGIT+ // one or more characters from '0' to '9'
```

Expand Down Expand Up @@ -87,7 +87,7 @@ The engine will not back up and try again.

Consider this grammar, matching on the string `"frumious"`:

```
```pest
word = { // to recognize a word...
ANY* // take any character, zero or more times...
~ ANY // followed by any character
Expand Down
44 changes: 26 additions & 18 deletions src/grammars/syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

`pest` grammars are lists of rules. Rules are defined like this:

```
```pest
my_rule = { ... }
another_rule = { // comments are preceded by two slashes
Expand All @@ -16,7 +16,7 @@ to be Rust keywords.
The left curly bracket `{` defining a rule can be preceded by [symbols that
affect its operation]:

```
```pest
silent_rule = _{ ... }
atomic_rule = @{ ... }
```
Expand Down Expand Up @@ -64,7 +64,7 @@ ANY
Finally, you can **refer to other rules** by writing their names directly, and
even **use rules recursively**:

```
```pest
my_rule = { "slithy " ~ other_rule }
other_rule = { "toves" }
recursive_rule = { "mimsy " ~ recursive_rule }
Expand Down Expand Up @@ -106,13 +106,15 @@ if `first` matched some input before it failed. When encountering a parse
failure, the engine will try the next ordered choice as though no input had
been matched. Failed parses never consume any input.

```
```pest
start = { "Beware " ~ creature }
creature = {
("the " ~ "Jabberwock")
| ("the " ~ "Jubjub bird")
}
```

```
"Beware the Jubjub bird"
^ (start) Parses via the second choice of `creature`,
even though the first choice matched "the " successfully.
Expand Down Expand Up @@ -178,7 +180,7 @@ a kind of "NOT" statement: "the input string must match `bar` but NOT `foo`".

This leads to the common idiom meaning "any character but":

```
```pest
not_space_or_tab = {
!( // if the following text is not
" " // a space
Expand Down Expand Up @@ -221,12 +223,14 @@ Larger expressions can be repeated by surrounding them with parentheses.
Repetition operators have the highest precedence, followed by predicate
operators, the sequence operator, and finally ordered choice.

```
```pest
my_rule = {
"a"* ~ "b"?
| &"b"+ ~ "a"
}
equivalent to
// equivalent to
my_rule = {
( ("a"*) ~ ("b"?) )
| ( (&("b"+)) ~ "a" )
Expand All @@ -243,7 +247,7 @@ For example, to ensure that a rule matches the entire input, where any syntax
error results in a failed parse (rather than a successful but incomplete
parse):

```
```pest
main = {
SOI
~ (...)
Expand All @@ -261,11 +265,13 @@ The **optional rules `WHITESPACE` and `COMMENT`** implement this behaviour. If
either (or both) are defined, they will be implicitly inserted at every
[sequence] and between every [repetition] (except in [atomic rules]).

```
```pest
expression = { "4" ~ "+" ~ "5" }
WHITESPACE = _{ " " }
COMMENT = _{ "/*" ~ (!"*/" ~ ANY)* ~ "*/" }
matches
```

```
"4+5"
"4 + 5"
"4 + 5"
Expand All @@ -276,7 +282,7 @@ As you can see, `WHITESPACE` and `COMMENT` are run repeatedly, so they need
only match a single whitespace character or a single comment. The grammar above
is equivalent to:

```
```pest
expression = {
"4" ~ (ws | com)*
~ "+" ~ (ws | com)*
Expand All @@ -291,11 +297,13 @@ Note that implicit whitespace is *not* inserted at the beginning or end of rules
include implicit whitespace at the beginning and end of a rule, you will need to
sandwich it between two empty rules (often `SOI` and `EOI` [as above]):

```
```pest
WHITESPACE = _{ " " }
expression = { "4" ~ "+" ~ "5" }
main = { SOI ~ expression ~ EOI }
matches
```

```
"4+5"
" 4 + 5 "
```
Expand All @@ -318,7 +326,7 @@ silent, it will never appear in a parse result.
To make a silent rule, precede the left curly bracket `{` with a low line
(underscore) `_`.

```
```pest
silent = _{ ... }
```

Expand All @@ -330,7 +338,7 @@ silent = _{ ... }
`pest` has two kinds of atomic rules: **atomic** and **compound atomic**. To
make one, write the sigil before the left curly bracket `{`.

```
```pest
atomic = @{ ... }
compound_atomic = ${ ... }
```
Expand Down Expand Up @@ -367,7 +375,7 @@ This is where you use a **non-atomic** rule. Write an exclamation mark `!` in
front of the defining curly bracket. The rule will run as non-atomic, whether
it is called from an atomic rule or not.

```
```pest
fstring = @{ "\"" ~ ... }
expr = !{ ... }
```
Expand All @@ -383,7 +391,7 @@ rather than *the same pattern*.

For example,

```
```pest
same_text = {
PUSH( "a" | "b" | "c" )
~ POP
Expand Down Expand Up @@ -411,7 +419,7 @@ const raw_str: &str = r###"
When parsing a raw string, we have to keep track of how many number signs `#`
occurred before the quotation mark. We can do this using the stack:

```
```pest
raw_string = {
"r" ~ PUSH("#"*) ~ "\"" // push the number signs onto the stack
~ raw_string_interior
Expand Down
2 changes: 1 addition & 1 deletion src/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ abstractions, `pest` parsers can be **very fast**.
Here is the complete grammar for a simple calculator [developed in a (currently
unwritten) later chapter](examples/calculator.html):

```
```pest
num = @{ int ~ ("." ~ ASCII_DIGIT*)? ~ (^"e" ~ int)? }
int = { ("+" | "-")? ~ ASCII_DIGIT+ }
Expand Down
Loading

0 comments on commit 2a5c7a2

Please sign in to comment.