appendix_a_hand_rolled_parser.html

<h1>Hand Rolled Parser<small>&bull; Appendix A</small></h1>

<h2 id='hand_rolling'>Hand Rolling</h2> <hr/>

<div class='pull-right alert alert-warning' style="margin: 15px; text-align: center;">
  <img src="/static/img/steamroller.png" alt="steamroller" class="img-responsive" width="278px" height="308px"/>
  <p><small>Steam Roller &bull; We're gonna be rolling <em>machines</em>.</small></p>
</div>

<p>In this book we used <code>mpc</code> to do all the parsing, but in this appendix we're going to replace <code>mpc</code> with our own hand rolled parser. The choice to use <code>mpc</code> has by far been the main source of complaints by readers of this book, and before we dive into the details of hand-rolling I want to talk about some of the reasons I decided on using a parsing library in the first place.</p>

<p>The main reason was that when I was learning about programming languages, I found the theory of formal languages fascinating. I really enjoyed getting into the mind set of thinking about languages more abstractly. I think this is a fun thing to teach, as it opens up a lot of new avenues of exploration.</p>

<p>Another reason is that it gives new programmers a chance to learn how to use a library. It gets them comfortable early on with weird interfaces, and other peoples' code.</p>

<p>And finally, perhaps most importantly, using a library for the parsing allowed me to delay the topics of <em>memory management</em> and <em>pointers</em> for as long as possible - by which point readers should be much more comfortable with the C constructs they'd encountered so far.</p>

<p>But that doesn't really matter - writing a parser by hand is a great thing to do, and although one with all the bells and whistles can be a complicated thing to get right, because our Lisp is so simple, for us it wont be too difficult.</p>

<div class="alert alert-warning">
  <p><strong>Can I do this if I've not finished the book?</strong></p>
  
  <p>If you've not completed all the chapters of this book it is probably not a good idea to attempt this appendix. It may be possible to complete this appendix if you're already past <a href="/chapter9_s_expressions">Chapter 9 &bull; S-Expressions</a>, but if you're not completed this chapter already it might not make much sense. Sorry!</p>
</div>


<h2 id='replacing_mpc'>Replacing mpc</h2>

<p>The output of <code>mpc</code> was an <em>Abstract Syntax Tree</em>, but because our language is so simple, when we replace it we're going to go directly from the input string to an S-Expressions (which is pretty much like abstract syntax tree anyway). So let us declare a function which takes some string <code>s</code>, some pointer to a position in that string <code>i</code> and some terminal character <code>end</code> and which outputs an <code>lval</code>. We'll call it <code>lval_read_expr</code>.</p>

<pre><code data-language='c'>lval* lval_read_expr(char* s, int* i, char end);
</code></pre>

<p>We'll define the implementation later. For now lets go ahead and replace the places we call <code>mpc</code> with this new function. For error handling we'll get <code>lval_read_expr</code> to return an error <code>lval</code> if something goes wrong. The terminal character when we're reading from a C string is just the null character <code>\0</code>.</p>

<pre><code data-language='c'>char* input = readline("lispy&gt; ");
/* Read from input to create an S-Expr */
int pos = 0;
lval* expr = lval_read_expr(input, &pos, '\0');

/* Evaluate and print input */
lval* x = lval_eval(e, expr);
</code></pre>

<p>We also need to replace <code>mpc</code> in our function <code>builtin_load</code>. Here, because we only support strings in our read function, we first need to load in the contents of the file. To do this we use <code>fseek</code> and <code>ftell</code> to find out how many characters are in the file before allocating a string of that size (plus one) and reading the file into it.</p>

<pre><code data-language='c'>lval* builtin_load(lenv* e, lval* a) {
  LASSERT_NUM("load", a, 1);
  LASSERT_TYPE("load", a, 0, LVAL_STR);
  
  /* Open file and check it exists */
  FILE* f = fopen(a-&gt;cell[0]-&gt;str, "rb");
  if (f == NULL) {
    lval* err = lval_err("Could not load Library %s", a-&gt;cell[0]-&gt;str);
    lval_del(a);
    return err;
  }
  
  /* Read File Contents */
  fseek(f, 0, SEEK_END);
  long length = ftell(f);
  fseek(f, 0, SEEK_SET);
  char* input = calloc(length+1, 1);
  fread(input, 1, length, f);
  fclose(f);
</code></pre>

<p>We can then easily read from this string as we did in <code>main</code>.</p>

<pre><code data-language='c'>  /* Read from input to create an S-Expr */
  int pos = 0;
  lval* expr = lval_read_expr(input, &pos, '\0');
  free(input);
  
  /* Evaluate all expressions contained in S-Expr */
  if (expr-&gt;type != LVAL_ERR) {
    while (expr-&gt;count) {
      lval* x = lval_eval(e, lval_pop(expr, 0));
      if (x-&gt;type == LVAL_ERR) { lval_println(x); }
      lval_del(x);
    }
  } else {
    lval_println(expr);
  }
  
  lval_del(expr);    
  lval_del(a);
  
  return lval_sexpr();
}
</code></pre>

<p>And with those calls replaced we can start defining <code>lval_read_expr</code>.</p>


<h2 id='a_character_at_a_time'>A Character at a Time</h2> <hr/>

<div class='pull-left alert alert-warning' style="margin: 15px; text-align: center;">
  <img src="/static/img/reading_with_pipe.png" alt="reading with pipe" class="img-responsive" width="278px" height="308px"/>
  <p><small>Reading &bull; A Jolly Good Time (tm).</small></p>
</div>

<p>The way we think about implementing a parser is quite different to the high level abstract view we were given with <code>mpc</code>. Instead of thinking about the <em>language</em> we instead need to think about the <em>process</em>.</p>

<p>Usually this <em>process</em> takes a very simple form - a parser is almost always just a loop, which repeatedly reads a character at a time from the input, and each time decides what to do with it. The challenge is in making this process elegant. It all starts to get a little messy when we think about whitespace, and comments, and everything else.</p>

<p>To give an idea of how it might work - in our Lisp, if we encounter the character <code>d</code> in the input, we can store it in some string, and also we know we must be reading in a symbol, so can enter a state where we look for more letters, each time adding them to the string. Once we're found no more letters in the input we can return the whole thing as a symbol (for example <code>def</code>) and start again.</p>

<p>The function <code>lval_read_expr</code> is basically going to work like this. We're going to take as input some string, some position in that string, and decide what to do next. When the next character isn't the one specified by the argument <code>end</code> we will try to read in whatever thing appears next, create an <code>lval</code> object from it, and append it to the first argument <code>v</code>.<p>

<p>If instead we reach the character specified by <code>end</code> we're going to return the next position in the string and return to the caller. This return value will help whoever calls <code>lval_read_expr</code> to see how much of the string it has consumed and how much is left.</p>

<p>For now let us assume the next character isn't the <code>end</code> character. The first thing we need to check is that we've not reached the end of the input. If we've reached the end of the input without encountering the <code>end</code> character then we can throw a syntax error and jump to the end of the input to ensure no more is consumed.</p>

<pre><code data-language='c'>int lval_read_expr(lval* v, char* s, int i, char end) {
  
  while (s[i] != end) {
    
    /* If we reach end of input then there is some syntax error */
    if (s[i] == '\0') {
      lval_add(v, lval_err("Missing %c at end of input", end));
      return strlen(s)+1;
    }
</code></pre>

<p>After this we can check if the next character is whitespace. Any whitespace characters we can just skip over as our language is not whitespace sensitive.</p>

<pre><code data-language='c'>    /* Skip all whitespace */
    if (strchr(" \t\v\r\n", s[i])) {
      i++;
      continue;
    }
</code></pre>

<p>Another easy case is if the next character is a semi-colon <code>;</code>. If it is a semi-colon we are starting a comment and we can ignore the rest of the characters until we reach a new line.</p>

<pre><code data-language='c'>    /* If next char is ; then read comment */
    if (s[i] == ';') {
      while (s[i] != '\n' &amp;&amp; s[i] != '\0') { i++; }
      i++;
      continue;
    }
</code></pre>

<p>If the next character is an open parenthesis <code>(</code> or a curly bracket <code>{</code> we need to parse either an S-Expression or a Q-Expression. For these we can use <code>lval_read_expr</code> again and just supply it with a different character to end on and a different expression to write the results to.</p>

<pre><code data-language='c'>    /* If next character is ( then read S-Expr */
    if (s[i] == '(') {
      lval* x = lval_sexpr();
      lval_add(v, x);
      i = lval_read_expr(x, s, i+1, ')');
      continue;
    }
    
    /* If next character is { then read Q-Expr */
    if (s[i] == '{') {
      lval* x = lval_qexpr();
      lval_add(v, x);
      i = lval_read_expr(x, s, i+1, '}');
      continue;
    }
</code></pre>

<p>Those are all the easy cases done. Now we need to decide what to do if we encounter some letter or number. In this case we need to parse all of the numbers or letters we can until we reach something that isn't. For simplicity we're going to treat numbers like a special case of symbols and if we encounter any of these we're going to call the function <code>lval_read_sym</code>, which we'll define later.</p>

<pre><code data-language='c'>    /* If next character is part of a symbol then read symbol */
    if (strchr(
      "abcdefghijklmnopqrstuvwxyz"
      "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
      "0123456789_+-*\\/=&lt;&gt;!&amp;", s[i])) {
      i = lval_read_sym(v, s, i);
      continue;
    }
</code></pre>

<p>We also have to deal with strings. If we reach a <code>"</code> character we're going to have to consume everything we encounter up until the next unescaped <code>"</code>. For this we can call a function <code>lval_read_str</code>, which we'll define later.</p>

<pre><code data-language='c'>    /* If next character is " then read string */
    if (strchr("\"", s[i])) {
      i = lval_read_str(v, s, i+1);
      continue;
    }
</code></pre>

<p>Finally if we somehow encounter something else we better throw an error and skip to the end of the input, and as mentioned before, if we do actually match our <code>end</code> character and the while loop ends, we just need to return the updated position in the input.</p>

<pre><code data-language='c'>    /* Encountered some unknown character */
    lval_add(v, lval_err("Unknown Character %c", s[i]));
    return strlen(s)+1;
  }
  
  return i+1;
}
</code></pre>

<p>That completes the body of our function <code>lval_read_expr</code>. Now we just need to fill in the gaps.</p>

<h2 id="reading_symbols">Reading Symbols</h2> <hr/>

<p>Reading in symbols is fairly straight forward. We start by allocating some 
empty string, and then, for each character in the input which is valid as part 
of a symbol we append this character to the string.</p>

<pre><code data-language="c">int lval_read_sym(lval* v, char* s, int i) {
  
  /* Allocate Empty String */
  char* part = calloc(1,1);
  
  /* While valid identifier characters */
  while (strchr(
      "abcdefghijklmnopqrstuvwxyz"
      "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
      "0123456789_+-*\\/=&lt;&gt;!&amp;", s[i]) &amp;&amp; s[i] != '\0') {
    
    /* Append character to end of string */
    part = realloc(part, strlen(part)+2);
    part[strlen(part)+1] = '\0';
    part[strlen(part)+0] = s[i];
    i++;
  }
</code></pre>

<p>You'll notice that we're also reading in numbers with this code. So the next 
step is to check if this symbol is actually a number. To do this we just check 
if all of the characters are digits.</p>

<pre><code data-language='c'>  /* Check if Identifier looks like number */
  int is_num = strchr("-0123456789", part[0]);
  for (int i = 1; i &lt; strlen(part); i++) {
    if (!strchr("0123456789", part[i])) { is_num = 0; break; }
  }
</code></pre>

<p>Finally, if the symbol is a number we convert it with similar code to the
previous code we used when we were reading from the <code>mpc</code> AST, or otherwise we just use it as a normal symbol. Then we <code>free</code> the 
temporary string we allocated and return the new position in the input.<p>

<pre><code data-language='c'>  /* Add Symbol or Number as lval */
  if (is_num) {
    errno = 0;
    long x = strtol(part, NULL, 10);
    lval_add(v, errno != ERANGE ? lval_num(x) : lval_err("Invalid Number %s", part));
  } else {
    lval_add(v, lval_sym(part));
  }
  
  /* Free temp string */
  free(part);
  
  /* Return updated position in input */
  return i;
}
</code></pre>

<p>And we are done with reading symbols. Onto strings!</p>

<h2 id="reading_strings">Reading Strings</h2> <hr/>

<p>When we read in strings things get a little more complicated. This is because of <em>escaping</em>. We've covered this a little in earlier chapters but now we're going to have to deal with it head on.</p>

<p>Escaping is the name we give to the way we let users enter special characters by using special combinations of symbols starting with a backslash <code>\</code>. For example the most famous of these is the newline character which we encode using <code>\n</code>. When converting from the multi-character representation to the real representation we call this <em>unescaping</em> and when converting from the real representation to the two character encoding we call this <em>escaping</em>.</p>

<p>When we read in user strings we'll need to convert these pairs of characters into the single special character they represent. If we ignore the leading backslash we can make a function in C that tells us this mapping.</p>

<pre><code data-language='c'>/* Function to unescape characters */
char lval_str_unescape(char x) {
  switch (x) {
    case 'a':  return '\a';
    case 'b':  return '\b';
    case 'f':  return '\f';
    case 'n':  return '\n';
    case 'r':  return '\r';
    case 't':  return '\t';
    case 'v':  return '\v';
    case '\\': return '\\';
    case '\'': return '\'';
    case '\"': return '\"';
  }
  return '\0';
}
</code></pre>

<p>It is also going to be useful to list all the possible unescapable characters so we can check if we have one.</p>

<pre><code data-language='c'>/* Possible unescapable characters */
char* lval_str_unescapable = "abfnrtv\\\'\"";
</code></pre>

<p>We can write similar functions to do the conversion in the other direction.</p>

<pre><code data-language='c'>/* List of possible escapable characters */
char* lval_str_escapable = "\a\b\f\n\r\t\v\\\'\"";
</code></pre>

<pre><code data-language='c'>/* Function to escape characters */
char* lval_str_escape(char x) {
  switch (x) {
    case '\a': return "\\a";
    case '\b': return "\\b";
    case '\f': return "\\f";
    case '\n': return "\\n";
    case '\r': return "\\r";
    case '\t': return "\\t";
    case '\v': return "\\v";
    case '\\': return "\\\\";
    case '\'': return "\\\'";
    case '\"': return "\\\"";
  }
  return "";
}
</code></pre>

<p>With these we can begin to write our functions for reading strings. First we allocate a temporary string and while we're not reading the terminal <code>"</code> character we're going to process the incoming characters.</p>

<pre><code data-language='c'>int lval_read_str(lval* v, char* s, int i) {
  
  /* Allocate empty string */
  char* part = calloc(1,1);
  
  while (s[i] != '"') {
    
    char c = s[i];
</code></pre>

<p>First we need to check for the end of the input - if we're reached this then there must be some string input which doesn't terminate. In this case we <code>free</code> the temporary string we allocated, and return some error.</p>

<pre><code data-language='c'>    /* If end of input then there is an unterminated string literal */
    if (c == '\0') {
      lval_add(v, lval_err("Unexpected end of input at string literal"));
      free(part);
      return strlen(s);
    }
</code></pre>

<p>We then check if the next character is a backslash. If we have a backslash then we need to escape the next character after it. Given the previous functions we're already defined this is easy. If it is unescapable then we unescape it - otherwise we throw some error.</p> 

<pre><code data-language='c'>    /* If backslash then unescape character after it */
    if (c == '\\') {
      i++;
      /* Check next character is escapable */
      if (strchr(lval_str_unescapable, s[i])) {        
        c = lval_str_unescape(s[i]);
      } else {
        lval_add(v, lval_err("Invalid escape character %c", c));
        free(part);
        return strlen(s);
      }
    }
</pre></code>

<p>Given either the escaped character, or the normal character which is part of the string we simply add it to our temporary string, and once we are done consuming characters we convert this into an <code>lval</code> and add it to the function argument <code>v</code>, <code>free</code> the temporary string we allocated, and return.</p>

<pre><code data-language='c'>
    /* Append character to string */
    part = realloc(part, strlen(part)+2);
    part[strlen(part)+1] = '\0';
    part[strlen(part)+0] = c;
    i++;    
  }
  
  /* Add lval and free temp string */
  lval_add(v, lval_str(part));
  
  free(part);
  
  return i+1;
}
</code></pre>

<h2 id="printing_strings">Printing Strings</h2> <hr/>

<p>As well as using <code>mpc</code> to <em>unescape</em> strings we also used <code>mpc</code> to <em>escape</em> strings when we printed them out.</p>

<p>So if we're going to replace all uses of <code>mpc</code> we better do it here too. Given our functions we already defined for escaping and unescaping characters this wont be too difficult. We'll just loop over each character in the string and if it is esapable then we'll escape it, otherwise we'll print it out as normal.<p>

<pre><code data-language='c'>void lval_print_str(lval* v) {
  putchar('"');
  /* Loop over the characters in the string */
  for (int i = 0; i &lt; strlen(v-&gt;str); i++) {
    if (strchr(lval_str_escapable, v-&gt;str[i])) {
      /* If the character is escapable then escape it */
      printf("%s", lval_str_escape(v-&gt;str[i]));
    } else {
      /* Otherwise print character as it is */
      putchar(v-&gt;str[i]);
    }
  }
  putchar('"');
}
</code></pre>

<h2 id="cleaning_up">Cleaning Up</h2> <hr/>

<p>With that done we can finally remove our include for <code>mpc</code>. We'll need to replace it with includes for some of the libraries <code>mpc</code> included for us. The top of our file should now look like this.</p>

<pre><code data-language='c'>#include &lt;string.h&gt;
#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;stdarg.h&gt;
#include &lt;errno.h&gt;</code></pre>

<p>And we'll also have to remove the forward declarations of our parsers, so please delete the following too.</p>

<pre><code data-language='c'>/* Parser Declariations */

mpc_parser_t* Number; 
mpc_parser_t* Symbol; 
mpc_parser_t* String; 
mpc_parser_t* Comment;
mpc_parser_t* Sexpr;  
mpc_parser_t* Qexpr;  
mpc_parser_t* Expr; 
mpc_parser_t* Lispy;
</code></pre>

<p>And with that we're really done! Enjoy your new hand rolled parser.</p>


<h2>Reference</h2> <hr/>

<references />

<h2>Bonus Marks</h2> <hr/>
  
<div class="alert alert-warning">
  <ul class="list-group">
    <li class="list-group-item">&rsaquo; Make syntax errors give the line and column where the error occured.</li>
  </ul>
</div>


<h2>Navigation</h2>

<table class="table" style='table-layout: fixed;'>
  <tr>
    <td class="text-center"><a href="contents"><h4>&bull; Contents &bull;</h4></a></td>
  </tr>
</table>