diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index b72e5e420352a..ca0f051e8fc75 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -54,6 +54,16 @@ repos: - '' - '// ' - '' + - --custom_format + - '\.l$' + - '/*' + - '' + - '*/' + - --custom_format + - '\.y$' + - '/*' + - '' + - '*/' exclude: | (?x)^( .bazelversion| @@ -61,6 +71,7 @@ repos: third_party/examples/.*/compile_flags.carbon.txt| website/(firebase/.firebaserc|jekyll/(Gemfile.lock|theme/.*))| .*\.def| + .*\.svg| .*/testdata/.*\.golden )$ - id: check-google-doc-style diff --git a/proposals/README.md b/proposals/README.md index 3c6fd6583968d..b8a348b6147e6 100644 --- a/proposals/README.md +++ b/proposals/README.md @@ -56,5 +56,6 @@ request: - [0444 - GitHub Discussions](p0444.md) - [0447 - Generics terminology](p0447.md) - [0538 - `return` with no argument](p0538.md) +- [0555 - Operator precedence](p0555.md) diff --git a/proposals/p0555.md b/proposals/p0555.md new file mode 100644 index 0000000000000..d94b85e094464 --- /dev/null +++ b/proposals/p0555.md @@ -0,0 +1,308 @@ +# Operator precedence + + + +[Pull request](https://github.com/carbon-language/carbon-lang/pull/555) + + + +## Table of contents + +- [Problem](#problem) +- [Background](#background) +- [Proposal](#proposal) +- [Details](#details) + - [Notational convention](#notational-convention) + - [When to add precedence edges](#when-to-add-precedence-edges) + - [Parsing with a partial precedence order](#parsing-with-a-partial-precedence-order) +- [Rationale based on Carbon's goals](#rationale-based-on-carbons-goals) +- [Alternatives considered](#alternatives-considered) + - [Total order](#total-order) + - [Different precedence for different operands](#different-precedence-for-different-operands) + - [Require less than a partial order](#require-less-than-a-partial-order) + + + +## Problem + +Most expression-oriented languages use a strict hierarchy of precedence levels. +That approach is error-prone, as it assigns meaning to programs that developers +may either not understand or may misunderstand. + +## Background + +Given an expression, we need to be able to infer its structure: what are the +operands of each of the operators? This may be ambiguous in the absence of rules +that determine which operator is preferred, such as in the expression +`a $ b ^ c`: is this `(a $ b) ^ c` or `a $ (b ^ c)`? + +Starting with a sequence of operators and non-operator terms, we can completely +determine the structure of an expression by determining which operator in our +sequence will be the root of the parse tree, splitting the expression at that +point, and recursively determining the structure of each subexpression. The +operator that forms the root of the parse tree is said to have the lowest +precedence in the expression. + +Traditionally, this is accomplished by assigning a precedence level to each +operator and devising a total ordering over precedence levels. For example, we +could assign a higher precedence level to an infix `*` operator than an infix +`+` operator. With that choice of precedence levels, an infix `*` operator would +bind tighter than an infix `+` operator, regardless of the order in which they +appear. + +This approach is well-understood, but is problematic. For example, in C++, +expressions such as `a & b << c * 3` are valid, but the meaning of such an +expression is unlikely to be readily apparent to many developers. Worse, for +cases such as `a & 3 == 3`, there is a clear intended meaning, namely +`(a & 3) == 3`, but the actual meaning is something else -- in this case, +`a & (3 == 3)`. + +Because the precedence rules are not widely known and are sometimes quite +surprising, parentheses are used as a matter of course for certain kinds of C++ +expressions. However, the absence of such parentheses is not diagnosed in all +cases, even by many linting tools, and forgetting those parentheses can lead to +subtle bugs. + +## Proposal + +Do not have a total ordering of precedence levels. Instead, define a partial +ordering of precedence levels. Expressions using operators that lack relative +orderings must be disambiguated by the developer, for example by adding +parentheses; when a program's meaning depends on an undefined relative ordering +of two operators, it will be rejected due to ambiguity. + +The default behavior for any new operator is for it to be unordered with respect +to all other operators, thereby requiring parentheses when combining that +operator with any other operator. Precedence rules should be added only if it is +reasonable to expect most or all developers who regularly use Carbon to reliably +remember the precedence rule. + +## Details + +### Notational convention + +For pedagogical purposes, our documentation will use +[Hasse diagrams](https://en.wikipedia.org/wiki/Hasse_diagram) to represent +operator precedence partial orders, where operators with lower precedence are +considered less than (and therefore depicted lower than and connected to) +operators with higher precedence. In our diagrams, an enclosing arrow will be +used to show associativity within precedence groups, if there is any, with a +left-to-right arrow meaning a left-associative operator. + +For example: + +
+Example operator precedence diagram +
+ +... would depict a higher-precedence `*` operator and a lower-precedence `+` +operator, both of which are left-associative, and a non-associative `<<` +operator. The `==` operator is lower precedence than all of those operators, and +parentheses are higher precedence than all of those operators. + +With those precedence rules: + +- `a + b * c` would be parsed as `a + (b * c)`, because `+` has lower + precedence than `*`. +- `a + b << c` would be an error, requiring parentheses, because the + precedence levels of `+` and `<<` are unordered. + +A [python script](p0555/figures.py) to generate these diagrams is included with +this proposal. + +### When to add precedence edges + +Given a program whose meaning is ambiguous to a reader, it is preferable to +reject the program rather than to arbitrarily pick a meaning. For Carbon's +operators, we should only add an ordering between two operators if there is a +logical reason for that ordering, not merely to provide _some_ answer. **Goal: +for every combination of operators, either it should be reasonable to expect +most or all developers who regularly use Carbon to reliably remember the +precedence, or there should not be a precedence rule.** + +As an example, consider the expression `a * b ^ c`, where `*` is assumed to be a +multiplication operator and `^` is assumed to be a bitwise XOR operation. We +should reject this expression because there is no logical reason to perform +either operator first and it would be unreasonable to expect Carbon developers +to remember an arbitrary tie-breaker between the two options. + +This still leaves open the question of how high a bar of knowledge we put on our +developers (what is reasonable for us to expect?). We can use experience from +C++ to direct this decision: just as many developers who regularly use C++ do +not remember the relative precedence of `&&` vs `||`, and `&` vs `|`, and `&` vs +`<<`, and so on, we shouldn't expect them to remember similar precedence rules +in Carbon. If we are in doubt, omitting a precedence rule and waiting for +real-world experience should be preferred. + +### Parsing with a partial precedence order + +A traditional, totally-ordered precedence scheme can be implemented by an +[operator precedence parser](https://en.wikipedia.org/wiki/Operator-precedence_parser): + +- Keep track of the current left-hand-side operand and an ambient precedence + level. The ambient precedence level is the precedence of the operator whose + operand is being parsed, or a placeholder "lowest" precedence level when + parsing an expression that is not the operand of an operator. +- When a new operator is encountered, its precedence is compared to the + ambient precedence level: + - If its precedence is higher than the ambient precedence level, then + recurse ("shift") with that as the new ambient precedence level to form + the right-hand side of the new operator. After forming the right-hand + side, build an operator expression from the current left-hand side + operand and the right-hand side operand; that is the new current + left-hand side. + - If its precedence is equal to the ambient precedence level, then use the + associativity of that precedence level to determine what to do: + - If the operator is left-associative, build an operator expression. + - If the operator is right-associative, recurse. + - If the operator is non-associative, produce an error. + - If its precedence is lower than the ambient precedence level, return the + expression formed so far; it's the complete operand to an earlier + operator. + +This is, for example, the strategy +[currently used in Clang](https://github.com/llvm/llvm-project/blob/5f0903e9bec97e67bf34d887bcbe9d05790de934/clang/lib/Parse/ParseExpr.cpp#L396). + +The above algorithm is only suited to parsing in the case where precedence +levels are totally ordered, because it does not say what to do if the new +precedence is not comparable with the ambient precedence. However, the algorithm +can easily be adapted to also parse with a partial precedence order by adding +one more case: + +- If the precedence level of the new operator is not comparable with the + ambient precedence level, produce an ambiguity error. + +The key observation here is that, if we ever see `... a * b ^ c ...`, where `*` +and `^` have incomparable precedence, no later tokens can ever resolve the +ambiguity, so we can diagnose it immediately. Sketch proof: If there were a +valid parse tree for this expression, one of `*` and `^` must end up as an +ancestor of the other. But in a valid parse tree, along the path from one +operator to the other, precedences monotonically increase, so by transitivity of +the precedence partial ordering, the ancestor operator has lower precedence than +the descendent operator. + +An operator precedence parser with a partial ordering of predecence levels +[has been implemented](https://github.com/carbon-language/carbon-lang/commit/b8afadb3c6af5e68192d585232fee759180ea1e3) +as a proof-of-concept in the Carbon toolchain. + +Operator precedence partial ordering can also be implemented in yacc / bison +parser generators by using a variant of the +[precedence climbing method](https://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climbing_method). +For example, here is a yacc grammar for the Hasse diagram shown above: + +``` +expression: compare_expression | compare_operand; + +compare_expression: compare_lhs EQEQ compare_operand { $$ = ($1 == $3); }; +compare_lhs: compare_expression | compare_operand; +compare_operand: add_expression | multiply_expression | shift_expression | primary_expression; + +add_expression: add_lhs '+' add_operand { $$ = ($1 + $3); }; +add_lhs: add_expression | add_operand; +add_operand: multiply_expression | multiply_operand; + +multiply_expression: multiply_lhs '*' multiply_operand { $$ = ($1 * $3); }; +multiply_lhs: multiply_expression | multiply_operand; +multiply_operand: primary_expression; + +shift_expression: shift_lhs LSH shift_operand { $$ = ($1 << $3); }; +shift_lhs: shift_expression | shift_operand; +shift_operand: primary_expression; + +primary_expression: INT | '(' expression ')' { $$ = $2; }; +``` + +Note that some care must be taken to avoid grammar ambiguities. Under the +precedence climbing method, a `primary_expression` would be a +`shift_expression`, a `multiply_expression`, and an `add_expression`, and +therefore interpreting a `primary_expression` as an `expression` would be +ambiguous: we could take either the `shift_expression` path or the +`multiply_expression` path through the grammar. The above formulation avoids +this ambiguity by excluding `primary_expression` from `add_expression` and +`shift_expression`, and instead listing it as a distinct production for +`compare_operand`. A yacc grammar such as the above can be produced +systematically for any precedence partial ordering. + +A complete example of a yacc parser with operator precedence partial ordering is +available [alongside this proposal](p0555/yacc-parser). + +## Rationale based on Carbon's goals + +- Software and language evolution + + - The advice to not supply an operator precedence relationship if in doubt + is based on the idea that it's easier to add a precedence rule as an + evolutionary step than to remove one. + +- Code that is easy to read, understand, and write + + - This proposal aims to support this goal by ensuring that the operator + expressions that are used in programs are readily understood by + practitioners, by making unreadable constructs invalid. + +## Alternatives considered + +### Total order + +We could provide a total order over operator precedence. This proposal is not +strictly in conflict with doing so, if every ordering relationship is justified, +but in practice we expect there to be pairs of operators for which there is no +obvious precedence relationship. + +For: + +- This is established practice across most languages. + +Against: + +- This practice is a common source of bugs in the case where an arbitrary or + bad choice is made. + +### Different precedence for different operands + +We could provide different precedence relationships for the left and right sides +of infix operators. For example, we could allow multiplication on the left of a +`<<` operator but not on the right. This is precedented in C++: the `?` in a +`?:` allows a comma operator on its right but not on its left. + +For: + +- This may allow some additional cases that would be clear and unsurprising. + +Against: + +- The resulting rules would be more challenging to learn, and it seems likely + that they would fail the test that most developers who regularly use Carbon + know the rules. + +This proposal is not incompatible with adopting such a direction in future if we +find motivation to do so. + +### Require less than a partial order + +We could require something weaker than a partial ordering of precedence levels. +This proposal assumes the following two points are useful for human +comprehension of operator precedence: + +- The lowest-precedence operator does not depend on the relative order of + operators in the expression (except as a tie-breaker when there are multiple + operators with the same precedence, where the associativity of the operator + is considered). +- If an `^` expression can appear indirectly (but unparenthesized) within an + `$` expression, then an `^` expression can appear directly within an `$` + expression. +- If the lowest-precedence operator in `a $ b ^ c` is `$`, and the + lowest-precedence operator in `b ^ c # d` is `^`, then the lowest-precedence + operator in `a $ b ^ c # d` is `$`. + +These assumptions lead to the conclusion that operator precedence should form a +partial order over equivalence classes of operators. However, these assumptions +could be wrong. + +If we find motivation to select rules that violate the above assumptions, we +should reconsider the approach of using a partial precedence ordering, but no +motivating case is currently known. diff --git a/proposals/p0555/example.svg b/proposals/p0555/example.svg new file mode 100644 index 0000000000000..0ac4abeaa6228 --- /dev/null +++ b/proposals/p0555/example.svg @@ -0,0 +1,118 @@ + + + + + + + + + +op1 + +( + +. + +. + +. + +) + + + + +op2 + + +a + + + +* + + + +b + + + + +op1->op2 + + + + +op4 + +a + + + +< + +< + + + +b + + + + +op1->op4 + + + + +op3 + + +a + + + ++ + + + +b + + + + +op2->op3 + + + + +op5 + +a + + + += + += + + + +b + + + + +op3->op5 + + + + +op4->op5 + + + + diff --git a/proposals/p0555/figures.py b/proposals/p0555/figures.py new file mode 100755 index 0000000000000..7366029e75ce8 --- /dev/null +++ b/proposals/p0555/figures.py @@ -0,0 +1,138 @@ +#! /usr/bin/env python + +__copyright__ = """ +Part of the Carbon Language project, under the Apache License v2.0 with LLVM +Exceptions. See /LICENSE for license information. +SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +""" + +fmt = "svg" + + +def escape(s): + return ( + s.replace("&", "&") + .replace("<", "<") + .replace(">", ">") + .replace("[", "[") + .replace("]", "]") + ) + + +def tablejoin(items, separator): + data = ("%s" % separator).join( + "%s" % item for item in items + ) + return '%s
' % data + + +def code(s): + # FIXME: GraphViz's handling of font metrics appears to be pretty broken. + # Add a little extra width to each character with a non-code-font space to + # compensate. + codefont = "".join( + ( + '%s ' + ) + % escape(part) + for part in s + ) + return ( + '
%s
' + % codefont + ) + + +def math(s): + # Render math in italics but otherwise unchanged. + return "%s" % s + + +def raw(s): + return s + + +LtR = ' shape="rarrow"' +RtL = ' shape="larrow"' +NonAssoc = "" + +out = None +num = 0 + + +def group(ops, assoc=NonAssoc, style=code): + global num + num = num + 1 + name = "op%d" % num + print( + " %s [label=<%s>%s]" + % ( + name, + tablejoin((style(op) for op in ops), ", "), + assoc, + ), + file=out, + ) + return name + + +def edge(a, b): + print(" %s -> %s" % (a, b), file=out) + + +def combine(name, items): + if len(items) <= 1: + return items + print(" %s [label=<%s> shape=ellipse]" % (name, name), file=out) + res = name + for i in items: + edge(i, name) + return [res] + + +def graph(f): + import subprocess + + outfile = open(f.__name__ + "." + fmt, "w") + process = subprocess.Popen( + ["dot", "-T" + fmt], + stdin=subprocess.PIPE, + stdout=outfile, + encoding="utf8" + # ["cat"], stdin=subprocess.PIPE, stdout=outfile, encoding='utf8' + ) + global out + out = process.stdin + # print >>out, ' node [shape="rectangle" style="rounded" fontname="Arial"]' + print( + """ +digraph { + layout = dot + rankdir = TB + rank = "min" + node [shape="none" fontsize="12" height="0" + fontname="BlinkMacSystemFont,Segoe UI,Helvetica,Arial,sans-serif"] + edge [dir="none"] + """.strip(), + file=out, + ) + f() + print("}", file=out) + process.communicate() + return f + + +@graph +def example(): + term = group(["(...)"], NonAssoc) + mul = group(["a * b"], LtR) + add = group(["a + b"], LtR) + shl = group(["a << b"], NonAssoc) + compare = group(["a == b"], NonAssoc) + + edge(term, mul) + edge(mul, add) + edge(term, shl) + edge(add, compare) + edge(shl, compare) diff --git a/proposals/p0555/yacc-parser/Makefile b/proposals/p0555/yacc-parser/Makefile new file mode 100644 index 0000000000000..ee58095ba0704 --- /dev/null +++ b/proposals/p0555/yacc-parser/Makefile @@ -0,0 +1,11 @@ +# Part of the Carbon Language project, under the Apache License v2.0 with LLVM +# Exceptions. See /LICENSE for license information. +# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + +example: example.l example.y + flex example.l + bison example.y --defines + clang example.tab.c lex.yy.c -o example + +clean: + rm -f example.tab.c example.tab.h lex.yy.c example diff --git a/proposals/p0555/yacc-parser/example.l b/proposals/p0555/yacc-parser/example.l new file mode 100644 index 0000000000000..92062a862db7f --- /dev/null +++ b/proposals/p0555/yacc-parser/example.l @@ -0,0 +1,25 @@ +/* +Part of the Carbon Language project, under the Apache License v2.0 with LLVM +Exceptions. See /LICENSE for license information. +SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +*/ + +%option noyywrap +%{ +#include +#define YY_DECL int yylex() +#include "example.tab.h" +%} + +%% + +[0-9]+ { yylval = atoi(yytext); return INT; } +"*" { return '*'; } +"+" { return '+'; } +"<<" { return LSH; } +"==" { return EQEQ; } +"(" { return '('; } +")" { return ')'; } +";" { return ';'; } + +%% diff --git a/proposals/p0555/yacc-parser/example.y b/proposals/p0555/yacc-parser/example.y new file mode 100644 index 0000000000000..1344890c54e1a --- /dev/null +++ b/proposals/p0555/yacc-parser/example.y @@ -0,0 +1,47 @@ +/* +Part of the Carbon Language project, under the Apache License v2.0 with LLVM +Exceptions. See /LICENSE for license information. +SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +*/ + +%{ +#include + +extern int yylex(); +extern int yyparse(); +extern FILE* yyin; + +void yyerror(const char *s) { fprintf(stderr, "%s\n", s); } +int main() { while (yyparse()) {} } +%} + +%define api.value.type {int} +%token INT LSH EQEQ + +%% + +interpreter: + %empty +| interpreter expression ';' { printf("%d\n", $2); } + +expression: compare_expression | compare_operand; + +compare_expression: compare_lhs EQEQ compare_operand { $$ = ($1 == $3); }; +compare_lhs: compare_expression | compare_operand; +compare_operand: add_expression | multiply_expression | shift_expression | primary_expression; + +add_expression: add_lhs '+' add_operand { $$ = ($1 + $3); }; +add_lhs: add_expression | add_operand; +add_operand: multiply_expression | multiply_operand; + +multiply_expression: multiply_lhs '*' multiply_operand { $$ = ($1 * $3); }; +multiply_lhs: multiply_expression | multiply_operand; +multiply_operand: primary_expression; + +shift_expression: shift_lhs LSH shift_operand { $$ = ($1 << $3); }; +shift_lhs: shift_expression | shift_operand; +shift_operand: primary_expression; + +primary_expression: INT | '(' expression ')' { $$ = $2; }; + +%%