Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Principle: information accumulation #875

Merged
merged 13 commits into from
Mar 16, 2022
394 changes: 394 additions & 0 deletions proposals/p0875.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,394 @@
# Principle: information accumulation

<!--
Part of the Carbon Language project, under the Apache License v2.0 with LLVM
Exceptions. See /LICENSE for license information.
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-->

[Pull request](https://github.com/carbon-language/carbon-lang/pull/875)

<!-- toc -->

## Table of contents

- [Problem](#problem)
- [Background](#background)
- [Single-pass "splat" compilation](#single-pass-splat-compilation)
- [Global consistency](#global-consistency)
- [The C++ compromise](#the-c-compromise)
- [Separate declarations and definitions](#separate-declarations-and-definitions)
- [Proposal](#proposal)
- [Details](#details)
- [Goals](#goals)
- [Rationale based on Carbon's goals](#rationale-based-on-carbons-goals)
- [Alternatives considered](#alternatives-considered)
- [Strict top-down information flow](#strict-top-down-information-flow)
- [Strict global consistency](#strict-global-consistency)
- [Top-down with minimally deferred type checking](#top-down-with-minimally-deferred-type-checking)
- [Consistent classes, top-down for everything else](#consistent-classes-top-down-for-everything-else)
- [Context-sensitive local consistency](#context-sensitive-local-consistency)

<!-- tocstop -->

## Problem

We should have consistent rules describing what information about a program is
visible where.

## Background

Information in a source file is provided incrementally, with each source
utterance providing a small piece of the overall picture. Different languages
have different rules for which information is available where.

### Single-pass "splat" compilation

In C and other languages of a similar age, single-pass compilation was highly
desirable, due to resource limits and performance concerns. In these languages:

- Information is accumulated top-down, and can only be used lexically after it
appears.
- Most information can be discarded soon after it is provided: function bodies
don't need to be kept around once they've been converted to the output
format, and no information on local variable or parameter names needs to
persist past the end of the variable's scope. However, the types of globals
and the contents of type definitions must be retained.
- The behavior of an entity can be different at different places in the same
source file. An early use may fail if it depends on information that's
provided later, and in some cases a later use may fail when an earlier use
succeeded because the use i's invalid in a way that was not visible at the
zygoloid marked this conversation as resolved.
Show resolved Hide resolved
point of an earlier use.

### Global consistency

In more modern languages such as C#, Rust, Java, and Swift, there is no lexical
information ordering. In these languages:

- Information is effectively accumulated and processed in separate passes.
- The language design and implementation ensure that the behavior of an entity
is the same everywhere: both before its definition, after its definition,
within its definition, and in any other source file in which it was made
visible.
- Dependency cycles between program properties are carefully avoided by the
language designers.

### The C++ compromise

In C++, a hybrid approach is taken. There is a C-like lexical information
ordering rule, but this rule is subverted within classes by -- effectively --
reordering certain parts of a class that appear within the class definition so
that they are processed after the class definition. This primarily applies to
the bodies of member functions. Here:

- Information is mostly accumulated top-down, and is accumulated fully
top-down after the reordering step.
- The behavior of a class is the same within member function bodies that are
defined inside the class as it is within member function bodies defined
lexically after the class.
- The language designers need to ensure that the bounds of the member function
bodies and similar constructs can be determined without parsing them, so
that the late-parsed portions can be separated from the early-parsed
portions. In C++, this was not done successfully, and there are constructs
for which this determination is very hard or impossible.

### Separate declarations and definitions

Somewhat separate from the direction of information flow is the ability to
separate the information about an entity into multiple distinct regions of
source files. In C and C++, entities can be separately declared and defined. As
a consequence, these languages need rules to determine whether two declarations
declare the same entity.

In C++, especially for templated declarations, these rules can be incredibly
complex, and even now, more than 30 years after the introduction of templates in
C++, [basic questions are not fully answered](https://wg21.link/cwg2), and
implementations disagree about which declarations declare the same entity in
fairly simple examples.

One key benefit of this separation is in reduction of _physical dependencies_:
in order to validate a usage of an entity, we need only see a source file
containing a declaration of that entity, and need never consider the source file
containing its definition. This both reduces the number of steps required for an
incremental rebuild and reduces the input information and processing required
for each individual step.

The ability to break physical dependencies is limited to the cases where
information can actually be hidden from the users of the entity. For example, if
the user actually needs a function body, either because they will evaluate a
call to the function during compilation or because they will inline it prior to
linking, it cannot be physically isolated from the user of that information. As
a consequence, in C++, a programmer must carefully manage which information they
put in the source files that are exposed to client code and which information is
kept separate.

Another key benefit is that the exported interface of a source file can become
more readable, by presenting an interface that contains only the facts that are
salient to a user and not the implementation details.

## Proposal

TODO: Decide between the alternatives listed below.

## Details

TODO: Fully explain the details of the proposed solution.

### Goals

For this proposal, we have the following goals as refinements of the overall
Carbon goals:

- _Comprehensiblity._ Our rules should be understandable, and should minimize
surprise and gotchas. Our behavior should be self-consistent, and
explainable in only a few sentences.
- _Ergonomics._ It should be easy to express common developer desires, without
a lot of boilerplate or repetitive code.
- _Readability._ Code written using our rules should be as straightforward as
possible for Carbon developers to read and reason about.
- _Efficient and simple compilation._ It should be relatively straightforward
to implement our semantic rules. Implementation heroics shouldn't be
required, and the number of special cases required should be minimized.
- _Diagnosability._ An implementation should be able to explain coding errors
in ways that are easy to understand and are well-correlated with the error
and its remedy. Diagnostics should appear in an order and style that guides
the developer through logical steps to fix their mistakes.
- _Toolability._ Relatively simple tools should be able to understand simpler
properties of Carbon code. It should ideally be possible to identify which
names can be used in a particular context and what those names mean without
full processing. It should ideally be possible to gather useful and mostly
complete information about a potentially-invalid source file that is
currently being edited, for which it may be desirable to assume there is a
"hole" in the source file at the cursor position that will be filled by
unknown code.

## Rationale based on Carbon's goals
chandlerc marked this conversation as resolved.
Show resolved Hide resolved
geoffromer marked this conversation as resolved.
Show resolved Hide resolved

- [Language tools and ecosystem](/docs/project/goals.md#language-tools-and-ecosystem)
- See "Toolability" goal.
- [Software and language evolution](/docs/project/goals.md#software-and-language-evolution)
- TODO: Order-independence improves the ability to evolve code on a small
scale.
- [Code that is easy to read, understand, and write](/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write)
- See "Readability", "Ergonomics", and "Comprehensibility" gaols.
- [Fast and scalable development](/docs/project/goals.md#fast-and-scalable-development)
- See "Efficient and simple compilation" and "Diagnosability" goals.

## Alternatives considered

Below, various alternatives are presented and rated according to the
[goals](#goals) for this proposal.

### Strict top-down information flow

Carbon could accumulate information top-down. We could require that each program
utterance is type-checked and fully validated before any later code is
considered.

In order to support this and still permit cyclic references between entities, we
would need to permit separate declaration and definition.

_Comprehensibility:_ This rule is simple to explain, and has no special cases.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this rule is only simple to explain if declaration vs definition is simple to explain, which as noted above for C++ is not true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is really a problem with separate declaration and definition rather than with the top-down rule. I've tried to separate the two out and added a discussion of this there.

However, the inability to look at information from later in the source file is
likely to result in gotchas:

```
class Base {
var n: i32;
}
class Derived extends Base {
// Returns me.(Base.n), not me.(Derived.n), because the latter has not
// been declared yet.
fn Get[me: Self]() -> i32 { return me.n; }
var n: i32;
}
```

It might be possible to require a diagnostic in such cases, when we find a
declaration that would change the meaning of prior code if it had appeared
earlier, but that would result in implementation complexity, and the fact that
such cases are rejected would still be a surprise.

_Ergonomics:_ The developer is required to topologically sort their source files
in dependency order, manually breaking cycles with forward declarations. Common
refactoring tasks such as reorganizing code may require effort or tooling
assistance in order to preserve a topological order.

In practice, we would expect developers to react to this ruleset by beginning
each source file with a collection of forward declarations. This mitigates the
chandlerc marked this conversation as resolved.
Show resolved Hide resolved
need to produce a topological ordering, except within those forward declarations
themselves, and other declarations required to provide those forward
declarations. For example, a forward declaration of a class member will likely
only be possible within a class definition, and the order in which that class
definition is given can be relevant to the validity of other class definitions.

_Readability:_ Developers wishing to understand code have the advantage that
they need only consider prior code, and there is no possibility that a later
source utterance could change the meaning of the code they're reading. However,
it is rare to read code top-down, so the effect of this advantage may be modest.

This advantage leads to a significant disadvantange: the behaviour of an entity
can be different at different places within a source file. For example, a type
can be incomplete in one place and complete in another, or can fail to implement
an interface when inspected early and then found to implement that interface
later. This can lead to very subtle incoherent behavior.
chandlerc marked this conversation as resolved.
Show resolved Hide resolved

In practice, the topological ordering constraint tends to lead to good locality
of information: helpers for particular functionality are often located near to
the functionality. However, this is not a universal advantage, and the
topological constraint sometimes leads to internal helpers being ordered
immediately before their first use instead of in a more logical position near
correlated functionality.

The ability and tacit encouragement to start a source file with a list of
forward declarations of entities in that file -- or, for an API file, in its
corresponding implementation file -- will improve readability compared to an
approach in which that style is not possible or would not be used in practice.

_Efficient and simple compilation:_ This rule is mostly simple and efficient to
implement, and even allows single-pass processing of source files. It supports
and is likely to encourage physical separation of implementation from interface,
potentially leading to build time wins through reduced recompilation.

However, the requirement to support separate declaration and definition has the
potential to lead to substantial implementation complexity, as it does in C++,
as it imposes the requirement to determine whether two declarations declare the
same entity or different entities -- especially in the context of overloaded
function templates.

_Diagnosability:_ Because information is provided top-down, diagnostics can also
be provided top-down and in every case the diagnostic will be caused by an error
at the given location or earlier. Fixing errors should require little
backtracking by the developer.

However, an implementation that strictly confines its processing to top-down
order and produces diagnostics eagerly cannot deliver diagnostics that react
intelligently to contextual cues that appear after the point of the diagnostic.
This approach diminishes the ability for an implementation to pinpoint the cause
of the error and describe it in a developer-oriented fashion.

_Toolability:_ Limiting information flow to top-down means that tools such as
code completion tools need only consider context prior to the cursor, and they
can be confident that if all the code prior to the cursor is valid that it can
be type-checked and suitable completions offered.

However, in the case where the user wants to refer to a later-declared entity,
such tools would not be able to use this strategy. They would need to parse as
if there were not a top-down rule in order to find such later-declared entities,
and would likely additionally need the ability to add forward declarations or to
reorder declarations in order to satisfy the ordering requirement.

### Strict global consistency

Carbon could follow an approach of requiring the behavior of every entity to be
globally consistent. In this approach, the behavior of every entity would be as
if the entire program could be consulted to determine facts about that entity.

In practice, to make this work, we would need to limit where those facts can be
declared. For example, we limit implementations of interfaces to appear only in
source files that must already be in use wherever the question "does this type
implement that interface?" can be asked.

In addition, we need to reject at least the case where some property of the
program recursively depends upon itself:

```
struct X {
var a: Array(sizeof(X), i8);
}
```

In order to give globally consistent semantics to, for example, a package name,
we would likely need to process all source files comprising a package at the
same time.

This alternative can be considered either with or without the ability to
separate declarations from definitions.

_Comprehensibility:_ This rule is simple to explain, and has no special cases.
The disallowance of semantic cycles is likely to be unsurprising as it is a
logical necessity in any rule.

Applying this rule to local name lookup in block scope does result in some
surprises. For example, C# uses this approach, and combined with its
disallowance of shadowing of local variables, this
[confuses some developers](https://stackoverflow.com/questions/1196941/variable-scope-confusion-in-c-sharp).

_Ergonomics:_ The developer can organize or arrange their code in any way they
desire. There is never a need to forward-declare or repeat an interface
declaration. Refactoring and code reorganization do not require any non-obvious
changes, because the same code means the same thing regardless of how it is
located relative to other code.

_Readability:_ Reasoning about code is simple in this model, as such reasoning
is largely not context-sensitivity. Instead of questioning "what does this do
here?" we can instead consider "what does this do?". Some context sensitivity
may remain, for example due to access and name bindings differing in different
contexts.

However, to developers accustomed to a top-down semantic model, the ability to
defer giving key information about an entity -- or even declaring it at all --
until long after it is first used may hinder readability in some circumstances,
particularly when reading code top-down.
chandlerc marked this conversation as resolved.
Show resolved Hide resolved

_Efficient and simple compilation:_ This model forces the compilation process to
operate in multiple stages rather than as a single pass.

Some form of cycle detection is necessary if cycles are possible. However, such
a mechanism is likely to be necessary for template instantiation too, so this is
likely not a novel requirement for Carbon.

Forcing all files within a package to be compiled together in order to provide
consistent semantics for the package name may place an undesirable scalability
barrier on the build system.
chandlerc marked this conversation as resolved.
Show resolved Hide resolved

_Diagnosability:_ The implementation is likely to have more contextual
information when providing diagnostics, improving their quality. However, the
diagnostics may appear in a confusing order: if an early declaration needs
information from a later declaration in order to type-check, diagnostics
associated with that later declaration may be produced first, or may be
interleaved with diagnostics for the earlier declaration, leading to the
programmer potentially revisiting the same code multiple times during a
compile-edit cycle.

_Toolability:_ This model requires tools to consider the whole file as context,
because code may refer to entities that are only introduced later. For an IDE
scenario, where the cursor represents a location where an arbitrary chunk of
code may be missing, this presents a challenge of determining how to
resynchronize the input in order to determine how to interpret the portion of
the source file following the cursor.

Sophisticated tooling for a top-down model may wish to inspect the trailing
portion of the file anyway, in order to provide a better developer experience,
but this complexity would be forced upon tools with this model.

### Top-down with minimally deferred type checking

We could follow a top-down approach generally, but defer type-checking each
top-level entity until we reach the end of that entity. For example, we would
check an entire class as a single unit, following the same principles as in the
globally-consistent rule, but using only information provided prior to the end
of the class definition. This would allow class members to use other members
that have not yet been declared, while not permitting a function definition
preceding the class definition to use such members.

### Consistent classes, top-down for everything else

We could provide a globally-consistent rule for some entities and a top-down
rule for others. Following C++'s lead, we could provide a top-down rule for
packages, namespaces, and within functions, but provide a globally-consistent
rule for classes.

### Context-sensitive local consistency

We could use different behaviors in different contexts, as follows:

- For contexts that are fundamentally ordered, such as function bodies, a
top-down rule is used.
chandlerc marked this conversation as resolved.
Show resolved Hide resolved
- For contexts that are defined across multiple source files, such as packages
and namespaces, we guarantee consistent behavior within each source file,
but the behavior may be inconsistent across source files: different source
files may see different sets of names within a package or namespace,
depending on what they have imported.
- For contexts that are defined within a single source file, such as a class
or an interface, we guarantee globally consistent behavior.
chandlerc marked this conversation as resolved.
Show resolved Hide resolved