carbon-language · zygoloid · Mar 16, 2022 · Oct 8, 2021 · Oct 9, 2021 · Oct 16, 2021
diff --git a/proposals/p0875.md b/proposals/p0875.md
@@ -0,0 +1,394 @@
+# Principle: information accumulation
+
+<!--
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+Exceptions. See /LICENSE for license information.
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+-->
+
+[Pull request](https://github.com/carbon-language/carbon-lang/pull/875)
+
+<!-- toc -->
+
+## Table of contents
+
+-   [Problem](#problem)
+-   [Background](#background)
+    -   [Single-pass "splat" compilation](#single-pass-splat-compilation)
+    -   [Global consistency](#global-consistency)
+    -   [The C++ compromise](#the-c-compromise)
+    -   [Separate declarations and definitions](#separate-declarations-and-definitions)
+-   [Proposal](#proposal)
+-   [Details](#details)
+    -   [Goals](#goals)
+-   [Rationale based on Carbon's goals](#rationale-based-on-carbons-goals)
+-   [Alternatives considered](#alternatives-considered)
+    -   [Strict top-down information flow](#strict-top-down-information-flow)
+    -   [Strict global consistency](#strict-global-consistency)
+    -   [Top-down with minimally deferred type checking](#top-down-with-minimally-deferred-type-checking)
+    -   [Consistent classes, top-down for everything else](#consistent-classes-top-down-for-everything-else)
+    -   [Context-sensitive local consistency](#context-sensitive-local-consistency)
+
+<!-- tocstop -->
+
+## Problem
+
+We should have consistent rules describing what information about a program is
+visible where.
+
+## Background
+
+Information in a source file is provided incrementally, with each source
+utterance providing a small piece of the overall picture. Different languages
+have different rules for which information is available where.
+
+### Single-pass "splat" compilation
+
+In C and other languages of a similar age, single-pass compilation was highly
+desirable, due to resource limits and performance concerns. In these languages:
+
+-   Information is accumulated top-down, and can only be used lexically after it
+    appears.
+-   Most information can be discarded soon after it is provided: function bodies
+    don't need to be kept around once they've been converted to the output
+    format, and no information on local variable or parameter names needs to
+    persist past the end of the variable's scope. However, the types of globals
+    and the contents of type definitions must be retained.
+-   The behavior of an entity can be different at different places in the same
+    source file. An early use may fail if it depends on information that's
+    provided later, and in some cases a later use may fail when an earlier use
+    succeeded because the use i's invalid in a way that was not visible at the
+    point of an earlier use.
+
+### Global consistency
+
+In more modern languages such as C#, Rust, Java, and Swift, there is no lexical
+information ordering. In these languages:
+
+-   Information is effectively accumulated and processed in separate passes.
+-   The language design and implementation ensure that the behavior of an entity
+    is the same everywhere: both before its definition, after its definition,
+    within its definition, and in any other source file in which it was made
+    visible.
+-   Dependency cycles between program properties are carefully avoided by the
+    language designers.
+
+### The C++ compromise
+
+In C++, a hybrid approach is taken. There is a C-like lexical information
+ordering rule, but this rule is subverted within classes by -- effectively --
+reordering certain parts of a class that appear within the class definition so
+that they are processed after the class definition. This primarily applies to
+the bodies of member functions. Here:
+
+-   Information is mostly accumulated top-down, and is accumulated fully
+    top-down after the reordering step.
+-   The behavior of a class is the same within member function bodies that are
+    defined inside the class as it is within member function bodies defined
+    lexically after the class.
+-   The language designers need to ensure that the bounds of the member function
+    bodies and similar constructs can be determined without parsing them, so
+    that the late-parsed portions can be separated from the early-parsed
+    portions. In C++, this was not done successfully, and there are constructs
+    for which this determination is very hard or impossible.
+
+### Separate declarations and definitions
+
+Somewhat separate from the direction of information flow is the ability to
+separate the information about an entity into multiple distinct regions of
+source files. In C and C++, entities can be separately declared and defined. As
+a consequence, these languages need rules to determine whether two declarations
+declare the same entity.
+
+In C++, especially for templated declarations, these rules can be incredibly
+complex, and even now, more than 30 years after the introduction of templates in
+C++, [basic questions are not fully answered](https://wg21.link/cwg2), and
+implementations disagree about which declarations declare the same entity in
+fairly simple examples.
+
+One key benefit of this separation is in reduction of _physical dependencies_:
+in order to validate a usage of an entity, we need only see a source file
+containing a declaration of that entity, and need never consider the source file
+containing its definition. This both reduces the number of steps required for an
+incremental rebuild and reduces the input information and processing required
+for each individual step.
+
+The ability to break physical dependencies is limited to the cases where
+information can actually be hidden from the users of the entity. For example, if
+the user actually needs a function body, either because they will evaluate a
+call to the function during compilation or because they will inline it prior to
+linking, it cannot be physically isolated from the user of that information. As
+a consequence, in C++, a programmer must carefully manage which information they
+put in the source files that are exposed to client code and which information is
+kept separate.
+
+Another key benefit is that the exported interface of a source file can become
+more readable, by presenting an interface that contains only the facts that are
+salient to a user and not the implementation details.
+
+## Proposal
+
+TODO: Decide between the alternatives listed below.
+
+## Details
+
+TODO: Fully explain the details of the proposed solution.
+
+### Goals
+
+For this proposal, we have the following goals as refinements of the overall
+Carbon goals:
+
+-   _Comprehensiblity._ Our rules should be understandable, and should minimize
+    surprise and gotchas. Our behavior should be self-consistent, and
+    explainable in only a few sentences.
+-   _Ergonomics._ It should be easy to express common developer desires, without
+    a lot of boilerplate or repetitive code.
+-   _Readability._ Code written using our rules should be as straightforward as
+    possible for Carbon developers to read and reason about.
+-   _Efficient and simple compilation._ It should be relatively straightforward
+    to implement our semantic rules. Implementation heroics shouldn't be
+    required, and the number of special cases required should be minimized.
+-   _Diagnosability._ An implementation should be able to explain coding errors
+    in ways that are easy to understand and are well-correlated with the error
+    and its remedy. Diagnostics should appear in an order and style that guides
+    the developer through logical steps to fix their mistakes.
+-   _Toolability._ Relatively simple tools should be able to understand simpler
+    properties of Carbon code. It should ideally be possible to identify which
+    names can be used in a particular context and what those names mean without
+    full processing. It should ideally be possible to gather useful and mostly
+    complete information about a potentially-invalid source file that is
+    currently being edited, for which it may be desirable to assume there is a
+    "hole" in the source file at the cursor position that will be filled by
+    unknown code.
+
+## Rationale based on Carbon's goals
+
+-   [Language tools and ecosystem](/docs/project/goals.md#language-tools-and-ecosystem)
+    -   See "Toolability" goal.
+-   [Software and language evolution](/docs/project/goals.md#software-and-language-evolution)
+    -   TODO: Order-independence improves the ability to evolve code on a small
+        scale.
+-   [Code that is easy to read, understand, and write](/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write)
+    -   See "Readability", "Ergonomics", and "Comprehensibility" gaols.
+-   [Fast and scalable development](/docs/project/goals.md#fast-and-scalable-development)
+    -   See "Efficient and simple compilation" and "Diagnosability" goals.
+
+## Alternatives considered
+
+Below, various alternatives are presented and rated according to the
+[goals](#goals) for this proposal.
+
+### Strict top-down information flow
+
+Carbon could accumulate information top-down. We could require that each program
+utterance is type-checked and fully validated before any later code is
+considered.
+
+In order to support this and still permit cyclic references between entities, we
+would need to permit separate declaration and definition.
+
+_Comprehensibility:_ This rule is simple to explain, and has no special cases.
+However, the inability to look at information from later in the source file is
+likely to result in gotchas:
+
+```
+class Base {
+  var n: i32;
+}
+class Derived extends Base {
+  // Returns me.(Base.n), not me.(Derived.n), because the latter has not
+  // been declared yet.
+  fn Get[me: Self]() -> i32 { return me.n; }
+  var n: i32;
+}
+```
+
+It might be possible to require a diagnostic in such cases, when we find a
+declaration that would change the meaning of prior code if it had appeared
+earlier, but that would result in implementation complexity, and the fact that
+such cases are rejected would still be a surprise.
+
+_Ergonomics:_ The developer is required to topologically sort their source files
+in dependency order, manually breaking cycles with forward declarations. Common
+refactoring tasks such as reorganizing code may require effort or tooling
+assistance in order to preserve a topological order.
+
+In practice, we would expect developers to react to this ruleset by beginning
+each source file with a collection of forward declarations. This mitigates the
+need to produce a topological ordering, except within those forward declarations
+themselves, and other declarations required to provide those forward
+declarations. For example, a forward declaration of a class member will likely
+only be possible within a class definition, and the order in which that class
+definition is given can be relevant to the validity of other class definitions.
+
+_Readability:_ Developers wishing to understand code have the advantage that
+they need only consider prior code, and there is no possibility that a later
+source utterance could change the meaning of the code they're reading. However,
+it is rare to read code top-down, so the effect of this advantage may be modest.
+
+This advantage leads to a significant disadvantange: the behaviour of an entity
+can be different at different places within a source file. For example, a type
+can be incomplete in one place and complete in another, or can fail to implement
+an interface when inspected early and then found to implement that interface
+later. This can lead to very subtle incoherent behavior.
+
+In practice, the topological ordering constraint tends to lead to good locality
+of information: helpers for particular functionality are often located near to
+the functionality. However, this is not a universal advantage, and the
+topological constraint sometimes leads to internal helpers being ordered
+immediately before their first use instead of in a more logical position near
+correlated functionality.
+
+The ability and tacit encouragement to start a source file with a list of
+forward declarations of entities in that file -- or, for an API file, in its
+corresponding implementation file -- will improve readability compared to an
+approach in which that style is not possible or would not be used in practice.
+
+_Efficient and simple compilation:_ This rule is mostly simple and efficient to
+implement, and even allows single-pass processing of source files. It supports
+and is likely to encourage physical separation of implementation from interface,
+potentially leading to build time wins through reduced recompilation.
+
+However, the requirement to support separate declaration and definition has the
+potential to lead to substantial implementation complexity, as it does in C++,
+as it imposes the requirement to determine whether two declarations declare the
+same entity or different entities -- especially in the context of overloaded
+function templates.
+
+_Diagnosability:_ Because information is provided top-down, diagnostics can also
+be provided top-down and in every case the diagnostic will be caused by an error
+at the given location or earlier. Fixing errors should require little
+backtracking by the developer.
+
+However, an implementation that strictly confines its processing to top-down
+order and produces diagnostics eagerly cannot deliver diagnostics that react
+intelligently to contextual cues that appear after the point of the diagnostic.
+This approach diminishes the ability for an implementation to pinpoint the cause
+of the error and describe it in a developer-oriented fashion.
+
+_Toolability:_ Limiting information flow to top-down means that tools such as
+code completion tools need only consider context prior to the cursor, and they
+can be confident that if all the code prior to the cursor is valid that it can
+be type-checked and suitable completions offered.
+
+However, in the case where the user wants to refer to a later-declared entity,
+such tools would not be able to use this strategy. They would need to parse as
+if there were not a top-down rule in order to find such later-declared entities,
+and would likely additionally need the ability to add forward declarations or to
+reorder declarations in order to satisfy the ordering requirement.
+
+### Strict global consistency
+
+Carbon could follow an approach of requiring the behavior of every entity to be
+globally consistent. In this approach, the behavior of every entity would be as
+if the entire program could be consulted to determine facts about that entity.
+
+In practice, to make this work, we would need to limit where those facts can be
+declared. For example, we limit implementations of interfaces to appear only in
+source files that must already be in use wherever the question "does this type
+implement that interface?" can be asked.
+
+In addition, we need to reject at least the case where some property of the
+program recursively depends upon itself:
+
+```
+struct X {
+  var a: Array(sizeof(X), i8);
+}
+```
+
+In order to give globally consistent semantics to, for example, a package name,
+we would likely need to process all source files comprising a package at the
+same time.
+
+This alternative can be considered either with or without the ability to
+separate declarations from definitions.
+
+_Comprehensibility:_ This rule is simple to explain, and has no special cases.
+The disallowance of semantic cycles is likely to be unsurprising as it is a
+logical necessity in any rule.
+
+Applying this rule to local name lookup in block scope does result in some
+surprises. For example, C# uses this approach, and combined with its
+disallowance of shadowing of local variables, this
+[confuses some developers](https://stackoverflow.com/questions/1196941/variable-scope-confusion-in-c-sharp).
+
+_Ergonomics:_ The developer can organize or arrange their code in any way they
+desire. There is never a need to forward-declare or repeat an interface
+declaration. Refactoring and code reorganization do not require any non-obvious
+changes, because the same code means the same thing regardless of how it is
+located relative to other code.
+
+_Readability:_ Reasoning about code is simple in this model, as such reasoning
+is largely not context-sensitivity. Instead of questioning "what does this do
+here?" we can instead consider "what does this do?". Some context sensitivity
+may remain, for example due to access and name bindings differing in different
+contexts.
+
+However, to developers accustomed to a top-down semantic model, the ability to
+defer giving key information about an entity -- or even declaring it at all --
+until long after it is first used may hinder readability in some circumstances,
+particularly when reading code top-down.
+
+_Efficient and simple compilation:_ This model forces the compilation process to
+operate in multiple stages rather than as a single pass.
+
+Some form of cycle detection is necessary if cycles are possible. However, such
+a mechanism is likely to be necessary for template instantiation too, so this is
+likely not a novel requirement for Carbon.
+
+Forcing all files within a package to be compiled together in order to provide
+consistent semantics for the package name may place an undesirable scalability
+barrier on the build system.
+
+_Diagnosability:_ The implementation is likely to have more contextual
+information when providing diagnostics, improving their quality. However, the
+diagnostics may appear in a confusing order: if an early declaration needs
+information from a later declaration in order to type-check, diagnostics
+associated with that later declaration may be produced first, or may be
+interleaved with diagnostics for the earlier declaration, leading to the
+programmer potentially revisiting the same code multiple times during a
+compile-edit cycle.
+
+_Toolability:_ This model requires tools to consider the whole file as context,
+because code may refer to entities that are only introduced later. For an IDE
+scenario, where the cursor represents a location where an arbitrary chunk of
+code may be missing, this presents a challenge of determining how to
+resynchronize the input in order to determine how to interpret the portion of
+the source file following the cursor.
+
+Sophisticated tooling for a top-down model may wish to inspect the trailing
+portion of the file anyway, in order to provide a better developer experience,
+but this complexity would be forced upon tools with this model.
+
+### Top-down with minimally deferred type checking
+
+We could follow a top-down approach generally, but defer type-checking each
+top-level entity until we reach the end of that entity. For example, we would
+check an entire class as a single unit, following the same principles as in the
+globally-consistent rule, but using only information provided prior to the end
+of the class definition. This would allow class members to use other members
+that have not yet been declared, while not permitting a function definition
+preceding the class definition to use such members.
+
+### Consistent classes, top-down for everything else
+
+We could provide a globally-consistent rule for some entities and a top-down
+rule for others. Following C++'s lead, we could provide a top-down rule for
+packages, namespaces, and within functions, but provide a globally-consistent
+rule for classes.
+
+### Context-sensitive local consistency
+
+We could use different behaviors in different contexts, as follows:
+
+-   For contexts that are fundamentally ordered, such as function bodies, a
+    top-down rule is used.
+-   For contexts that are defined across multiple source files, such as packages
+    and namespaces, we guarantee consistent behavior within each source file,
+    but the behavior may be inconsistent across source files: different source
+    files may see different sets of names within a package or namespace,
+    depending on what they have imported.
+-   For contexts that are defined within a single source file, such as a class
+    or an interface, we guarantee globally consistent behavior.