Skip to content

Commit

Permalink
Incorporate proposals #142 and #143 into the design
Browse files Browse the repository at this point in the history
Add design text based on the contents of two proposals:

#142 Unicode source files
#143 Numeric literals
  • Loading branch information
zygoloid authored Nov 4, 2020
1 parent 8e3b675 commit ee7a108
Show file tree
Hide file tree
Showing 8 changed files with 889 additions and 44 deletions.
24 changes: 16 additions & 8 deletions docs/design/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,8 @@ cleaned up during evolution.

### Code and comments

> References: [Lexical conventions](lexical_conventions.md)
> References: [Source files](code_and_name_organization/source_files.md) and
> [lexical conventions](lexical_conventions)
>
> **TODO:** References need to be evolved.
Expand All @@ -127,9 +128,16 @@ cleaned up during evolution.
live code
```

- Decimal, hexadecimal, and binary integer literals and decimal and
hexadecimal floating-point literals are supported, with `_` as a digit
separator. For example, `42`, `0b1011_1101` and `0x1.EEFp+5`. Numeric
literals are case-sensitive: `0x`, `0b`, `e+`, and `p+` must be lowercase,
whereas hexadecimal digits must be uppercase. A digit is required on both
sides of a period.

### Packages, libraries, and namespaces

> References: [Code and name organization](code_and_name_organization.md)
> References: [Code and name organization](code_and_name_organization)
- **Files** are grouped into libraries, which are in turn grouped into
packages.
Expand Down Expand Up @@ -161,16 +169,16 @@ fn Foo(var Geometry.Shapes.Flat.Circle: circle) { ... }

### Names and scopes

> References: [Lexical conventions](lexical_conventions.md)
> References: [Lexical conventions](lexical_conventions)
>
> **TODO:** References need to be evolved.
Various constructs introduce a named entity in Carbon. These can be functions,
types, variables, or other kinds of entities that we'll cover. A name in Carbon
is always formed out of an "identifier", or a sequence of letters, numbers, and
underscores which starts with a letter. As a regular expression, this would be
`/[a-zA-Z][a-zA-Z0-9_]*/`. Eventually we may add support for more unicode
characters as well.
is formed from a word, which is a sequence of letters, numbers, and underscores,
and which starts with a letter. We intend to follow Unicode's Annex 31 in
selecting valid identifier characters, but a concrete set of valid characters
has not been selected yet.

#### Naming conventions

Expand Down Expand Up @@ -240,7 +248,7 @@ file, including `Int` and `Bool`. These will likely be defined in a special

### Expressions

> References: [Lexical conventions](lexical_conventions.md) and
> References: [Lexical conventions](lexical_conventions) and
> [operators](operators.md)
>
> **TODO:** References need to be evolved.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -112,8 +112,8 @@ Important Carbon goals for code and name organization are:

## Overview

Carbon files have a `.carbon` extension, such as `geometry.carbon`. These files
are the basic unit of compilation.
Carbon [source files](source_files.md) have a `.carbon` extension, such as
`geometry.carbon`. These files are the basic unit of compilation.

Each file begins with a declaration of which
_package_<sup><small>[[define](/docs/guides/glossary.md#package)]</small></sup>
Expand Down Expand Up @@ -228,9 +228,9 @@ Every source file will consist of, in order:
3. Source file body, with other code.

Comments and blank lines may be intermingled with these sections.
[Metaprogramming](metaprogramming.md) code may also be intermingled, so long as
the outputted code is consistent with the enforced ordering. Other types of code
must be in the source file body.
[Metaprogramming](/docs/design/metaprogramming.md) code may also be
intermingled, so long as the outputted code is consistent with the enforced
ordering. Other types of code must be in the source file body.

### Name paths

Expand All @@ -241,7 +241,7 @@ separated by dots. This syntax may be loosely expressed as a regular expression:
IDENTIFIER(\.IDENTIFIER)*
```

Name conflicts are addressed by [name lookup](name_lookup.md).
Name conflicts are addressed by [name lookup](/docs/design/name_lookup.md).

#### `package` syntax

Expand Down Expand Up @@ -467,7 +467,7 @@ An import declares a package entity named after the imported package, and makes
`api`-tagged entities from the imported library through it. The full name path
is a concatenation of the names of the package entity, any namespace entities
applied, and the final entity addressed. Child namespaces or entities may be
[aliased](aliases.md) if desired.
[aliased](/docs/design/aliases.md) if desired.

For example, given a library:

Expand Down Expand Up @@ -574,8 +574,8 @@ struct Shapes.Square { ... };

#### Aliasing

Carbon's [alias keyword](aliases.md) will support aliasing namespaces. For
example, this would be valid code:
Carbon's [alias keyword](/docs/design/aliases.md) will support aliasing
namespaces. For example, this would be valid code:

```carbon
namespace Timezones.Internal;
Expand Down Expand Up @@ -606,7 +606,7 @@ import, and that the `api` is infeasible to rename due to existing callers.
Alternately, the `api` entity may be using an idiomatic name that it would
contradict naming conventions to rename. In either case, this conflict may exist
in a single file without otherwise affecting users of the API. This will be
addressed by [name lookup](name_lookup.md).
addressed by [name lookup](/docs/design/name_lookup.md).

### Potential refactorings

Expand Down Expand Up @@ -904,7 +904,7 @@ Advantages:
Disadvantages:

- We are likely to want a more fine-grained, file-level approach proposed by
[name lookup](name_lookup.md).
[name lookup](/docs/design/name_lookup.md).
- Allows package owners to name their packages things that they rarely type,
but that importers end up typing frequently.
- The existence of a short `package` keyword shifts the balance for long
Expand Down
244 changes: 244 additions & 0 deletions docs/design/code_and_name_organization/source_files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
# Source files

<!--
Part of the Carbon Language project, under the Apache License v2.0 with LLVM
Exceptions. See /LICENSE for license information.
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-->

## Table of contents

<!-- toc -->

- [Overview](#overview)
- [Encoding](#encoding)
- [References](#references)
- [Alternatives](#alternatives)
- [Character encoding](#character-encoding)
- [Byte order marks](#byte-order-marks)
- [Normalization forms](#normalization-forms)

<!-- tocstop -->

## Overview

A Carbon _source file_ is a sequence of Unicode code points in Unicode
Normalization Form C ("NFC"), and represents a portion of the complete text of a
program.

Program text can come from a variety of sources, such as an interactive
programming environment (a so-called "Read-Evaluate-Print-Loop" or REPL), a
database, a memory buffer of an IDE, or a command-line argument.

The canonical representation for Carbon programs is in files stored as a
sequence of bytes in a file system on disk. Such files have a `.carbon`
extension.

## Encoding

The on-disk representation of a Carbon source file is encoded in UTF-8. Such
files may begin with an optional UTF-8 BOM, that is, the byte sequence
EF<sub>16</sub>,BB<sub>16</sub>,BF<sub>16</sub>. This prefix, if present, is
ignored.

No Unicode normalization is performed when reading an on-disk representation of
a Carbon source file, so the byte representation is required to be normalized in
Normalization Form C. The Carbon source formatting tool will convert source
files to NFC as necessary.

## References

- [Unicode](https://www.unicode.org/versions/latest/) is a universal character
encoding, maintained by the
[Unicode Consortium](https://home.unicode.org/basic-info/overview/). It is
the canonical encoding used for textual information interchange across all
modern technology.

Carbon is based on Unicode 13.0, which is currently the latest version of
the Unicode standard. Newer versions will be considered for adoption as they
are released.

- [Unicode Standard Annex #15: Unicode Normalization Forms](https://www.unicode.org/reports/tr15/tr15-50.html)

- [wikipedia article on Unicode normal forms](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms)

## Alternatives

The choice to require NFC is really four choices:

1. Equivalence classes: we use a canonical normalization form rather than a
compatibility normalization form or no normalization form at all.

- If we use no normalization, invisibly-different ways of representing the
same glyph, such as with pre-combined diacritics versus with diacritics
expressed as separate combining characters, or with combining characters
in a different order, would be considered different characters.
- If we use a canonical normalization form, all ways of encoding diacritics
are considered to form the same character, but ligatures such as `` are
considered distinct from the character sequence that they decompose into.
- If we use a compatibility normalization form, ligatures are considered
equivalent to the character sequence that they decompose into.

For a fixed-width font, a canonical normalization form is most likely to
consider characters to be the same if they look the same. Unicode annexes
[UAX#15](https://www.unicode.org/reports/tr15/tr15-18.html#Programming%20Language%20Identifiers)
and
[UAX#31](https://www.unicode.org/reports/tr31/tr31-33.html#normalization_and_case)
both recommend the use of Normalization Form C for case-sensitive
identifiers in programming languages.

2. Composition: we use a composed normalization form rather than a decomposed
normalization form. For example, `ō` is encooded as U+014D (LATIN SMALL
LETTER O WITH MACRON) in a composed form and as U+006F (LATIN SMALL LETTER
O), U+0304 (COMBINING MACRON) in a decomposed form. The composed form results
in smaller representations whenever the two differ, but the decomposed form
is a little easier for algorithmic processing (for example, typo correction
and homoglyph detection).

3. We require source files to be in our chosen form, rather than converting to
that form as necessary.

4. We require that the entire contents of the file be normalized, rather than
restricting our attention to only identifiers, or only identifiers and string
literals.

### Character encoding

**We could restrict programs to ASCII.**

Advantages:

- Reduced implementation complexity.
- Avoids all problems relating to normalization, homoglyphs, text
directionality, and so on.
- We have no intention of using non-ASCII characters in the language syntax or
in any library name.
- Provides assurance that all names in libraries can reliably be typed by all
developers -- we already require that keywords, and thus all ASCII letters,
can be typed.

Disadvantages:

- An overarching goal of the Carbon project is to provide a language that is
inclusive and welcoming. A language that does not permit names and comments
in programs to be expressed in the developer's native language will not meet
that goal for at least some of our developers.
- Quoted strings will be substantially less readable if non-ASCII printable
characters are required to be written as escape sequences.

### Byte order marks

**We could disallow byte order marks.**

Advantages:

- Marginal implementation simplicity.

Disadvantages:

- Several major editors, particularly on the Windows platform, insert UTF-8
BOMs and use them to identify file encoding.

### Normalization forms

**We could require a different normalization form.**

Advantages:

- Some environments might more naturally produce a different normalization
form.
- Normalization Form D is more uniform, in that characters are always
maximally decomposed into combining characters; in NFC, characters may or
may not be decomposed depending on whether a composed form is available.
- NFD may be more suitable for certain uses such as typo correction,
homoglyph detection, or code completion.

Disadvantages:

- The C++ standard and community is moving towards using NFC:

- WG21 is in the process of adopting a NFC requirement for C++
identifiers.
- GCC warns on C++ identifiers that aren't in NFC.

As a consequence, we should expect that the tooling and development
environments that C++ developers are using will provide good support for
authoring NFC-encoded source files.

- The W3C recommends using NFC for all content, so code samples distributed on
webpages may be canonicalized into NFC by some web authoring tools.

- NFC produces smaller encodings than NFD in all cases where they differ.

**We could require no normalization form and compare identifiers by code point
sequence.**

Advantages:

- This is the rule in use in C++20 and before.

Disadvantages:

- This is not the rule planned for the near future of C++.
- Different representations of the same character may result in different
identifiers, in a way that is likely to be invisible in most programming
environments.

**We could require no normalization form, and normalize the source code
ourselves.**

Advantages:

- We would treat source text identically regardless of the normalization form.
- Developers would not be responsible for ensuring that their editing
environment produces and preserves the proper normalization form.

Disadvantages:

- There is substantially more implementation cost involved in normalizing
identifiers than in detecting whether they are in normal form. While this
proposal would require the implementation complexity of converting into NFC
in the formatting tool, it would not require the conversion cost to be paid
during compilation.

A high-quality implementation may choose to accept this cost anyway, in
order to better recover from errors. Moreover, it is possible to
[detect NFC on a fast path](http://unicode.org/reports/tr15/#NFC_QC_Optimization)
and do the conversion only when necessary. However, if non-canonical source
is formally valid, there are more stringent performance constraints on such
conversion than if it is only done for error recovery.

- Tools such as `grep` do not perform normalization themselves, and so would
be unreliable when applied to a codebase with inconsistent normalization.
- GCC already diagnoses identifiers that are not in NFC, and WG21 is in the
process of adopting an
[NFC requirement for C++ identifiers](http://wg21.link/P1949R6), so
development environments should be expected to increasingly accommodate
production of text in NFC.
- The byte representation of a source file may be unstable if different
editing environments make different normalization choices, creating problems
for revision control systems, patch files, and the like.
- Normalizing the contents of string literals, rather than using their
contents unaltered, will introduce a risk of user surprise.

**We could require only identifiers, or only identifiers and comments, to be
normalized, rather than the entire input file.**

Advantages:

- This would provide more freedom in comments to use arbitrary text.
- String literals could contain intentionally non-normalized text in order to
represent non-normalized strings.

Disadvantages:

- Within string literals, this would result in invisible semantic differences:
strings that render identically can have different meanings.
- The semantics of the program could vary if its sources are normalized, which
an editing environment might do invisibly and automatically.
- If an editing environment were to automatically normalize text, it would
introduce spurious diffs into changes.
- We would need to be careful to ensure that no string or comment delimiter
ends with a code point sequence that is a prefix of a decomposition of
another code point, otherwise different normalizations of the same source
file could tokenize differently.
Loading

0 comments on commit ee7a108

Please sign in to comment.