ircv3 · DanielOaks · Sep 15, 2016 · Sep 15, 2016 · Sep 15, 2016 · Sep 15, 2016
diff --git a/documentation/rfc8265.md b/documentation/rfc8265.md
@@ -0,0 +1,181 @@
+---
+title: rfc8265 IRC Unicode-aware Casemapping
+layout: spec
+copyrights:
+  -
+    name: "Daniel Oaks"
+    period: "2016-2017"
+    email: "[email protected]"
+---
+This document describes a Unicode-aware casemapping for IRC, based on the recommendations in [RFC 8265](https://tools.ietf.org/html/rfc8265).
+
+Client and channel names in languages other than English is a much-desired feature. This casemapping allows for an extended set of characters while minimising the risks around allowing these characters, and avoiding the technical limitations of prior solutions to this problem.
+
+The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119](http://tools.ietf.org/html/rfc2119).
+
+
+## Casemapping
+
+The server indicates that this Unicode-aware casemapping is in use by advertising the `"UTF8MAPPING"` `RPL_ISUPPORT` parameter with the value `"rfc8265"`. The `"CASEMAPPING"` parameter (with a value such as `"ascii"` or `"rfc1459"`) MUST also be sent, and indicates additional casemapping information. An example of this is shown below:
+
+    :irc.example.com 005 dan CASEMAPPING=ascii UTF8MAPPING=rfc8265 :are supported by this server"
+
+When the `rfc8265` casemapping is in use, names (and other messages from the IRCd where possible) MUST be sent using the [UTF-8](https://tools.ietf.org/html/rfc3629) encoding.
+
+The `rfc8265` casemapping uses the PRECIS [UsernameCaseMapped profile][precis] as defined in [Section 3.3 of RFC 8265][precis].
+
+[precis]: https://tools.ietf.org/html/rfc8265#section-3.3
+
+
+### CASEMAPPING vs UTF8MAPPING
+
+The original `"CASEMAPPING"` `RPL_ISUPPORT` token is preserved, and a new token called `"UTF8MAPPING"` is added to denote this Unicode-compatible casemapping. This is done for two reasons - to allow servers to continue using `rfc1459`-specific casemapping rules if they need to, and ensuring clients who aren't aware of this Unicode casemapping can fall back to a default casemapping.
+
+We recommend using these values:
+
+- `CASEMAPPING`: `ascii`
+- `UTF8MAPPING`: `rfc8265`
+
+If the above values are in use, software can skip the ASCII-specific casemapping rules and continue with the [UsernameCaseMapped profile][precis] preperation rules.
+
+If `CASEMAPPING` is `"rfc1459"`, software MUST apply these special casemapping rules before continuing with the [UsernameCaseMapped profile][precis] preperation rules:
+
+- `('{', 0x7B)` is the lower-case equivalent of the character `('[', 0x5B)`
+- `('|', 0x7C)` is the lower-case equivalent of the character `('\', 0x5C)`
+- `('}', 0x7D)` is the lower-case equivalent of the character `(']', 0x5D)`
+
+
+### Order of Operations
+
+Names being prepared MUST apply the following rules in the order shown:
+
+1. If casefolding a channel name, strip all channel prefixes (in the ISUPPORT token `CHANTYPES`) right at the start of the name.
+2. Preperation using additional casemapping rules specified by `CASEMAPPING` (such as `rfc1459`).
+3. Preparation using the PRECIS [UsernameCaseMapped profile][precis].
+4. Check for restricted characters.
+5. If casefolding a channel name, re-apply the original stripped channel prefixes.
+
+These steps MUST happen in the order shown, or else the restricted characters check may miss characters that should be legitimately restricted.
+
+If a name contains a restricted character (found in either step 1 or 2), that name MUST be rejected by the server and MUST NOT be propogated to other clients. This is done through the appropriate numeric for the command which tried to set or use the invalid name such as `ERR_ERRONEUSNICKNAME`, `ERR_NOSUCHCHANNEL`, or whichever numeric is most appropriate.
+
+The first and last step are specific to casefolding channel names. This is done because by default, the Bidi (bi-directional) rule restrictions mean that a name like `#愛でる` will be rejected (as `#` is considered a 'left-to-right' character, and `愛でる` are considered right-to-left). A channel name like `#愛でる` should have the `#` character stripped, the rest of the string prepared, and then the prefix `#` re-applied. If there is more than one channel prefix character at the start, then all should be stripped (e.g. `##愛でる` would have the `##` stripped and then re-applied).
+
+
+### Comparisons
+
+When comparing names for equivalency, both strings are casemapped with the above process. After this, they are compared with a standard bit-for-bit comparison.
+
+
+## Visually Similar Characters
+
+As noted in the [Visually Similar Characters section](https://tools.ietf.org/html/rfc8264#section-12.5) of the PRECIS framework specification, the PRECIS framework itself does not address the issue of eliminating all possible visually similar characters.
+
+With the new allowed Unicode characters comes the ability to use characters that look the same. For example, `E (0x45)`, `Ε (U+0395)` (Greek Capital Letter Epsilon), and `Е (U+0415)` (Cyrillic Capital Letter IE) look the same in most fonts, but are treated as separate characters by this casemapping. More examples of these can be found in Unicode's [Confusables document](https://www.unicode.org/Public/security/latest/confusables.txt).
+
+Unicode skeletonisation is the method we recommend to combat this. For each identifier (nick/channel name) on the server, a 'skeleton' is generated by taking the **casefolded** name, and then applying to it the transformations described in the [Unicode Security Mechanisms document](http://unicode.org/reports/tr39/#Confusable_Detection). These skeletons, if used, MUST ONLY be used for comparison, and not as any user-visible identifier (as they intentionally contain complicated mixes of scripts and characters). When users change nicknames or create new channels, the casefolded names should be compared and the skeletons should also be compared to ensure that both are globally-unique (with any non-unique names rejected outright). This seems to be the most reliable method as of right now, but does require storing the skeletons of all in-use names for comparison purposes.
+
+Otherwise, server authors may wish to only allow characters from specific character sets of locales to be used, only allow characters from a list of known, non-confusable characters. Other recommendations are available in the [Visually Similar Characters section](https://tools.ietf.org/html/rfc8264#section-12.5) of the PRECIS framework specification. Names that have the opportunity to be confusing SHOULD be disallowed by servers.
+
+
+## Restricted Characters
+
+This section contains a list of characters which (currently) cause protocol issues or are otherwise undesirable, and are recommended to be disallowed from use. The justifications for disallowing these characters is listed beside them.
+
+Servers and bouncers may disallow any characters they deem necessary.
+
+
+### All Names
+
+Nicknames, usernames, hostnames and channel names should not contain the following characters:
+
+* `(' ', 0x20)` - Separates parameters.
+* `(',', 0x2C)` - Used as a separator.
+
+In addition, the first character of nicknames, usernames, hostnames and channel names should not be:
+
+* `(':', 0x3A)` - Separates trailing parameter.
+
+To be clear, these characters will also be repeated below.
+
+#### Protocol Framing
+
+These characters should be disallowed either by the protocol itself or because of how they are used in message framing:
+
+* `('\0', 0x00)` - Disallowed by the protocol.
+* `('\n', 0x0A)` - Used in message framing.
+* `('\r', 0x0D)` - Used in message framing.
+
+
+### Nicknames
+
+Nicknames should not contain the following characters:
+
+* `(' ', 0x20)` - Separates parameters.
+* `(',', 0x2C)` - Used as a separator.
+* `('*', 0x2A)` - Used in mask matching.
+* `('?', 0x3F)` - Used in mask matching.
+* `('.', 0x2E)` - Denotes a server name (some clients recognise a client vs server by whether the name contains a period).
+* `('!', 0x21)` - Separates nickname from username.
+* `('@', 0x40)` - Separates username from hostname.
+
+In addition to the above restrictions, the first character of a nickname should not be:
+
+* `(':', 0x3A)` - Separates trailing parameter.
+* Channel membership prefixes as defined in the `PREFIX` parameter of `RPL_ISUPPORT` - Protocol conflicts.
+* Channel prefixes as defined in the `CHANTYPES` parameter of `RPL_ISUPPORT` - Protocol conflicts.
+
+
+### Usernames
+
+Usernames should not contain the following characters:
+
+* `(' ', 0x20)` - Separates parameters.
+* `(',', 0x2C)` - Used as a separator.
+* `('*', 0x2A)` - Used in mask matching.
+* `('?', 0x3F)` - Used in mask matching.
+* `('!', 0x21)` - Separates nickname from username.
+* `('@', 0x40)` - Separates username from hostname.
+
+In addition, the first character of a username should not be:
+
+* `(':', 0x3A)` - Separates trailing parameter.
+
+
+### Hostnames
+
+Hostnames should not contain the following charactes:
+
+* `(' ', 0x20)` - Separates parameters.
+* `(',', 0x2C)` - Used as a separator.
+* `('*', 0x2A)` - Used in mask matching.
+* `('?', 0x3F)` - Used in mask matching.
+* `('!', 0x21)` - Separates nickname from username.
+* `('@', 0x40)` - Separates username from hostname.
+
+In addition, the first character of a hostname should not be:
+
+* `(':', 0x3A)` - Separates trailing parameter.
+
+Hostnames that are looked up on client connection should also be checked with the traditional hostname rules to ensure it is valid.
+
+
+### Channel Names
+
+Channel names must start with a channel prefix as defined in the `CHANTYPES` parameter of `RPL_ISUPPORT` (005).
+
+Channel names should not contain the following characters:
+
+* `(' ', 0x20)` - Separates parameters.
+* `(',', 0x2C)` - Used as a separator.
+
+
+## Security Considerations
+
+New Unicode-capable casemappings have a considerable security impact.
+
+Introducing so many new characters in names brings the possibility for those new characters to break the protocol. We aim to prevent this from being an issue by outlining these problematic characters in the Restricted Characters section.
+
+With Unicode characters comes the possibility of having visually similar characters (sometimes called 'confusables'). We acknowledge this and outline several ways to combat this possibility in the Visually Similar Characters section.
+
+Legacy clients may have issues working correctly with Unicode names. It is acknowledged that issues may exist around this, and in particular it is very difficult to understand how clients may react to Unicode names given the very vast base of clients out there today. I do not attempt to address this concern within this document.