-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add document for Unicode casemapping #272
Closed
Closed
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
49c702b
Add document for rfc7700 casemapping
DanielOaks 8f48b1c
rfc7700: Improve links, fix incorrect description
DanielOaks 357cde6
rfc7700: Fix incorrect description, only disallow : as the first char…
DanielOaks 0a605cd
rfc7700: Fix mistake, remove legacy restrictions
DanielOaks 36cfa0c
rfc7700: Make Disallowed Characters section recommended, various edits
DanielOaks f1a3082
rfc7700: Add * and ? to hostnames as well
DanielOaks 382c522
rfc7700: Incorportate fixes/suggestions from @SaberUK
DanielOaks 0d6435d
unicode_casemapping: 7700 -> 7613. Now using UsernameCaseMapped.
DanielOaks fb9dc99
rfc7613: Update copyright years
DanielOaks 807e084
unicode_casemapping: 7613 -> 8265. UTF8MAPPING to allow better compat…
DanielOaks fa23d59
rfc8265: Fix text
DanielOaks 634fb6d
Add additional advice for casefolding channel names, to avoid hitting…
DanielOaks 840aa78
Add Unicode skeletonisation recommendation to avoid confusables
DanielOaks File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,181 @@ | ||
--- | ||
title: rfc8265 IRC Unicode-aware Casemapping | ||
layout: spec | ||
copyrights: | ||
- | ||
name: "Daniel Oaks" | ||
period: "2016-2017" | ||
email: "[email protected]" | ||
--- | ||
This document describes a Unicode-aware casemapping for IRC, based on the recommendations in [RFC 8265](https://tools.ietf.org/html/rfc8265). | ||
|
||
Client and channel names in languages other than English is a much-desired feature. This casemapping allows for an extended set of characters while minimising the risks around allowing these characters, and avoiding the technical limitations of prior solutions to this problem. | ||
|
||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119](http://tools.ietf.org/html/rfc2119). | ||
|
||
|
||
## Casemapping | ||
|
||
The server indicates that this Unicode-aware casemapping is in use by advertising the `"UTF8MAPPING"` `RPL_ISUPPORT` parameter with the value `"rfc8265"`. The `"CASEMAPPING"` parameter (with a value such as `"ascii"` or `"rfc1459"`) MUST also be sent, and indicates additional casemapping information. An example of this is shown below: | ||
|
||
:irc.example.com 005 dan CASEMAPPING=ascii UTF8MAPPING=rfc8265 :are supported by this server" | ||
|
||
When the `rfc8265` casemapping is in use, names (and other messages from the IRCd where possible) MUST be sent using the [UTF-8](https://tools.ietf.org/html/rfc3629) encoding. | ||
|
||
The `rfc8265` casemapping uses the PRECIS [UsernameCaseMapped profile][precis] as defined in [Section 3.3 of RFC 8265][precis]. | ||
|
||
[precis]: https://tools.ietf.org/html/rfc8265#section-3.3 | ||
|
||
|
||
### CASEMAPPING vs UTF8MAPPING | ||
|
||
The original `"CASEMAPPING"` `RPL_ISUPPORT` token is preserved, and a new token called `"UTF8MAPPING"` is added to denote this Unicode-compatible casemapping. This is done for two reasons - to allow servers to continue using `rfc1459`-specific casemapping rules if they need to, and ensuring clients who aren't aware of this Unicode casemapping can fall back to a default casemapping. | ||
|
||
We recommend using these values: | ||
|
||
- `CASEMAPPING`: `ascii` | ||
- `UTF8MAPPING`: `rfc8265` | ||
|
||
If the above values are in use, software can skip the ASCII-specific casemapping rules and continue with the [UsernameCaseMapped profile][precis] preperation rules. | ||
|
||
If `CASEMAPPING` is `"rfc1459"`, software MUST apply these special casemapping rules before continuing with the [UsernameCaseMapped profile][precis] preperation rules: | ||
|
||
- `('{', 0x7B)` is the lower-case equivalent of the character `('[', 0x5B)` | ||
- `('|', 0x7C)` is the lower-case equivalent of the character `('\', 0x5C)` | ||
- `('}', 0x7D)` is the lower-case equivalent of the character `(']', 0x5D)` | ||
|
||
|
||
### Order of Operations | ||
|
||
Names being prepared MUST apply the following rules in the order shown: | ||
|
||
1. If casefolding a channel name, strip all channel prefixes (in the ISUPPORT token `CHANTYPES`) right at the start of the name. | ||
2. Preperation using additional casemapping rules specified by `CASEMAPPING` (such as `rfc1459`). | ||
3. Preparation using the PRECIS [UsernameCaseMapped profile][precis]. | ||
4. Check for restricted characters. | ||
5. If casefolding a channel name, re-apply the original stripped channel prefixes. | ||
|
||
These steps MUST happen in the order shown, or else the restricted characters check may miss characters that should be legitimately restricted. | ||
|
||
If a name contains a restricted character (found in either step 1 or 2), that name MUST be rejected by the server and MUST NOT be propogated to other clients. This is done through the appropriate numeric for the command which tried to set or use the invalid name such as `ERR_ERRONEUSNICKNAME`, `ERR_NOSUCHCHANNEL`, or whichever numeric is most appropriate. | ||
|
||
The first and last step are specific to casefolding channel names. This is done because by default, the Bidi (bi-directional) rule restrictions mean that a name like `#愛でる` will be rejected (as `#` is considered a 'left-to-right' character, and `愛でる` are considered right-to-left). A channel name like `#愛でる` should have the `#` character stripped, the rest of the string prepared, and then the prefix `#` re-applied. If there is more than one channel prefix character at the start, then all should be stripped (e.g. `##愛でる` would have the `##` stripped and then re-applied). | ||
|
||
|
||
### Comparisons | ||
|
||
When comparing names for equivalency, both strings are casemapped with the above process. After this, they are compared with a standard bit-for-bit comparison. | ||
|
||
|
||
## Visually Similar Characters | ||
|
||
As noted in the [Visually Similar Characters section](https://tools.ietf.org/html/rfc8264#section-12.5) of the PRECIS framework specification, the PRECIS framework itself does not address the issue of eliminating all possible visually similar characters. | ||
|
||
With the new allowed Unicode characters comes the ability to use characters that look the same. For example, `E (0x45)`, `Ε (U+0395)` (Greek Capital Letter Epsilon), and `Е (U+0415)` (Cyrillic Capital Letter IE) look the same in most fonts, but are treated as separate characters by this casemapping. More examples of these can be found in Unicode's [Confusables document](https://www.unicode.org/Public/security/latest/confusables.txt). | ||
|
||
Unicode skeletonisation is the method we recommend to combat this. For each identifier (nick/channel name) on the server, a 'skeleton' is generated by taking the **casefolded** name, and then applying to it the transformations described in the [Unicode Security Mechanisms document](http://unicode.org/reports/tr39/#Confusable_Detection). These skeletons, if used, MUST ONLY be used for comparison, and not as any user-visible identifier (as they intentionally contain complicated mixes of scripts and characters). When users change nicknames or create new channels, the casefolded names should be compared and the skeletons should also be compared to ensure that both are globally-unique (with any non-unique names rejected outright). This seems to be the most reliable method as of right now, but does require storing the skeletons of all in-use names for comparison purposes. | ||
|
||
Otherwise, server authors may wish to only allow characters from specific character sets of locales to be used, only allow characters from a list of known, non-confusable characters. Other recommendations are available in the [Visually Similar Characters section](https://tools.ietf.org/html/rfc8264#section-12.5) of the PRECIS framework specification. Names that have the opportunity to be confusing SHOULD be disallowed by servers. | ||
|
||
|
||
## Restricted Characters | ||
|
||
This section contains a list of characters which (currently) cause protocol issues or are otherwise undesirable, and are recommended to be disallowed from use. The justifications for disallowing these characters is listed beside them. | ||
|
||
Servers and bouncers may disallow any characters they deem necessary. | ||
|
||
|
||
### All Names | ||
|
||
Nicknames, usernames, hostnames and channel names should not contain the following characters: | ||
|
||
* `(' ', 0x20)` - Separates parameters. | ||
* `(',', 0x2C)` - Used as a separator. | ||
|
||
In addition, the first character of nicknames, usernames, hostnames and channel names should not be: | ||
|
||
* `(':', 0x3A)` - Separates trailing parameter. | ||
|
||
To be clear, these characters will also be repeated below. | ||
|
||
#### Protocol Framing | ||
|
||
These characters should be disallowed either by the protocol itself or because of how they are used in message framing: | ||
|
||
* `('\0', 0x00)` - Disallowed by the protocol. | ||
* `('\n', 0x0A)` - Used in message framing. | ||
* `('\r', 0x0D)` - Used in message framing. | ||
|
||
|
||
### Nicknames | ||
|
||
Nicknames should not contain the following characters: | ||
|
||
* `(' ', 0x20)` - Separates parameters. | ||
* `(',', 0x2C)` - Used as a separator. | ||
* `('*', 0x2A)` - Used in mask matching. | ||
* `('?', 0x3F)` - Used in mask matching. | ||
* `('.', 0x2E)` - Denotes a server name (some clients recognise a client vs server by whether the name contains a period). | ||
* `('!', 0x21)` - Separates nickname from username. | ||
* `('@', 0x40)` - Separates username from hostname. | ||
|
||
In addition to the above restrictions, the first character of a nickname should not be: | ||
|
||
* `(':', 0x3A)` - Separates trailing parameter. | ||
* Channel membership prefixes as defined in the `PREFIX` parameter of `RPL_ISUPPORT` - Protocol conflicts. | ||
* Channel prefixes as defined in the `CHANTYPES` parameter of `RPL_ISUPPORT` - Protocol conflicts. | ||
|
||
|
||
### Usernames | ||
|
||
Usernames should not contain the following characters: | ||
|
||
* `(' ', 0x20)` - Separates parameters. | ||
* `(',', 0x2C)` - Used as a separator. | ||
* `('*', 0x2A)` - Used in mask matching. | ||
* `('?', 0x3F)` - Used in mask matching. | ||
* `('!', 0x21)` - Separates nickname from username. | ||
* `('@', 0x40)` - Separates username from hostname. | ||
|
||
In addition, the first character of a username should not be: | ||
|
||
* `(':', 0x3A)` - Separates trailing parameter. | ||
|
||
|
||
### Hostnames | ||
|
||
Hostnames should not contain the following charactes: | ||
|
||
* `(' ', 0x20)` - Separates parameters. | ||
* `(',', 0x2C)` - Used as a separator. | ||
* `('*', 0x2A)` - Used in mask matching. | ||
* `('?', 0x3F)` - Used in mask matching. | ||
* `('!', 0x21)` - Separates nickname from username. | ||
* `('@', 0x40)` - Separates username from hostname. | ||
|
||
In addition, the first character of a hostname should not be: | ||
|
||
* `(':', 0x3A)` - Separates trailing parameter. | ||
|
||
Hostnames that are looked up on client connection should also be checked with the traditional hostname rules to ensure it is valid. | ||
|
||
|
||
### Channel Names | ||
|
||
Channel names must start with a channel prefix as defined in the `CHANTYPES` parameter of `RPL_ISUPPORT` (005). | ||
|
||
Channel names should not contain the following characters: | ||
|
||
* `(' ', 0x20)` - Separates parameters. | ||
* `(',', 0x2C)` - Used as a separator. | ||
|
||
|
||
## Security Considerations | ||
|
||
New Unicode-capable casemappings have a considerable security impact. | ||
|
||
Introducing so many new characters in names brings the possibility for those new characters to break the protocol. We aim to prevent this from being an issue by outlining these problematic characters in the Restricted Characters section. | ||
|
||
With Unicode characters comes the possibility of having visually similar characters (sometimes called 'confusables'). We acknowledge this and outline several ways to combat this possibility in the Visually Similar Characters section. | ||
|
||
Legacy clients may have issues working correctly with Unicode names. It is acknowledged that issues may exist around this, and in particular it is very difficult to understand how clients may react to Unicode names given the very vast base of clients out there today. I do not attempt to address this concern within this document. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't what we implemented in oragono --- we skeletonize the unfolded, original identifier (the one that is displayed to the users), and only then apply a round of width and case normalization. The rationale is that an initial round of casefolding may lose information about visual confusability. Hypothetically, you could have a non-Latin character with both uppercase and lowercase forms, such that its uppercase form is visually confusable with a Latin character but its lowercase form is visually distinct. Casefolding first would allow an impersonation attack using the uppercase character.
To be honest, I think it would help to see how this plays out in the wild --- try to get real-world user stories from people using non-Latin scripts, also see if we can get some Unicode experts to play with the implementation and try to break it.