Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address confusable AIs for 16.0 #841

Merged
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 31 additions & 9 deletions docs/security.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ machine-generated, then tweaked. They have names like
source/confusables-winFonts.txt. The main file is confusables-source.txt.

***There is fairly complex processing for the confusables, so carefully diff the
results. Sometimes you may get an unexpected union of two equivalence sets. Look
at Testing below for help.***
results. Sometimes you may get an unexpected union of two equivalence sets.
Look at Testing below for help.***

Look at the following spreadsheets / bugs to see if there are any additional
suggestions.
Expand All @@ -27,9 +27,28 @@ suggestions.
and deleted it without saving data. Check with Mark.

If so, assess and add to unicodetools/data/security/{version}/data/source/confusables-source.txt — *if needed.*

Then in the spreadsheets, move the "new stuff" line to the end.

### File Format
There is a brief description of the file format at the top.
Each line represents a mapping from a code point or set of code points to a sequence of one or more code points.

For example:
```
0021 ; 01C3 # ( ! → ǃ) EXCLAMATION MARK → LATIN LETTER RETROFLEX CLICK
```

The ordering of characters doesn't matter.
So it doesn't matter whether you have the above line, or
```
01C3 ; 0021 # ( ǃ → !) LATIN LETTER RETROFLEX CLICK → EXCLAMATION MARK
```
It also doesn't matter if you have identical lines; the second one will be a NOOP.

The mappings are used to generate equivalence classes.
From each equivalence class, one representative member will be chosen,
and in the resulting data file, all the other characters will map to that representative.

## Before generating

First, in CLDR, update the script metadata:
Expand All @@ -51,13 +70,10 @@ Run GenerateConfusables -c -b to generate the files. They will appear in two pla
* reformatted source, log
* $UNICODETOOLS_DIR/data/security/11.0.0/* *including log.txt*

**Run TestSecurity to verify that the confusable mappings are idempotent!**
The TestSecurity.java test is part of the unit test suite, run by a github CI.
It verifies that the confusable mappings are idempotent.

With the same VM arguments as the generator.
Starting in 2021q3, TestSecurity needs to be run as a JUnit test.
It is also now part of the unit test suite and run on GitHub CI.

Copy the following from the output directory to the top level of the revision directory:
Copy the following from the output directory to the top level of the revision directory, and check in.

* confusables.txt
* confusablesSummary.txt
Expand All @@ -66,6 +82,12 @@ Copy the following from the output directory to the top level of the revision di
* ReadMe.txt
* xidmodifications.txt

### Review

Review the mappings to make sure that there are no surprises.
The biggest issue is if two equivalence classes are mistakenly joined.
For example, if you map b to d, then that will join the equivalence class for b with that of d.

### IdentifierStatus.txt & IdentifierType.txt

Markus 2020-feb-07 for Unicode 13.0:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,17 @@
0021 ; 01C3 # ( ! → ǃ) EXCLAMATION MARK → LATIN LETTER RETROFLEX CLICK
# See https://github.com/unicode-org/unicodetools/blob/main/docs/security.md for how to use this file.
# The format is
# Source ; Target ; comments # comments
# Source is:
# - a hex code point
# - a literal character
# - a range of the above with .. (need to check this)
# - a UnicodeSet
# Target is:
# - a hex code point
# - a literal character
# - a sequence of hex code points and or literal characters (they can be mixed)
#######
0021 ; 01C3 # ( ! → ǃ) EXCLAMATION MARK → LATIN LETTER RETROFLEX CLICK
0022 ; 02BA # ( " → ʺ) QUOTATION MARK → MODIFIER LETTER DOUBLE PRIME
0022 ; 0027 0027
0022 ; 05F4 # ( " → ״) QUOTATION MARK → HEBREW PUNCTUATION GERSHAYIM
Expand Down Expand Up @@ -5437,4 +5450,26 @@ ABBB; 0473; V8_0; ꮻ => ѳ; CHEROKEE SMALL LETTER WI => CYRILLIC SMALL LETTER F
1F16E ; C 20E0 ; V11_0 ; CIRCLED C WITH OVERLAID BACKSLASH
# 1F16F ; 🚹 ; V11_0 ; CIRCLED HUMAN FIGURE

# 178-A76 — Section 21 of document L2/24-012

513F ; 儿 # V16.0 ; U+513F ︎➡︎ U+16FF2
16FF3 ; 兒 # V16.0 ; U+5152 ➡ U+16FF3
ㄦ ; 儿 # V16.0 ; U+3126 ㄦ BOPOMOFO LETTER ER ➡ 儿

# 176-A116 — Section 2a of L2/23-164

A7DA ; Λ # V16.0 ; U+A7DA LATIN CAPITAL LETTER LAMBDA ➡ greek equiv
A7DB ; λ # V16.0 ; U+A7DB LATIN SMALL LETTER LAMBDA ➡ greek equiv
A7DC ; Λ̷̷̷ # V16.0 ; U+A7DC LATIN CAPITAL LETTER LAMBDA WITH STROKE ➡ greek equiv
ƛ ; λ̷̷̷ # V16.0 ; existing Latin variant

# 165-A37 — L2/20-272

1715 ; 1734 # V16.0 ; U+1715 TAGALOG SIGN PAMUDPOD ➡ 1734, Hanunoo Sign Pamudpod

# 166-A55 —

macchiati marked this conversation as resolved.
Show resolved Hide resolved
ß ; β # sharp S with beta
ẞ ~ ß # sharp S upper/lower
A7D6 ; β # Middle Scots S, uppercase
macchiati marked this conversation as resolved.
Show resolved Hide resolved
A7D6 ; β # Middle Scots S, lowercase
Loading