Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure we have comprehensive coverage for the ICU4X bug #669

Merged
merged 38 commits into from
Jan 23, 2024
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
cc76245
A test which I expected to fail, but not in this way
eggrobin Dec 1, 2023
e23d1c1
Pre-16 and NFKCQC
eggrobin Dec 1, 2023
24fe8e1
🤪
eggrobin Dec 2, 2023
328c761
Canonical closure tests
eggrobin Dec 29, 2023
2d0ceaf
Generate canonical closures
eggrobin Dec 29, 2023
3880c4f
Some interesting sequences
eggrobin Dec 29, 2023
b3b53c0
Some very crappy code
eggrobin Dec 29, 2023
22dfd8c
Drop Hangul and make sure we have all overlaps
eggrobin Dec 29, 2023
a742327
Split it into its own part and look at chaining compositions, not dec…
eggrobin Jan 3, 2024
182cc3a
despam
eggrobin Jan 3, 2024
53459b0
spots
eggrobin Jan 3, 2024
5f16271
Regenerate UCD
eggrobin Jan 3, 2024
747f982
Some comments.
eggrobin Jan 3, 2024
695c95e
Allow a single non-decomposable starter at either end of the chain
eggrobin Jan 4, 2024
9fea9ea
Deduplicate parts 4 and 5
eggrobin Jan 4, 2024
7362f2d
Remove redundant test cases in NFC (covered by the NFC column of othe…
eggrobin Jan 5, 2024
cdd391a
Clean things up
eggrobin Jan 5, 2024
7bcb9b4
more cleanup
eggrobin Jan 5, 2024
cf4275c
more cleanup
eggrobin Jan 5, 2024
3cb23ac
More testing
eggrobin Jan 7, 2024
e41b3ea
Fix the QC properties
eggrobin Jan 7, 2024
0c312ce
stray import
eggrobin Jan 7, 2024
361a977
factor
eggrobin Jan 7, 2024
0380b27
report all failures
eggrobin Jan 7, 2024
7a6220b
Markus’s suggestions
eggrobin Jan 20, 2024
89cdf7a
Merge remote-tracking branch 'la-vache/main' into normalization-woes
eggrobin Jan 20, 2024
e1a01ed
More honest primaryCompositesByMeowNFDCodePoint maps
eggrobin Jan 20, 2024
b0b4cf6
Regenerate UCD
eggrobin Jan 20, 2024
910039c
Merge branch 'normalization-woes' of https://github.com/eggrobin/unic…
eggrobin Jan 20, 2024
c21622e
spotless
eggrobin Jan 20, 2024
6308912
Merge branch 'normalization-woes' into HEAD
eggrobin Jan 20, 2024
af8b00b
Make sure we have the ICU4X test cases if this expands to other planes
eggrobin Jan 21, 2024
5196123
Regenerate UCD
eggrobin Jan 21, 2024
21d5167
ic
eggrobin Jan 21, 2024
f47e9c8
Regenerate UCD
eggrobin Jan 21, 2024
66e7296
Merge remote-tracking branch 'la-vache/main' into normalization-woes
eggrobin Jan 22, 2024
b392d39
Merge branch 'normalization-woes' into icu4x-belt-and-suspenders
eggrobin Jan 22, 2024
98d8820
After Markus’s review
eggrobin Jan 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion unicodetools/data/ucd/dev/NormalizationTest.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# NormalizationTest-16.0.0.txt
# Date: 2024-01-20, 01:49:31 GMT
# Date: 2024-01-21, 18:36:20 GMT
# © 2023 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use, see https://www.unicode.org/terms_of_use.html
Expand Down Expand Up @@ -67,7 +67,25 @@
1100 AC00 11A8;1100 AC01;1100 1100 1161 11A8;1100 AC01;1100 1100 1161 11A8; # (ᄀ각; ᄀ각; ᄀ각; ᄀ각; ᄀ각; ) HANGUL CHOSEONG KIYEOK, HANGUL SYLLABLE GA, HANGUL JONGSEONG KIYEOK
1100 AC00 11A8 11A8;1100 AC01 11A8;1100 1100 1161 11A8 11A8;1100 AC01 11A8;1100 1100 1161 11A8 11A8; # (ᄀ각ᆨ; ᄀ각ᆨ; ᄀ각ᆨ; ᄀ각ᆨ; ᄀ각ᆨ; ) HANGUL CHOSEONG KIYEOK, HANGUL SYLLABLE GA, HANGUL JONGSEONG KIYEOK, HANGUL JONGSEONG KIYEOK
01C4 0323;01C4 0323;01C4 0323;0044 1E92 030C;0044 005A 0323 030C; # (DŽ◌̣; DŽ◌̣; DŽ◌̣; DẒ◌̌; DZ◌̣◌̌; ) LATIN CAPITAL LETTER DZ WITH CARON, COMBINING DOT BELOW
01C5 0323;01C5 0323;01C5 0323;0044 1E93 030C;0044 007A 0323 030C; # (Dž◌̣; Dž◌̣; Dž◌̣; Dẓ◌̌; Dz◌̣◌̌; ) LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON, COMBINING DOT BELOW
01C6 0323;01C6 0323;01C6 0323;0064 1E93 030C;0064 007A 0323 030C; # (dž◌̣; dž◌̣; dž◌̣; dẓ◌̌; dz◌̣◌̌; ) LATIN SMALL LETTER DZ WITH CARON, COMBINING DOT BELOW
0DDD 0334;0DDD 0334;0DD9 0DCF 0334 0DCA;0DDD 0334;0DD9 0DCF 0334 0DCA; # (ෝ◌̴; ෝ◌̴; ො◌̴◌්; ෝ◌̴; ො◌̴◌්; ) SINHALA VOWEL SIGN KOMBUVA HAA DIGA AELA-PILLA, COMBINING TILDE OVERLAY
3304 0334;3304 0334;3304 0334;30A4 30CB 30F3 30B0 0334;30A4 30CB 30F3 30AF 0334 3099; # (㌄◌̴; ㌄◌̴; ㌄◌̴; イニング◌̴; イニンク◌̴◌゙; ) SQUARE ININGU, COMBINING TILDE OVERLAY
3307 0334;3307 0334;3307 0334;30A8 30B9 30AF 30FC 30C9 0334;30A8 30B9 30AF 30FC 30C8 0334 3099; # (㌇◌̴; ㌇◌̴; ㌇◌̴; エスクード◌̴; エスクート◌̴◌゙; ) SQUARE ESUKUUDO, COMBINING TILDE OVERLAY
3310 0334;3310 0334;3310 0334;30AE 30AC 0334;30AD 3099 30AB 0334 3099; # (㌐◌̴; ㌐◌̴; ㌐◌̴; ギガ◌̴; キ◌゙カ◌̴◌゙; ) SQUARE GIGA, COMBINING TILDE OVERLAY
331E 0334;331E 0334;331E 0334;30B3 30FC 30DD 0334;30B3 30FC 30DB 0334 309A; # (㌞◌̴; ㌞◌̴; ㌞◌̴; コーポ◌̴; コーホ◌̴◌゚; ) SQUARE KOOPO, COMBINING TILDE OVERLAY
3321 0334;3321 0334;3321 0334;30B7 30EA 30F3 30B0 0334;30B7 30EA 30F3 30AF 0334 3099; # (㌡◌̴; ㌡◌̴; ㌡◌̴; シリング◌̴; シリンク◌̴◌゙; ) SQUARE SIRINGU, COMBINING TILDE OVERLAY
3332 0334;3332 0334;3332 0334;30D5 30A1 30E9 30C3 30C9 0334;30D5 30A1 30E9 30C3 30C8 0334 3099; # (㌲◌̴; ㌲◌̴; ㌲◌̴; ファラッド◌̴; ファラット◌̴◌゙; ) SQUARE HUARADDO, COMBINING TILDE OVERLAY
333B 0334;333B 0334;333B 0334;30DA 30FC 30B8 0334;30D8 309A 30FC 30B7 0334 3099; # (㌻◌̴; ㌻◌̴; ㌻◌̴; ページ◌̴; ヘ◌゚ーシ◌̴◌゙; ) SQUARE PEEZI, COMBINING TILDE OVERLAY
3340 0334;3340 0334;3340 0334;30DD 30F3 30C9 0334;30DB 309A 30F3 30C8 0334 3099; # (㍀◌̴; ㍀◌̴; ㍀◌̴; ポンド◌̴; ホ◌゚ント◌̴◌゙; ) SQUARE PONDO, COMBINING TILDE OVERLAY
334B 0334;334B 0334;334B 0334;30E1 30AC 0334;30E1 30AB 0334 3099; # (㍋◌̴; ㍋◌̴; ㍋◌̴; メガ◌̴; メカ◌̴◌゙; ) SQUARE MEGA, COMBINING TILDE OVERLAY
334E 0334;334E 0334;334E 0334;30E4 30FC 30C9 0334;30E4 30FC 30C8 0334 3099; # (㍎◌̴; ㍎◌̴; ㍎◌̴; ヤード◌̴; ヤート◌̴◌゙; ) SQUARE YAADO, COMBINING TILDE OVERLAY
FEF5 0656;FEF5 0656;FEF5 0656;0644 0622 0656;0644 0627 0656 0653; # (ﻵ◌ٖ; ﻵ◌ٖ; ﻵ◌ٖ; لآ◌ٖ; لا◌ٖ◌ٓ; ) ARABIC LIGATURE LAM WITH ALEF WITH MADDA ABOVE ISOLATED FORM, ARABIC SUBSCRIPT ALEF
FEF6 0656;FEF6 0656;FEF6 0656;0644 0622 0656;0644 0627 0656 0653; # (ﻶ◌ٖ; ﻶ◌ٖ; ﻶ◌ٖ; لآ◌ٖ; لا◌ٖ◌ٓ; ) ARABIC LIGATURE LAM WITH ALEF WITH MADDA ABOVE FINAL FORM, ARABIC SUBSCRIPT ALEF
FEF7 0656;FEF7 0656;FEF7 0656;0644 0623 0656;0644 0627 0656 0654; # (ﻷ◌ٖ; ﻷ◌ٖ; ﻷ◌ٖ; لأ◌ٖ; لا◌ٖ◌ٔ; ) ARABIC LIGATURE LAM WITH ALEF WITH HAMZA ABOVE ISOLATED FORM, ARABIC SUBSCRIPT ALEF
FEF8 0656;FEF8 0656;FEF8 0656;0644 0623 0656;0644 0627 0656 0654; # (ﻸ◌ٖ; ﻸ◌ٖ; ﻸ◌ٖ; لأ◌ٖ; لا◌ٖ◌ٔ; ) ARABIC LIGATURE LAM WITH ALEF WITH HAMZA ABOVE FINAL FORM, ARABIC SUBSCRIPT ALEF
FEF9 0334;FEF9 0334;FEF9 0334;0644 0625 0334;0644 0627 0334 0655; # (ﻹ◌̴; ﻹ◌̴; ﻹ◌̴; لإ◌̴; لا◌̴◌ٕ; ) ARABIC LIGATURE LAM WITH ALEF WITH HAMZA BELOW ISOLATED FORM, COMBINING TILDE OVERLAY
FEFA 0334;FEFA 0334;FEFA 0334;0644 0625 0334;0644 0627 0334 0655; # (ﻺ◌̴; ﻺ◌̴; ﻺ◌̴; لإ◌̴; لا◌̴◌ٕ; ) ARABIC LIGATURE LAM WITH ALEF WITH HAMZA BELOW FINAL FORM, COMBINING TILDE OVERLAY
#
@Part1 # Character by character test
# All characters not explicitly occurring in c1 of Part 1 have identical NFC, D, KC, KD forms.
Expand Down
61 changes: 54 additions & 7 deletions unicodetools/src/main/java/org/unicode/text/UCD/GenerateData.java
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
import com.google.common.collect.ImmutableMap;
import com.google.common.collect.ImmutableSet;
import com.ibm.icu.text.UTF16;
import com.ibm.icu.text.UnicodeSet;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
Expand All @@ -23,6 +24,8 @@
import java.util.TreeMap;
import java.util.TreeSet;
import java.util.function.Consumer;
import org.unicode.props.IndexUnicodeProperties;
import org.unicode.props.UcdProperty;
import org.unicode.text.utility.Settings;
import org.unicode.text.utility.UTF32;
import org.unicode.text.utility.UnicodeDataFile;
Expand Down Expand Up @@ -811,6 +814,57 @@ public static void writeNormalizerTestSuite(String directory, String fileName)
for (final String testSuiteCase : testSuiteCases) {
writeLine(testSuiteCase, log, false);
}
// At least one implementation (ICU4X) has an edge case when a character
// whose decomposition contains multiple starters and ends with a
// non-starter is followed by a non-starter of lower CCC.
// See https://github.com/unicode-org/unicodetools/issues/656
// and https://github.com/unicode-org/icu4x/pull/4530.
// That implementation also has separate code paths for the BMP and
// higher planes. No such decompositions currently exist outside the
// BMP, but by generating these test cases we ensure that this would be
// covered.
// We stick them in Part 0, which is in principle for handcrafted test
// cases, because there are not many of them, and the edge case feels a
// tad too weird to describe in the title of a new part.
final org.unicode.props.UnicodeProperty sc =
IndexUnicodeProperties.make().getProperty(UcdProperty.Script);
final org.unicode.props.UnicodeProperty ccc =
IndexUnicodeProperties.make().getProperty(UcdProperty.Canonical_Combining_Class);
loopOverCodePoints:
for (final String cp : UnicodeSet.ALL_CODE_POINTS) {
final String[] decompositions =
new String[] {Default.nfd().normalize(cp), Default.nfkd().normalize(cp)};
markusicu marked this conversation as resolved.
Show resolved Hide resolved
for (final String decomposition : decompositions) {
final int lastCCC =
Default.ucd()
.getCombiningClass(
markusicu marked this conversation as resolved.
Show resolved Hide resolved
decomposition.codePointBefore(decomposition.length()));
final long nonStarterCount =
decomposition
.codePoints()
.filter(c -> (Default.ucd().getCombiningClass(c) == 0))
.count();
final String script = sc.getValue(cp.codePointAt(0));
if (lastCCC > 1 && nonStarterCount > 1) {
// Try to pick a trailing nonstarter that might have a
// chance of combining with the character if possible,
// both for æsthetic reasons and to reproduce the example
// ICU4X came across. If all else fails, use a character
// with CCC=1, as low as it gets.
if (script.equals("Arabic") && lastCCC > 220) {
// ARABIC SUBSCRIPT ALEF.
writeLine(cp + "\u0656", log, false);
} else if (lastCCC > 220) {
// COMBINING DOT BELOW.
writeLine(cp + "\u0323", log, false);
} else {
// COMBINING TILDE OVERLAY.
writeLine(cp + "\u0334", log, false);
}
continue loopOverCodePoints;
eggrobin marked this conversation as resolved.
Show resolved Hide resolved
}
}
}

System.out.println("Writing Part 2");

Expand Down Expand Up @@ -1318,13 +1372,6 @@ static final String comma(String s) {
"\u0592\u05B7\u05BC\u05A5\u05B0\u05C0\u05C4\u05AD",
"\u1100\uAC00\u11A8",
"\u1100\uAC00\u11A8\u11A8",
// Some implementations have an edge case when a character whose
// decomposition contains multiple starters and ends with a non-starter
// is followed by a non-starter of lower CCC.
// See https://github.com/unicode-org/unicodetools/issues/656
// and https://github.com/unicode-org/icu4x/pull/4530.
"\u01C4\u0323",
"\u0DDD\u0334",
};
/*
static final void backwardsCompat(String directory, String filename, int[] list) throws IOException {
Expand Down
Loading