diff --git a/index.html b/index.html index d164786..fb20d9d 100644 --- a/index.html +++ b/index.html @@ -5283,7 +5283,6 @@

Operator Dictionary

Operator Dictionary (Compact)

-
Remove fence/separator?

The following dictionary provides a compact form for the operator dictionary, suitable for @@ -5316,23 +5315,58 @@

Operator Dictionary (Compact)

fence, separator to false.
  • If Content is a single character in the - BMP Private Use Area (range U+E000–U+F8FF) + range U+0320–U+03FF then exit with NotFound status.
  • If Content an UTF-16 strings of lengths more than 1 (including the case of surrogate pairs) and is listed in Operators_multichar then replace Content with the Unicode character - "U+E000 plus the index of Content in - Operators_multichar". Otherwise, exit with - NotFound status. + "U+0320 plus the index of Content in + Operators_multichar". If it is not listed, then + exit with NotFound status.
  • -
  • If (Content, Form) - corresponds to one category of - then - set the properties according to - . - Otherwise, exit with NotFound status. +
  • + During this step, the algorithm will try and find a category + corresponding to (Content, Form) from + and + either exit with NotFound status or and move to + the next point. More precisely, this can be done as follows: +
  • If Content is in Operators_fence then set property fence to true.
  • @@ -5348,46 +5382,34 @@

    Operator Dictionary (Compact)

    +
    + The fence and separator properties do not + have any visible effect on the layout described in this + specification. So step 5 and 6 as well as the corresponding tables + may be ignored. +
    +
    Remove fence/separator?
    +

    - After conversion to a single UTF-16 character, determining the - category of ('Content', 'Form') can be done by binary searches - on the tables corresponding to the 'Form' value - of . - For tables of ranges, the binary search can be performed on the - range start code point. Note that small tables only have a few - ranges or code points to check and so can be handled by direct - comparaisons. + When encoded as ranges, one can perform a binary search by looking + for the range start, followed by an extra check on the range length. + Since log is concave, + it is worse to do one binary search on each large subtable + of than one + binary search on the whole table of + . + One can see that there are several contiguous Unicode blocks, so + encoding tables as ranges allow to get almost 8 bits per entry.

    - The possible characters 'Content' values after conversion - characters are located into the three small ranges - U+0000–U+03FF, U+2000–U+2BFF and - U+E000–U+E04F and after simple offset shift can be encoded on - 12 bits. Note that all Unicode ranges from - and - contain between 1 and 32 characters. By splitting ranges into - at most two parts, each range can be encoded on 16 bits. - Due to several contiguous Unicode blocks, the tables would still be - encoded in significantly less than 16bits/entry but all the - tables are now encoded and treated the same way. -

    -

    - Alternatively, discarding the smallest tables as explained above, - one can consider only those having a 4bits encoding in - . - Using the 12-bit encoding of the 'Content' described - above this means that these tables can be encoded with - 16bits/entry but binary search would now be performed on a single - table. -

    -

    - Continuing on the previous approach, it is possible to + Alternatively, it is possible to use a perfect hash function to implement table lookup in constant - time [[?gperf]] [[?CMPH]]. This would add 16 bits per empty entry + time [[?gperf]] [[?CMPH]]. This would instead take + 16 bits per entry, plus 16 bits per extra empty entry (for non-minimal perfect hash function) as well as extra data to store the hash function parameters. For minimal perfect hash function, the theorical lower bound for storing these parameters is @@ -5395,11 +5417,6 @@

    Operator Dictionary (Compact)

    limit up to 4bits/entry.

    -
    - TODO give more compact form in two tables combining the two ideas above for encoding categories 0-9 + 11: - Table 1: 12bits (start code point) + 4bit (form+category) - Table 2: 1, 2 or 4bits? (number of code points in contiguous block) -

    Stretchy Operator Axis

    diff --git a/tables/operator-dictionary-compact.html b/tables/operator-dictionary-compact.html index 6c62049..605f2ce 100644 --- a/tables/operator-dictionary-compact.html +++ b/tables/operator-dictionary-compact.html @@ -1,2 +1,2 @@ -
    Special TableEntries
    Operators_multichar41 entries (null-terminated UTF-16 strings): {U+0021,U+0021,U+0000}, {U+0021,U+003D,U+0000}, {U+0026,U+0026,U+0000}, {U+002A,U+003D,U+0000}, {U+002B,U+002B,U+0000}, {U+002B,U+003D,U+0000}, {U+002D,U+002D,U+0000}, {U+002D,U+003D,U+0000}, {U+002D,U+003E,U+0000}, {U+002E,U+002E,U+0000}, {U+002E,U+002E,U+002E,U+0000}, {U+002F,U+003D,U+0000}, {U+003A,U+003D,U+0000}, {U+003C,U+003D,U+0000}, {U+003D,U+003D,U+0000}, {U+003E,U+003D,U+0000}, {U+007C,U+007C,U+0000}, {U+007C,U+007C,U+007C,U+0000}, {U+223D,U+0331,U+0000}, {U+2242,U+0338,U+0000}, {U+224E,U+0338,U+0000}, {U+224F,U+0338,U+0000}, {U+2266,U+0338,U+0000}, {U+226A,U+0338,U+0000}, {U+226B,U+0338,U+0000}, {U+227F,U+0338,U+0000}, {U+2282,U+20D2,U+0000}, {U+2283,U+20D2,U+0000}, {U+228F,U+0338,U+0000}, {U+2290,U+0338,U+0000}, {U+29CF,U+0338,U+0000}, {U+29D0,U+0338,U+0000}, {U+2A7D,U+0338,U+0000}, {U+2A7E,U+0338,U+0000}, {U+2AA1,U+0338,U+0000}, {U+2AA2,U+0338,U+0000}, {U+2AAF,U+0338,U+0000}, {U+2AB0,U+0338,U+0000}, {U+2ADD,U+0338,U+0000}, {U+D83B,U+DEF0,U+0000}, {U+D83B,U+DEF1,U+0000},
    Operators_fence57 entries (15 Unicode ranges): [U+0028–U+0029], {U+005B}, {U+005D}, [U+007B–U+007D], {U+2016}, [U+2018–U+2019], [U+201C–U+201D], [U+2308–U+230B], [U+2329–U+232A], [U+2772–U+2773], [U+27E6–U+27EF], {U+2980}, [U+2983–U+2998], [U+29FC–U+29FD], [U+E010–U+E011],
    Operators_separator3 entries: U+002C, U+003B, U+2063,
    Special tables for the operator dictionary.
    Total size: 101 entries, 301 bytes.
    (assuming characters are UTF-16 and 1-byte range lengths)
    (Content, Form) keysCategory
    138 entries (18 Unicode ranges) in infix form: [U+2190–U+2199], [U+219C–U+21AD], [U+21AF–U+21B5], {U+21B9}, [U+21BC–U+21CC], [U+21D0–U+21DD], [U+21E0–U+21F0], {U+21F3}, [U+21F5–U+21F6], [U+21FD–U+21FF], [U+27F0–U+27F1], [U+27F5–U+27FF], [U+290A–U+2910], [U+2912–U+2913], [U+2921–U+2922], [U+294E–U+2961], [U+296E–U+296F], [U+2B45–U+2B46], A
    103 entries (36 Unicode ranges) in infix form: {U+002B}, {U+002D}, {U+002F}, {U+00B1}, {U+00F7}, [U+2212–U+2214], {U+2216}, {U+2218}, {U+2224}, [U+2227–U+222A], {U+2236}, {U+2238}, [U+228C–U+228F], [U+2293–U+2296], {U+2298}, [U+229D–U+229F], [U+22BB–U+22BD], {U+22C4}, {U+22C6}, [U+22CE–U+22CF], [U+22D2–U+22D3], [U+2795–U+2797], {U+27F4}, {U+29BC}, {U+29F6}, [U+2A22–U+2A2E], [U+2A38–U+2A3A], [U+2A40–U+2A4F], [U+2A51–U+2A63], [U+2ADA–U+2ADB], {U+2AFB}, {U+2AFD}, {U+2B32}, {U+E002}, {U+E005}, {U+E007}, B
    89 entries (42 Unicode ranges) in infix form: {U+0025}, {U+002A}, {U+002E}, {U+0040}, {U+00B7}, {U+00D7}, {U+2022}, {U+2043}, {U+2206}, {U+220E}, {U+2217}, [U+223F–U+2240], {U+2297}, {U+2299}, [U+22A0–U+22A1], {U+22C5}, {U+22C7}, [U+22C9–U+22CC], [U+2305–U+2306], [U+25A0–U+25A1], [U+25AA–U+25AB], [U+25AD–U+25B1], [U+2981–U+2982], [U+2999–U+299A], {U+29B5}, [U+29C2–U+29C3], [U+29C9–U+29CD], [U+29D8–U+29D9], {U+29DB}, [U+29DF–U+29E0], {U+29E2}, [U+29E7–U+29ED], [U+29F8–U+29FB], [U+2A1D–U+2A21], [U+2A2F–U+2A37], [U+2A3B–U+2A3D], {U+2A3F}, {U+2A50}, [U+2ADC–U+2ADD], {U+2AFE}, [U+E010–U+E012], {U+E026}, C
    53 entries (22 Unicode ranges) in prefix form: {U+0021}, {U+002B}, {U+002D}, {U+00AC}, {U+00B1}, {U+2018}, {U+201C}, [U+2200–U+2201], [U+2203–U+2204], {U+2207}, [U+2212–U+2213], [U+221B–U+221C], [U+221F–U+2222], {U+223C}, [U+22BE–U+22BF], {U+2310}, {U+2319}, [U+2795–U+2796], {U+27C0}, [U+299B–U+29AF], [U+2AEC–U+2AED], [U+E010–U+E011], D
    42 entries (22 Unicode ranges) in postfix form: [U+0021–U+0022], [U+0026–U+0027], {U+0060}, {U+00A8}, {U+00B0}, [U+00B2–U+00B4], [U+00B8–U+00B9], [U+02CA–U+02CB], [U+02D8–U+02DA], {U+02DD}, {U+0311}, [U+2019–U+201B], [U+201D–U+201F], [U+2032–U+2037], {U+2057}, [U+20DB–U+20DC], {U+23CD}, {U+E000}, {U+E004}, {U+E006}, [U+E009–U+E00A], [U+E010–U+E011], E
    26 entries (16 Unicode ranges) in postfix form: [U+005E–U+005F], {U+007E}, {U+00AF}, [U+02C6–U+02C7], {U+02C9}, {U+02CD}, {U+02DC}, {U+02F7}, {U+0302}, {U+2016}, {U+203E}, [U+2322–U+2323], [U+23B4–U+23B5], [U+23DC–U+23E1], {U+2980}, [U+E027–U+E028], F
    25 entries in prefix form: U+0028, U+005B, U+007B, U+007C, U+2308, U+230A, U+2329, U+2772, U+27E6, U+27E8, U+27EA, U+27EC, U+27EE, U+2983, U+2985, U+2987, U+2989, U+298B, U+298D, U+298F, U+2991, U+2993, U+2995, U+2997, U+29FC, G
    25 entries in postfix form: U+0029, U+005D, U+007C, U+007D, U+2309, U+230B, U+232A, U+2773, U+27E7, U+27E9, U+27EB, U+27ED, U+27EF, U+2984, U+2986, U+2988, U+298A, U+298C, U+298E, U+2990, U+2992, U+2994, U+2996, U+2998, U+29FD, H
    22 entries (3 Unicode ranges) in prefix form: [U+222B–U+2233], [U+2A0B–U+2A0F], [U+2A15–U+2A1C], I
    18 entries (5 Unicode ranges) in prefix form: [U+220F–U+2210], [U+22C0–U+22C3], [U+2A00–U+2A09], {U+2AFC}, {U+2AFF}, J
    7 entries (3 Unicode ranges) in prefix form: {U+2211}, {U+2A0A}, [U+2A10–U+2A14], K
    6 entries (3 Unicode ranges) in infix form: {U+005C}, [U+2061–U+2064], {U+2396}, L
    3 entries in infix form: U+002C, U+003A, U+003B, M
    3 entries in prefix form: U+2145, U+2146, U+2202, N
    Mapping from operator (Content, Form) to a category.
    Total size: 560 entries, 622 bytes.
    (assuming characters are UTF-16 and 1-byte range lengths)
    Categoryencodingrspacelspaceproperties
    A0x00.2777777777777778em0.2777777777777778emstretchy
    B0x40.2222222222222222em0.2222222222222222emN/A
    C0x80.16666666666666666em0.16666666666666666emN/A
    D0x100N/A
    E0x200N/A
    F0x600stretchy
    G0x500stretchy symmetric
    H0xA00stretchy symmetric
    I0x90.16666666666666666em0.16666666666666666emsymmetric largeop
    J0xD0.05555555555555555em0.1111111111111111emsymmetric largeop movablelimits
    KN/A0.16666666666666666em0.16666666666666666emsymmetric largeop movablelimits
    L0xC00N/A
    MN/A00.16666666666666666emN/A
    NN/A0.16666666666666666em0N/A
    Operators values for each category.
    The second column provides a 4bits encoding of the categories
    where the 2 least significant bits encodes the form infix (0), prefix (1) and postfix (2).
    \ No newline at end of file +
    Special TableEntries
    Operators_multichar41 entries (null-terminated UTF-16 strings): {U+0021,U+0021,U+0000}, {U+0021,U+003D,U+0000}, {U+0026,U+0026,U+0000}, {U+002A,U+003D,U+0000}, {U+002B,U+002B,U+0000}, {U+002B,U+003D,U+0000}, {U+002D,U+002D,U+0000}, {U+002D,U+003D,U+0000}, {U+002D,U+003E,U+0000}, {U+002E,U+002E,U+0000}, {U+002E,U+002E,U+002E,U+0000}, {U+002F,U+003D,U+0000}, {U+003A,U+003D,U+0000}, {U+003C,U+003D,U+0000}, {U+003D,U+003D,U+0000}, {U+003E,U+003D,U+0000}, {U+007C,U+007C,U+0000}, {U+007C,U+007C,U+007C,U+0000}, {U+223D,U+0331,U+0000}, {U+2242,U+0338,U+0000}, {U+224E,U+0338,U+0000}, {U+224F,U+0338,U+0000}, {U+2266,U+0338,U+0000}, {U+226A,U+0338,U+0000}, {U+226B,U+0338,U+0000}, {U+227F,U+0338,U+0000}, {U+2282,U+20D2,U+0000}, {U+2283,U+20D2,U+0000}, {U+228F,U+0338,U+0000}, {U+2290,U+0338,U+0000}, {U+29CF,U+0338,U+0000}, {U+29D0,U+0338,U+0000}, {U+2A7D,U+0338,U+0000}, {U+2A7E,U+0338,U+0000}, {U+2AA1,U+0338,U+0000}, {U+2AA2,U+0338,U+0000}, {U+2AAF,U+0338,U+0000}, {U+2AB0,U+0338,U+0000}, {U+2ADD,U+0338,U+0000}, {U+D83B,U+DEF0,U+0000}, {U+D83B,U+DEF1,U+0000},
    Operators_fence57 entries (15 Unicode ranges): [U+0028–U+0029], {U+005B}, {U+005D}, [U+007B–U+007D], [U+0330–U+0331], {U+2016}, [U+2018–U+2019], [U+201C–U+201D], [U+2308–U+230B], [U+2329–U+232A], [U+2772–U+2773], [U+27E6–U+27EF], {U+2980}, [U+2983–U+2998], [U+29FC–U+29FD],
    Operators_separator3 entries: U+002C, U+003B, U+2063,
    Special tables for the operator dictionary.
    Total size: 101 entries, 301 bytes.
    (assuming characters are UTF-16 and 1-byte range lengths)
    (Content, Form) keysCategory
    138 entries (18 Unicode ranges) in infix form: [U+2190–U+2199], [U+219C–U+21AD], [U+21AF–U+21B5], {U+21B9}, [U+21BC–U+21CC], [U+21D0–U+21DD], [U+21E0–U+21F0], {U+21F3}, [U+21F5–U+21F6], [U+21FD–U+21FF], [U+27F0–U+27F1], [U+27F5–U+27FF], [U+290A–U+2910], [U+2912–U+2913], [U+2921–U+2922], [U+294E–U+2961], [U+296E–U+296F], [U+2B45–U+2B46], A
    103 entries (36 Unicode ranges) in infix form: {U+002B}, {U+002D}, {U+002F}, {U+00B1}, {U+00F7}, {U+0322}, {U+0325}, {U+0327}, [U+2212–U+2214], {U+2216}, {U+2218}, {U+2224}, [U+2227–U+222A], {U+2236}, {U+2238}, [U+228C–U+228F], [U+2293–U+2296], {U+2298}, [U+229D–U+229F], [U+22BB–U+22BD], {U+22C4}, {U+22C6}, [U+22CE–U+22CF], [U+22D2–U+22D3], [U+2795–U+2797], {U+27F4}, {U+29BC}, {U+29F6}, [U+2A22–U+2A2E], [U+2A38–U+2A3A], [U+2A40–U+2A4F], [U+2A51–U+2A63], [U+2ADA–U+2ADB], {U+2AFB}, {U+2AFD}, {U+2B32}, B
    89 entries (42 Unicode ranges) in infix form: {U+0025}, {U+002A}, {U+002E}, {U+0040}, {U+00B7}, {U+00D7}, [U+0330–U+0332], {U+0346}, {U+2022}, {U+2043}, {U+2206}, {U+220E}, {U+2217}, [U+223F–U+2240], {U+2297}, {U+2299}, [U+22A0–U+22A1], {U+22C5}, {U+22C7}, [U+22C9–U+22CC], [U+2305–U+2306], [U+25A0–U+25A1], [U+25AA–U+25AB], [U+25AD–U+25B1], [U+2981–U+2982], [U+2999–U+299A], {U+29B5}, [U+29C2–U+29C3], [U+29C9–U+29CD], [U+29D8–U+29D9], {U+29DB}, [U+29DF–U+29E0], {U+29E2}, [U+29E7–U+29ED], [U+29F8–U+29FB], [U+2A1D–U+2A21], [U+2A2F–U+2A37], [U+2A3B–U+2A3D], {U+2A3F}, {U+2A50}, [U+2ADC–U+2ADD], {U+2AFE}, C
    53 entries (22 Unicode ranges) in prefix form: {U+0021}, {U+002B}, {U+002D}, {U+00AC}, {U+00B1}, [U+0330–U+0331], {U+2018}, {U+201C}, [U+2200–U+2201], [U+2203–U+2204], {U+2207}, [U+2212–U+2213], [U+221B–U+221C], [U+221F–U+2222], {U+223C}, [U+22BE–U+22BF], {U+2310}, {U+2319}, [U+2795–U+2796], {U+27C0}, [U+299B–U+29AF], [U+2AEC–U+2AED], D
    42 entries (22 Unicode ranges) in postfix form: [U+0021–U+0022], [U+0026–U+0027], {U+0060}, {U+00A8}, {U+00B0}, [U+00B2–U+00B4], [U+00B8–U+00B9], [U+02CA–U+02CB], [U+02D8–U+02DA], {U+02DD}, {U+0311}, {U+0320}, {U+0324}, {U+0326}, [U+0329–U+032A], [U+0330–U+0331], [U+2019–U+201B], [U+201D–U+201F], [U+2032–U+2037], {U+2057}, [U+20DB–U+20DC], {U+23CD}, E
    26 entries (16 Unicode ranges) in postfix form: [U+005E–U+005F], {U+007E}, {U+00AF}, [U+02C6–U+02C7], {U+02C9}, {U+02CD}, {U+02DC}, {U+02F7}, {U+0302}, [U+0347–U+0348], {U+2016}, {U+203E}, [U+2322–U+2323], [U+23B4–U+23B5], [U+23DC–U+23E1], {U+2980}, F
    25 entries in prefix form: U+0028, U+005B, U+007B, U+007C, U+2308, U+230A, U+2329, U+2772, U+27E6, U+27E8, U+27EA, U+27EC, U+27EE, U+2983, U+2985, U+2987, U+2989, U+298B, U+298D, U+298F, U+2991, U+2993, U+2995, U+2997, U+29FC, G
    25 entries in postfix form: U+0029, U+005D, U+007C, U+007D, U+2309, U+230B, U+232A, U+2773, U+27E7, U+27E9, U+27EB, U+27ED, U+27EF, U+2984, U+2986, U+2988, U+298A, U+298C, U+298E, U+2990, U+2992, U+2994, U+2996, U+2998, U+29FD, H
    22 entries (3 Unicode ranges) in prefix form: [U+222B–U+2233], [U+2A0B–U+2A0F], [U+2A15–U+2A1C], I
    18 entries (5 Unicode ranges) in prefix form: [U+220F–U+2210], [U+22C0–U+22C3], [U+2A00–U+2A09], {U+2AFC}, {U+2AFF}, J
    7 entries (3 Unicode ranges) in prefix form: {U+2211}, {U+2A0A}, [U+2A10–U+2A14], K
    6 entries (3 Unicode ranges) in infix form: {U+005C}, [U+2061–U+2064], {U+2396}, L
    3 entries in infix form: U+002C, U+003A, U+003B, M
    3 entries in prefix form: U+2145, U+2146, U+2202, N
    Mapping from operator (Content, Form) to a category.
    Total size: 560 entries, 622 bytes.
    (assuming characters are UTF-16 and 1-byte range lengths)
    Categoryencodingrspacelspaceproperties
    A0x00.2777777777777778em0.2777777777777778emstretchy
    B0x40.2222222222222222em0.2222222222222222emN/A
    C0x80.16666666666666666em0.16666666666666666emN/A
    D0x100N/A
    E0x200N/A
    F0x600stretchy
    G0x500stretchy symmetric
    H0xA00stretchy symmetric
    I0x90.16666666666666666em0.16666666666666666emsymmetric largeop
    J0xD0.05555555555555555em0.1111111111111111emsymmetric largeop movablelimits
    KN/A0.16666666666666666em0.16666666666666666emsymmetric largeop movablelimits
    L0xC00N/A
    MN/A00.16666666666666666emN/A
    NN/A0.16666666666666666em0N/A
    Operators values for each category.
    The second column provides a 4bits encoding of the categories
    where the 2 least significant bits encodes the form infix (0), prefix (1) and postfix (2).
    547 entries (221 ranges of length at most 16): [0x0590–0x0599], [0x059C–0x05AB], [0x05AC–0x05AD], [0x05AF–0x05B5], {0x05B9}, [0x05BC–0x05CB], {0x05CC}, [0x05D0–0x05DD], [0x05E0–0x05EF], {0x05F0}, {0x05F3}, [0x05F5–0x05F6], [0x05FD–0x05FF], [0x0BF0–0x0BF1], [0x0BF5–0x0BFF], [0x0D0A–0x0D10], [0x0D12–0x0D13], [0x0D21–0x0D22], [0x0D4E–0x0D5D], [0x0D5E–0x0D61], [0x0D6E–0x0D6F], [0x0F45–0x0F46], {0x402B}, {0x402D}, {0x402F}, {0x40B1}, {0x40F7}, {0x4322}, {0x4325}, {0x4327}, [0x4612–0x4614], {0x4616}, {0x4618}, {0x4624}, [0x4627–0x462A], {0x4636}, {0x4638}, [0x468C–0x468F], [0x4693–0x4696], {0x4698}, [0x469D–0x469F], [0x46BB–0x46BD], {0x46C4}, {0x46C6}, [0x46CE–0x46CF], [0x46D2–0x46D3], [0x4B95–0x4B97], {0x4BF4}, {0x4DBC}, {0x4DF6}, [0x4E22–0x4E2E], [0x4E38–0x4E3A], [0x4E40–0x4E4F], [0x4E51–0x4E60], [0x4E61–0x4E63], [0x4EDA–0x4EDB], {0x4EFB}, {0x4EFD}, {0x4F32}, {0x8025}, {0x802A}, {0x802E}, {0x8040}, {0x80B7}, {0x80D7}, [0x8330–0x8332], {0x8346}, {0x8422}, {0x8443}, {0x8606}, {0x860E}, {0x8617}, [0x863F–0x8640], {0x8697}, {0x8699}, [0x86A0–0x86A1], {0x86C5}, {0x86C7}, [0x86C9–0x86CC], [0x8705–0x8706], [0x89A0–0x89A1], [0x89AA–0x89AB], [0x89AD–0x89B1], [0x8D81–0x8D82], [0x8D99–0x8D9A], {0x8DB5}, [0x8DC2–0x8DC3], [0x8DC9–0x8DCD], [0x8DD8–0x8DD9], {0x8DDB}, [0x8DDF–0x8DE0], {0x8DE2}, [0x8DE7–0x8DED], [0x8DF8–0x8DFB], [0x8E1D–0x8E21], [0x8E2F–0x8E37], [0x8E3B–0x8E3D], {0x8E3F}, {0x8E50}, [0x8EDC–0x8EDD], {0x8EFE}, {0x1021}, {0x102B}, {0x102D}, {0x10AC}, {0x10B1}, [0x1330–0x1331], {0x1418}, {0x141C}, [0x1600–0x1601], [0x1603–0x1604], {0x1607}, [0x1612–0x1613], [0x161B–0x161C], [0x161F–0x1622], {0x163C}, [0x16BE–0x16BF], {0x1710}, {0x1719}, [0x1B95–0x1B96], {0x1BC0}, [0x1D9B–0x1DAA], [0x1DAB–0x1DAF], [0x1EEC–0x1EED], [0x2021–0x2022], [0x2026–0x2027], {0x2060}, {0x20A8}, {0x20B0}, [0x20B2–0x20B4], [0x20B8–0x20B9], [0x22CA–0x22CB], [0x22D8–0x22DA], {0x22DD}, {0x2311}, {0x2320}, {0x2324}, {0x2326}, [0x2329–0x232A], [0x2330–0x2331], [0x2419–0x241B], [0x241D–0x241F], [0x2432–0x2437], {0x2457}, [0x24DB–0x24DC], {0x27CD}, [0x605E–0x605F], {0x607E}, {0x60AF}, [0x62C6–0x62C7], {0x62C9}, {0x62CD}, {0x62DC}, {0x62F7}, {0x6302}, [0x6347–0x6348], {0x6416}, {0x643E}, [0x6722–0x6723], [0x67B4–0x67B5], [0x67DC–0x67E1], {0x6D80}, {0x5028}, {0x505B}, [0x507B–0x507C], {0x5708}, {0x570A}, {0x5729}, {0x5B72}, {0x5BE6}, {0x5BE8}, {0x5BEA}, {0x5BEC}, {0x5BEE}, {0x5D83}, {0x5D85}, {0x5D87}, {0x5D89}, {0x5D8B}, {0x5D8D}, {0x5D8F}, {0x5D91}, {0x5D93}, {0x5D95}, {0x5D97}, {0x5DFC}, {0xA029}, {0xA05D}, [0xA07C–0xA07D], {0xA709}, {0xA70B}, {0xA72A}, {0xAB73}, {0xABE7}, {0xABE9}, {0xABEB}, {0xABED}, {0xABEF}, {0xAD84}, {0xAD86}, {0xAD88}, {0xAD8A}, {0xAD8C}, {0xAD8E}, {0xAD90}, {0xAD92}, {0xAD94}, {0xAD96}, {0xAD98}, {0xADFD}, [0x962B–0x9633], [0x9E0B–0x9E0F], [0x9E15–0x9E1C], [0xD60F–0xD610], [0xD6C0–0xD6C3], [0xDE00–0xDE09], {0xDEFC}, {0xDEFF}, {0xC05C}, [0xC461–0xC464], {0xC796},
    List of entries for the largest categories.
    Key is Entry % 0x400, category encoding is Entry / 0x1000.
    Total size: 547 entries, 553 bytes
    (assuming 4 bits for range lengths).
    \ No newline at end of file diff --git a/tables/operator-dictionary.py b/tables/operator-dictionary.py index cc9c13f..aebdd32 100755 --- a/tables/operator-dictionary.py +++ b/tables/operator-dictionary.py @@ -2,6 +2,7 @@ from lxml import etree from download import downloadUnicodeXML +from math import ceil import operator import json @@ -77,7 +78,7 @@ def dumpKnownTables(fenceAndSeparators): table = item["singleChar"] print(" singleChar (%d): " % len(table), end="") - for unicodeRange in toUnicodeRanges(table): + for unicodeRange in toRanges(table): print("%s, " % stringifyRange(unicodeRange), end="") if "multipleChar" in item: @@ -90,33 +91,33 @@ def dumpKnownTables(fenceAndSeparators): print("") def stringifyRange(unicodeRange): - assert unicodeRange[1] - unicodeRange[0] < 256 if unicodeRange[0] == unicodeRange[1]: return "{%s}" % toHexa(unicodeRange[0]) else: return "[%s–%s]" % (toHexa(unicodeRange[0]), toHexa(unicodeRange[1])) -def toUnicodeRanges(operators): - unicodeRange = None +def toRanges(operators, max_range_length = 256): + current_range = None ranges = [] for character in operators: - if not unicodeRange: - unicodeRange = character, character + if not current_range: + current_range = character, character else: - if unicodeRange[1] + 1 == character: - unicodeRange = unicodeRange[0], character + if (current_range[1] + 1 - current_range[0] < max_range_length and + current_range[1] + 1 == character): + current_range = current_range[0], character else: - ranges.append(unicodeRange) - unicodeRange = character, character + ranges.append(current_range) + current_range = character, character - if unicodeRange: - ranges.append(unicodeRange) + if current_range: + ranges.append(current_range) return ranges def printCodePointStats(): - ranges=[(0x0000, 0x1FFF), (0x2000, 0x2FFF), (0x3000, 0xFFFF)] + ranges=[(0x0000, 0x1FFF), (0x2000, 0x2FFF)] minmax=[] for r in ranges: minmax.append([r[1], r[0]]) @@ -134,14 +135,14 @@ def printCodePointStats(): print(" [%s–%s] (length 0x%04X)" % (toHexa(r[0]), toHexa(r[1]), r[1] - r[0] + 1)) s += r[1] - r[0] + 1 - print("Total: %04X different code points\n" % s) + print("Total: 0x%04X different code points\n" % s) def printRangeStats(): print("The max of codePointEnd - codePointStart for ranges are ") maxDeltaTotal = 0 for name in knownTables: maxDelta = 0 - for unicodeRange in toUnicodeRanges(knownTables[name]["singleChar"]): + for unicodeRange in toRanges(knownTables[name]["singleChar"]): maxDelta = max(maxDelta, unicodeRange[1] - unicodeRange[0]) print(maxDelta, end=" ") maxDeltaTotal = max(maxDeltaTotal, maxDelta) @@ -476,7 +477,7 @@ def serializeValue(value, fence, separator): md.write(""); md.write("\n"); -md.write('
    Mapping from operator (Content, Form) to properties.
    Total size: %d entries, ≥ %d bytes
    (assuming \'Content\' uses at least one UTF-16 character, \'Form\' 2 bits, spacing 3 bits and properties 3 bits).
    ' % (totalEntryCount, totalEntryCount * (16 + 2 + 3 + 3)/8)) +md.write('
    Mapping from operator (Content, Form) to properties.
    Total size: %d entries, ≥ %d bytes
    (assuming \'Content\' uses at least one UTF-16 character, \'Form\' 2 bits, spacing 3 bits and properties 3 bits).
    ' % (totalEntryCount, ceil(totalEntryCount * (16 + 2 + 3 + 3)/8.))) md.write('') print("done."); ################################################################################ @@ -507,7 +508,7 @@ def serializeValue(value, fence, separator): knownTables[name]["singleChar"].sort() # Convert multiChar to singleChar -reservedBlock = (0xE000, 0xF8FF) +reservedBlock = (0x0320, 0x03FF) for name in knownTables: if "multipleChar" in knownTables[name]: for entry in knownTables[name]["multipleChar"]: @@ -519,6 +520,11 @@ def serializeValue(value, fence, separator): for name in knownTables: knownTables[name]["singleChar"].sort() +# Print more statistics +print() +printCodePointStats() +printRangeStats() + # Print the compact dictionary print("Generate operator-dictionary-compact.html...", end=" "); md = open("operator-dictionary-compact.html", "w") @@ -553,7 +559,7 @@ def serializeValue(value, fence, separator): count = len(knownTables[name]["singleChar"]) md.write("") md.write("Operators_%s" % name); - ranges = toUnicodeRanges(knownTables[name]["singleChar"]) + ranges = toRanges(knownTables[name]["singleChar"]) if (3 * len(ranges) < 2 * count): md.write("%d entries (%d Unicode ranges): " % (count, len(ranges))) for entry in ranges: @@ -587,7 +593,7 @@ def serializeValue(value, fence, separator): md.write("") totalEntryCount += count - ranges = toUnicodeRanges(knownTables[name]["singleChar"]) + ranges = toRanges(knownTables[name]["singleChar"]) if (3 * len(ranges) < 2 * count): md.write("%d entries (%d Unicode ranges) in %s form: " % (count, len(ranges), knownTables[name]["value"]["form"])) for entry in ranges: @@ -606,6 +612,14 @@ def serializeValue(value, fence, separator): md.write('
    Mapping from operator (Content, Form) to a category.
    Total size: %d entries, %d bytes.
    (assuming characters are UTF-16 and 1-byte range lengths)
    ' % (totalEntryCount, totalBytes)) md.write('') +def formValueFromString(value): + form = knownTables[name]["value"]["form"] + if form == "infix": + return 0 + if form == "prefix": + return 1 + assert form == "postfix" + return 2 category_for_form = [0, 0, 0] value_index = 0 @@ -620,13 +634,7 @@ def serializeValue(value, fence, separator): for entry in knownTables[name]["singleChar"]: md.write(""); md.write("%s" % chr(ord('A') + value_index)) - form = knownTables[name]["value"]["form"] - if form == "infix": - form = 0 - elif form == "prefix": - form = 1 - elif form == "postfix": - form = 2 + form = formValueFromString(knownTables[name]["singleChar"]) if category_for_form[form] >= 4: md.write("N/A") else: @@ -644,7 +652,43 @@ def serializeValue(value, fence, separator): print("done."); -# Print more statistics -print() -printCodePointStats() -printRangeStats() +# Calculate compact form for the largest categories. +compact_table = [] +category_for_form = [0, 0, 0] +totalEntryCount = 0 +for name, item in sorted(knownTables.items(), + key=(lambda v: len(v[1]["singleChar"])), + reverse=True): + if name in ["fence", "separator"]: + continue + count = len(knownTables[name]["singleChar"]) + form = formValueFromString(knownTables[name]["singleChar"]) + if category_for_form[form] >= 4: + continue + totalEntryCount += count + hexa = form + (category_for_form[form] << 2) + category_for_form[form] += 1 + + for entry in knownTables[name]["singleChar"]: + assert entry <= 0x3FF or (0x2000 <= entry and entry <= 0x2BFF) + if 0x2000 <= entry and entry <= 0x2BFF: + entry = entry - 0x1C00 + entry = entry + (hexa << 12) + compact_table.append(entry) + +bits_per_range = 4 +compact_table = toRanges(compact_table, 1 << bits_per_range) +rangeCount = 0 + +md.write('
    ') +md.write('%d entries (%d ranges of length at most %d): ' % (totalEntryCount, len(compact_table), 1 << bits_per_range)); +for r in compact_table: + if r[0] == r[1]: + md.write('{0x%04X}, ' % r[0]) + else: + md.write('[0x%04X–0x%04X], ' % (r[0], r[1])) + rangeCount += 1 + +md.write(''); +md.write('
    List of entries for the largest categories.
    Key is Entry %% 0x400, category encoding is Entry / 0x1000.
    Total size: %d entries, %d bytes
    (assuming %d bits for range lengths).
    ' % (totalEntryCount, ceil((16+bits_per_range) * rangeCount / 8.), bits_per_range)) +md.write('
    ')