-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator dictionary: Provide a compact form in MathML Core #176
Comments
I'm not sure why you opened this as a separate issues from #161. Is #161 about what should go in the spec (which I think is a pointer to some program friendly file format) and this issue is about what is in that external file/how it is organized? Just in case you weren't aware, gperf (gnu) will generate a perfect hash for you. Perfect hashes can sometimes use a fair amount of space, so an alternative is "quasi-perfect" hashing. That allows for at most two probes into the hash table and can often significantly reduce the size of the table. There's probably an implementation that generates a table/hash function function for doing quasi-perfect hashing, but I didn't see it on the first page of a google search... |
#161 was only about the removing the priority property but as I said there I would open a separate more general issue later. Which is what I'm doing now. The other issue has already gone offtopic. |
I think @bfgeek proposal was actually a minimal perfect hash table https://en.wikipedia.org/wiki/Perfect_hash_function#Minimal_perfect_hash_function (?) |
Consensus from yesterday's meeting: @davidcarlisle will try to check the values to make them more consistent and reduce special cases. |
@davidcarlisle How many categories remain after your changes? |
@fred-wang The changes are mainly from @NSoiffer I've just been pushing through the resulting updated files, and I believe Neil is hoping to do at least one more round on this. I also updated to Unicode 13, but not expecting that to affect MathML. However as things stand now, if you ignore priority= (which isn't really a mathml-core thing) there are 17 different combinations of https://mathml-refresh.github.io/xml-entities/opdict.html The report including priority and showing differences from Unicode TR25 is below. The first part shows the priority values still need a bit of rationalisation but that's on Neil's radar (and doesn't affect core) the second part showing differences from the Mathclass-15 file is probably OK but we should (perhaps) coordinate with Murray and Barbara get the two back in sync at some point.
|
@davidcarlisle Do you plan to merge more? In particular how important is it to keep specific categories for some isolated values with lspace != rspace? |
I would probably suggest merging some of them, but as I say @NSoiffer has more changes planned, so I was planning on waiting for that to end before really reviewing this. Certainly TeX gets away with fewer space categories, with only three non zero spaces ever automatically added: thin medium and thick, which are theoretically user-settable but are nearly always the latex and plain tex defaults
where 1mu =1/18 em |
OK, let's wait for @NSoiffer Here is a quick analysis on my side: I think complement "∁" has prefix form and should be moved into an existing prefix category. Then I'm not sure how important it is to keep a single category for ":"? It seems fine to me to use a default symmetric spacing for this one, it can be used as a separator or as a binary operator (Note that in text, some languages use a spacing before ":"). So I would merge it into "form:infix lspace:2 rspace:2" for example or "form:infix lspace:1 rspace:1". I guess unbalanced spacing separators is still important so we can't remove category "form:infix lspace:0 rspace:3", right? How important is the expact spacing for postfix "♭", "♮", "♯", "!" and "!!" ? They don't seem to have a clear default spacing to me. Can't we merge them into a single category with zero lspace and rspace nonzero? Or even just into "form:postfix lspace:0 rspace:0"? How important is the category for "form:prefix lspace:1 rspace:1"? I don't think people use this square root operator as a single mo, they would instead use the msqrt or mroot element. So I would just drop them from the operator dictionary or otherwise merge into another arbitrary existing prefix category. I still don't quite understand what is the distinction between "form:prefix lspace:1 rspace:2" and "form:prefix lspace:3 rspace:3". Maybe it's integral VS non-integral but treating ∑ and ∏ differently seems dubious to me. Can we merge them into a single category? I guess unbalanced spacing differential operators is still important so we can't remove "form:prefix lspace:3 rspace:0", right? How important is the "form:prefix lspace:2 rspace:1"? Can't we merge it with another existing category with balanced spacing or with lspace > rspace? |
Some characters will likely go away including the musical notation signs
(and hence their spacing character), but I'm spending time for each
character trying to find whether they have a mathematical usage and if so,
what it is. I'm currently sifting through the priority 265 symbols and
either removing them or moving them to a more appropriate place. That
sometimes involves changing their form and also their spacing. Once I'm
done with that, I'm going to review spacing for what remains.
…On Thu, Mar 19, 2020 at 5:05 AM Frédéric Wang ***@***.***> wrote:
OK, let's wait for @NSoiffer <https://github.com/NSoiffer>
Here is a quick analysis on my side:
I think complement "∁" has prefix form and should be moved into an
existing prefix category. Then I'm not sure how important it is to keep a
single category for ":"? It seems fine to me to use a default symmetric
spacing for this one, it can be used as a separator or as a binary operator
(Note that in text, some languages use a spacing before ":"). So I would
merge it into "form:infix lspace:2 rspace:2" for example or "form:infix
lspace:1 rspace:1".
I guess unbalanced spacing separators is still important so we can't
remove category "form:infix lspace:0 rspace:3", right?
How important is the expact spacing for postfix "♭", "♮", "♯", "!" and
"!!" ? They don't seem to have a clear default spacing to me. Can't we
merge them into a single category with zero lspace and rspace nonzero? Or
even just into "form:postfix lspace:0 rspace:0"?
How important is the category for "form:prefix lspace:1 rspace:1"? I don't
think people use this square root operator as a single mo, they would
instead use the msqrt or mroot element. So I would just drop them from the
operator dictionary or otherwise merge into another arbitrary existing
prefix category.
I still don't quite understand what is the distinction between
"form:prefix lspace:1 rspace:2" and "form:prefix lspace:3 rspace:3". Maybe
it's integral VS non-integral but treating ∑ and ∏ differently seems
dubious to me. Can we merge them into a single category?
I guess unbalanced spacing differential operators is still important so we
can't remove "form:prefix lspace:3 rspace:0", right?
How important is the "form:prefix lspace:2 rspace:1"? Can't we merge it
with another existing category with balanced spacing or with lspace >
rspace?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#176 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALZM3GYQLRCK5IY5EAA3Z3RIIDAJANCNFSM4JOQZ6LQ>
.
|
"a:b" in TeX seems to give wide symmetric spacing, probably colon should be in lspace:4 rspace:4 or lspace:3 rspace:3? What is the logic behind the two categories "form:prefix lspace:1 rspace:2" and "form:prefix lspace:3 rspace:3" for largeop/integral ? It seems some "sums" are in the former. |
This is the current output from the compact form script ( https://mathml-refresh.github.io/xml-entities/opdict.html#compressed only gives spacing, not properties):
|
@fred-wang : why break out multichar chars in the table into their own category? I thought the goal was to minimize the size of the operator dictionary in the core spec. Most of the entries would belong to existing groupings. |
I've been trying to decide what to do about colon, which is why I haven't changed its values yet. I've written down what I found in #87 (comment), which is where this discussion properly belongs. |
The final script still depends on what the possible values will be. The general rule of thumb is still to try to reduce possible values as much as possible, independently on how the keys will be handled. Regarding keys, strings in browsers are heavy objects, see [1] [2]. So to minimize space it seems optimal use single UTF-16 characters (only 2 bytes, less than any concept of generic 16-bits strings) which cover most of the operators but the non-BMP ones (only two of them so can easily be handled separately) and the multiple chars (for which we can maybe find a clever handling, e.g. the non-ASCII strings are always 'lspace': 5, 'rspace': '5'). [1] https://source.chromium.org/chromium/chromium/src/+/master:third_party/blink/renderer/platform/wtf/text/README.md (webkit is similar) |
@NSoiffer There are still inconsistencies with largeop. Some of them are symmetric+largeop+movablelimits (e.g. anticlockwise integration) others are just symmetric+largeop (e.g. integral). Why can't we just make all of them symmetric+largeop+movablelimits? |
I'm not sure people should use radical as mo, can we please remove them? Or at least make square root not stretchy so we don't have a special case: √ √ square root prefix 845 1 1 stretchy |
Can we move U+007C to any other existing category? |
Script updated a bit (fences/separators are handled in separate table now). This is the current output. I feel like largeop could still be make more consistent and that we could reduce special cases (cf otherEntries table):
|
Trying to put multichars into existing categories:
|
Can these three be moved into existing categories? otherEntriesWithMultipleCharacters 3
Also, do you plan to review the multi char operators too? I thought we agreed some of them probably don't make sense... |
This removes entries that just use the default values. w3c/mathml#176
First attempt to make the dictionary more compact: https://mathml-refresh.github.io/mathml-core/#operator-dictionary
|
These are already covered by math radical constructions. w3c/mathml#176
This make things consistent with other stretchy fences. w3c/mathml#176
…space="4" This is consistent with other similar operators /=, *= or >= w3c/mathml#176
This makes it consistent with other operators like '@'. w3c/mathml#176
These are slashes and bars that make sense as non-stretchy too, with no clear default. However, non-stretchy is more consistent with similar symbols: U+2223 U+2225 U+2758 U+29F5 U+2AFB U+2AFD U+29F8 U+29F9 U+005C w3c/mathml#176
From Wikipedia this does not seem to be an operator: "This symbol is from ISO/IEC 9995 and is intended for use on a keyboard to indicate a key that performs decimal separation." (https://en.wikipedia.org/wiki/Decimal_separator#Other_numeral_systems) Even if it is a decimal separator, such a separator is generally directly used into the number together with the digits (e.g. <mn> in MathML) not as an operator. Dictionary does not contain other similar separators like the Arabic ones. w3c/mathml#176
@NSoiffer @davidcarlisle I submitted a couple of PR to make things more consistent: mathml-refresh/xml-entities#24 After these changes, I believe the remaining entries could be classified as: infix lspace=0 rspace=0 (invisible op) which seems to deserve their own category indeed. (I haven't tried to run the script with all the changes merged, but I'm willing to do it and check again after this is done) |
* "|||" does not seem to be used as a programming language operator. * For (stretchy) fences, U+2980 is more appropriate than "|||" w3c/mathml#143 w3c/mathml#176
It seems to be used as a punctuation sign rather than an operator. The ellipsis character … U+2026 seems more appropriate for that purpose. w3c/mathml#143 w3c/mathml#176
Fortunately, only 2 of them (arabic ones) need to be treated specially. w3c/mathml#176
This is done: https://mathml-refresh.github.io/mathml-core/#operator-dictionary-compact We still need special handling for a few edge cases but the main subset (553 entries) is now treated uniformly. That subset can be encoded as a 560bytes table and as a binary search on 224 elements (8 comparisons). Alternatively, this main subset can be encoded as a perfect hash function with a table using 16 bits / entry, but not sure whether the extra overhead (memory & complexity) is worth it. A note gives suggestion to implementers. |
As discussed in #161 ; we can remove the priority property, ignore the infix entries that just use the default values and try to make entries a bit more consistent. Then we should be able to describe the operator dictionary in a more compact way. This would help implementers to calculate default values without relying on a huge table (more than 1100 entries).
As a reminder, we agreed in #143 to keep entries with multiple characters, so they would need to be handled separately. @davidcarlisle Can you please check the operators from the
otherEntries
table and indicate whether we could actually integrate them in one of the existing larger tables? Fixing #6 and #151 would probably help here.I was also discussing with @bfgeek during BlinkOn 11 and he suggested we could even try and describe it as a perfect hash table for example by relying on the fact that many entries are in contiguous unicode ranges. If the hash is simple enough, that could make lookup faster than binary search. I modified my script to dump these unicode ranges. Below is what I obtain with the current state of the operator dictionary.
The text was updated successfully, but these errors were encountered: