Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add &nnbsp; entity for U+202F #5121

Open
ygoe opened this issue Dec 3, 2019 · 22 comments · Fixed by #7071
Open

Add &nnbsp; entity for U+202F #5121

ygoe opened this issue Dec 3, 2019 · 22 comments · Fixed by #7071
Labels
addition/proposal New features or enhancements i18n-alreq Notifies Arabic script experts of relevant issues i18n-amlreq Notifies experts in languages of the Americas of relevant issues i18n-mlreq Notifies traditional Mongolian script experts of relevant issues i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. needs implementer interest Moving the issue forward requires implementers to express interest topic: parser

Comments

@ygoe
Copy link

ygoe commented Dec 3, 2019

There's   for U+00A0. It's a full-width no-break space. It can be used between numbers and their short unit names, or in other places.

Typography and regional norms require (or at least recommend) using a thin no-brak space (or narrow no-break space) in several places:

  • As thousands separator, Source or DIN 5008 (to avoid ambiguous presentation of point or comma)
  • Between abbreviated words like “z. B.” (German: zum Beispiel), Source
  • As fine space before certain punctuation in French, Source

(These are the first and best sources I could find now. There may be better or more authoritative sources available, but they're usually hard to find.)

While it is technically possible to create a keyboard layout that produces this character, not many users have this installed and even then it's hard to distinguish it from other space characters when reading and revising text. Most editors don't even show a replacement symbol for this space character.

AFAIK Wikipedia suggests writing   in these places. And that's probably a good idea in team projects as well. But this is actually the wrong character in these places.

To use the correct narrow no-break space, one has to use a different HTML entity representation, like   or   which are frankly hard to remember or recognise.

As a solution, the new entity &nnbsp; should be added to HTML to make it easy to write readable text following the correct typographic rules and recommendations.

@annevk annevk added addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest topic: parser labels Dec 3, 2019
@kosek
Copy link

kosek commented Dec 18, 2019

If new entity will be added effort should be coordinated with MathML to keep entity definitions synchronized -- https://w3c.github.io/xml-entities/

@ygoe
Copy link
Author

ygoe commented Jun 7, 2020

Mozilla is not interested in this. I guess that's a bad starting point already? I don't have the best experiences with the Chrome developers, maybe I'll try it there anyway.

Unfortunately, entities is something that's not extensible in HTML, so I can't even run my own little happy solution.

@Celdron
Copy link

Celdron commented Aug 10, 2020

If HTML standard evolves, Mozilla, and others, must follow the new specifications, that's an evidence.

I'm currently interested about having &nnbsp;, or equal, entity for a French wiki project, as narrow non-breaking space is recommended in some cases, as explained by ygoe.

Futhermore, HTML entities exist for a numerous characters, in my opinions, almost never used, like ≺ and such.

@r12a r12a added i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. i18n-alreq Notifies Arabic script experts of relevant issues labels Aug 11, 2020
@r12a
Copy link

r12a commented Aug 11, 2020

In my opinion this would be extremely useful for French authors, but also for other languages. The NNBSP character was initially added to Unicode for Mongolian suffix handling, where it is important to visually distinguish between spaces separating suffixes and those separating words. It is also being proposed as an ideal fit for a morphological separator in the numerous languages written in the Canadian Aboriginal script (see w3c/amlreq#4). An entity would significantly help authors produce correct (and better machine-readable) text in all these languages.

[@annevk could you add i18n-mlreq and i18n-amlreq labels to the repo, so i can alert those folks to the discussion? Thanks.]

Here is an extension of this issue, which i can raise in a new issue if preferred.

There are other invisible characters for which a named character reference would be very useful for producing correctly authored Unicode text, for the same reasons as mentioned in the first comment. Here, for example, is a list of formatting characters used for Arabic, but most are essential characters for all RTL script-based languages.

Characters with entities:

‍
‌
‏
‎

Characters without entities:
RLI
LRI
FSI
PDI
RLE
LRE
PDF
RLM
LRM
CGJ
ALM

Keyboards generally don't address the problem of inputting the characters, but it's also a problem that the characters themselves are invisible. It would really help to have Named character references. As someone who works with people who use these languages, and works with them myself, it seems to me that from a user's perspective it would be well worth the effort to add them. I don't remember why that hasn't happened before now.

@annevk
Copy link
Member

annevk commented Aug 14, 2020

(New labels are to be introduced through https://github.com/whatwg/meta.)

@xfq
Copy link
Contributor

xfq commented Aug 15, 2020

(New labels are to be introduced through https://github.com/whatwg/meta.)

I just filed whatwg/meta#182

@hsivonen hsivonen added i18n-amlreq Notifies experts in languages of the Americas of relevant issues i18n-mlreq Notifies traditional Mongolian script experts of relevant issues labels Oct 30, 2020
@hsivonen
Copy link
Member

I believe I've commented previously along the following lines when this has come up:

  1. For wiki projects, it's irrelevant whether this is in HTML. The wiki software processes the wiki syntax before generating HTML output, so wiki software can introduce whatever macro expansions its developers see fit and users find useful.
  2. In the case of HTML itself, I think the backward-compatibility characteristics of this feature request are bad. The requested feature doesn't expand the expressiveness of HTML in any way: You can already express U+202F unescaped in UTF-8 or escaped as a numeric reference. However, if a named entity was added, it would break in the currently-existing HTML parsers (not only in the currently-existing browsers). This could either lead to unwanted breakage or to lead to non-usage of the feature (i.e. using the numeric form or unescaped UTF-8 anyway for better compat).
  3. Making this change would set a precedent for others to request named entities for characters they find important causing a repeat of the previous point over and over again.

@annevk
Copy link
Member

annevk commented Oct 30, 2020

Curious to hear what others think, but I tend to agree. Perhaps the best course of action here would be to update https://github.com/whatwg/html/blob/master/FAQ.md and close these type of feature requests.

@fantasai
Copy link
Contributor

fantasai commented Oct 30, 2020

@hsivonen I think what makes this request a bit different from others is that it's for invisible characters. As @r12a points out, it's hard to work with invisible characters. And letting wiki markup handle it isn't helpful at all: this is something that needs to work across all input modes into HTML, because it has to be reliable and consistent to be useful to the people who need them.

So while I understand your general premise about the update cycle being, potentially, 5 years or so, I think it's worth it in this case. If we want to take the time to batch up all the invisible characters we need to care about so we can do it at once, let's do that and make a coordinated update to the parser that makes languages that need invisible characters easier to typeset in HTML.

@ygoe
Copy link
Author

ygoe commented Nov 1, 2020

What wikis or any other applications do is entirely irrelevant here. And following @hsivonen 's argumentation, any progress is bad. So why care at all? Just leave it forever as it was defined some 30 years ago. Never change a running system (which is generally bad advice).

I'm fully aware that not all existing HTML parsers and renderers will properly handle this overnight when it's added. It'll take time. But we're in the fortunate (and also unfortunate) situation that the number of relevant HTML parsers in use is very limited, and these are actively maintained and automatically updated most of the time. So changes like this will eventually trickle through to all users and in a few years we can benefit from it without worrying too much. If you're not willing to wait such a long time, you shouldn't work in such projects. Web projects already have a large number of dependencies on browsers and this could be just one of them. As soon as you discover that all browsers that support everything else you already need also support this entity, you can safely use it.

Also, of course I can use any Unicode character directly. But this one hasn't made it onto physical or software-defined keyboards. As the NBSP. Or the SHY. Or the MINUS. So this argumentation is moot. Also, of course I can escape any Unicode character by its codepoint value. But nobody will remember those numbers, which means that 1. nobody will be able to fluently write these characters and 2. nobody will be able to fluently read and understand them. This is about as big as a usability fail as it can get. Then, we already have similar entities, like NBSP. Why do they exist? I imagine they exist because they cannot be written with keyboards, their codepoint cannot be remembered, this one is even visually indistinguishable from a more common character (SP) and its use is required sometimes.

While not being strictly "required" and not used as often, NNBSP falls exactly in the same category. So I definitely see reason for its existence as an entity. On the other hand, it doesn't hurt anybody. Any undefined HTML entity is invalid markup, and the "nnbsp" entity is undefined, so it can safely be assigned. As could other invisible Unicode whitespace, like some zero-width characters that affect wrapping and/or hyphenation.

@hsivonen
Copy link
Member

hsivonen commented Nov 2, 2020

But this one hasn't made it onto physical or software-defined keyboards.

Why is that?

@Crissov
Copy link

Crissov commented Mar 14, 2021

In addition to what @fantasai said, for some characters it’s not about the decision of direct UTF encoding vs. numeric character reference, if there is no named entity reference available, but between the proper character and some inferior replacement character. For invisible characters in particular, that’s either a space or nothing.

@domenic domenic added the agenda+ To be discussed at a triage meeting label Jul 22, 2021
@past past removed the agenda+ To be discussed at a triage meeting label Sep 2, 2021
domenic added a commit that referenced this issue Sep 14, 2021
domenic added a commit that referenced this issue Sep 20, 2021
@aphillips
Copy link
Contributor

I was actioned by I18N to reopen this issue.

We are well aware of #7071 which notes that HTML will not add new named character references. The argument in favor of that policy is that newly added named entities would be broken in all parsers (not just browsers) until such time as the parsers adopted the change and that this would be a barrier to use (users would not adopt the new entities because they do not work).

The sense of I18N is that we want to reopen the discussion anyway. We have a particular interest in the new isolating bidi controls, although other invisible characters are also in this request. Invisible characters are hard to use and harder to manage when authoring a page. When using NCRs, the user must memorize the code point number, which is more prone to error. Most of these characters have memorable short names that lend themselves to entities, such as RLI for U+2067 RIGHT TO LEFT ISOLATE.

Adding the invisible characters to the named entity list would not enable users soon, but could become commonly supported in just a few years.

Please advise how best to prosecute this issues and whether you would like to discuss it in our teleconference or some other venue.

@aphillips aphillips reopened this Mar 4, 2023
@ras52
Copy link

ras52 commented Dec 12, 2023

I hope it's okay for an outsider to post to this thread. It seems to me that one of the bigger barriers to adding entities is not merely that existing parsers will not recognise them, but more specifically the manner in which they fail. §13.1.2 of the current HTML 5 spec says ambiguous ampersands are invalid in most contexts. That means all bets are off, but in the various browsers I've tested the entity is displayed literally in the text, which is pretty bad in this particular case. The argument is probably to cope with HTML like <p>I ordered fish&chips; John had a pie.</p>, though I wonder how common this really is. (Are there languages where ampersands are commonly used without surrounding space?) If HTML5 starts adding new entities, this is probably no longer the best behaviour. Would it be better to display U+FFFD in place of the full entity-like-thing when an ambiguous ampersand? At least that makes it clear to a reader that something is off, which the raw entity name may not. If so, might it be sensible to change the spec to mandate this behaviour in advance of actually adding new entities?

@xfq
Copy link
Contributor

xfq commented Dec 14, 2023

Are there languages where ampersands are commonly used without surrounding space?

I'm not sure, but note that in English, there are words like P&G, R&D, and AT&T that don't have the surrounding space.

@annevk
Copy link
Member

annevk commented Dec 14, 2023

Apologies for the lack of reply here. I just noticed @aphillips's request to discuss this in person. I'll mark it agenda+ and suggest we discuss it somewhere in January at a time suitable for the US and Europe given the locations of the relevant experts. January 11 looks to be the first available such slot at 9AM PST.

@annevk annevk added the agenda+ To be discussed at a triage meeting label Dec 14, 2023
@Crissov
Copy link

Crissov commented Dec 14, 2023

What exactly is the I18N proposal to be discussed?

Introduce named character references for …

  1. some specific non-spacing (control) characters?
  2. all existing non-spacing characters?
  3. all existing and future non-spacing characters?
  4. some specific non-spacing and whitespace characters?
  5. all existing non-spacing and whitespace characters?
  6. all existing and future non-spacing and whitespace characters?

@aphillips
Copy link
Contributor

@annevk

Thanks! Let's look for a suitable time slot. I'm not familiar with HTML's call schedule. Would it be possible to do a week later (assuming you have calls weekly??) such as the 18th? That way we could include @r12a, who has previously contributed on this thread. We can also host you in our regular call (Thursdays at 7 AM Pacific)

@Crissov

We would like to discuss the possibility of additions of this type in general. We have specific existing non-spacing characters and, it appears, perhaps a few specific whitespace characters in mind. Obviously, if we "broke the dam" on additions, there is also the question of establishing criteria for any future additions. We do not propose to add named entities in a broad or general sense.

@ygoe
Copy link
Author

ygoe commented Dec 14, 2023

Maybe, what sets this one apart from others is that it's invisible. You could potentially use smart input methods to generate just about any visible character and anybody else reading the document would see it. Of course you can also use smart input methods to generate special white-space characters (like I do with my modified keyboard layout), but the problem is that other people editing the document likely won't see it if they're not familiar with the various spaces and have the tools to see them. So to be safe, it could be a good solution to use a presentation that makes it visible. &nbsp; is already widely used in Wikipedia content, for example.

So if you're looking for criteria, this might be one. 🙂

@annevk
Copy link
Member

annevk commented Dec 15, 2023

@aphillips for that time slot the next one is Feb 22. There are two other meetings, but one is not useful for Europe and one is not useful for the US. Getting WHATNOT participants to join another meeting could maybe work, but it probably requires explicitly pinging some people and making sure they can all make it which is not work I can sign up for right now. Maybe next year.

@aphillips
Copy link
Contributor

@annevk Thanks. This isn't urgent, so let's go for February? Thinking aloud, perhaps we (meaning me) should make a list of I18N issues that could use attention ahead of time and we can have a section of the call for I18N?

@annevk
Copy link
Member

annevk commented Dec 15, 2023

Sounds good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
addition/proposal New features or enhancements i18n-alreq Notifies Arabic script experts of relevant issues i18n-amlreq Notifies experts in languages of the Americas of relevant issues i18n-mlreq Notifies traditional Mongolian script experts of relevant issues i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. needs implementer interest Moving the issue forward requires implementers to express interest topic: parser
Development

Successfully merging a pull request may close this issue.