-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add &nnbsp; entity for U+202F #5121
Comments
If new entity will be added effort should be coordinated with MathML to keep entity definitions synchronized -- https://w3c.github.io/xml-entities/ |
Mozilla is not interested in this. I guess that's a bad starting point already? I don't have the best experiences with the Chrome developers, maybe I'll try it there anyway. Unfortunately, entities is something that's not extensible in HTML, so I can't even run my own little happy solution. |
If HTML standard evolves, Mozilla, and others, must follow the new specifications, that's an evidence. I'm currently interested about having Futhermore, HTML entities exist for a numerous characters, in my opinions, almost never used, like |
In my opinion this would be extremely useful for French authors, but also for other languages. The NNBSP character was initially added to Unicode for Mongolian suffix handling, where it is important to visually distinguish between spaces separating suffixes and those separating words. It is also being proposed as an ideal fit for a morphological separator in the numerous languages written in the Canadian Aboriginal script (see w3c/amlreq#4). An entity would significantly help authors produce correct (and better machine-readable) text in all these languages. [@annevk could you add i18n-mlreq and i18n-amlreq labels to the repo, so i can alert those folks to the discussion? Thanks.] Here is an extension of this issue, which i can raise in a new issue if preferred. There are other invisible characters for which a named character reference would be very useful for producing correctly authored Unicode text, for the same reasons as mentioned in the first comment. Here, for example, is a list of formatting characters used for Arabic, but most are essential characters for all RTL script-based languages. Characters with entities:
Characters without entities: Keyboards generally don't address the problem of inputting the characters, but it's also a problem that the characters themselves are invisible. It would really help to have Named character references. As someone who works with people who use these languages, and works with them myself, it seems to me that from a user's perspective it would be well worth the effort to add them. I don't remember why that hasn't happened before now. |
(New labels are to be introduced through https://github.com/whatwg/meta.) |
I just filed whatwg/meta#182 |
I believe I've commented previously along the following lines when this has come up:
|
Curious to hear what others think, but I tend to agree. Perhaps the best course of action here would be to update https://github.com/whatwg/html/blob/master/FAQ.md and close these type of feature requests. |
@hsivonen I think what makes this request a bit different from others is that it's for invisible characters. As @r12a points out, it's hard to work with invisible characters. And letting wiki markup handle it isn't helpful at all: this is something that needs to work across all input modes into HTML, because it has to be reliable and consistent to be useful to the people who need them. So while I understand your general premise about the update cycle being, potentially, 5 years or so, I think it's worth it in this case. If we want to take the time to batch up all the invisible characters we need to care about so we can do it at once, let's do that and make a coordinated update to the parser that makes languages that need invisible characters easier to typeset in HTML. |
What wikis or any other applications do is entirely irrelevant here. And following @hsivonen 's argumentation, any progress is bad. So why care at all? Just leave it forever as it was defined some 30 years ago. Never change a running system (which is generally bad advice). I'm fully aware that not all existing HTML parsers and renderers will properly handle this overnight when it's added. It'll take time. But we're in the fortunate (and also unfortunate) situation that the number of relevant HTML parsers in use is very limited, and these are actively maintained and automatically updated most of the time. So changes like this will eventually trickle through to all users and in a few years we can benefit from it without worrying too much. If you're not willing to wait such a long time, you shouldn't work in such projects. Web projects already have a large number of dependencies on browsers and this could be just one of them. As soon as you discover that all browsers that support everything else you already need also support this entity, you can safely use it. Also, of course I can use any Unicode character directly. But this one hasn't made it onto physical or software-defined keyboards. As the NBSP. Or the SHY. Or the MINUS. So this argumentation is moot. Also, of course I can escape any Unicode character by its codepoint value. But nobody will remember those numbers, which means that 1. nobody will be able to fluently write these characters and 2. nobody will be able to fluently read and understand them. This is about as big as a usability fail as it can get. Then, we already have similar entities, like NBSP. Why do they exist? I imagine they exist because they cannot be written with keyboards, their codepoint cannot be remembered, this one is even visually indistinguishable from a more common character (SP) and its use is required sometimes. While not being strictly "required" and not used as often, NNBSP falls exactly in the same category. So I definitely see reason for its existence as an entity. On the other hand, it doesn't hurt anybody. Any undefined HTML entity is invalid markup, and the "nnbsp" entity is undefined, so it can safely be assigned. As could other invisible Unicode whitespace, like some zero-width characters that affect wrapping and/or hyphenation. |
Why is that? |
In addition to what @fantasai said, for some characters it’s not about the decision of direct UTF encoding vs. numeric character reference, if there is no named entity reference available, but between the proper character and some inferior replacement character. For invisible characters in particular, that’s either a space or nothing. |
Closes whatwg#3655. Closes whatwg#5121. Closes whatwg#6049.
I was actioned by I18N to reopen this issue. We are well aware of #7071 which notes that HTML will not add new named character references. The argument in favor of that policy is that newly added named entities would be broken in all parsers (not just browsers) until such time as the parsers adopted the change and that this would be a barrier to use (users would not adopt the new entities because they do not work). The sense of I18N is that we want to reopen the discussion anyway. We have a particular interest in the new isolating bidi controls, although other invisible characters are also in this request. Invisible characters are hard to use and harder to manage when authoring a page. When using NCRs, the user must memorize the code point number, which is more prone to error. Most of these characters have memorable short names that lend themselves to entities, such as Adding the invisible characters to the named entity list would not enable users soon, but could become commonly supported in just a few years. Please advise how best to prosecute this issues and whether you would like to discuss it in our teleconference or some other venue. |
I hope it's okay for an outsider to post to this thread. It seems to me that one of the bigger barriers to adding entities is not merely that existing parsers will not recognise them, but more specifically the manner in which they fail. §13.1.2 of the current HTML 5 spec says ambiguous ampersands are invalid in most contexts. That means all bets are off, but in the various browsers I've tested the entity is displayed literally in the text, which is pretty bad in this particular case. The argument is probably to cope with HTML like |
I'm not sure, but note that in English, there are words like P&G, R&D, and AT&T that don't have the surrounding space. |
Apologies for the lack of reply here. I just noticed @aphillips's request to discuss this in person. I'll mark it agenda+ and suggest we discuss it somewhere in January at a time suitable for the US and Europe given the locations of the relevant experts. January 11 looks to be the first available such slot at 9AM PST. |
What exactly is the I18N proposal to be discussed? Introduce named character references for …
|
Thanks! Let's look for a suitable time slot. I'm not familiar with HTML's call schedule. Would it be possible to do a week later (assuming you have calls weekly??) such as the 18th? That way we could include @r12a, who has previously contributed on this thread. We can also host you in our regular call (Thursdays at 7 AM Pacific) We would like to discuss the possibility of additions of this type in general. We have specific existing non-spacing characters and, it appears, perhaps a few specific whitespace characters in mind. Obviously, if we "broke the dam" on additions, there is also the question of establishing criteria for any future additions. We do not propose to add named entities in a broad or general sense. |
Maybe, what sets this one apart from others is that it's invisible. You could potentially use smart input methods to generate just about any visible character and anybody else reading the document would see it. Of course you can also use smart input methods to generate special white-space characters (like I do with my modified keyboard layout), but the problem is that other people editing the document likely won't see it if they're not familiar with the various spaces and have the tools to see them. So to be safe, it could be a good solution to use a presentation that makes it visible. So if you're looking for criteria, this might be one. 🙂 |
@aphillips for that time slot the next one is Feb 22. There are two other meetings, but one is not useful for Europe and one is not useful for the US. Getting WHATNOT participants to join another meeting could maybe work, but it probably requires explicitly pinging some people and making sure they can all make it which is not work I can sign up for right now. Maybe next year. |
@annevk Thanks. This isn't urgent, so let's go for February? Thinking aloud, perhaps we (meaning me) should make a list of I18N issues that could use attention ahead of time and we can have a section of the call for I18N? |
Sounds good to me! |
There's
for U+00A0. It's a full-width no-break space. It can be used between numbers and their short unit names, or in other places.Typography and regional norms require (or at least recommend) using a thin no-brak space (or narrow no-break space) in several places:
(These are the first and best sources I could find now. There may be better or more authoritative sources available, but they're usually hard to find.)
While it is technically possible to create a keyboard layout that produces this character, not many users have this installed and even then it's hard to distinguish it from other space characters when reading and revising text. Most editors don't even show a replacement symbol for this space character.
AFAIK Wikipedia suggests writing
in these places. And that's probably a good idea in team projects as well. But this is actually the wrong character in these places.To use the correct narrow no-break space, one has to use a different HTML entity representation, like
 
or 
which are frankly hard to remember or recognise.As a solution, the new entity
&nnbsp;
should be added to HTML to make it easy to write readable text following the correct typographic rules and recommendations.The text was updated successfully, but these errors were encountered: