Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document parsing of (X)HTML entities, or drop it even? #4

Closed
syranide opened this issue Aug 29, 2014 · 43 comments · Fixed by #136
Closed

Document parsing of (X)HTML entities, or drop it even? #4

syranide opened this issue Aug 29, 2014 · 43 comments · Fixed by #136
Labels
Impl Reality Reality that the spec does not capture

Comments

@syranide
Copy link
Contributor

We should probably document how (X)HTML entities are parsed.

However, I can imagine dropping HTML entities instead and adopt the escaping used by JS-strings, i.e. bla \< \{ \u1234 bla. To me it would make sense in many ways:

  1. JSX is the JavaScript-equivalent of HTML (it's not HTML), using JavaScript syntax seems preferable.
  2. JSX explicitly disallows inline HTML in-favor of just JSXElements and JSXText, HTML entities seem a bit malplaced in that context.
  3. It's currently <a href="&amp;\" /> vs <a href={'&\\'} /> which is kind of awkward.

The downside of dropping HTML entities is obviously that you wouldn't be able to copy-paste HTML and it could be a mental disconnect for a lot of users. But I think it makes a lot of sense from a technical perspective.

I think it makes even more sense if you look beyond HTML. Why would you be using HTML entities for non-HTML frontends? Like iOS, QT, etc.

@ghost
Copy link

ghost commented Aug 4, 2015

Thank you for reporting this issue and appreciate your patience. We've notified the core team for an update on this issue. We're looking for a response within the next 30 days or the issue may be closed.

@gajus
Copy link

gajus commented Sep 2, 2015

The downside of dropping HTML entities is obviously that you wouldn't be able to copy-paste HTML and it could be a mental disconnect for a lot of users. But I think it makes a lot of sense from a technical perspective.

React has already chose to deviate from HTML. facebook/react#2781

and

Third, our thinking is that JSX's primary advantage is the symmetry of matching closing tags which make code easier to read, not the direct resemblance to HTML or XML. It's convenient to copy/paste HTML directly, but other minor differences (in self-closing tags, for example) make this a losing battle and we have a HTML to JSX converter to help you anyway. Finally, to translate HTML to idiomatic React code, a fair amount of work is usually involved in breaking up the markup into components that make sense, so changing class to className is only a small part of that anyway.

from @spicyj answer https://www.quora.com/Why-do-I-have-to-use-className-instead-of-class-in-ReactJs-components-done-in-JSX/answer/Ben-Alpert

Therefore, I am in favour of dropping HTML entity support.

@RReverser
Copy link
Contributor

React has already chose to deviate from HTML.

@gajus From HTML - yes, from XML - not so much (apart from JS injections).

@gajus
Copy link

gajus commented Sep 2, 2015

Well, I am biased. I want JSX to allow template strings in JSXAttributeValue. The fate of that issue depends on whether HTML entity support is dropped or not. This is another consideration to have when deciding on this.

@RReverser
Copy link
Contributor

Do those two braces around really mean that much to you to change two behaviors? 😄

@gajus
Copy link

gajus commented Sep 2, 2015

One is HTML entities. Whats the second?

@RReverser
Copy link
Contributor

Template strings without braces on their own.

@gajus
Copy link

gajus commented Sep 2, 2015

I think that since JSX is present in JS and that it is in essence a syntactic sugar for createElement, then it should behave in the same way, i.e.,

React.createElement(`div`, {className: `foo-${foo}`}, `bar-${bar}`);

should not be different from

<div className=`foo-${foo}`>`bar-${bar}`</div>

@RReverser
Copy link
Contributor

Then we return to questions like numeric literals, object and array literals and so.

@gajus
Copy link

gajus commented Sep 2, 2015

@RReverser Explain?

If I understand correctly, then yes, objects, strings, null and numbers (thats all there is) should be valid attribute values.

<div foo=null />
<div foo=123 />
<div foo=() => {} />
<div foo=({}) />

Does this clash with anything in the spec?

@RReverser
Copy link
Contributor

It doesn't clash, but increases complexity for purely aesthetic reason.

@gajus
Copy link

gajus commented Sep 2, 2015

That is true. But consistency/conventions lower bug count (sorry, no reference for this stats). Assuming that is true, then if the rest of the code base is using convention X (template string in this case), it would make sense if JSX supported that too.

@RReverser
Copy link
Contributor

That arguments has two sides - on one hand, you're increasing consistency for those who work with JS for developing logic, an on another you at the same time decrease consistency and familiarity for those who develop views (HTML/XML coders).

@sebmarkbage
Copy link
Contributor

I think that it probably only makes sense to do this if we also drop it from JSXText or drop JSXText completely, as described in #8 and #35 .

@syranide
Copy link
Contributor Author

syranide commented Sep 3, 2015

@sebmarkbage I'd say #28 is a candidate for otherwise keeping JSX as it is and being able to drop XHTML entities.

That arguments has two sides - on one hand, you're increasing consistency for those who work with JS for developing logic, an on another you at the same time decrease consistency and familiarity for those who develop views (HTML/XML coders).

IMHO the problem is that it is inconsistent, it would be fine if <a href="&nbsp;" /> was the same as <a href={"&nbsp;"} />, which it isn't... to be honest I'm quite sure that many don't even realize this difference exists.

@RReverser
Copy link
Contributor

to be honest I'm quite sure that many don't even realize this difference exists

Dunno, maybe, but didn't meet such people yet. Right now it's pretty balanced in sense that most realize that {...} is boundaries of JavaScript, outside of them everything works pretty much as XML, inside - as JS.

The biggest benefit of entities is that they're properly named and easy to remember. Most people know perfectly how to write &nbsp; or &mdash; or &copy; to get what they want, while very few people know corresponding hexademical codes, and googling them every time you want special character or using some external library that would just provide list of characters is not a really pleasant experience.

@syranide
Copy link
Contributor Author

syranide commented Sep 3, 2015

The biggest benefit of entities is that they're properly named and easy to remember. Most people know perfectly how to write   or — or © to get what they want, while very few people know corresponding hexademical codes, and googling them every time you want special character or using some external library that would just provide list of characters is not a really pleasant experience.

\< \> \& \" seems easier to me than &lt; &gt; &amp; &quot;? Hexadecimal codes are last resort.

PS. If you want © then just write it, there's no reason to use the hexcode or HTML entity.

@gajus
Copy link

gajus commented Sep 3, 2015

< > & " seems easier to me than < > & "? Hexadecimal codes are last resort.

Was just typing that. Why bother with HTML entities at all.

@RReverser
Copy link
Contributor

then just write it

You mean use specific keyboard layout that allows them or table character application? Not all platforms & localizations have that ability out of the box.

@gajus
Copy link

gajus commented Sep 3, 2015

@gajus
Copy link

gajus commented Sep 3, 2015

Thats genuinely what I do when my keyboard does not have a character that I need. Since it is very rare that I would need a character thats not on my keyboard, it does not bother me. I cannot imagine anyone being bothered by that either.

@RReverser
Copy link
Contributor

Well, I do that as well, but it's not pleasant at all, and it's not as rare as it seems - especially for examples above as non-breaking spaces, medium dashes and copyright characters. They are in fact much more often than < and > in regular text, and two others mentioned (" and &) are already perfectly supported without any kind of escaping in JSX.

@gajus
Copy link

gajus commented Sep 3, 2015

While not all platforms support character maps, I imagine that every IDE/text editor has a plugin for that (vim, Sublime, WebStorm, to name a few).

@gajus
Copy link

gajus commented Sep 3, 2015

Not to mention that "regular text" is rarely typed in React code. It is something you load from a database of some sort.

@syranide
Copy link
Contributor Author

syranide commented Sep 3, 2015

You mean use specific keyboard layout that allows them or table character application? Not all platforms & localizations have that ability out of the box.

http://fsymbols.com/computer/copyright/

I'm pretty sure entities aren't meant to be human-friendly first and foremost, but simply a mechanism for escaping that is charset and implementation independent.

Regardless, I don't see how this is a problem JSX should try to solve (and intentionally deviate from JS), JS makes no effort.

@RReverser
Copy link
Contributor

While not all platforms support character maps, I imagine that every IDE/text editor has a plugin for that (vim, Sublime, WebStorm, to name a few).

So in any case - remove built-in human-friendly way for escaping, and instead force dev to google/use charmap/plugin/whatever. Degradation of DX is not something nice.

Not to mention that "regular text" is rarely typed in React code. It is something you load from a database of some sort.

Often it does - text is exactly the thing that is rather rarely generated dynamically compared to static parts on the page (user names, blog contents, numbers are but those are rather minority and have not much to do with our issue and special characters). And if we take your assumption, then this issue doesn't make sense to discuss at all.

I'm pretty sure entities aren't meant to be human-friendly first and foremost, but simply a mechanism for escaping that is charset and implementation independent.

In that case, they would be left as &#123;. I believe names were designed specifically to be human-friendly and compatible with any locale and they serve this purpose far better than escapes in JS.

Regardless, I don't see how this is a problem JSX should try to solve (and intentionally deviate from JS), JS makes no effort.

I see, this issue becomes yet another discussion of whether JSX should be sugar as much as possible compatible with XML/HTML syntax or we should reduce it's coverage slowly moving towards JS. I don't buy the second way because it's no better than just using some kind of Hyperscript - if you want JS, you can write JS, but JSX is beautiful exactly because you can escape some of JS painful points when dealing with structures and contents such as unobvious nestings and foreign-locale escapes.

@syranide
Copy link
Contributor Author

syranide commented Sep 3, 2015

In that case, they would be left as {. I believe names were designed specifically to be human-friendly and compatible with any locale and they serve this purpose far better than escapes in JS.

No, because &#123; is inherently meaningless without a specified charset, HTML entities are independent of charset and later translated.

I see, this issue becomes yet another discussion of whether JSX should be sugar as much as possible compatible with XML/HTML syntax or we should reduce it's coverage slowly moving towards JS. I don't buy the second way because it's no better than just using some kind of Hyperscript - if you want JS, you can write JS, but JSX is beautiful exactly because you can escape some of JS painful points when dealing with structures and contents such as unobvious nestings and foreign-locale escapes.

If you ask me, JSX should not expand to do more than is absolutely necessary, that is to introduce the concept of elements in a meaningful way. If we want to solve anything else then it should be considered independently and where possible proposed to ECMA instead so that everyone benefits and not just a partial subset of JSX content. "Foreign-locale escapes" sounds far more useful at the level of JS.

@matthewwithanm
Copy link

@gajus From HTML - yes, from XML - not so much (apart from JS injections).

Or namespaces or CDATA sections or comments…IMO there are a bunch of ways that it deviates.

I'm sympathetic to the DX argument, but IMO the best thing for DX is to keep the transformation as simple as possible. Also, the more similar JSX and XML are, the more confusing any deviation becomes.

If you ask me, JSX should not expand to do more than is absolutely necessary, that is to introduce the concept of elements in a meaningful way. If we want to solve anything else then it should be considered independently and where possible proposed to ECMA instead so that everyone benefits and not just a partial subset of JSX content. "Foreign-locale escapes" sounds far more useful at the level of JS.

👍

@sebmck
Copy link

sebmck commented Sep 25, 2015

If the purpose of JSX is to be agnostic to a certain target (that's not always HTML) then does it really make sense to allow HTML entities?

@sebmarkbage
Copy link
Contributor

If we get buy in, will we have any problems making the switch? I.e. will we risk a long lived fork? The codemod should be safe.

@sebmck
Copy link

sebmck commented Sep 25, 2015

Do we have any stats (or anecdotal evidence) on how widely used HTML entities in JSX are?

@sebmarkbage
Copy link
Contributor

Or backslashes...

@sebmck
Copy link

sebmck commented Sep 25, 2015

Oh right. I've actually broken backslashes in JSX attributes before in Babel and it took over 7 days for someone to notice and file an issue: babel/babel#2114.

@NekR
Copy link

NekR commented Sep 25, 2015

I believe that entities (or other specific things) should be handled by the renderer which transforms JSX-output to HTML DOM/HTML string, but not by the transformer which transforms JSX to JSX-output.

@syranide
Copy link
Contributor Author

@NekR It would then apply to all strings equally so even user input would be subject to HTML entity decoding (aside from it being a runtime cost too), you definitely do not want that.

@NekR
Copy link

NekR commented Sep 25, 2015

@syranide what is user input in JSX? I did not say everything in runtime should be parsed with entities.

class EntitiesString {
  constructor(str) {
    this.str = myLibraryDoesHTMLEntytiesParsingHere(str);
  }

  toString() {
    return str;
  }
}

<div>{ new EntitiesString('&nbsp;') }</div>

@syranide
Copy link
Contributor Author

...by the renderer which transforms JSX-output to HTML DOM/HTML string...

@NekR I interpreted that differently. IMHO what you are proposing is runtime decoding (which is for everyone to decide on their own) and outside this discussion about entities/escape codes in JSX source code. EDIT: That is to say, JSX needs to support escaping to some extent (like { and <), regardless of whether or not JSX will drop support for HTML entities.

@NekR
Copy link

NekR commented Sep 25, 2015

@NekR I interpreted that differently.

Yes, I meanе that renderers are responsible for parsing entities. One could support EntitiesString, other don't.

. IMHO what you are proposing is runtime decoding (which is for everyone to decide on their own) and outside this discussion about entities/escape codes in JSX source code.

Of course I do not propose such decoding method here for JSX, it's implementation detail of JSX consumers. What I am saying is that entities parsing on a transpilation stage is not needed (because of runtime possibilities) and hence it's in scope of this discussion, right?

EDIT: That is to say, JSX needs to support escaping to some extent (like { and <), regardless of whether or not JSX will drop support for HTML entities.

Hmm.. <div>{ '{test}' } { '<div>' }</div> seems like it's escaped?

@syranide
Copy link
Contributor Author

What I am saying is that entities parsing on a transpilation stage is not needed (because of runtime possibilities) and hence it's in scope of this discussion, right?

IMHO no, entity parsing during transpilation and runtime decoding of entities are "complementary". Runtime decoding of static source code strings in this context is inefficient and cumbersome.

Hmm.. <div>{ '{test}' } { '<div>' }</div> seems like it's escaped?

Produces React.createElement('div', null, '{test}', '<div>') and yeah it will visually render the same as it would if you had {'{test}<div>'}, but it's not the same. So yes, you can work around the problem that way (but you're inserting a JS string, not escaping in JSX). However, this all-or-nothing if you don't want to affect runtime behavior is really inconvenient, especially considering <div>{'&nbsp;'}</div> is very different from <div>&nbsp;</div> at current.

@NekR
Copy link

NekR commented Sep 25, 2015

IMHO no, entity parsing during transpilation and runtime decoding of entities are "complementary". Runtime decoding of static source code strings in this context is inefficient and cumbersome.

Sorry, but topic is "Document parsing of (X)HTML entities, or drop it even?" and I am saying: Drop it. How it's not related? Runtime parsing was suggested as a solution. Some one who do not want runtime solution could write plugin which will pre-parse entities to JS escapes or something like that. But you are not even listening to me. What I am saying is that it makes sense to have JSXText to equal to simple JS string (sugar). Like these two should be equivalent: <div>&nbsp;</div> and <div>{'&nbsp;'}</div>.

Runtime decoding of static source code strings in this context is inefficient and cumbersome.

This is only problem of React since it's doing re-render on every move. I use JSX in a different way and it's perfectly fine for me.

So yes, you can work around the problem that way (but you're inserting a JS string, not escaping in JSX). However, this all-or-nothing if you don't want to affect runtime behavior is really inconvenient, especially considering <div>{'&nbsp;'}</div> is very different from <div>&nbsp;</div> at current.

Why we need to do work arounds or escape JSX? Just have JS string everywhere. I do not see any difference here except that transpiration entities parsing is benefit for React.

P.S. Interesting that you made this repository public and asked for feedback from non-React implementations and when people came here with their opinions, you say: "This is not related". Just make it private repository and then no problem with "not related".

@syranide
Copy link
Contributor Author

P.S. Interesting that you made this repository public and asked for feedback from non-React implementations and when people came here with their opinions, you say: "This is not related". Just make it private repository and then no problem with "not related".

@NekR I'm only one collaborator of many, these are my opinions. Feel free to refute them, but there are many things to consider. If I didn't care about your opinion I wouldn't have responded.

Sorry, but topic is "Document parsing of (X)HTML entities, or drop it even?" and I am saying: Drop it. How it's not related? Runtime parsing was suggested as a solution.

Decoding at compile-time (source code and static strings) and run-time (dynamic strings) can both co-exist and make sense. In the context of language design, run-time decoding being possible is not an argument against a syntax feature, nor vice versa. They are solutions to different problems.

Yes, we both agree that HTML entities should be dropped, that's not what I objected to. I undoubtedly think that is the way forward, but the holes left behind by dropping HTML entities still needs to be considered, runtime decoding is not it.

@NekR
Copy link

NekR commented Sep 25, 2015

In the context of language design, run-time decoding being possible is not an argument against a syntax feature, nor vice versa.

I saw many such arguments and decisions in TC-39, but okay, you do not accepts this as argument then nevermind.

runtime decoding is not it.

Why? Where is a big performance problem with it except of React contact re-render?

@dantman
Copy link

dantman commented Jan 5, 2016

Personally I don't think the DX argument is valid. And that is not through an expectation of everyone using character maps, etc...

JSX is JavaScript and it doesn't really make sense that the solution when writing JS+JSX to "I can't type © with my keyboard" is "You can use &copy; in JSX strings but you're SOL in every other part of the JS". Which of course leads to a mess like:

<Foo
    label="I can &copy; here"
    legal={__('This site \u00A9 2016 Acme Media Inc.')} />

Same code. But you can use &copy; in one part of the JSX and you can't in the other because you have something – which doesn't have to be i18n, it can be collection processing or anything else – that requires that one of the strings be part of JS space and not JSX space.

If this is a problem, it is a problem universal to JS and not one that should have a JSX-only fix.

Rather I think the solution is to embrace the fact we're writing JS and fix this with JS. Specifically, given #25 I think the solution to "I can't type © with my keyboard and don't want to use a character map, C&P, or use some other tooling" is this.

var ent = require('character-entities');

<Foo
    label=`I can ${ent.copy} here`
    legal={_(`This site ${ent.copy} 2016 Acme Media Inc.`)} />

@sebmarkbage sebmarkbage mentioned this issue Oct 12, 2016
@Huxpro Huxpro added the Impl Reality Reality that the spec does not capture label Feb 25, 2022
Huxpro added a commit that referenced this issue Mar 1, 2022
## Summary

Let's be faithful to the de-facto and document the HTML entity behaviors to the spec. Note that this is not about whether we should "drop this semantics or not", but about documenting the current behaviors that everyone has been living with for years.

### The Proposed Normative Change

I'm not aware of any practices specifying such transpiler/transform semantics in ECMA-262 so this is a really interesting attempt 🙂 So I ended up extending `Static Semantics: SV` which is the smartest way I can find to hack the semantics into the ECMA-262 spec. I think this should work and should be accurate enough. I'm curious on how implementors think about it though.

<del>I also intentionally left the set of supported HTML entities implementation-defined to allow either HTML4 or HTML5 set. This may be seen as a breaking change in some regard and **this is open to discuss here**. </del> We've reached consensus that only HTML4 entities are allowed.

This commit also close #133 by using `::` for characters which are supposed to be lexical grammars.

Close #126
Close #4

## Test Plan

open `index.html` and proof-read the spec ;)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Impl Reality Reality that the spec does not capture
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants