Document parsing of (X)HTML entities, or drop it even? #4

syranide · 2014-08-29T15:40:13Z

We should probably document how (X)HTML entities are parsed.

However, I can imagine dropping HTML entities instead and adopt the escaping used by JS-strings, i.e. bla \< \{ \u1234 bla. To me it would make sense in many ways:

JSX is the JavaScript-equivalent of HTML (it's not HTML), using JavaScript syntax seems preferable.
JSX explicitly disallows inline HTML in-favor of just JSXElements and JSXText, HTML entities seem a bit malplaced in that context.
It's currently <a href="&\" /> vs <a href={'&\\'} /> which is kind of awkward.

The downside of dropping HTML entities is obviously that you wouldn't be able to copy-paste HTML and it could be a mental disconnect for a lot of users. But I think it makes a lot of sense from a technical perspective.

I think it makes even more sense if you look beyond HTML. Why would you be using HTML entities for non-HTML frontends? Like iOS, QT, etc.

The text was updated successfully, but these errors were encountered:

ghost · 2015-08-04T18:04:35Z

Thank you for reporting this issue and appreciate your patience. We've notified the core team for an update on this issue. We're looking for a response within the next 30 days or the issue may be closed.

gajus · 2015-09-02T19:53:55Z

The downside of dropping HTML entities is obviously that you wouldn't be able to copy-paste HTML and it could be a mental disconnect for a lot of users. But I think it makes a lot of sense from a technical perspective.

React has already chose to deviate from HTML. facebook/react#2781

and

Third, our thinking is that JSX's primary advantage is the symmetry of matching closing tags which make code easier to read, not the direct resemblance to HTML or XML. It's convenient to copy/paste HTML directly, but other minor differences (in self-closing tags, for example) make this a losing battle and we have a HTML to JSX converter to help you anyway. Finally, to translate HTML to idiomatic React code, a fair amount of work is usually involved in breaking up the markup into components that make sense, so changing class to className is only a small part of that anyway.

from @spicyj answer https://www.quora.com/Why-do-I-have-to-use-className-instead-of-class-in-ReactJs-components-done-in-JSX/answer/Ben-Alpert

Therefore, I am in favour of dropping HTML entity support.

RReverser · 2015-09-02T20:18:40Z

React has already chose to deviate from HTML.

@gajus From HTML - yes, from XML - not so much (apart from JS injections).

gajus · 2015-09-02T20:25:03Z

Well, I am biased. I want JSX to allow template strings in JSXAttributeValue. The fate of that issue depends on whether HTML entity support is dropped or not. This is another consideration to have when deciding on this.

RReverser · 2015-09-02T20:46:16Z

Do those two braces around really mean that much to you to change two behaviors? 😄

gajus · 2015-09-02T20:55:13Z

One is HTML entities. Whats the second?

RReverser · 2015-09-02T21:11:31Z

Template strings without braces on their own.

gajus · 2015-09-02T21:21:06Z

I think that since JSX is present in JS and that it is in essence a syntactic sugar for createElement, then it should behave in the same way, i.e.,

React.createElement(`div`, {className: `foo-${foo}`}, `bar-${bar}`);

should not be different from

<div className=`foo-${foo}`>`bar-${bar}`</div>

RReverser · 2015-09-02T21:35:33Z

Then we return to questions like numeric literals, object and array literals and so.

gajus · 2015-09-02T21:40:23Z

@RReverser Explain?

If I understand correctly, then yes, objects, strings, null and numbers (thats all there is) should be valid attribute values.

<div foo=null />
<div foo=123 />
<div foo=() => {} />
<div foo=({}) />

Does this clash with anything in the spec?

RReverser · 2015-09-02T21:42:18Z

It doesn't clash, but increases complexity for purely aesthetic reason.

gajus · 2015-09-02T21:47:01Z

That is true. But consistency/conventions lower bug count (sorry, no reference for this stats). Assuming that is true, then if the rest of the code base is using convention X (template string in this case), it would make sense if JSX supported that too.

RReverser · 2015-09-03T01:21:12Z

That arguments has two sides - on one hand, you're increasing consistency for those who work with JS for developing logic, an on another you at the same time decrease consistency and familiarity for those who develop views (HTML/XML coders).

sebmarkbage · 2015-09-03T01:50:23Z

I think that it probably only makes sense to do this if we also drop it from JSXText or drop JSXText completely, as described in #8 and #35 .

syranide · 2015-09-03T07:11:58Z

@sebmarkbage I'd say #28 is a candidate for otherwise keeping JSX as it is and being able to drop XHTML entities.

That arguments has two sides - on one hand, you're increasing consistency for those who work with JS for developing logic, an on another you at the same time decrease consistency and familiarity for those who develop views (HTML/XML coders).

IMHO the problem is that it is inconsistent, it would be fine if <a href=" " /> was the same as <a href={" "} />, which it isn't... to be honest I'm quite sure that many don't even realize this difference exists.

RReverser · 2015-09-03T10:17:54Z

to be honest I'm quite sure that many don't even realize this difference exists

Dunno, maybe, but didn't meet such people yet. Right now it's pretty balanced in sense that most realize that {...} is boundaries of JavaScript, outside of them everything works pretty much as XML, inside - as JS.

The biggest benefit of entities is that they're properly named and easy to remember. Most people know perfectly how to write   or — or © to get what they want, while very few people know corresponding hexademical codes, and googling them every time you want special character or using some external library that would just provide list of characters is not a really pleasant experience.

syranide · 2015-09-03T10:20:28Z

The biggest benefit of entities is that they're properly named and easy to remember. Most people know perfectly how to write or — or © to get what they want, while very few people know corresponding hexademical codes, and googling them every time you want special character or using some external library that would just provide list of characters is not a really pleasant experience.

\< \> \& \" seems easier to me than < > & "? Hexadecimal codes are last resort.

PS. If you want © then just write it, there's no reason to use the hexcode or HTML entity.

gajus · 2015-09-03T10:21:45Z

< > & " seems easier to me than < > & "? Hexadecimal codes are last resort.

Was just typing that. Why bother with HTML entities at all.

RReverser · 2015-09-03T10:52:57Z

then just write it

You mean use specific keyboard layout that allows them or table character application? Not all platforms & localizations have that ability out of the box.

gajus · 2015-09-03T10:57:10Z

Copy paste from https://en.wikipedia.org/wiki/List_of_Unicode_characters.

gajus · 2015-09-03T10:58:27Z

Thats genuinely what I do when my keyboard does not have a character that I need. Since it is very rare that I would need a character thats not on my keyboard, it does not bother me. I cannot imagine anyone being bothered by that either.

RReverser · 2015-09-03T11:10:46Z

Well, I do that as well, but it's not pleasant at all, and it's not as rare as it seems - especially for examples above as non-breaking spaces, medium dashes and copyright characters. They are in fact much more often than < and > in regular text, and two others mentioned (" and &) are already perfectly supported without any kind of escaping in JSX.

gajus · 2015-09-03T11:14:27Z

While not all platforms support character maps, I imagine that every IDE/text editor has a plugin for that (vim, Sublime, WebStorm, to name a few).

gajus · 2015-09-03T11:15:16Z

Not to mention that "regular text" is rarely typed in React code. It is something you load from a database of some sort.

syranide · 2015-09-03T11:28:07Z

You mean use specific keyboard layout that allows them or table character application? Not all platforms & localizations have that ability out of the box.

http://fsymbols.com/computer/copyright/

I'm pretty sure entities aren't meant to be human-friendly first and foremost, but simply a mechanism for escaping that is charset and implementation independent.

Regardless, I don't see how this is a problem JSX should try to solve (and intentionally deviate from JS), JS makes no effort.

RReverser · 2015-09-03T11:47:25Z

While not all platforms support character maps, I imagine that every IDE/text editor has a plugin for that (vim, Sublime, WebStorm, to name a few).

So in any case - remove built-in human-friendly way for escaping, and instead force dev to google/use charmap/plugin/whatever. Degradation of DX is not something nice.

Not to mention that "regular text" is rarely typed in React code. It is something you load from a database of some sort.

Often it does - text is exactly the thing that is rather rarely generated dynamically compared to static parts on the page (user names, blog contents, numbers are but those are rather minority and have not much to do with our issue and special characters). And if we take your assumption, then this issue doesn't make sense to discuss at all.

I'm pretty sure entities aren't meant to be human-friendly first and foremost, but simply a mechanism for escaping that is charset and implementation independent.

In that case, they would be left as {. I believe names were designed specifically to be human-friendly and compatible with any locale and they serve this purpose far better than escapes in JS.

Regardless, I don't see how this is a problem JSX should try to solve (and intentionally deviate from JS), JS makes no effort.

I see, this issue becomes yet another discussion of whether JSX should be sugar as much as possible compatible with XML/HTML syntax or we should reduce it's coverage slowly moving towards JS. I don't buy the second way because it's no better than just using some kind of Hyperscript - if you want JS, you can write JS, but JSX is beautiful exactly because you can escape some of JS painful points when dealing with structures and contents such as unobvious nestings and foreign-locale escapes.

syranide · 2015-09-03T12:10:59Z

In that case, they would be left as {. I believe names were designed specifically to be human-friendly and compatible with any locale and they serve this purpose far better than escapes in JS.

No, because { is inherently meaningless without a specified charset, HTML entities are independent of charset and later translated.

I see, this issue becomes yet another discussion of whether JSX should be sugar as much as possible compatible with XML/HTML syntax or we should reduce it's coverage slowly moving towards JS. I don't buy the second way because it's no better than just using some kind of Hyperscript - if you want JS, you can write JS, but JSX is beautiful exactly because you can escape some of JS painful points when dealing with structures and contents such as unobvious nestings and foreign-locale escapes.

If you ask me, JSX should not expand to do more than is absolutely necessary, that is to introduce the concept of elements in a meaningful way. If we want to solve anything else then it should be considered independently and where possible proposed to ECMA instead so that everyone benefits and not just a partial subset of JSX content. "Foreign-locale escapes" sounds far more useful at the level of JS.

matthewwithanm · 2015-09-25T01:59:39Z

@gajus From HTML - yes, from XML - not so much (apart from JS injections).

Or namespaces or CDATA sections or comments…IMO there are a bunch of ways that it deviates.

I'm sympathetic to the DX argument, but IMO the best thing for DX is to keep the transformation as simple as possible. Also, the more similar JSX and XML are, the more confusing any deviation becomes.

If you ask me, JSX should not expand to do more than is absolutely necessary, that is to introduce the concept of elements in a meaningful way. If we want to solve anything else then it should be considered independently and where possible proposed to ECMA instead so that everyone benefits and not just a partial subset of JSX content. "Foreign-locale escapes" sounds far more useful at the level of JS.

👍

sebmck · 2015-09-25T02:07:12Z

If the purpose of JSX is to be agnostic to a certain target (that's not always HTML) then does it really make sense to allow HTML entities?

sebmarkbage · 2015-09-25T03:06:18Z

If we get buy in, will we have any problems making the switch? I.e. will we risk a long lived fork? The codemod should be safe.

sebmck · 2015-09-25T03:26:29Z

Do we have any stats (or anecdotal evidence) on how widely used HTML entities in JSX are?

sebmarkbage · 2015-09-25T04:10:56Z

Or backslashes...

sebmck · 2015-09-25T04:13:03Z

Oh right. I've actually broken backslashes in JSX attributes before in Babel and it took over 7 days for someone to notice and file an issue: babel/babel#2114.

NekR · 2015-09-25T08:35:49Z

I believe that entities (or other specific things) should be handled by the renderer which transforms JSX-output to HTML DOM/HTML string, but not by the transformer which transforms JSX to JSX-output.

syranide · 2015-09-25T09:09:52Z

@NekR It would then apply to all strings equally so even user input would be subject to HTML entity decoding (aside from it being a runtime cost too), you definitely do not want that.

NekR · 2015-09-25T10:28:07Z

@syranide what is user input in JSX? I did not say everything in runtime should be parsed with entities.

class EntitiesString {
  constructor(str) {
    this.str = myLibraryDoesHTMLEntytiesParsingHere(str);
  }

  toString() {
    return str;
  }
}

<div>{ new EntitiesString('&nbsp;') }</div>

syranide · 2015-09-25T10:42:03Z

...by the renderer which transforms JSX-output to HTML DOM/HTML string...

@NekR I interpreted that differently. IMHO what you are proposing is runtime decoding (which is for everyone to decide on their own) and outside this discussion about entities/escape codes in JSX source code. EDIT: That is to say, JSX needs to support escaping to some extent (like { and <), regardless of whether or not JSX will drop support for HTML entities.

NekR · 2015-09-25T10:56:48Z

@NekR I interpreted that differently.

Yes, I meanе that renderers are responsible for parsing entities. One could support EntitiesString, other don't.

. IMHO what you are proposing is runtime decoding (which is for everyone to decide on their own) and outside this discussion about entities/escape codes in JSX source code.

Of course I do not propose such decoding method here for JSX, it's implementation detail of JSX consumers. What I am saying is that entities parsing on a transpilation stage is not needed (because of runtime possibilities) and hence it's in scope of this discussion, right?

EDIT: That is to say, JSX needs to support escaping to some extent (like { and <), regardless of whether or not JSX will drop support for HTML entities.

Hmm.. <div>{ '{test}' } { '<div>' }</div> seems like it's escaped?

syranide · 2015-09-25T11:14:36Z

What I am saying is that entities parsing on a transpilation stage is not needed (because of runtime possibilities) and hence it's in scope of this discussion, right?

IMHO no, entity parsing during transpilation and runtime decoding of entities are "complementary". Runtime decoding of static source code strings in this context is inefficient and cumbersome.

Hmm.. <div>{ '{test}' } { '<div>' }</div> seems like it's escaped?

Produces React.createElement('div', null, '{test}', '<div>') and yeah it will visually render the same as it would if you had {'{test}<div>'}, but it's not the same. So yes, you can work around the problem that way (but you're inserting a JS string, not escaping in JSX). However, this all-or-nothing if you don't want to affect runtime behavior is really inconvenient, especially considering <div>{' '}</div> is very different from <div> </div> at current.

NekR · 2015-09-25T11:53:26Z

IMHO no, entity parsing during transpilation and runtime decoding of entities are "complementary". Runtime decoding of static source code strings in this context is inefficient and cumbersome.

Sorry, but topic is "Document parsing of (X)HTML entities, or drop it even?" and I am saying: Drop it. How it's not related? Runtime parsing was suggested as a solution. Some one who do not want runtime solution could write plugin which will pre-parse entities to JS escapes or something like that. But you are not even listening to me. What I am saying is that it makes sense to have JSXText to equal to simple JS string (sugar). Like these two should be equivalent: <div> </div> and <div>{' '}</div>.

Runtime decoding of static source code strings in this context is inefficient and cumbersome.

This is only problem of React since it's doing re-render on every move. I use JSX in a different way and it's perfectly fine for me.

So yes, you can work around the problem that way (but you're inserting a JS string, not escaping in JSX). However, this all-or-nothing if you don't want to affect runtime behavior is really inconvenient, especially considering <div>{' '}</div> is very different from <div> </div> at current.

Why we need to do work arounds or escape JSX? Just have JS string everywhere. I do not see any difference here except that transpiration entities parsing is benefit for React.

P.S. Interesting that you made this repository public and asked for feedback from non-React implementations and when people came here with their opinions, you say: "This is not related". Just make it private repository and then no problem with "not related".

syranide · 2015-09-25T12:39:41Z

P.S. Interesting that you made this repository public and asked for feedback from non-React implementations and when people came here with their opinions, you say: "This is not related". Just make it private repository and then no problem with "not related".

@NekR I'm only one collaborator of many, these are my opinions. Feel free to refute them, but there are many things to consider. If I didn't care about your opinion I wouldn't have responded.

Sorry, but topic is "Document parsing of (X)HTML entities, or drop it even?" and I am saying: Drop it. How it's not related? Runtime parsing was suggested as a solution.

Decoding at compile-time (source code and static strings) and run-time (dynamic strings) can both co-exist and make sense. In the context of language design, run-time decoding being possible is not an argument against a syntax feature, nor vice versa. They are solutions to different problems.

Yes, we both agree that HTML entities should be dropped, that's not what I objected to. I undoubtedly think that is the way forward, but the holes left behind by dropping HTML entities still needs to be considered, runtime decoding is not it.

NekR · 2015-09-25T17:11:21Z

In the context of language design, run-time decoding being possible is not an argument against a syntax feature, nor vice versa.

I saw many such arguments and decisions in TC-39, but okay, you do not accepts this as argument then nevermind.

runtime decoding is not it.

Why? Where is a big performance problem with it except of React contact re-render?

dantman · 2016-01-05T05:30:11Z

Personally I don't think the DX argument is valid. And that is not through an expectation of everyone using character maps, etc...

JSX is JavaScript and it doesn't really make sense that the solution when writing JS+JSX to "I can't type © with my keyboard" is "You can use © in JSX strings but you're SOL in every other part of the JS". Which of course leads to a mess like:

<Foo
    label="I can &copy; here"
    legal={__('This site \u00A9 2016 Acme Media Inc.')} />

Same code. But you can use © in one part of the JSX and you can't in the other because you have something – which doesn't have to be i18n, it can be collection processing or anything else – that requires that one of the strings be part of JS space and not JSX space.

If this is a problem, it is a problem universal to JS and not one that should have a JSX-only fix.

Rather I think the solution is to embrace the fact we're writing JS and fix this with JS. Specifically, given #25 I think the solution to "I can't type © with my keyboard and don't want to use a character map, C&P, or use some other tooling" is this.

var ent = require('character-entities');

<Foo
    label=`I can ${ent.copy} here`
    legal={_(`This site ${ent.copy} 2016 Acme Media Inc.`)} />

## Summary Let's be faithful to the de-facto and document the HTML entity behaviors to the spec. Note that this is not about whether we should "drop this semantics or not", but about documenting the current behaviors that everyone has been living with for years. ### The Proposed Normative Change I'm not aware of any practices specifying such transpiler/transform semantics in ECMA-262 so this is a really interesting attempt 🙂 So I ended up extending `Static Semantics: SV` which is the smartest way I can find to hack the semantics into the ECMA-262 spec. I think this should work and should be accurate enough. I'm curious on how implementors think about it though. <del>I also intentionally left the set of supported HTML entities implementation-defined to allow either HTML4 or HTML5 set. This may be seen as a breaking change in some regard and **this is open to discuss here**. </del> We've reached consensus that only HTML4 entities are allowed. This commit also close #133 by using `::` for characters which are supposed to be lexical grammars. Close #126 Close #4 ## Test Plan open `index.html` and proof-read the spec ;)

syranide mentioned this issue Aug 29, 2014

Extend JSXText with Comment? #7

Open

sebmarkbage mentioned this issue Sep 2, 2015

Allow template literal in JSXAttributeValue #25

Open

lo1tuma mentioned this issue Jan 13, 2016

Make jsx-quotes fixable eslint/eslint#4932

Merged

sebmarkbage mentioned this issue Oct 12, 2016

JSX 2.0 #65

Open

Haroenv mentioned this issue Mar 20, 2017

Highlighter doesn't parse html entities algolia/instantsearch#2043

Closed

lydell mentioned this issue Sep 27, 2017

jsx-curly-brace-presence removes braces that are required jsx-eslint/eslint-plugin-react#1449

Closed

nojvek mentioned this issue Jun 19, 2020

RFC: An evolved JSX 2.0 proposal #124

Open

tolmasky mentioned this issue Feb 10, 2022

Add TemplateLiteral support to JSXAttribute. #132

Closed

This was referenced Feb 24, 2022

[Normative] Capture the HTML entity behaviors Huxpro/jsx#1

Closed

[Normative] Capture the HTML entity behaviors #136

Merged

Huxpro added the Impl Reality Reality that the spec does not capture label Feb 25, 2022

Huxpro closed this as completed in #136 Mar 1, 2022

Document parsing of (X)HTML entities, or drop it even? #4

Document parsing of (X)HTML entities, or drop it even? #4

Comments

syranide commented Aug 29, 2014

ghost commented Aug 4, 2015

gajus commented Sep 2, 2015

RReverser commented Sep 2, 2015

gajus commented Sep 2, 2015

RReverser commented Sep 2, 2015

gajus commented Sep 2, 2015

RReverser commented Sep 2, 2015

gajus commented Sep 2, 2015

RReverser commented Sep 2, 2015

gajus commented Sep 2, 2015

RReverser commented Sep 2, 2015

gajus commented Sep 2, 2015

RReverser commented Sep 3, 2015

sebmarkbage commented Sep 3, 2015

syranide commented Sep 3, 2015

RReverser commented Sep 3, 2015

syranide commented Sep 3, 2015

gajus commented Sep 3, 2015

RReverser commented Sep 3, 2015

gajus commented Sep 3, 2015

gajus commented Sep 3, 2015

RReverser commented Sep 3, 2015

gajus commented Sep 3, 2015

gajus commented Sep 3, 2015

syranide commented Sep 3, 2015

RReverser commented Sep 3, 2015

syranide commented Sep 3, 2015

matthewwithanm commented Sep 25, 2015

sebmck commented Sep 25, 2015

sebmarkbage commented Sep 25, 2015

sebmck commented Sep 25, 2015

sebmarkbage commented Sep 25, 2015

sebmck commented Sep 25, 2015

NekR commented Sep 25, 2015

syranide commented Sep 25, 2015

NekR commented Sep 25, 2015

syranide commented Sep 25, 2015

NekR commented Sep 25, 2015

syranide commented Sep 25, 2015

NekR commented Sep 25, 2015

syranide commented Sep 25, 2015

NekR commented Sep 25, 2015

dantman commented Jan 5, 2016