Should the 'empty cell' definition ignore Default_ignoreable code points? #4854

dd8 · 2019-08-20T11:11:49Z

The definition used to discard empty table header cells doesn't discard cells containing only Default_ignorable characters:

A cell is said to be an empty cell if it contains no elements and its text content, if any, consists only of White_Space characters.

https://html.spec.whatwg.org/multipage/tables.html#empty-cell

... but the CSS spec says that:

As required by [UNICODE], unsupported Default_ignorable characters must be ignored for rendering.
https://drafts.csswg.org/css-text-3/#white-space-processing
http://unicode.org/faq/unsup_char.html

This leads to the following table having :

no rendered header content in any TH
the first two TH match the empty cells definition
the second two TH don't match the empty cells definition

<table>
<tr>
  <th>&#0020;</th> <!-- ASCII space -->
  <th>&#00A0;</th> <!-- no-break space (unicode White_space, but not ASCII whitespace) -->
  <th>&#200B;</th> <!-- zero width space (not classified as unicode White_space) -->
  <th>&#FEFF;</th> <!-- zero width no-break space (not classified as unicode White_space) -->
</tr>
</table>

The text was updated successfully, but these errors were encountered:

domenic · 2019-08-22T21:57:35Z

Hmm. It's not clear to me what the right thing to do here is. These definitions serve rather different purposes. But, having them be aligned might be sort of nice?

In particular, the CSS definition is about what is to be rendered. It's saying that certain characters should not be rendered.

The HTML definition is about the data model. In particular the only usage of that definition is in the line "Remove all the empty cells from the header list." So it's trying to say that if you have an only-whitespace table header, then it's not a real table header; it's just kind of hanging out. For example the one in the corner of the following table:

	Good	Evil
Lawful	Superman	Darkseid
Chaotic	Green Arrow	The Joker

The rules for what should be rendered on the screen, and the rules for interpreting the data model, are distinct. Generally we separate style and content so there's no a-priori reason why they should be the same.

The more relevant question is, if someone puts a zero width space or zero width no-break space in that upper-left corner, should it still count as "not a real table header"? Probably, but it's not 100% obvious, since the author went and did something very specifically outside the beaten path of denoting not-real-headers using whitespace.

Additionally, if we're going to increase the set of characters that allow you to denote something as "not a real table header", is default-ignorables really the right cutoff? I feel like there's a whole bunch of Unicode characters which, if they were the sole occupants of a <th>, would make me think it was not really a header cell. For example, arguably control characters would fit in there.

So I'm not opposed to this change, but I'm also not sure it's worth tweaking this, and I find the reasoning so far a bit weaker than I'd like.

/cc @tabatkins @fantasai if they have thoughts from the CSS side.

fantasai · 2019-08-23T00:06:56Z

Wrt Default_Ignorable: that line isn't saying CSS shouldn't be paying attention to those characters ever, only that they are ignored when rendering the text. I'll clarify that point in css-text-3. (Also notice that this rule only applies to unsupported Default_ignorable codepoints.)

Wrt what HTML should say: imho the only characters that HTML should ignore as not being significant content are the document white space characters. This definition should be aligned with the Selectors definition of :empty. In particular, the Unicode definition of White_Space is not appropriate for this purpose.

…ring, not for all of CSS. whatwg/html#4854

dd8 · 2019-08-23T05:42:47Z

Wrt what HTML should say: imho the only characters that HTML should ignore as not being significant content are the document white space characters. This definition should be aligned with the Selectors definition of :empty. In particular, the Unicode definition of White_Space is not appropriate for this purpose.

+1 for making them consistent. My reason for raising this is that there are inconsistencies in white space definitions between various specs, and this can lead to weird situations where one spec considers something empty, but another doesn't.

One of the main consumers of the headers algorithm that uses the 'empty cells' definition are accessibility APIs - they also use the https://www.w3.org/TR/accname-1.1/ algorithm to compute the value of various elements, including table headers.

Problems arise if various specs disagree on what empty or whitespace means in document content (e.g. :empty in CSS vs 'empty cell' in HTML vs empty accessible name in accname )

The more relevant question is, if someone puts a zero width space or zero width no-break space in that upper-left corner, should it still count as "not a real table header"? Probably, but it's not 100% obvious, since the author went and did something very specifically outside the beaten path of denoting not-real-headers using whitespace.

U+FEFF Zero Width No Break Space can appear when the author doesn't intend it because use as a UTF-16 BOM means it can sneak into the middle of content. For example, saving a blank Windows Notepad document produces a file containing only U+FEFF, which can appear in the middle of content by careless use of cat or doing something like <th></th>.

Edit: It's particularly hard to know this has happened because you usually need a hexdump to see the zero width character.

annevk · 2019-08-23T07:46:58Z

I agree with @fantasai that HTML using White_Space rather than ASCII whitespace is suspect. Anything that is White_Space but not ASCII whitespace would already be outside the "beaten path".

Created #4860.

For semantics of something being empty (table cells in this case) we should only consider ASCII whitespace, as we do elsewhere. Fixes #4854.

dd8 · 2019-08-23T08:39:42Z

There's also the empty-cells: property to consider. It would seem odd if:

a) the 'empty cells' definition in HTML and the empty-cells: property don't match the same cells
b) cells matching the 'empty cells' definition in HTML don't match td:empty, th:empty

This might be difficult to resolve. Looking at the definitions it seems that empty-cells: and td:empty, th:empty don't always apply to the same set of cells (but I could be mis-reading the spec).

<table>
<tr><td> </td></tr>
<tr><td style="white-space:normal"> <!--collapsed white space --> </td></tr>
<tr><td style="white-space:pre">  <!--non-collapsed white space -->  </td></tr>
</table>

My reading of the spec is ~~the both TDs are~~ all the TDs are matched by td:empty
https://drafts.csswg.org/selectors-4/#the-empty-pseudo

.. but ~~the only the first TD has~~ only the first and second TDs have the empty-cells: property applied:
https://www.w3.org/TR/CSS2/tables.html#propdef-empty-cells

Edit: I had mis-read the spec, but I think there's still a difference, and edited the example to show this

dd8 · 2019-08-23T12:48:21Z

Some of these problems seem to be layering issues. The original intent of the header cells algorithm in HTML 4 was rendering by non-visual user agents (i.e. screen readers)
https://www.w3.org/TR/html401/struct/tables.html#h-11.4
https://www.w3.org/TR/html401/struct/tables.html#h-11.4.3

So the question is - is this algorithm in the correct layer? Does anything other than a screen reader need to know about the association of cells and headers? If not, is the HTML spec the right place for a screen reader rendering algorithm?

Screen readers do what they say on the tin and read out the screen, so they need to take screen rendering into account (e.g. CSS display:none content isn't usually read). They also pull in information from ARIA when deciding what to read for an element.

As currently specified the table headers algorithm discards empty cells that have been given a screen reader name by other means (e.g. HTML title attribute or aria-label attribute). It also includes headers that may have been hidden by display:none. As currently specified this doesn't match up with how other content presented via a screen reader is handled.

For example:

<table>
<tr>
<th aria-label="Year"></th>
<th>Sales</th>
<th style="display:none">Costs</th>
</tr>
</table>

annevk · 2019-08-23T16:27:45Z

I think we should have some kind of algorithm that does not rely on CSS for this obtaining this information, as it's rather intrinsic part of data tables.

The definition seems to match :empty.

It is a little weird that empty-cells uses a different definition of empty though, but that's not one we can use in HTML as it would make it dependent on layout.

fantasai · 2019-08-23T20:43:20Z

Agree with @annevk. I think the most important thing is to keep :empty and HTML aligned--neither of them should be depending on CSS (:empty literally can't be), and it would be good for them to match insofar as practical so that we have a consistent concept of “empty”.

Fwiw, empty-cells is ancient technology from the mid-90s; the dependency on white-space is a bit curious, but not something we need to be concerned about.

Does anything other than a screen reader need to know about the association of cells and headers?

Yes, anything that's trying to parse out or transform the data needs to know that.

If not, is the HTML spec the right place for a screen reader rendering algorithm?

No, but it's the right place to define the association of header and data cells.

annevk · 2019-08-27T12:12:03Z

@dd8 does #4860 now seem acceptable to you as well?

dd8 · 2019-08-27T12:14:14Z

Can you give me a couple of days on this? I'm investigating a related whitespace issue which might impact this.

annevk · 2019-08-27T12:24:01Z

Sure thing.

dd8 · 2019-08-29T15:40:28Z

I'm happy with the outcome in #4860 - it makes things more consistent.

I've been investigating the interop story for whitespace between different specs - it's not good:
https://gist.github.com/dd8/8a8149c2ec7093dcf8caae6b9645ac0b

For ASCII whitespace different specs disagree on whether these are whitespace:

U+000B Vertical Tab (white space in JavaScript, RegEx \s and most isspace() functions)
U+000C Form Feed (whitespace in HTML5, not whitespace in XHTML 1.0, and both whitespace and an unused code point in HTML 4.01 due to a DTD bug - using U+000C with an HTML 4 doctype reports validator error "non SGML character number 12")

There are definite implementation dangers to trap the unwary - using a RegEx pattern like \s or an built-in isspace() function may cause unexpected results in a tokenizer.

annevk · 2019-08-30T07:17:25Z

XHTML 1.0 and HTML 4.01 are obsolete and processed with an HTML5 processor so that's fortunately no longer an issue. JavaScript's whitespace definition is based on that of Unicode and definitely different therefore.

For semantics of something being empty (table cells in this case) we should only consider ASCII whitespace, as we do elsewhere. Fixes #4854.

annevk added the topic: table label Aug 20, 2019

fantasai added a commit to w3c/csswg-drafts that referenced this issue Aug 23, 2019

[css-text-3] Clarify that Default_ignorable is ignored for text rende…

f0d7910

…ring, not for all of CSS. whatwg/html#4854

annevk added a commit that referenced this issue Aug 23, 2019

Stop using White_Space

3272ecb

For semantics of something being empty (table cells in this case) we should only consider ASCII whitespace, as we do elsewhere. Fixes #4854.

annevk mentioned this issue Aug 23, 2019

Stop using White_Space #4860

Merged

dd8 mentioned this issue Aug 28, 2019

Some space characters missing from glossary/whitespace.md act-rules/act-rules.github.io#642

Open

annevk closed this as completed in #4860 Aug 30, 2019

annevk added a commit that referenced this issue Aug 30, 2019

Stop using White_Space

635a599

For semantics of something being empty (table cells in this case) we should only consider ASCII whitespace, as we do elsewhere. Fixes #4854.

zcorpan pushed a commit that referenced this issue Nov 6, 2019

Stop using White_Space

b36d8eb

For semantics of something being empty (table cells in this case) we should only consider ASCII whitespace, as we do elsewhere. Fixes #4854.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should the 'empty cell' definition ignore Default_ignoreable code points? #4854

Should the 'empty cell' definition ignore Default_ignoreable code points? #4854

dd8 commented Aug 20, 2019

domenic commented Aug 22, 2019 •

edited

Loading

fantasai commented Aug 23, 2019

dd8 commented Aug 23, 2019 •

edited

Loading

annevk commented Aug 23, 2019 •

edited

Loading

dd8 commented Aug 23, 2019 •

edited

Loading

dd8 commented Aug 23, 2019

annevk commented Aug 23, 2019

fantasai commented Aug 23, 2019

annevk commented Aug 27, 2019

dd8 commented Aug 27, 2019

annevk commented Aug 27, 2019

dd8 commented Aug 29, 2019

annevk commented Aug 30, 2019

Should the 'empty cell' definition ignore Default_ignoreable code points? #4854

Should the 'empty cell' definition ignore Default_ignoreable code points? #4854

Comments

dd8 commented Aug 20, 2019

domenic commented Aug 22, 2019 • edited Loading

fantasai commented Aug 23, 2019

dd8 commented Aug 23, 2019 • edited Loading

annevk commented Aug 23, 2019 • edited Loading

dd8 commented Aug 23, 2019 • edited Loading

dd8 commented Aug 23, 2019

annevk commented Aug 23, 2019

fantasai commented Aug 23, 2019

annevk commented Aug 27, 2019

dd8 commented Aug 27, 2019

annevk commented Aug 27, 2019

dd8 commented Aug 29, 2019

annevk commented Aug 30, 2019

domenic commented Aug 22, 2019 •

edited

Loading

dd8 commented Aug 23, 2019 •

edited

Loading

annevk commented Aug 23, 2019 •

edited

Loading

dd8 commented Aug 23, 2019 •

edited

Loading