Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should the 'empty cell' definition ignore Default_ignoreable code points? #4854

Closed
dd8 opened this issue Aug 20, 2019 · 13 comments · Fixed by #4860
Closed

Should the 'empty cell' definition ignore Default_ignoreable code points? #4854

dd8 opened this issue Aug 20, 2019 · 13 comments · Fixed by #4860

Comments

@dd8
Copy link
Contributor

dd8 commented Aug 20, 2019

The definition used to discard empty table header cells doesn't discard cells containing only Default_ignorable characters:

A cell is said to be an empty cell if it contains no elements and its text content, if any, consists only of White_Space characters.

https://html.spec.whatwg.org/multipage/tables.html#empty-cell

... but the CSS spec says that:

As required by [UNICODE], unsupported Default_ignorable characters must be ignored for rendering.
https://drafts.csswg.org/css-text-3/#white-space-processing
http://unicode.org/faq/unsup_char.html

This leads to the following table having :

  • no rendered header content in any TH
  • the first two TH match the empty cells definition
  • the second two TH don't match the empty cells definition
<table>
<tr>
  <th>&#0020;</th> <!-- ASCII space -->
  <th>&#00A0;</th> <!-- no-break space (unicode White_space, but not ASCII whitespace) -->
  <th>&#200B;</th> <!-- zero width space (not classified as unicode White_space) -->
  <th>&#FEFF;</th> <!-- zero width no-break space (not classified as unicode White_space) -->
</tr>
</table>

@domenic
Copy link
Member

domenic commented Aug 22, 2019

Hmm. It's not clear to me what the right thing to do here is. These definitions serve rather different purposes. But, having them be aligned might be sort of nice?

In particular, the CSS definition is about what is to be rendered. It's saying that certain characters should not be rendered.

The HTML definition is about the data model. In particular the only usage of that definition is in the line "Remove all the empty cells from the header list." So it's trying to say that if you have an only-whitespace table header, then it's not a real table header; it's just kind of hanging out. For example the one in the corner of the following table:

Good Evil
Lawful Superman Darkseid
Chaotic Green Arrow The Joker

The rules for what should be rendered on the screen, and the rules for interpreting the data model, are distinct. Generally we separate style and content so there's no a-priori reason why they should be the same.

The more relevant question is, if someone puts a zero width space or zero width no-break space in that upper-left corner, should it still count as "not a real table header"? Probably, but it's not 100% obvious, since the author went and did something very specifically outside the beaten path of denoting not-real-headers using whitespace.

Additionally, if we're going to increase the set of characters that allow you to denote something as "not a real table header", is default-ignorables really the right cutoff? I feel like there's a whole bunch of Unicode characters which, if they were the sole occupants of a <th>, would make me think it was not really a header cell. For example, arguably control characters would fit in there.

So I'm not opposed to this change, but I'm also not sure it's worth tweaking this, and I find the reasoning so far a bit weaker than I'd like.

/cc @tabatkins @fantasai if they have thoughts from the CSS side.

@fantasai
Copy link
Contributor

Wrt Default_Ignorable: that line isn't saying CSS shouldn't be paying attention to those characters ever, only that they are ignored when rendering the text. I'll clarify that point in css-text-3. (Also notice that this rule only applies to unsupported Default_ignorable codepoints.)

Wrt what HTML should say: imho the only characters that HTML should ignore as not being significant content are the document white space characters. This definition should be aligned with the Selectors definition of :empty. In particular, the Unicode definition of White_Space is not appropriate for this purpose.

fantasai added a commit to w3c/csswg-drafts that referenced this issue Aug 23, 2019
@dd8
Copy link
Contributor Author

dd8 commented Aug 23, 2019

Wrt what HTML should say: imho the only characters that HTML should ignore as not being significant content are the document white space characters. This definition should be aligned with the Selectors definition of :empty. In particular, the Unicode definition of White_Space is not appropriate for this purpose.

+1 for making them consistent. My reason for raising this is that there are inconsistencies in white space definitions between various specs, and this can lead to weird situations where one spec considers something empty, but another doesn't.

One of the main consumers of the headers algorithm that uses the 'empty cells' definition are accessibility APIs - they also use the https://www.w3.org/TR/accname-1.1/ algorithm to compute the value of various elements, including table headers.

Problems arise if various specs disagree on what empty or whitespace means in document content (e.g. :empty in CSS vs 'empty cell' in HTML vs empty accessible name in accname )

The more relevant question is, if someone puts a zero width space or zero width no-break space in that upper-left corner, should it still count as "not a real table header"? Probably, but it's not 100% obvious, since the author went and did something very specifically outside the beaten path of denoting not-real-headers using whitespace.

U+FEFF Zero Width No Break Space can appear when the author doesn't intend it because use as a UTF-16 BOM means it can sneak into the middle of content. For example, saving a blank Windows Notepad document produces a file containing only U+FEFF, which can appear in the middle of content by careless use of cat or doing something like <th><!--#include virtual="../notepad.txt" --></th>.

Edit: It's particularly hard to know this has happened because you usually need a hexdump to see the zero width character.

@annevk
Copy link
Member

annevk commented Aug 23, 2019

I agree with @fantasai that HTML using White_Space rather than ASCII whitespace is suspect. Anything that is White_Space but not ASCII whitespace would already be outside the "beaten path".

Created #4860.

annevk added a commit that referenced this issue Aug 23, 2019
For semantics of something being empty (table cells in this case) we should only consider ASCII whitespace, as we do elsewhere.

Fixes #4854.
@dd8
Copy link
Contributor Author

dd8 commented Aug 23, 2019

There's also the empty-cells: property to consider. It would seem odd if:

a) the 'empty cells' definition in HTML and the empty-cells: property don't match the same cells
b) cells matching the 'empty cells' definition in HTML don't match td:empty, th:empty

This might be difficult to resolve. Looking at the definitions it seems that empty-cells: and td:empty, th:empty don't always apply to the same set of cells (but I could be mis-reading the spec).

<table>
<tr><td> </td></tr>
<tr><td style="white-space:normal"> <!--collapsed white space --> </td></tr>
<tr><td style="white-space:pre">  <!--non-collapsed white space -->  </td></tr>
</table>

My reading of the spec is the both TDs are all the TDs are matched by td:empty
https://drafts.csswg.org/selectors-4/#the-empty-pseudo

.. but the only the first TD has only the first and second TDs have the empty-cells: property applied:
https://www.w3.org/TR/CSS2/tables.html#propdef-empty-cells

Edit: I had mis-read the spec, but I think there's still a difference, and edited the example to show this

@dd8
Copy link
Contributor Author

dd8 commented Aug 23, 2019

Some of these problems seem to be layering issues. The original intent of the header cells algorithm in HTML 4 was rendering by non-visual user agents (i.e. screen readers)
https://www.w3.org/TR/html401/struct/tables.html#h-11.4
https://www.w3.org/TR/html401/struct/tables.html#h-11.4.3

So the question is - is this algorithm in the correct layer? Does anything other than a screen reader need to know about the association of cells and headers? If not, is the HTML spec the right place for a screen reader rendering algorithm?

Screen readers do what they say on the tin and read out the screen, so they need to take screen rendering into account (e.g. CSS display:none content isn't usually read). They also pull in information from ARIA when deciding what to read for an element.

As currently specified the table headers algorithm discards empty cells that have been given a screen reader name by other means (e.g. HTML title attribute or aria-label attribute). It also includes headers that may have been hidden by display:none. As currently specified this doesn't match up with how other content presented via a screen reader is handled.

For example:

<table>
<tr>
<th aria-label="Year"></th>
<th>Sales</th>
<th style="display:none">Costs</th>
</tr>
</table>

@annevk
Copy link
Member

annevk commented Aug 23, 2019

I think we should have some kind of algorithm that does not rely on CSS for this obtaining this information, as it's rather intrinsic part of data tables.

The definition seems to match :empty.

It is a little weird that empty-cells uses a different definition of empty though, but that's not one we can use in HTML as it would make it dependent on layout.

@fantasai
Copy link
Contributor

Agree with @annevk. I think the most important thing is to keep :empty and HTML aligned--neither of them should be depending on CSS (:empty literally can't be), and it would be good for them to match insofar as practical so that we have a consistent concept of “empty”.

Fwiw, empty-cells is ancient technology from the mid-90s; the dependency on white-space is a bit curious, but not something we need to be concerned about.

Does anything other than a screen reader need to know about the association of cells and headers?

Yes, anything that's trying to parse out or transform the data needs to know that.

If not, is the HTML spec the right place for a screen reader rendering algorithm?

No, but it's the right place to define the association of header and data cells.

@annevk
Copy link
Member

annevk commented Aug 27, 2019

@dd8 does #4860 now seem acceptable to you as well?

@dd8
Copy link
Contributor Author

dd8 commented Aug 27, 2019

Can you give me a couple of days on this? I'm investigating a related whitespace issue which might impact this.

@annevk
Copy link
Member

annevk commented Aug 27, 2019

Sure thing.

@dd8
Copy link
Contributor Author

dd8 commented Aug 29, 2019

I'm happy with the outcome in #4860 - it makes things more consistent.

I've been investigating the interop story for whitespace between different specs - it's not good:
https://gist.github.com/dd8/8a8149c2ec7093dcf8caae6b9645ac0b

For ASCII whitespace different specs disagree on whether these are whitespace:

  • U+000B Vertical Tab (white space in JavaScript, RegEx \s and most isspace() functions)
  • U+000C Form Feed (whitespace in HTML5, not whitespace in XHTML 1.0, and both whitespace and an unused code point in HTML 4.01 due to a DTD bug - using U+000C with an HTML 4 doctype reports validator error "non SGML character number 12")

There are definite implementation dangers to trap the unwary - using a RegEx pattern like \s or an built-in isspace() function may cause unexpected results in a tokenizer.

@annevk
Copy link
Member

annevk commented Aug 30, 2019

XHTML 1.0 and HTML 4.01 are obsolete and processed with an HTML5 processor so that's fortunately no longer an issue. JavaScript's whitespace definition is based on that of Unicode and definitely different therefore.

annevk added a commit that referenced this issue Aug 30, 2019
For semantics of something being empty (table cells in this case) we should only consider ASCII whitespace, as we do elsewhere.

Fixes #4854.
zcorpan pushed a commit that referenced this issue Nov 6, 2019
For semantics of something being empty (table cells in this case) we should only consider ASCII whitespace, as we do elsewhere.

Fixes #4854.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

4 participants