Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser tests without error cases #102

Open
wycats opened this issue Aug 17, 2017 · 19 comments
Open

Parser tests without error cases #102

wycats opened this issue Aug 17, 2017 · 19 comments

Comments

@wycats
Copy link

wycats commented Aug 17, 2017

We attempt to implement a valid parser for a large subset of valid HTML in Glimmer (including SVG, including foreign elements), but unfortunately we cannot use these tests.

The reason is that the tests liberally make use of invalid HTML (missing doctypes, no closing tags, etc.) for convenience, but since Glimmer is an offline build tool, we report real errors in these cases.

Right now, we make a best-effort attempt at conformance (we fix bugs when they're reported) but I just noticed a bug in our table handling that makes me wish we could just use an exhaustive suite directly.

Is there any interest in avoiding gratuitous use of invalid HTML in the tests, or is it too late for that?

@inikulin
Copy link
Member

inikulin commented Aug 17, 2017

Is there any interest in avoiding gratuitous use of invalid HTML in the tests, or is it too late for that?

Testing of invalid HTML is one of reasons why this test suite exists in the first place. However, we are currently working on standardisation of parse errors both in spec and tests. Part for tokenisation was already landed and we just started work on tree construction stage. So eventually you will be able to get subset of test suite for valid HTML by just filtering out tests that have parse errors specified.

@wycats
Copy link
Author

wycats commented Aug 17, 2017

Testing of invalid HTML is one of reasons why these test suite exists in the first place.

I certainly want parse errors to be specified and in fact it's critical to me that I report the correct parse errors in my offline compiler.

However, we are currently working on standardisation of parse errors both in spec and tests. Part for tokenisation was already landed and we just started work on tree construction stage. So eventually you will be able to get subset of test for valid HTML by just filtering out tests that have parse errors specified.

Today, many tests use parse errors gratuitously. For example, it's hard (maybe impossible, but I can't tell) to find a test for the (valid) case of inserting a <tbody> if a <tr> is found directly inside of a <table>.

Here's an example test:

#data
<table><td>
#errors
(1,7): expected-doctype-but-got-start-tag
(1,11): unexpected-cell-in-table-body
(1,11): expected-closing-tag-but-got-eof
#document
| <html>
|   <head>
|   <body>
|     <table>
|       <tbody>
|         <tr>
|           <td>

This test is really testing a very simple, valid thing, but because of the way it happens to be tested, it ends up producing a lot of errors that I would have to somehow know how to ignore. It's not obvious to me how to ignore these errors reliably. Maybe there's a good trick?

Also, is the idea of the standardization to make tests like this one avoid gratuitously invalid content? That's more or less what I'm asking for.

@inikulin
Copy link
Member

OK, I'm a bit confused now. What's the problem with errors if your parser can recover from errors like any spec compatible parser do? Or I'm missing something?

@inikulin
Copy link
Member

inikulin commented Aug 17, 2017

Or you just want more test cases to be added for valid markup? If so, then it's up to you, we are happy to accept PRs.

@wycats
Copy link
Author

wycats commented Aug 17, 2017

What's the problem with errors if your parser can recover from errors like any spec compatible parser do?

According to the spec:

This specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined: user agents must either act as described below when encountering such problems, or must abort processing at the first error that they encounter for which they do not wish to apply the rules described below.

We abort, in conformance with the spec, but the tests don't have a mode to test this conformant scenario.

@inikulin
Copy link
Member

Thank you for the clarification. The only thing that we can do at the moment is just to add more tests for valid scenarios. Any contribution is welcome.

@wycats
Copy link
Author

wycats commented Aug 17, 2017

@inikulin would you take patches to add missing doctypes and closing tags in existing tests that gratuitously do not include them?

@wycats
Copy link
Author

wycats commented Aug 17, 2017

@inikulin another possible refactoring could be to differentiate between real errors and errors that only arise as a result of early termination of the HTML document.

@inikulin
Copy link
Member

We would like to keep existing tests as is, since they are widely used by implementations that gracefully handle these errors (i.e. browsers). But we are open to accept new tests even if they are just valid versions of old one, you can place them in separate file, e.g. valid.dat and valid.json.

@wycats
Copy link
Author

wycats commented Aug 17, 2017

@inikulin would you be open to adding more metadata to existing tests, like indicating whether an error is due to early EOF? Can you think of a way to do that that wouldn't break existing consumers?

@inikulin
Copy link
Member

Well, parse errors is such metadata. It's well documented which tokenisation errors occur due to unexpected EOF.

@inikulin
Copy link
Member

inikulin commented Aug 17, 2017

Not implemented yet for tree construction stage, though

@wycats
Copy link
Author

wycats commented Aug 17, 2017

@inikulin what I'm saying is that from my perspective there are "errors I am specifically trying to assert occur" and "errors that occur as a side effect of the way we wrote the test".

Errors that occur due to unexpected EOF can be treated specially in my tests, because I have the partially constructed tree that I can compare.

Maybe all I really need (and maybe this is already true) is that all errors that occur due to unexpected EOF end in something like but-got-eof. It's probably even sufficient for me to enumerate the errors myself, but this would be brittle if things change and more but-got-eof style errors were added in the future.

@inikulin
Copy link
Member

It's probably even sufficient for me to enumerate the errors myself, but this would be brittle if things change and more but-got-eof style errors were added in the future.

It's very unlikely that there will be new errors added in near future and if they would it will be a significant change to HTML syntax itself, so you will need to update your parser anyway. Meanwhile, you can find standard parse errors and their descriptions in spec: https://html.spec.whatwg.org/multipage/parsing.html#parse-errors

@inikulin
Copy link
Member

inikulin commented Aug 17, 2017

(Apart from tree construction stage errors that are not in spec yet)

@inikulin
Copy link
Member

inikulin commented Aug 17, 2017

My vision is that if any implementation requires some subset of tests based on some assertions it should create these subset on it's side and we provide all required prerequisites for it, such as standard and well documented error codes. Or if it's missing some test cases we are open to accept new tests. I don't see any reason why some ad hoc solutions should get into this test suite.

@inikulin
Copy link
Member

inikulin commented Aug 17, 2017

Thinking of it a bit more, we can add support for implementations that bailout on parse errors. We can achieve this by inserting marker in expected markup that shows position of first parse error. So during testing it will be possible to skip comparison of AST after this marker. However, it might not work in all cases, such as when error recovery mechanism modify AST that is already parsed before error point, e.g. in case of foster parenting. But we can mark these tests somehow as well, so they can be completely skipped \cc @gsnedders

@wycats
Copy link
Author

wycats commented Aug 18, 2017

@inikulin if there's a strategy that you're happy with, I'm willing to do some work to update the tests in that way.

One issue with the current tests is that a lot of the cases of "first parse error" is "missing doctype".

I've been thinking more about the issue of keeping the existing suite working for existing impls. One approach might be to accept some improvements to the metadata description, but write a script to emit the original format. We could then confirm 0 diffs in the original format, but I could enrich the test definitions with more information.

@inikulin
Copy link
Member

inikulin commented Sep 1, 2017

@wycats Sorry, I've missed your last reply somehow. In my opinion for now the best solution will be the one that doesn't require any change in the existing test format. So, I vote for adding new tests in the separate file (they can be modified existing tests).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants