Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port HtmlParser over from amphtml #272

Open
schlessera opened this issue Jul 7, 2021 · 2 comments
Open

Port HtmlParser over from amphtml #272

schlessera opened this issue Jul 7, 2021 · 2 comments

Comments

@schlessera
Copy link
Collaborator

The validator cannot be built on top of the Dom\Document we are using for the sanitizer and optimizer, as it requires precise line/column/length coordinates for pinpointing validation issues in the source files.

The NodeJS validator uses a SAX parser to traverse the HTML, with the actual validation engine being a handler that gets triggered by the SAX events, i.e. startTag(), endTag(), ...

After looking at existing HTML SAX parsers in PHP, my conclusion is to port over the parser implementation from NodeJS instead of reusing an existing PHP HTML SAX parser, for the following reasons:

  • no implementation was recently maintained;
  • third-party dependencies should be avoided whenever we can for the toolbox;
  • a lot of the hard-coded logic of the parser is already found in the toolbox because we needed parts for other tools...
  • ...therefore it can be ported with only modest effort.
@westonruter
Copy link
Member

What about using Masterminds? https://github.com/Masterminds/html5-php

Cf. ampproject/amp-wp#3293

@schlessera
Copy link
Collaborator Author

As already discussed in the meeting, Masterminds HtmlParser doesn't fulfil the condition for solving the original issue ampproject/amp-wp#3293, and I'd like to avoid pulling it in as a dependency just for the SAX parser, knowing that there's no guarantee that the parsing will result in the exact same tokenization than the one from the NodeJS library. Given the porting work is not huge, avoiding this dependency seems preferable to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants