-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
html5lib parses extremely slow certain type of content #15076
Comments
Is this when running on the VM or via dart2js ? Routing to VM because they might be interested in performance reports like this. (FWIW I've heard the same report when trying to parse a large SVG document. Fast via dart2js but slower via VM.) cc @jmesserly. |
This comment was originally written by [email protected] Sorry I wasn't clear. This happens on the VM. Latest stable. |
Can you please give a bit more context what your example is doing? In particular I am missing how you allocated the htmlParser object and where you got its class from. Since this is filed against parsing, I assume we can just download the example URL once and feed a test from the file, correct? Added NeedsInfo label. |
This comment was originally written by [email protected] Sure, I don't see why we can't save it on a file. Here's a full code snippet to try against: library test; import 'package:http/http.dart' as http; void main() { It prints 11½ seconds for me. This happens consistently. |
Here is a directory with all the pub setup ready to go. unzip Attachment: |
Thanks for the reproduction instructions. It turns out that for the sample in question we spend a lot of time copying strings due to tokenizer.dart appending characters to a string with interpolation: currentStringToken.data = "${currentStringToken.data}-${data}"; Using a StringBuffer would be the right way to go in this situation. Also there are many other places like this in this particular source file. Set owner to @jmesserly. |
Thanks for the analysis. |
Nice tracking that down! wow, turns out this is a really old bug :) Patches are welcome for this. Probably pretty easy now that we have proper token classes and APIs use types. Originally the Python code used tokens in a very untyped way (they were just maps, "data" had different meaning in different places depending on the tokenizer state) ... now that it's sorted out into "currentStringToken" and APIs are typed, it's probably pretty easy to find the .data concatenations. Removed the owner. |
Added Pkg-Html5Lib label. |
Removed Library-Html5lib label. |
This issue has been moved to dart-lang/tools#1101. |
This issue was originally filed by [email protected]
The following code demonstrates a very slow parsing speed:
http.read('https://www.facebook.com/MARCA').then((c) {
htmlParser.parse(c);
});
For most websites out there, it's fast, but for a few and this site in particular, it parses in around 20-25 seconds.
The text was updated successfully, but these errors were encountered: