Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect spans generated for HTML with higher-plane unicode characters #1009

Open
filiph opened this issue Mar 21, 2018 · 4 comments
Open

Comments

@filiph
Copy link
Contributor

filiph commented Mar 21, 2018

When parsing HTML that includes characters like "🍋", the start and end FileLocations are generated incorrectly.

Here's a short repo:

import 'package:html/dom.dart';
import "package:html/parser.dart";
import "package:source_span/source_span.dart";

void main() {
  final dom = parse(contents,generateSpans: true);
  final Element element = dom.querySelectorAll("link").single;
  final span = element.sourceSpan;
  final spanCopy = new SourceSpan(span.start, span.end, contents);
}

const contents = """
<head>
    <meta charset="UTF-8">
    <title></title>
    <link rel="alternate" type="application/rss+xml" title="ArtLung &raquo; Limones 🍋 Comments Feed" href="subdirectory/other.html" />
</head>
""";

This will throw the following error:

Unhandled exception:
Invalid argument(s): Text "<head>
    <meta charset="UTF-8">
    <title></title>
    <link rel="alternate" type="application/rss+xml" title="ArtLung &raquo; Limones 🍋 Comments Feed" href="subdirectory/other.html" />
</head>
" must be 130 characters long.
#0      new SourceSpanBase (package:source_span/src/span.dart:85:7)
dart-lang/html#1      new SourceSpan (package:source_span/src/span.dart:34:11)
dart-lang/html#2      main (file:///Users/filiph/dev/linkcheck/test/source_span_bug.dart:9:24)
dart-lang/html#3      _startIsolate.<anonymous closure> (dart:isolate-patch/isolate_patch.dart:265)
dart-lang/html#4      _RawReceivePortImpl._handleMessage (dart:isolate-patch/isolate_patch.dart:151)

This is not an issue with package:source_span — when I create the span manually, without parse(), copying it works okay.

@filiph
Copy link
Contributor Author

filiph commented May 27, 2019

Hi, friendly nudge. This prevents package:html to be used with HTML that includes unicode chars in attributes. Which is an increasing portion of them (according to bugs reported to linkcheck).

filiph referenced this issue in filiph/linkcheck May 27, 2019
For later reference, the skip parameter now includes a link to https://github.com/dart-lang/html/issues/70 (I just spent 5 minutes looking for the bug).
@cvolzke4
Copy link
Contributor

I've created a pull request with a fix: dart-archive/html#109

@cvolzke4
Copy link
Contributor

Carriage returns also affect the file location start and end points.

@b4stien
Copy link

b4stien commented Sep 27, 2019

Hey there, thanks for the great work. Now that the fix is merged, would it be possible to release a new version?

We're stuck with this issue downstream (there: filiph/linkcheck#35)

@mosuem mosuem transferred this issue from dart-archive/html Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants