Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Vertical Bar Character as class name throws an Exception #1998

Closed
m-heider opened this issue Sep 14, 2023 · 2 comments
Closed

Bug: Vertical Bar Character as class name throws an Exception #1998

m-heider opened this issue Sep 14, 2023 · 2 comments
Labels

Comments

@m-heider
Copy link

Consider the following HTML:

<!DOCTYPE html>
<html lang="en">
<head><title>Title</title></head>
<body> 
   <button class="|">Button</button>
</body>
</html>

This HTML is valid according to https://validator.w3.org/nu/#textarea

If I try to get the CSS selector of the button, a SelectorParseException is being thrown:

Document doc = Jsoup.parse("<!DOCTYPE html>\n" +
                           "<html lang=\"en\">\n" +
                           "<head><title>Title</title></head>\n" +
                           "<body> \n" +
                           "  <button class=\"|\">Button</button>\n" +
                           "</body>\n" +
                           "</html>");
Elements elements = doc.select("button");
String selector = elements.get(0).cssSelector();

Stacktrace (Jsoup latest stable version 1.16.1):

org.jsoup.select.Selector$SelectorParseException: String must not be empty
	at org.jsoup.select.QueryParser.parse(QueryParser.java:47)
	at org.jsoup.select.QueryParser.combinator(QueryParser.java:90)
	at org.jsoup.select.QueryParser.parse(QueryParser.java:60)
	at org.jsoup.select.QueryParser.parse(QueryParser.java:45)
	at org.jsoup.select.Selector.select(Selector.java:98)
	at org.jsoup.nodes.Element.select(Element.java:418)
	at org.jsoup.nodes.Element.cssSelector(Element.java:858)

A real-world example where a class name is being used that consists only of a vertical bar character can be found here: https://www.mueller.at/ (Search for .\| in the browser DevTools to find an example node)

If the vertical bar character appears inside a class name (e.g. a|b) a different error is thrown:

org.jsoup.select.Selector$SelectorParseException: Could not parse query 'button.a|b': unexpected token at '|b'
	at org.jsoup.select.QueryParser.findElements(QueryParser.java:226)
	at org.jsoup.select.QueryParser.parse(QueryParser.java:74)
	at org.jsoup.select.QueryParser.parse(QueryParser.java:45)
	at org.jsoup.select.QueryParser.combinator(QueryParser.java:90)
	at org.jsoup.select.QueryParser.parse(QueryParser.java:60)
	at org.jsoup.select.QueryParser.parse(QueryParser.java:45)
	at org.jsoup.select.Selector.select(Selector.java:98)
	at org.jsoup.nodes.Element.select(Element.java:418)
	at org.jsoup.nodes.Element.cssSelector(Element.java:858)
@PeichengLiu
Copy link

PeichengLiu commented Oct 3, 2023

Yes, | is a valid class name in HTML, while it's not a valid CSS identifier. It needs to be escaped as a CSS identifier, while TokenQueue#escapeCssIdentifier is NOT doing this correctly.

Inspired by mathiasbynens/CSS.escape, I wrote the following code,

    /*
    Given a CSS identifier (such as a tag, ID, or class), escape any CSS special characters that would otherwise not be
    valid in a selector.
     */
    public static String escapeCssIdentifier(String in) {
        StringBuilder result = StringUtil.borrowBuilder();
        int[] codePoints = in.codePoints().toArray();
        int length = codePoints.length;
        int firstCodePoint = codePoints[0];
        // If the character is the first character and is a `-` (U+002D), and
        // there is no second character, […]
        if (length == 1 && firstCodePoint == 0x002D) {
            return ESC + in;
        }
        int index = -1;
        while (++index < length) {
            int codePoint = codePoints[index];
            // Note: there’s no need to special-case astral symbols, surrogate
            // pairs, or lone surrogates.
            // If the character is NULL (U+0000), then the REPLACEMENT CHARACTER
            // (U+FFFD).
            if (codePoint == 0x0000) {
                result.append('\uFFFD');
                continue;
            }
            if (
                // If the character is in the range [\1-\1F] (U+0001 to U+001F) or is
                // U+007F, […]
                    (codePoint >= 0x0001 && codePoint <= 0x001F) || codePoint == 0x007F ||
                            // If the character is the first character and is in the range [0-9]
                            // (U+0030 to U+0039), […]
                            (index == 0 && codePoint >= 0x0030 && codePoint <= 0x0039) ||
                            // If the character is the second character and is in the range [0-9]
                            // (U+0030 to U+0039) and the first character is a `-` (U+002D), […]
                            (
                                    index == 1 &&
                                            codePoint >= 0x0030 && codePoint <= 0x0039 &&
                                            firstCodePoint == 0x002D
                            )
            ) {
                // https://drafts.csswg.org/cssom/#escape-a-character-as-code-point
                result.append(ESC).append(Integer.toHexString(codePoint)).append(' ');
                continue;
            }

            // If the character is not handled by one of the above rules and is
            // greater than or equal to U+0080, is `-` (U+002D) or `_` (U+005F), or
            // is in one of the ranges [0-9] (U+0030 to U+0039), [A-Z] (U+0041 to
            // U+005A), or [a-z] (U+0061 to U+007A), […]
            if (
                    codePoint >= 0x0080 ||
                            codePoint == 0x002D ||
                            codePoint == 0x005F ||
                            codePoint >= 0x0030 && codePoint <= 0x0039 ||
                            codePoint >= 0x0041 && codePoint <= 0x005A ||
                            codePoint >= 0x0061 && codePoint <= 0x007A
            ) {
                // the character itself
                result.append(new String(codePoints, index, 1));
                continue;
            }

            // Otherwise, the escaped character.
            // https://drafts.csswg.org/cssom/#escape-a-character
            result.append(ESC).append(new String(codePoints, index, 1));
        }
        return StringUtil.releaseBuilder(result);
    }

then animal-sniffer-maven-plugin complained

[ERROR] C:\Users\14517\Desktop\jsoup\src\main\java\org\jsoup\parser\TokenQueue.java:303: Undefined reference: java.util.stream.IntStream
[ERROR] C:\Users\14517\Desktop\jsoup\src\main\java\org\jsoup\parser\TokenQueue.java:303: Undefined reference: java.util.stream.IntStream String.codePoints()
[ERROR] C:\Users\14517\Desktop\jsoup\src\main\java\org\jsoup\parser\TokenQueue.java:303: Undefined reference: int[] java.util.stream.IntStream.toArray()

It seems to be a bit troublesome to make it work with Android.

And I've found out a few more related issues.

  1. Element.cssSelector() generates invalid queries if the class or ID is invalid; needs escaping #1742
  2. CSS identifier escapes are not supported #838

See,

@jhy jhy closed this as completed in d5debf8 Aug 27, 2024
@jhy jhy added this to the 1.18.2 milestone Aug 27, 2024
@jhy jhy removed this from the 1.18.2 milestone Aug 27, 2024
@jhy jhy added the fixed label Aug 27, 2024
@jhy
Copy link
Owner

jhy commented Aug 27, 2024

Was fixed in #2146

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants