Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support selecting by element's inner text/content #55

Open
noperator opened this issue Nov 2, 2022 · 4 comments
Open

Support selecting by element's inner text/content #55

noperator opened this issue Nov 2, 2022 · 4 comments

Comments

@noperator
Copy link

noperator commented Nov 2, 2022

Love htmlq—use it daily. Thanks for the effort you've put into this tool.

I have a pretty frequent need to find an element based on the text that it contains. For example, I'd like to match on the first div (i.e., the one containing Item 1).

<div>Item 1</div>
<div>Item 2</div>
<div>Item 3</div>

There have been various attempts at updating the official CSS specification to support this kind of functionality, though I don't think any of them have actually made their way into the spec. Instead, various tools (Playwright. etc.) extend their own CSS selectors to support one of the following forms.

div:contains("Item 1")
div:has-text("Item 1")
div[innerText="Item 1"]
div[textContent="Item 1"]

It would be very useful if I could use the following CSS selector with htmlq to match on the previously noted element.

htmlq 'div:contains("Item 1")'

To my understanding, htmlq relies on kuchiki for HTML parsing, which in turn relies on servo for CSS selection—so I think I'd need to request support for this upstream in servo. Does that seem right to you? Just wanted to run this idea past you in case you've thought about it already and/or have an opinion about it.

@harri-halttunen-aktia
Copy link

I highly support the idea of selecting elements based on their content. While it is true that :contains() did not make it way to CSS3 (as it will break the whole idea of separating structure and content) it would be extremely important feature for tools like htmlqor pup as the content to be filtered cannot be controlled by the user of these tools.

Acutally, pup has this feature but it unfortunately does not have a general sibling selector which htmlq has. It would be nice to have both features.

@noperator
Copy link
Author

noperator commented Nov 28, 2022

Agree. I'm kind of doubtful that a widely used package like servo would accept a PR for a nonstandard CSS selector like :contains(). More realistic option might be to find a way to implement it downstream in kuchiki, or directly within htmlq.

From the developer behind kuchiki:

It is possible [to support pseudo-class selectors]. I have no plan to work on this, however.

Later on, :contains() was also explicitly requested in kuchiki, which was met with the reply:

:contains is not part of CSS: https://drafts.csswg.org/selectors/. I’m not even sure what it’s supposed to do.

Wonder if kuchiki would accept a working PR for :contains(); it does support a number of valid pseudo-classes.


Also, for reference, the full list of selectors that pup implements.

@noperator
Copy link
Author

If it's too hard to get this functionality implemented upstream as a pseudo-class selector, then we could alternatively add a CLI option instead:

OPTIONS:
    -a, --attribute <attribute>         Only return this attribute (if present) from selected elements
    -b, --base <base>                   Use this URL as the base for links
 👉 -c, --contains <REGEX>              Return only selected elements whose whose text nodes match this regular expression
    -f, --filename <FILE>               The input file. Defaults to stdin
    -o, --output <FILE>                 The output file. Defaults to stdout
    -r, --remove-nodes <SELECTOR>...    Remove nodes matching this expression before output. May be specified multiple

I'm suggesting -c, --contains since :contains() seems to be the most common form that the non-standard pseudo-class selector takes—but something like -m, --match could make sense, too. This seems to align with other options like --attribute and --remove-nodes that post-process the HTML with non-standard selection operations before finally returning.

@noperator
Copy link
Author

Drafted a change as proposed above. Given the following HTML sample from https://lethain.com/company-team-self/:

<li class="mb2">
  <a href="/work-hard-work-smart/">
    Work hard / work smart.</a>
</li>
<li class="mb2">
  <a href="/mailbag-not-measurable-whether-hire-exec/">  👈
    Mailbag: What isn't measurable? To hire as exec or not?</a>
</li>
<li class="mb2">
  <a href="/reminiscing/">
    Reminiscing: the retreat to comforting work.</a>
</li>

You can find the hyperlink list item li.mb2 > a whose name matches the case-insensitive regex (?i)mailbag.*measure?, extract the href, and prepend https://example.com to its base URL.

curl -s https://lethain.com/company-team-self/ |
    htmlq -c '(?i)mailbag.*measure?' -a href -b https://example.com 'li.mb2 > a'

https://example.com/mailbag-not-measurable-whether-hire-exec/

@noperator noperator changed the title Support CSS selector by element's inner text/content Support selecting by element's inner text/content Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants