Targeting text that spans an element boundary #72

tilgovi · 2019-11-29T18:21:22Z

From the find a target text algorithm steps (section 2.4.3), it seems as though the TreeWalker is used to iterate over text nodes and the search is performed on their innerText. This seems to mean that no direct search for text that spans an element boundary would ever match.

The easiest way to address this would be to say that this is intentional and that any such match should use a range fragment, specifying the start and end text.

There are maybe other ways to address it, but I think that they immediately bring complications with the innerText algorithm.

The text was updated successfully, but these errors were encountered:

bokand · 2020-02-19T18:06:27Z

The easiest way to address this would be to say that this is intentional and that any such match should use a range fragment, specifying the start and end text.

This was indeed our intent. That said, I think that, in general, exact matching should be preferred since it preserves the targeted context even if the page changes, making it less fragile [1]. However, in our brief usage it seems that there's often cases where an exact match should seem possible but isn't.

For example, it's often the case that text has some inline iconography or images. This breaks our current exact matching algorithm and requires using a range. As an experiment, I changed it to use an (internally) modified innerText algorithm: perform the exact search on document.documentElement.innerText where innerText returns a string (as today) as well as a mapping from positions in the string to corresponding positions in DOM. This lets us turn a substring of the innerText string to a DOM Range representing the nodes and text in the substring.

In my brief testing this actually worked really well for exactly matching across complex DOM and had the benefit of being very predictable and easy to generate links. In addition, innerText is specified and interoperable and handles lots of the edge-cases we worry about already so it seems like it'd be a good choice.

I wonder if this idea has been explored in the past or if there's any drawbacks I'm missing. Would appreciate any input here.

[1] There's some nuance here, a long exact match might be more fragile than an equivalent range. e.g. if the author fixes a single typo in the passage. In addition, ranges are also useful to keep link size down when the intent is to target a very long passage; 1000+ char URLs are rather unwieldy without support from tools.

tilgovi · 2020-02-19T21:38:36Z

For example, it's often the case that text has some inline iconography or images. This breaks our current exact matching algorithm and requires using a range. As an experiment, I changed it to use an (internally) modified innerText algorithm: perform the exact search on document.documentElement.innerText where innerText returns a string (as today) as well as a mapping from positions in the string to corresponding positions in DOM. This lets us turn a substring of the innerText string to a DOM Range representing the nodes and text in the substring.

I was planning on implementing this in https://github.com/apache/incubator-annotator in order to demonstrate a proof of concept for exactly this use case. Is your implementation in the browser, and not in JS? I'm very motivated to get a JS version of this to help polyfill this feature.

tilgovi · 2020-02-19T21:41:18Z

I wonder if this idea has been explored in the past or if there's any drawbacks I'm missing. Would appreciate any input here.

One that comes to mind is the world of editors built around content editable and document schemas. For example, ProseMirror will map a position to a node. Its positions don't track innerText, but the roughly track the text content in its internal document model.

bokand · 2020-02-19T22:20:54Z

Yeah, my implementation was in the browser. The main push back I've gotten from Blink experts in this area is that innerText doesn't account for Shadow DOM and some issues around Unicode text matching/conversion which I'm still trying to understand better.

bokand · 2020-03-13T03:14:40Z

I'm inclined to leave things as-is for now. I think there's room to improve the matching algorithm to use an "innerText-like" algorithm (e.g. converting table cells to " ", blocks to newlines, etc.) but I'm not sure how much value it'll add vs. the effort required.

The latest refresh of the spec I've done in #90 specifies the block element boundaries explicitly so I think there's nothing left to do here.

akavel · 2020-06-03T23:09:31Z

Will it allow linking to fragments of text with links or italics (i.e. inline nodes)? For example, can I link to:

https://wicg.github.io/scroll-to-text-fragment/#:~:text=for each string directive of directives

or will it be impossible?

In other words: if I write an article, then somebody saves a fragment link, then I "wikify" some words in the article or make them strong/italic, will the fragment link break?

bokand · 2020-06-04T00:22:46Z

The algorithm uses block elements as boundaries. That means

<div>for each string <i>directive</i> of directives</div>

will match in your example, but

<div>for each string <div>directive</div> of directives</div>

will not. However, a range based match can cross block boundaries. The second case above could be matched using :~:text=for each,of directives

nickburris self-assigned this Jan 20, 2020

bokand closed this as completed Mar 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Targeting text that spans an element boundary #72

Targeting text that spans an element boundary #72

tilgovi commented Nov 29, 2019

bokand commented Feb 19, 2020

tilgovi commented Feb 19, 2020 •

edited

Loading

tilgovi commented Feb 19, 2020

bokand commented Feb 19, 2020

bokand commented Mar 13, 2020

akavel commented Jun 3, 2020 •

edited

Loading

bokand commented Jun 4, 2020

Targeting text that spans an element boundary #72

Targeting text that spans an element boundary #72

Comments

tilgovi commented Nov 29, 2019

bokand commented Feb 19, 2020

tilgovi commented Feb 19, 2020 • edited Loading

tilgovi commented Feb 19, 2020

bokand commented Feb 19, 2020

bokand commented Mar 13, 2020

akavel commented Jun 3, 2020 • edited Loading

bokand commented Jun 4, 2020

tilgovi commented Feb 19, 2020 •

edited

Loading

akavel commented Jun 3, 2020 •

edited

Loading