Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Targeting text that spans an element boundary #72

Closed
tilgovi opened this issue Nov 29, 2019 · 7 comments
Closed

Targeting text that spans an element boundary #72

tilgovi opened this issue Nov 29, 2019 · 7 comments
Assignees

Comments

@tilgovi
Copy link

tilgovi commented Nov 29, 2019

From the find a target text algorithm steps (section 2.4.3), it seems as though the TreeWalker is used to iterate over text nodes and the search is performed on their innerText. This seems to mean that no direct search for text that spans an element boundary would ever match.

The easiest way to address this would be to say that this is intentional and that any such match should use a range fragment, specifying the start and end text.

There are maybe other ways to address it, but I think that they immediately bring complications with the innerText algorithm.

@nickburris nickburris self-assigned this Jan 20, 2020
@bokand
Copy link
Collaborator

bokand commented Feb 19, 2020

The easiest way to address this would be to say that this is intentional and that any such match should use a range fragment, specifying the start and end text.

This was indeed our intent. That said, I think that, in general, exact matching should be preferred since it preserves the targeted context even if the page changes, making it less fragile [1]. However, in our brief usage it seems that there's often cases where an exact match should seem possible but isn't.

For example, it's often the case that text has some inline iconography or images. This breaks our current exact matching algorithm and requires using a range. As an experiment, I changed it to use an (internally) modified innerText algorithm: perform the exact search on document.documentElement.innerText where innerText returns a string (as today) as well as a mapping from positions in the string to corresponding positions in DOM. This lets us turn a substring of the innerText string to a DOM Range representing the nodes and text in the substring.

In my brief testing this actually worked really well for exactly matching across complex DOM and had the benefit of being very predictable and easy to generate links. In addition, innerText is specified and interoperable and handles lots of the edge-cases we worry about already so it seems like it'd be a good choice.

I wonder if this idea has been explored in the past or if there's any drawbacks I'm missing. Would appreciate any input here.

[1] There's some nuance here, a long exact match might be more fragile than an equivalent range. e.g. if the author fixes a single typo in the passage. In addition, ranges are also useful to keep link size down when the intent is to target a very long passage; 1000+ char URLs are rather unwieldy without support from tools.

@tilgovi
Copy link
Author

tilgovi commented Feb 19, 2020

For example, it's often the case that text has some inline iconography or images. This breaks our current exact matching algorithm and requires using a range. As an experiment, I changed it to use an (internally) modified innerText algorithm: perform the exact search on document.documentElement.innerText where innerText returns a string (as today) as well as a mapping from positions in the string to corresponding positions in DOM. This lets us turn a substring of the innerText string to a DOM Range representing the nodes and text in the substring.

I was planning on implementing this in https://github.com/apache/incubator-annotator in order to demonstrate a proof of concept for exactly this use case. Is your implementation in the browser, and not in JS? I'm very motivated to get a JS version of this to help polyfill this feature.

@tilgovi
Copy link
Author

tilgovi commented Feb 19, 2020

I wonder if this idea has been explored in the past or if there's any drawbacks I'm missing. Would appreciate any input here.

One that comes to mind is the world of editors built around content editable and document schemas. For example, ProseMirror will map a position to a node. Its positions don't track innerText, but the roughly track the text content in its internal document model.

@bokand
Copy link
Collaborator

bokand commented Feb 19, 2020

Yeah, my implementation was in the browser. The main push back I've gotten from Blink experts in this area is that innerText doesn't account for Shadow DOM and some issues around Unicode text matching/conversion which I'm still trying to understand better.

@bokand
Copy link
Collaborator

bokand commented Mar 13, 2020

I'm inclined to leave things as-is for now. I think there's room to improve the matching algorithm to use an "innerText-like" algorithm (e.g. converting table cells to " ", blocks to newlines, etc.) but I'm not sure how much value it'll add vs. the effort required.

The latest refresh of the spec I've done in #90 specifies the block element boundaries explicitly so I think there's nothing left to do here.

@bokand bokand closed this as completed Mar 13, 2020
@akavel
Copy link

akavel commented Jun 3, 2020

Will it allow linking to fragments of text with links or italics (i.e. inline nodes)? For example, can I link to:

https://wicg.github.io/scroll-to-text-fragment/#:~:text=for each string directive of directives

or will it be impossible?

In other words: if I write an article, then somebody saves a fragment link, then I "wikify" some words in the article or make them strong/italic, will the fragment link break?

@bokand
Copy link
Collaborator

bokand commented Jun 4, 2020

The algorithm uses block elements as boundaries. That means

<div>for each string <i>directive</i> of directives</div>

will match in your example, but

<div>for each string <div>directive</div> of directives</div>

will not. However, a range based match can cross block boundaries. The second case above could be matched using :~:text=for each,of directives

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants