-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elements to ignore by TOC extraction algorithm? #378
Comments
For one thing, I'm not even convinced all sectioning content elements should be ignored; especially given that |
For those who may not follow this issue, if a nav is subsectioned like this:
The algorithm will not extract anything, as we don't account for partitioning of navs. It's not that complicated to treat each nested section/nav element as a branch of the toc (i.e., like an I don't like that we don't descend into subsections, but if we descend into them and ignore them as branches of the table of contents (i.e., we just go in search of lists), then do we concatenate the lists we find as though they all represent children? Does that make sense if their grouping heading is gone? |
I am afraid this would make all this way too complicated for no major gain. While the section elements are indeed very important in general, the current TOC algorithm is not aimed at using the main body of the content as the source of the TOC, but using a dedicated structure. A general, or even a somewhat restrictive, outlining algorithm goes, in my view, beyond what we should define. Authoring tools may impose their own content structure restrictions, and use a corresponding TOC generating algorithm, which can then be used as a WPUB TOC if needed (this is what respec does, after all). |
(Interestingly, my previous comment may be used in favor of a TOC structure in JSON, see issue #376...) |
I'm not suggesting we try to parse the contents of publication, only that partitioning a toc into subsections isn't uncommon. It's often done inside a list with an unlinked label, but that's just one pattern. Inside each section we'd still expect a list of links:
Where it gets more complex is if people start using only headings, and don't use them consistently:
But just as we restrict what we accept now, we could probably restrict partitioning to the first example and if you stray from that you can't expect a usable toc. That includes if you insert a section element and don't provide a heading, you just get a placeholder (like a The latter example requires constructing an outline first before the intention can be meaningfully extracted. |
Indeed and this is just the beginning of a very very long list of edge cases that we'll encounter... If we move away from a sub-set of HTML like in EPUB, I see no end in sight to these discussions. |
In both Matt's examples above, making
Yes, 👍 of course.
Could we in this case merge the list descendants, as if they were one big list? I.e. we would totally ignore sections and headings, but at least the list content would get extracted? I'm aware this may not fully represent the author's intent in 100% cases, but maybe better than nothing?
What the algorithm does right now isn't much different from what EPUB Nav Doc was doing, it's just slightly more permissive and is specified for the UA rather than for the Author, which is better for interoperability. |
Trying to second-guess @HadrienGardeur's statement:
I guess the issue is whether we move further away from the (relatively) simple HTML structure that the current draft specifies for TOC. If we are not careful, we may end up with an infinite amount of various structural variations, ie, very complex structures and algorithms. As a general statement (if this is what @HadrienGardeur meant, that is), I agree with this. Whether @mattgarrish's first example is within the bounds: maybe. The current algorithm have to become a bit more complicated, because we have to account for the heading elements within the sections, too, which would probably mean that a heading element within a simple But I would think we should not go in direction of the second example. Ie, we should definitely stop there unless real and widely used cases come to the fore. I must admit, however, that I was surprised by @mattgarrish's statement, whereby:
I have never seed such TOC structure myself so far on the Web (but it may be used in e-books, I cannot say). I definitely yield to others if such structures are really common. But if they are not, we should not complicate our lives further... |
I don't follow here. EPUB used span to represent headings, so is there really any difference between the example I gave above and this:
EPUB forced you to follow the above pattern, but should we continue to rule out doing it with nested section/nav elements? It's certainly debatable that we should allow only one representation. But, to be clear, all I'm saying is if we do this, we should maintain strict rules on what we allow. I would argue for the following if we do:
And that's all we'll recognize. It will pose a question of whether to apply |
@mattgarrish do we need to raise this to WG or is this for you to solve? |
This is a question of how many possible representations of the markup we want the toc algorithm to account for, but no one has complained (yet) about the existing algorithm's expectations so maybe it's a non-issue. |
Thanks. We'll add it to discussion on Monday. |
As the person raising the original issue: I am perfectly fine if the answer to the question:
is a "no", and we close this issue with no further action. |
This issue was discussed in a meeting.
View the transcriptTzviya Siegman: Issue 378Tzviya Siegman: The last comment - how many possible ways do we want the TOC to account for. I apologize, I didn’t add this to the agenda so people may need time to think. Ivan Herman: I plead guilty - I was the one who raised this issue, but looking at all the discussion, I am perfectly fine closing the issue with “no further action” and my question should be deemed as - unnecessary Benjamin Young: my question is about document structure - generally related to the TOC and processing and where those should live. The answer could be “turn in next week.” Is the core piece - the manifest thing - is the data model, and how do you get the spec? … how do you end up with the data model? Tzviya Siegman: We’re going to talk about the overall document structure next week. As for this - lets bring it back to github and discuss when Matt has a microphone. Dave Cramer: This seems like the classic issue where we’re not going to know what needs to happen until we try a bunch of stuff and things go wrong. It’s hard to imagine a bunch of theoretical TOCs. Ivan Herman: That’ll take a bit of time. The other way around would be to close this to give us piece of mind, and if we hit problems later, take it as we come… Dave Cramer: this is why we have CR and implementation experience. Matt Garrish: What we have right now mimics closely what we have in epub. Do we need to expand it more? It has worked well so far. Maybe we can live with it - it’s something we need actual implementation data on… … it’s probably something we can close off until we have something specific to deal with or let it go dormant. Tzviya Siegman: I have anecdotal evidence that people want more, but i can put that in the issue. |
anecdotal evidence about restrictions of EPUB (note I am not saying we should include these things in WPUB, just offering stories). Some of these can be addressed by best practices.
|
@TzviyaSiegman, to your note (though tangential to the conversation in general; apologies to Ivan):
We have real use cases where in-line chemistry and music (!) notation are also important and irreplaceable. Norton used SVGs in the past, against better judgment, but it worked in our reader, which parses and sets the TOC as HTML. I imagine other publishers would also expect stuff like MathML. More on topic: I would be happy to contribute samples of many of our various TOC "types", if that's a way forward to iterating on the parsing logic. Norton Anthologies, for example, could make good use of sectionable navigation docs. |
@mteixeira-wwn such examples would be really useful. We should use to test the current extraction algorithm (@mattgarrish has a running implementation code, afaik). Thx! |
This issue was discussed in a meeting.
View the transcriptToC algorithmTzviya Siegman: #378 Tzviya Siegman: issue 378 … the issue is “what goes into ToC?” … the proposal is to leave things as is, unless we have evidence it needs to be adjusted … mateus said they need extra types (chemistry, music) in the ToC Mateus Teixeira: I can provide examples from NN Matt Garrish: there are two issues, one is allowing markup within the ToC labels (#414), but #378 is more about the various structures of ToC … what kind of different structuring of the ToC should we try to account for … maybe we should wait and see Ivan Herman: my impression is that the possibility of putting advanced markup in label is different from #378 … the reason I raised it back then is that some structural things (e.g. section elements) are ignored by the algo, and I was wondering if other things should be ignored too … my feeling is that the answer is no; but it doesn’t mean we can’t allow MathML … it seems we can close #378 without much problems, regardless of what we decide for the markup of labels Tzviya Siegman: I agree with that Mateus Teixeira: +1, but I’ll share examples either way Tzviya Siegman: mateus can provide examples for #414 then Matt Garrish: right, we’re waiting for evidence for more table of contents … we can close it and raise specific issues, specific kind of ToC when we have evidence or examples Tzviya Siegman: the proposal is to leave the algo as is and close #378 Ivan Herman: yes, I have the impression that the structure of the ToC, as currently described, should be fine as-is. Mateus Teixeira: true Avneesh Singh: +1 Tzviya Siegman: ok, so overwhelming support for closing #378, and mateus will add comment to #414 Romain Deltour: +1 Charles LaPierre: +1 Wendy Reid: +1 Luc Audrain: +1 Mateus Teixeira: +1 Joshua Pyle: +1 Ivan Herman: +1 Resolution #3: overwhelming support for closing #378, and Mateus will add comment to #414 |
The HTML TOC structure and extraction ignores sectioning content and hidden elements from the TOC. Is there a need to ignore others?
The text was updated successfully, but these errors were encountered: