-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with XML content requirements and external entities: make MathML and SVG1.1 invalid #1338
Comments
Note that, in w3c/epubcheck#1114 two possible ways forward were already proposed: |
No, it does not. The rules automatically add the DTD before validation.
The DTD heavily uses external parameter entities, which never occur in documents. But if memory is correct, it does not define any parsed entities. |
Checking it again, I stand corrected:-) The validity process requires the presence or the addition of the DTD. So my sentence:
should rather say:
But, taking into account that tools produce SVG files with the DTD, and they have the perfect right to do so, the problem is not less severe for us. (I have just ran into a different application, called draw.io, that can produce nice diagrams in SVG and adds the DTD.)
I believe you are right. But the presence of the external parameter entities in the DTD is enough to raise a red flag for EPUB. |
I do not understand. Since we do not allow external DTD subsets, external parameter entities are never encountered. I think that you would like to use the output of SVG software that always embed references to the DTD at the W3C web site. But such references have caused troubles to the W3C web site in the past. |
Indeed. Such software implementations are correct in doing so, they exist, and our users may want to use them. Adobe Illustrator is probably the most prominent, and certainly very important example. I do not think we can expect, say, Adobe Illustrator users to manually edit their generated SVG files by removing that DTD reference. I.e., this is a very practical deployment issue.
I do not know about the past, but the troubles are not in the present. My original issue came from the fact that most of the logos used in the W3C publications (e.g., all recommendations) in the past 3-4 years (at least) use SVG files that have those DTD-s, and all the Web Browsers accept those without further ado. Converting them into valid EPUB is not possibly without manually modifying the SVG files. I suspect that this issue never came up in EPUB land so far because Reading Systems tacitly accepted such SVG files (as the core Web engines accepted them). What brought it to the fore is the much stricter newer version of epubcheck that rejects those files. (This started the whole discussion in w3c/epubcheck#1114 after all...) I cc @rdeltour explicitly here, to call on the epubcheck implementers to give their opinion. |
I still do not think that this is a problem. Yes, human users or software always have to remove external DTD subsets, if any. IDPF prohibited them, since parsing documents containing external DTD subsets has caused unnecessarily network access. |
But why are we putting this requirement on authors and not reading systems? It's a reading system optimization pushed onto authors to carry out when a developer could suppress any resolution of DTD references. Isn't that pretty standard anyway, as it's been a while since I processed XML documents but W3C used to blacklist you if you tried to continuously resolve their DTDs. (And just for the record, I don't see that this affects MathML. MathML can only be embedded in HTML so can't have its own DTD declaration.) |
Well, this is where we disagree. I believe it is a problem that one of the official content format has an unnecessary restriction compared to its official definition. But, actually, I do not understand your note, @murata2makoto. In your comment in w3c/epubcheck#1114 (comment) you seem to say that there is no such problem, that content with such DTD is o.k., and the bug is by epubcheck. In which case I believe we have to add some explanation in the spec. |
I might have agreed if this proposal was made when EPUB 3.0 was being designed. But the charter for EPUB 3.3 does not allow any changes that might destroy existing reading systems. If this DOCTYPE statement occurs in an SVG document, some existing EPUB 3 reading systems might try to fetch an external DTD subset. This causes network access, which is slow or unavailable. |
My comment might not be clear. But I'm saying that the DOCTYPE statement in SVG must not be allowed by EPUB 3.3. |
But the only thing that stops this from happening now is the idea that only perfectly valid publications are ever loaded into reading systems. That seems unlikely given that there are often ways to side-load content. I agree we shouldn't cause backwards incompatibilities, but It would be interesting to survey whether moving the requirement to reading systems actually changes the way any have been implemented. |
@murata2makoto the backward compatibility issue is something we should check. You say:
and I am not sure such reading system actually exist. What I would think is that reading systems, so far, did not care about the DTD-s (surely all RS-s that relied on one of the web rendering engines) and this whole issue came to the fore because epubcheck reports this issue although it did not care about it before. In other words, I would expect there were a number of, technically, invalid EPUB content in circulation that were never flagged as such and that become invalid now. In other words, it may well be that the practical deployment of EPUB content as well as RS-s were actually both ignoring a constraint in the document... |
It may well exist. If my memory is correct, a key member of IDPF argued that the DOCTYPE statement for SVG must not be allowed for this reason. This happened during the development of EPUB3. |
I am not sure such reading system actually exist.
It may well exist. If my memory is correct, a key member of IDPF argued that the DOCTYPE statement for SVG must not be allowed for this reason. This happened during the development of EPUB3.
I'm confident that the overwhelming number of RS do not care about validating publication resources, like web browsers DO NOT validate web resources.
|
Sorry. That is irrelevant. Surely, valid SVG documents will be correctly handled by RSs that do not validate SVG. I'm saying that invalid SVG might be handled very differently by existing RSs. |
I have just realized today that epubcheck also refuses an SVG file (ie, an XML file) which starts with the standard <?xml version="1.0" encoding="UTF-8"?> Is that indeed forbidden by the standard? Or is it an epubcheck issue? |
What's the error message? This seems absurd. |
That it is invalid. But, if this is not an EPUB restriction then I should bring it to the epubcheck issues. I just wanted to check... |
How are you using the SVG? This SVG image validates fine in an EPUB (it's used as <?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 15.1.0, SVG Export Plug-In . SVG Version: 6.00 Build 0) -->
<svg version="1.1" id="box" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
width="108px" height="108.391px" viewBox="0 0 108 108.391" enable-background="new 0 0 108 108.391" xml:space="preserve">
... |
Ya, I suspect you have some other issue. The SVG samples validate fine with xml declarations. Are you sure there isn't a curly quote or something in your actual source? There's definitely nothing in the spec that forbids an xml declaration, although they're generally pointless because version 1.0 and UTF-8 encoding are defaults. |
BTW, non-validating XML processors might fetch external DTD subsets (and external parameter entities) for obtaining entity declarations (and telling which whitespace is significant). But I do not know how many such processors are used. |
The error report of epubcheck is as follows:
it is a stand-along SVG file, used as a cover. If I remove the processing instruction, then things are accepted. I still wonder whether there is a problem somewhere in our spec that triggers this. Cc: @rdeltour @danielweck |
No, you're violating XML rules. If you put an xml declaration in a file, you can't have any whitespace before it (only a BOM is allowed before). Is there an extra hard return before your declaration by any chance? For reference, see the prolog requirements in the XML spec. |
Ouch! Touché :-) This was the problem. Sorry for the noise... |
I don't think this is is new in EPUBCheck (but haven't doubled checked with older versions). The conformance criteria dates back to before EPUB 3.2, and was already implemented in EPUBCheck if I remember correctly. What EPUBCheck 4.2.x brought are schema changes, which are orthogonal to this specific issue. |
I agree with @rdeltour here. Nothing has changed in a long time on this front. External identifiers are forbidden and have been since 3.0, so it's always been an issue. There really isn't a lot of ambiguity about what that means -- you can't reference external files, which would contain the external entities that would make some uses valid. It's not been much of an issue as most people don't use SVG, and HTML has dropped the external subset from its default doctype so the discussion isn't even relevant to that format. But in the interests of putting this issue to rest, I mostly agree with @llemeurfr comment in the other thread that we reformulate the restriction (but adding the RS ignoring is normative), so:
That would allow a doctype with an external identifier, but disallow content that relies on such. |
@mattgarrish just to be sure I understand what you propose:
You mean we add an extra item in EPUB 3.3. RS saying that external subsets must simply be ignored for all XML content? (At this moment there is no counterpart for XML in that document)
When you say "change" you mean removing the second item in EPUB 3.3. Core, right?
That sounds o.k. to me, but that does not answer #1338 (comment). |
I'm not following. Why does it matter if the DTD makes use of entities, and how is that problematic for authoring? If the entities only exist for parsing the DTD, and reading systems don't validate, then is the issue just theoretical? |
I don't think we're that far apart. That's why I'm suggesting we only note this in a security considerations section without making normative statements that potentially aren't followable. I expect browser cores already do the right thing, but it is still a security consideration. It's like banning external identifiers in authoring. It doesn't fully solve the problem, and any vendor could continue to ban these at ingestion if they really don't want them. But not using them is something we advise as a best practice (which could be part of where core media types go). We're arguably making a lot out of an issue that's never been questioned as problematic. I'd be surprised if any appreciable number of publishers use/define entities in a DTD subset, let alone the external, and I doubt anyone will rush to begin if we remove the restriction. |
From the start of this thread, it seems that two different notions are conflated:
Or maybe we conflate "external identifier" and "external entity". And also "document type declaration" and "document type definition". The current contraint in the spec (External identifiers MUST NOT appear in the document type declaration) forbids both. A sentence like "External entities MUST NOT be defined in a document type definition" would only forbid the second notion, which would be ok (especially is we also state the RS should/must not validate content) |
Just a note: I tried to include internal entities in the html doctype of a resource using Sigil and BlueGriffonEPUB : Sigil made a mess of the XHTML doc at save time, and BlueGriffon made another type of mess. So I tend to agree with you : entities in content documents must be really rare. |
I strongly believe that requirements on data are always preferrable since data conformance is very enforceable (e.g., by epubcheck). Moving things to requirements on applications is always less satisfactory, although it is sometimes inevitable. We should first make clear what is allowed as an SVG or MathML document.
Some publishers might if they have a reason. We shouldn't easily allow what is not needed. The currently proposed wording allows what is not needed. |
External identifiers of external DTD subsets may well be referenced by a non-validating XML processor. This is one of my points. See Clause 5 of the XML Rec.
I do not see any reasons to allow such external (parsed) entities in the case of SVG.
Yes, some terms are not easy to use correctly. I checked the definitions of these terms for discussions around this issue.
I would like to continue to disallow 2) in EPUB 3.3. I can live with small changes to 1) if they are needed to use the output of SVG editors. |
This thread is becoming quite confusing (for me at least), and I would love to see it converge somehow. Also, as it was mentioned several times, many of the situations in the thread are hypothetical as far as current EPUB deployments are concerned, because very few of our authors make use explicitly of remotely complex XML facilities; the only real practical issue seems to be that some tools would dump some DTDs into their output that make content unusable. With the future push to move away from XML altogether (e.g., with SVG-in-HTML and/or SVG 2 coming to the fore, or if, either in this version of EPUB or the next one, we move towards HTML5) this issue may become increasingly moot. I.e., I am not sure we have to thrive for something that is complete and exhaustive from an XML point of view… I have seen some concrete proposals in the thread (since the WG call); can we try to see if we can keep to one of them?
Did I miss/misinterpret something here? Would it be possible to see if we can converge towards one of these approaches and solve this issue? |
Notes on my previous comment
|
I don't think that we are ready to do so.
The door has been closed in the past. But the tentative resolution opens a can of worms. I don't think that we should open the can and hope no worms will cause any trouble. I think that we should divide this issue into smaller ones and close one at a time. My four issues (#1354, #1355, #1357, and #1358) are intended to solve easier subissues. |
I am not sure I agree and I am afraid that this issue may get out of hand. Those issues try to define a class of XML files that are acceptable for EPUB, looking at variations of possibilities with DTDs, entities, etc, that the XML specification offer. This is inherently a complex task. The question we have to ask ourselves is: who would really care? Is there really a user need behind these? All these possibilities around SVG would be unused by any SVG authoring tool I know, and I do not think that will change. The same would hold for SMIL or MathML, or other XML vocabularies we may care about. As I said above, with the future push to move away from XML altogether (e.g., with SVG-in-HTML and/or SVG 2 coming to the fore, or if, either in this version of EPUB or the next one, we move towards HTML5) this issue may become increasingly moot.
I agree with your concern. So maybe we want to go back to the original issue an stop there. Here is a proposal: The specification text should say
We may want to add the XHTML1.1 DTD-s tot the lot, too, but not necessarily. That is it. We do not really open the door, we just acknowledge the reality for our usage of W3C Recommendations. (We should be careful about the “Perfect is the enemy of good” effect…) |
I find it doubtful. No offense, as I grew up on SGML, but we're partying like it's 1999 in this thread... :) XSLT is the the more usual language of choice these days if publishers want to insert boilerplate, repeated strings, etc. Your proposal seems similar to where we were somewhere way back in this thread (although I'm not sure how we disagreed on the wording and I'm not going to try and locate why). External identifiers are disallowed except for a controlled list and reading systems ignore (that has to be separately in the RS spec, though). Works for me. |
I have just realized that we should not forget to add the ONIX DTD (or DTDs?) to our controlled list:
after all, it should be possible to add the ONIX metadata as part of the package... |
yep, we made a big circle... |
Maybe no harm, but not sure how critical it is. My understanding is they're pushing to move everyone to 3.0 now and 3.0 records aren't supposed to be distributed with a doctype. I don't believe ONIX records actually end up in EPUBs all that often, either. |
To be honest, I do not know. SOmething to be checked, possibly with Graham... |
ONIX files are out-of-band metadata that publishers send out separate from the EPUB. a lot of times aggregators and distributors get only the ONIX files to determine what EPUBs they want. Also publishers can send these ONIX files with updates say pricing changes etc. without updating the EPUB itself. |
If we go down the road of listing some 'acceptable' XML DTD-s (see #1338 (comment)) then adding an ONIX DTD is painless. I realize that it does not happen often, but we should not make it impossible to add an ONIX file into the EPUB container... |
Created another issue (#1373) for ONIX. |
+1 to have the solution that works for ONIX also. Most of times ONIX is external metadata but during ISO process of EPUB Accessibility, we observed interest in having ONIX accessibility metadata also as internal metadata. |
This article is interesting. It mentions a massive DDoS attack on the W3C web site all the time. |
This issue was discussed in a meeting.
View the transcriptWendy Reid: we had resolutions at the F2F, and further discussions on github… and came to a happy place Matt Garrish: #1368 Matt Garrish: where we ended up was… … we put in an allowance for a specific set of external identifiers that we have put in an appendix … we have SVG and MathML that are allowed to be used in content docs or in separate files … and we made a restriction against external entities in the internal DTD subset … so it prevents some security issues but eases authoring … so we’ll no longer force people to remove SVG DTDs from tool-generated files … I’m hoping this is it :) Ivan Herman: tech comment … in fact, the changes are such that … makes possible something that I’m not sure we really use … I can define as part of an internal entity something that won’t go out to the network … I’m not sure if this feature is in use … formal comment … there was a formal resolution on the previous version; this PR slightly changes that … can we get a formal resolution to merge, and also close a bunch of issues which were examples of the problem? Proposed resolution: Merge PR #1368 to address outstanding DTD issues, and close GH issues 1369-1373 (Wendy Reid) Garth Conboy: +1 Matt Garrish: +1 Ivan Herman: +1 Charles LaPierre: +1 Matthew Chan: +1 Wendy Reid: +1 Brady Duga: +1 George Kerscher: +1 Laura Brady: +1 Bill Kasdorf: +1 Ben Schroeter: +1 Resolution #1: Merge PR #1368 to address outstanding DTD issues, and close GH issues 1369-1373 |
The current text on XML conformance says that an XML-based media MUST meet:
This constraint apply regardless of whether the given Publication Resource is a Core Media Type Resource or a Foreign Resource.
However, that creates problems with SVG 1.1. The conformance rules of SVG1.1 require the presence of:
The DTD itself, as described in the relevant appendix, is making a heavy use of external entities (via its usage of DTD Modularization). As a consequence, per the current document, no conform SVG 1.1 file is valid for EPUB 3.3.
Although the upcoming SVG 2 removed this requirement, there are lots of perfectly valid SVG 1.1 out there (SVG 2 is not yet a Recommendation, b.t.w., only a CR). Most notably, Adobe Illustrator generates SVG 1.1 files with those DTD-s in them.
A similar problem may occur with MathML 3, whose official DTD relies on the same approach (although MathML is more permissive and it does not require the usage of that DTD).
This problem was originally raised in w3c/epubcheck#1114.
The text was updated successfully, but these errors were encountered: