Illegal characters in XML output #670

bitsgalore · 2016-01-14T15:56:17Z

While processing an encrypted EPUB with Epubcheck 4.0.1 (with output to XML format), I ended up with the following output file:

https://github.com/KBNLresearch/epubPolicyTests/blob/master/epubcheckout/4.0.1/epub20_encryption_binary_content.xml

Line 17 of the output file contains a warning about an illegal XHTML Named entity. However, this very entity (Unicode code point U+000B / 0xb) is included in the output file (line 17, column 81), which makes the output itself invalid XML! This creates problems if the XML needs to be processed further down the line (in my case I want to run some Schematron rules on it).

For rights reasons I cannot share the original EPUB, but I created a synthetic file that reproduces the problem at:

https://github.com/KBNLresearch/epubPolicyTests/blob/master/build/epub20_encryption_binary_content.epub?raw=true

(See also about control characters in XML: https://www.w3.org/International/questions/qa-controls.en.php)

rdeltour · 2016-01-14T19:37:21Z

Ah, good catch. Thanks for the sample file!

This commits changes the generation of XML reports to use regular Java libraries avoiding bad output. It also checks for not UTF-8 characters and escapes them. Finally, it adds the list of media-types included in the epub. The tests have been enhanced to better compare the actual and expected results. Some tests cases have been added to test for encrypted or obfuscated epubs. Fixes w3c#670. Fixes w3c#517.

rdeltour self-assigned this Jan 14, 2016

rdeltour added this to the Next milestone Jan 14, 2016

rdeltour added the type: bug The issue describes a bug label Jan 14, 2016

tledoux mentioned this issue Feb 13, 2016

Generate XML with JVM libraries in javax.xml #673

Merged

tledoux closed this as completed in #673 Oct 4, 2016

tofi86 modified the milestones: Next, 4.0.2 Dec 11, 2016

iherman unassigned rdeltour Oct 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Illegal characters in XML output #670

Illegal characters in XML output #670

bitsgalore commented Jan 14, 2016

rdeltour commented Jan 14, 2016

Illegal characters in XML output #670

Illegal characters in XML output #670

Comments

bitsgalore commented Jan 14, 2016

rdeltour commented Jan 14, 2016