Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illegal characters in XML output #670

Closed
bitsgalore opened this issue Jan 14, 2016 · 1 comment
Closed

Illegal characters in XML output #670

bitsgalore opened this issue Jan 14, 2016 · 1 comment
Labels
type: bug The issue describes a bug
Milestone

Comments

@bitsgalore
Copy link

While processing an encrypted EPUB with Epubcheck 4.0.1 (with output to XML format), I ended up with the following output file:

https://github.com/KBNLresearch/epubPolicyTests/blob/master/epubcheckout/4.0.1/epub20_encryption_binary_content.xml

Line 17 of the output file contains a warning about an illegal XHTML Named entity. However, this very entity (Unicode code point U+000B / 0xb) is included in the output file (line 17, column 81), which makes the output itself invalid XML! This creates problems if the XML needs to be processed further down the line (in my case I want to run some Schematron rules on it).

For rights reasons I cannot share the original EPUB, but I created a synthetic file that reproduces the problem at:

https://github.com/KBNLresearch/epubPolicyTests/blob/master/build/epub20_encryption_binary_content.epub?raw=true

(See also about control characters in XML: https://www.w3.org/International/questions/qa-controls.en.php)

@rdeltour
Copy link
Member

Ah, good catch. Thanks for the sample file!

@rdeltour rdeltour self-assigned this Jan 14, 2016
@rdeltour rdeltour added this to the Next milestone Jan 14, 2016
@rdeltour rdeltour added the type: bug The issue describes a bug label Jan 14, 2016
tledoux added a commit to tledoux/epubcheck that referenced this issue Feb 13, 2016
This commits changes the generation of XML reports to use regular Java
libraries avoiding bad output.
It also checks for not UTF-8 characters and escapes them.
Finally, it adds the list of media-types included in the epub.

The tests have been enhanced to better compare the actual and
expected results.
Some tests cases have been added to test for encrypted or obfuscated
epubs.

Fixes w3c#670.
Fixes w3c#517.
@tofi86 tofi86 modified the milestones: Next, 4.0.2 Dec 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug The issue describes a bug
Projects
None yet
Development

No branches or pull requests

3 participants