Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode the _x<4 hex chars>_ escape notation in shared strings #584

Merged
merged 1 commit into from
Jun 26, 2018

Conversation

papandreou
Copy link
Contributor

Hi!

I ran into a problem where a trailing carriage in a cell (shared string) would come out as _x000D_ in the decoded worksheet. Turns out that OOXML has a weird escaping scheme intended for characters that cannot be represented in XML: https://www.robweir.com/blog/2008/03/ooxmls-out-of-control-characters.html

This PR makes shared strings with escapes come out right when reading .xlsx files. For completeness we should also make sure to introduce escape sequences for characters disallowed by XML (U+0004 END OF TRANSMISSION, U+0006 ACKNOWLEDGE, U+0007 BELL, U+0008 BACKSPACE, U+0017 SYNCHRONOUS IDLE). Also, we should encode underscores as _x005F_ when they are part of literal text that could otherwise be interpreted as an escape.

For the record, the enclosed test case was created in Excel by entering _x000D_ into a cell. In the shared strings table that comes out as:

<si><t>_x005F_x000D_</t></si>

Similar libraries in other languages have been dealing with this as well, both while reading and writing .xlsx, eg.:

@guyonroche guyonroche merged commit ec4cd23 into exceljs:master Jun 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants