Decode the _x<4 hex chars>_ escape notation in shared strings #584
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi!
I ran into a problem where a trailing carriage in a cell (shared string) would come out as
_x000D_
in the decoded worksheet. Turns out that OOXML has a weird escaping scheme intended for characters that cannot be represented in XML: https://www.robweir.com/blog/2008/03/ooxmls-out-of-control-characters.htmlThis PR makes shared strings with escapes come out right when reading .xlsx files. For completeness we should also make sure to introduce escape sequences for characters disallowed by XML (U+0004 END OF TRANSMISSION, U+0006 ACKNOWLEDGE, U+0007 BELL, U+0008 BACKSPACE, U+0017 SYNCHRONOUS IDLE). Also, we should encode underscores as
_x005F_
when they are part of literal text that could otherwise be interpreted as an escape.For the record, the enclosed test case was created in Excel by entering
_x000D_
into a cell. In the shared strings table that comes out as:Similar libraries in other languages have been dealing with this as well, both while reading and writing .xlsx, eg.: