Node/Scalar/String.php's parseEscapeSequences() breaks binary strings for (xml) serialization #38

theseer · 2012-10-14T14:38:20Z

Due to the internal call to parseEscapeSequences() uppon construction, the "original" string value is lost, causing invalid xml (invalid unicode bytes) to be written when serializing the nodes.

There is also no way to actually overwrite the value of a string with binary data and still get a "save" xml output.

What i found to be fixing the issue is to bin2hex the data and prepend \x to each byte, e.g. something like this:
$field= '\x' . substr(chunk_split(bin2hex($value),2,"\x"), 0, -2);

This would obviously encode all chars though, not only the original binary ones...

The text was updated successfully, but these errors were encountered:

mvriel · 2012-10-15T10:39:12Z

Isn't this the same issue as #26?

nikic · 2012-10-15T18:37:51Z

Heh, this issue seems to crop up in every project trying to serialize to XML. Seen it a few times in various unit testing frameworks, but didn't realize that it applies here too.

Just disabling parsing of escape sequences won't really help here, as one could still have issues with malformed UTF-8 in strings (or not UTF-8 at all). PHP's strings are raw binary data after all.

The only real way to solve this is what you did. Or rather one could only selectively encode strings containing invalid UTF-8.

By the way, what are you using the XML serialization for?

theseer · 2012-10-15T20:36:32Z

I don't think that that problem only occurs when serializing to xml - it's just the only processor that actually complains. If you try to save the source back to a new (modified) php file, you'll end up having the same issues: The original \xxx component is lost, making the result unreadable at best.

Regarding the UTF-8 issue you pointed out: I'm 'iconv'ing the php source file before parsing to avoid that issue and so far didn't have any problems.

The XML serialization is used in phpDox ( https://github.com/theseer/phpDox ). I for now adopted your suggestion for Issue #26

nikic · 2012-10-31T17:41:10Z

@theseer Do you think it would make sense to move the static Scalar_*::parse() methods into the parser (as normal methods), so they can be overridden by extending it?

theseer · 2012-11-20T16:18:47Z

I'm not sure how that would fix the problem as it would defer the solution to a client implementation?
From a user perspective, I'd expect the Parser not to "interfere" by modifying or translating values when parsing.

A serialization my add - depending on the output format - whatever is needed to escape otherwise invalid chars or translate it according to whatever makes sense, e.g. interpret x?? as binary.

Do you really have to - by default! - translate the x??-Values from a string into their binary presentation at parse time? How would you set an x??-Value at runtime for it, expecting the same output as the source had before parsing?

I guess the only thing that makes really sense is to keep the "raw" value and, on demand, translate it if requested.

nikic · 2012-11-21T21:43:31Z

@theseer The parser provides an abstract syntax tree, meaning that a lot of information is (intentionally) discarded, only retaining the parts that are relevant to the programs interpretation. String formatting is one of those things that are discarded. From PHP's point of view it does not make a difference whether a string is "Hello, World!" or whether it is "\x48\x65\x6C\x6C\x6F\x2C\x20\x77\x6F\x72\x6C\x64\x21". Interpreting the literal values allows to directly work with these values, e.g. use them as lookups, compare them, etc. This is not possible with encoded values because the same literal can have multiple representations (simplest example is single vs double quotes).

I see that this behavior is not appropriate for some use cases, these use cases simple weren't the ones I originally had in mind. My main motivation was a) static analysis and b) automated code changes where nobody ever has to read the generated code.

But in any case, ways to fully retain the file file formatting are being discussed in issue #41, so this might soon be possible. Though it probably doesn't really apply to this particular problem, because here the solution is rather simple anyway :)

theseer · 2012-11-21T22:03:43Z

I do see your problem and your point. But considering your very example about automagic rewriting of existing source code, I - as a user - would expect it to NOT modify my string definition when writing it back as source.

But at least for me, the workaround with storing the raw version as additional attribute works fine.

mvriel · 2012-11-22T10:01:26Z

@nikic perhaps an idea to make a wiki or FAQ entry with this information and the workaround?

nikic closed this as completed Nov 21, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node/Scalar/String.php's parseEscapeSequences() breaks binary strings for (xml) serialization #38

Node/Scalar/String.php's parseEscapeSequences() breaks binary strings for (xml) serialization #38

theseer commented Oct 14, 2012

mvriel commented Oct 15, 2012

nikic commented Oct 15, 2012

theseer commented Oct 15, 2012

nikic commented Oct 31, 2012

theseer commented Nov 20, 2012

nikic commented Nov 21, 2012

theseer commented Nov 21, 2012

mvriel commented Nov 22, 2012

Node/Scalar/String.php's parseEscapeSequences() breaks binary strings for (xml) serialization #38

Node/Scalar/String.php's parseEscapeSequences() breaks binary strings for (xml) serialization #38

Comments

theseer commented Oct 14, 2012

mvriel commented Oct 15, 2012

nikic commented Oct 15, 2012

theseer commented Oct 15, 2012

nikic commented Oct 31, 2012

theseer commented Nov 20, 2012

nikic commented Nov 21, 2012

theseer commented Nov 21, 2012

mvriel commented Nov 22, 2012