Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node/Scalar/String.php's parseEscapeSequences() breaks binary strings for (xml) serialization #38

Closed
theseer opened this issue Oct 14, 2012 · 8 comments

Comments

@theseer
Copy link
Contributor

theseer commented Oct 14, 2012

Due to the internal call to parseEscapeSequences() uppon construction, the "original" string value is lost, causing invalid xml (invalid unicode bytes) to be written when serializing the nodes.

There is also no way to actually overwrite the value of a string with binary data and still get a "save" xml output.

What i found to be fixing the issue is to bin2hex the data and prepend \x to each byte, e.g. something like this:

$field= '\x' . substr(chunk_split(bin2hex($value),2,"\x"), 0, -2);

This would obviously encode all chars though, not only the original binary ones...

@mvriel
Copy link

mvriel commented Oct 15, 2012

Isn't this the same issue as #26?

@nikic
Copy link
Owner

nikic commented Oct 15, 2012

Heh, this issue seems to crop up in every project trying to serialize to XML. Seen it a few times in various unit testing frameworks, but didn't realize that it applies here too.

Just disabling parsing of escape sequences won't really help here, as one could still have issues with malformed UTF-8 in strings (or not UTF-8 at all). PHP's strings are raw binary data after all.

The only real way to solve this is what you did. Or rather one could only selectively encode strings containing invalid UTF-8.

By the way, what are you using the XML serialization for?

@theseer
Copy link
Contributor Author

theseer commented Oct 15, 2012

I don't think that that problem only occurs when serializing to xml - it's just the only processor that actually complains. If you try to save the source back to a new (modified) php file, you'll end up having the same issues: The original \xxx component is lost, making the result unreadable at best.

Regarding the UTF-8 issue you pointed out: I'm 'iconv'ing the php source file before parsing to avoid that issue and so far didn't have any problems.

The XML serialization is used in phpDox ( https://github.com/theseer/phpDox ). I for now adopted your suggestion for Issue #26

@nikic
Copy link
Owner

nikic commented Oct 31, 2012

@theseer Do you think it would make sense to move the static Scalar_*::parse() methods into the parser (as normal methods), so they can be overridden by extending it?

@theseer
Copy link
Contributor Author

theseer commented Nov 20, 2012

I'm not sure how that would fix the problem as it would defer the solution to a client implementation?
From a user perspective, I'd expect the Parser not to "interfere" by modifying or translating values when parsing.

A serialization my add - depending on the output format - whatever is needed to escape otherwise invalid chars or translate it according to whatever makes sense, e.g. interpret x?? as binary.

Do you really have to - by default! - translate the x??-Values from a string into their binary presentation at parse time? How would you set an x??-Value at runtime for it, expecting the same output as the source had before parsing?

I guess the only thing that makes really sense is to keep the "raw" value and, on demand, translate it if requested.

@nikic
Copy link
Owner

nikic commented Nov 21, 2012

@theseer The parser provides an abstract syntax tree, meaning that a lot of information is (intentionally) discarded, only retaining the parts that are relevant to the programs interpretation. String formatting is one of those things that are discarded. From PHP's point of view it does not make a difference whether a string is "Hello, World!" or whether it is "\x48\x65\x6C\x6C\x6F\x2C\x20\x77\x6F\x72\x6C\x64\x21". Interpreting the literal values allows to directly work with these values, e.g. use them as lookups, compare them, etc. This is not possible with encoded values because the same literal can have multiple representations (simplest example is single vs double quotes).

I see that this behavior is not appropriate for some use cases, these use cases simple weren't the ones I originally had in mind. My main motivation was a) static analysis and b) automated code changes where nobody ever has to read the generated code.

But in any case, ways to fully retain the file file formatting are being discussed in issue #41, so this might soon be possible. Though it probably doesn't really apply to this particular problem, because here the solution is rather simple anyway :)

@theseer
Copy link
Contributor Author

theseer commented Nov 21, 2012

I do see your problem and your point. But considering your very example about automagic rewriting of existing source code, I - as a user - would expect it to NOT modify my string definition when writing it back as source.

But at least for me, the workaround with storing the raw version as additional attribute works fine.

@nikic nikic closed this as completed Nov 21, 2012
@mvriel
Copy link

mvriel commented Nov 22, 2012

@nikic perhaps an idea to make a wiki or FAQ entry with this information and the workaround?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants