-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node/Scalar/String.php's parseEscapeSequences() breaks binary strings for (xml) serialization #38
Comments
Isn't this the same issue as #26? |
Heh, this issue seems to crop up in every project trying to serialize to XML. Seen it a few times in various unit testing frameworks, but didn't realize that it applies here too. Just disabling parsing of escape sequences won't really help here, as one could still have issues with malformed UTF-8 in strings (or not UTF-8 at all). PHP's strings are raw binary data after all. The only real way to solve this is what you did. Or rather one could only selectively encode strings containing invalid UTF-8. By the way, what are you using the XML serialization for? |
I don't think that that problem only occurs when serializing to xml - it's just the only processor that actually complains. If you try to save the source back to a new (modified) php file, you'll end up having the same issues: The original \xxx component is lost, making the result unreadable at best. Regarding the UTF-8 issue you pointed out: I'm 'iconv'ing the php source file before parsing to avoid that issue and so far didn't have any problems. The XML serialization is used in phpDox ( https://github.com/theseer/phpDox ). I for now adopted your suggestion for Issue #26 |
@theseer Do you think it would make sense to move the static |
I'm not sure how that would fix the problem as it would defer the solution to a client implementation? A serialization my add - depending on the output format - whatever is needed to escape otherwise invalid chars or translate it according to whatever makes sense, e.g. interpret x?? as binary. Do you really have to - by default! - translate the x??-Values from a string into their binary presentation at parse time? How would you set an x??-Value at runtime for it, expecting the same output as the source had before parsing? I guess the only thing that makes really sense is to keep the "raw" value and, on demand, translate it if requested. |
@theseer The parser provides an abstract syntax tree, meaning that a lot of information is (intentionally) discarded, only retaining the parts that are relevant to the programs interpretation. String formatting is one of those things that are discarded. From PHP's point of view it does not make a difference whether a string is I see that this behavior is not appropriate for some use cases, these use cases simple weren't the ones I originally had in mind. My main motivation was a) static analysis and b) automated code changes where nobody ever has to read the generated code. But in any case, ways to fully retain the file file formatting are being discussed in issue #41, so this might soon be possible. Though it probably doesn't really apply to this particular problem, because here the solution is rather simple anyway :) |
I do see your problem and your point. But considering your very example about automagic rewriting of existing source code, I - as a user - would expect it to NOT modify my string definition when writing it back as source. But at least for me, the workaround with storing the raw version as additional attribute works fine. |
@nikic perhaps an idea to make a wiki or FAQ entry with this information and the workaround? |
Due to the internal call to parseEscapeSequences() uppon construction, the "original" string value is lost, causing invalid xml (invalid unicode bytes) to be written when serializing the nodes.
There is also no way to actually overwrite the value of a string with binary data and still get a "save" xml output.
What i found to be fixing the issue is to bin2hex the data and prepend \x to each byte, e.g. something like this:
$field= '\x' . substr(chunk_split(bin2hex($value),2,"\x"), 0, -2);
This would obviously encode all chars though, not only the original binary ones...
The text was updated successfully, but these errors were encountered: