-
-
Notifications
You must be signed in to change notification settings - Fork 665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
haxe.Http class should not use Strings to represent binary data #2586
Comments
Same problem exists in the Haxe Socket class. This appears to be an endemic problem of the Haxe library design. It seems that it is impossible to create a String implementation that enforces that when created from binary data, the binary data is properly formed UTF-8, because Strings are used all over the Haxe standard library to represent arbitrary binary data. May I say again what an awful, horrendous choice it is to use String to represent arbitrary binary data? |
Thanks for reporting, we will investigate it. For the socket you can use the input/output which allows you to read/write binary data. The String usage comes from historical reasons, where we didn't have haxe.io.Bytes, as well as the lack of binary data support on some platforms (JS only got typed arrays in Chrome 7 in late 2010 and Firefox 4 in mid 2011) |
I think we can work around this problem. The basic mis-assumption of our implementation is that Bytes.toString() and Bytes.readString() should assume UTF-8 encodings of the strings being read or written. That assumption is incorrect and I think leads to most other Haxe io related APIs not working. |
What actual API change would you suggest? |
I think ideally String would not be used for binary data; but since that is a very far-reaching change. I have implemented a workaround in our local version of Haxe: I've added readStringUtf8() to haxe.io.Input and writeStringUtf8() to haxe.io.Output. This allows haxe.io.Input.readString() and haxe.io.Output.writeString() to read and write strings as raw byte values, and the Utf8() versions of these functions to read and write strings using Utf8 encoding. The additions of these APIs then require additional functions in Bytes: toStringUtf8() and ofStringUtf8(). What confuses me is that Bytes already appears to assume UTF-8 encoding of the input or output strings for some languages; for example the cs version of Bytes.toString() has: return cs.system.text.Encoding.UTF8.GetString(b, 0, length); I don't really understand how these other target platforms can encode and decode byte data into/from Strings using UTF-8 encoding, and yet the Socket API and other APIs which assume that Strings have raw binary data, continue to work. It's a mystery to me; but I am concerned primarily with the C++ target where I can understand things a little better. The only real problem with my approach is with Bytes.fromString(). Since our internal representation of Strings always uses UTF-8 encoding, it means that when raw byte arrays are converted to Strings using Bytes.toString(), the bytes are single-byte encoded if the value is < 0x80 but two-byte encoded if the value is between 0x80 and 0xFF, inclusive. When converting a String that is internally represented in this way back into raw binary bytes, there is no problem; since the String itself only included characters in the range 0 - 0xFF inclusive, the conversion back to binary characters can be exact. But if the String contained Unicode characters > 0xFF, then converting back to raw binary using Bytes.fromString() is problematic. How do we represent a unicode character value greater than 0xFF in a raw binary form that would make sense in Bytes.fromString()? I chose the simplest and most obvious solution: unicode characters between 0x100 and 0xFFFF inclusive encode into two binary bytes (MSB first), and unicode characters > 0xFFFF encode into 4 bytes. As long as Strings are always treated either as binary data, or as actual strings, this scheme will work correctly. But it's awfully fragile. |
Yes, strings are "array of bytes", not "arrays of characters" in haxe. It would be great for users if it was not possible to determine the Hugh On Thu, Jan 30, 2014 at 2:54 AM, bjitivo [email protected] wrote:
|
We will review entirely the String encoding and usage of binary data in 3.4, I'm assigning the issue to the corresponding milestone |
|
I propose the addition of an This is backward compat, allows the addition of other future encodings, and allows differing platforms to conditionally support encodings. It does leave a slightly unsightly |
It doesn't make a lot of sense for the user to specify the encoding, since HTTP has built-in ways for the server to specify that. |
A lot of the time when requesting content specifying a format to receive in makes a lot of sense. For example, if requesting an html or json url text content is presumed. If requesting a downloadable asset a binary encoding is presumed (you may wish to request a binary encoding even if downloading an ascii file, so that it can be saved locally in a byte-identical form). In these cases explicitly stating what format to receive makes a lot of sense. If this encoding is not available for the data received, a suitable error should occur instead. An |
@Simn could you review the changes needed to have haxe.io.Bytes results for haxe.Http, VS introducing a new haxe.io.Http that works the same as haxe.Http but with binary data ? |
Are you just talking about the results passed to onData/onError or is this also about other parts which currently use strings (responseData, responseHeaders, headers, params)? |
The protocol headers are supposedly "human readable", so String should do On Wed, Jun 10, 2015 at 11:42 PM, Simon Krajewski [email protected]
|
I agree the headers should be strings. On Thu, Jun 11, 2015 at 6:42 AM, Juraj Kirchheim [email protected]
|
For reference: I pushed an update across to dionjwa/nodejs-std at the same as making my original suggestion, as I need the functionality on a haxe node project. You can view the pull requests here. |
I agree we could keep headers as String, hopefully there will no UTF-8 incompatible header. |
Well then it's just a question of what to do with the interface. We can't just change the type of onData/onError. |
How about just leaving Other options:
|
If we keep any part as a string, we should still make sure that an invalid On Thu, Jun 11, 2015, 07:40 Juraj Kirchheim [email protected]
|
@back2dos: Are you saying Http should call both onBytes and onData? |
Yes, both. You could change them from |
I've unassigned myself because I don't feel comfortable working on this. If someone sends a pull request we can review it accordingly. |
What's the status of this issue ? is any way of requesting binary content (for example images) ? The following works on neko but breaks in cpp and java: var http = new haxe.Http(url);
http.onData = function (data:String) {
var bytes = haxe.io.Bytes.ofString(data);
// I tried with Encoding.UTF8 and Encoding.RawNative but not success.
};
http.request(); Any workaround to request binary content is most appreciated. Thanks! |
A small update to some of the old messages in this thread: There has not been a binary-safe refactor for |
This is how I accomplished to get binary content on sys targets. Unfortunately customRequest method, although it won't give compilation errors, it's not documented. (So a cheap solution could be just exposing that method or an alias and document it). Hope it help others: var req = new haxe.Http(url);
var responseBytes = new haxe.io.BytesOutput();
req.customRequest(false, responseBytes, null, "GET");
var bytes = responseBytes.getBytes(); This is a fetch() function that will work on all targets, including node.js and the browser: public static function fetchResource(url:String, cb:(error:Dynamic, data:haxe.io.BytesData) -> Void) {
#if js
untyped if ((typeof(window) == 'undefined' || typeof(window.fetch) == 'undefined') && typeof(process) != 'undefined') {
untyped require('https').get(url, resp -> {
var data = [];
resp.on('data', chunk -> {
data.push(chunk);
});
resp.on('end', chunk -> {
var body = Buffer.concat(data);
cb(null, body);
return null;
});
}).on("error", (err) -> {
cb(err, null);
});
} else {
untyped js.Browser.window.fetch(url).then(response -> response.arrayBuffer().then(data -> cb(null, data)));
}
#else
var req = new haxe.Http(url);
var responseBytes = new haxe.io.BytesOutput();
req.onError = function(err) {
cb(err, null);
};
req.customRequest(false, responseBytes, null, "GET");
cb(null, responseBytes.getBytes().getData());
#end
}
I would be glad to create a cookbook example with these, but I feel it will be rejected since the method is not exposed. What do you think? |
Also a small feedback, you want to expose a signature based on |
4.1 (hopefully) will bring streams and a more Node.JS-like API for HTTP, see HaxeFoundation/haxe-evolution#59 |
We have modified the C++ String implementation to correctly support UTF-8 strings. Our implementation does not accept improperly encoded UTF-8 strings, and we have found that when using the haxe.Http class, the representation of binary data as Strings in functions like onData and setPostData causes problems.
Using a String as a container for binary data implies that the String internal representation is in bytes; but this is a problematic assumption to make when the String internal representation is UTF-8.
Another issue is the fact that binary data used in onData and setPostData may have embedded 0 values (nulls), and embedded nulls are in general at odds with string representations.
Why don't onData and setPostData use Bytes objects instead of Strings?
The text was updated successfully, but these errors were encountered: