Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

haxe.Http class should not use Strings to represent binary data #2586

Closed
bjitivo opened this issue Jan 29, 2014 · 29 comments · Fixed by #8570
Closed

haxe.Http class should not use Strings to represent binary data #2586

bjitivo opened this issue Jan 29, 2014 · 29 comments · Fixed by #8570

Comments

@bjitivo
Copy link
Contributor

bjitivo commented Jan 29, 2014

We have modified the C++ String implementation to correctly support UTF-8 strings. Our implementation does not accept improperly encoded UTF-8 strings, and we have found that when using the haxe.Http class, the representation of binary data as Strings in functions like onData and setPostData causes problems.

Using a String as a container for binary data implies that the String internal representation is in bytes; but this is a problematic assumption to make when the String internal representation is UTF-8.

Another issue is the fact that binary data used in onData and setPostData may have embedded 0 values (nulls), and embedded nulls are in general at odds with string representations.

Why don't onData and setPostData use Bytes objects instead of Strings?

@bjitivo
Copy link
Contributor Author

bjitivo commented Jan 29, 2014

Same problem exists in the Haxe Socket class. This appears to be an endemic problem of the Haxe library design.

It seems that it is impossible to create a String implementation that enforces that when created from binary data, the binary data is properly formed UTF-8, because Strings are used all over the Haxe standard library to represent arbitrary binary data.

May I say again what an awful, horrendous choice it is to use String to represent arbitrary binary data?

@ncannasse
Copy link
Member

Thanks for reporting, we will investigate it.

For the socket you can use the input/output which allows you to read/write binary data.

The String usage comes from historical reasons, where we didn't have haxe.io.Bytes, as well as the lack of binary data support on some platforms (JS only got typed arrays in Chrome 7 in late 2010 and Firefox 4 in mid 2011)

@bjitivo
Copy link
Contributor Author

bjitivo commented Jan 29, 2014

I think we can work around this problem. The basic mis-assumption of our implementation is that Bytes.toString() and Bytes.readString() should assume UTF-8 encodings of the strings being read or written. That assumption is incorrect and I think leads to most other Haxe io related APIs not working.

@ncannasse
Copy link
Member

What actual API change would you suggest?

@bjitivo
Copy link
Contributor Author

bjitivo commented Jan 29, 2014

I think ideally String would not be used for binary data; but since that is a very far-reaching change.

I have implemented a workaround in our local version of Haxe: I've added readStringUtf8() to haxe.io.Input and writeStringUtf8() to haxe.io.Output. This allows haxe.io.Input.readString() and haxe.io.Output.writeString() to read and write strings as raw byte values, and the Utf8() versions of these functions to read and write strings using Utf8 encoding.

The additions of these APIs then require additional functions in Bytes: toStringUtf8() and ofStringUtf8().

What confuses me is that Bytes already appears to assume UTF-8 encoding of the input or output strings for some languages; for example the cs version of Bytes.toString() has:

return cs.system.text.Encoding.UTF8.GetString(b, 0, length);

I don't really understand how these other target platforms can encode and decode byte data into/from Strings using UTF-8 encoding, and yet the Socket API and other APIs which assume that Strings have raw binary data, continue to work. It's a mystery to me; but I am concerned primarily with the C++ target where I can understand things a little better.

The only real problem with my approach is with Bytes.fromString(). Since our internal representation of Strings always uses UTF-8 encoding, it means that when raw byte arrays are converted to Strings using Bytes.toString(), the bytes are single-byte encoded if the value is < 0x80 but two-byte encoded if the value is between 0x80 and 0xFF, inclusive. When converting a String that is internally represented in this way back into raw binary bytes, there is no problem; since the String itself only included characters in the range 0 - 0xFF inclusive, the conversion back to binary characters can be exact.

But if the String contained Unicode characters > 0xFF, then converting back to raw binary using Bytes.fromString() is problematic. How do we represent a unicode character value greater than 0xFF in a raw binary form that would make sense in Bytes.fromString()? I chose the simplest and most obvious solution: unicode characters between 0x100 and 0xFFFF inclusive encode into two binary bytes (MSB first), and unicode characters > 0xFFFF encode into 4 bytes.

As long as Strings are always treated either as binary data, or as actual strings, this scheme will work correctly. But it's awfully fragile.

@hughsando
Copy link
Member

Yes, strings are "array of bytes", not "arrays of characters" in haxe.
Characters entered in haxe code in the range 7f-ff are converted to 2
bytes. The rest, you are on your own.
I would like to see haxe string literals as utf8 in code. Let the backend
store it however it likes (I think flash uses utf16/32 ?).
String.toBytes() should convert from internal representation to utf8 (a
no-op if the string is stored as utf8). Things that do not deal with
printing character codes should by bytes. Reading/Writing a string to
stream should imply a utf8 stream and have nothing to do with the internal
representation.
Thus it should not be possible to determine the internal representation
without resorting to the index functions.

It would be great for users if it was not possible to determine the
internal representation externally, which means the index functions should
refer to character indices not bytes. But there are implementation issues
here if the internal representation is utf8.

Hugh

On Thu, Jan 30, 2014 at 2:54 AM, bjitivo [email protected] wrote:

I think ideally String would not be used for binary data; but since that
is a very far-reaching change.

I have implemented a workaround in our local version of Haxe: I've added
readStringUtf8() to haxe.io.Input and writeStringUtf8() to haxe.io.Output.
This allows haxe.io.Input.readString() and haxe.io.Output.writeString() to
read and write strings as raw byte values, and the Utf8() versions of these
functions to read and write strings using Utf8 encoding.

The additions of these APIs then require additional functions in Bytes:
toStringUtf8() and ofStringUtf8().

What confuses me is that Bytes already appears to assume UTF-8 encoding of
the input or output strings for some languages; for example the cs version
of Bytes.toString() has:

return cs.system.text.Encoding.UTF8.GetString(b, 0, length);

I don't really understand how these other target platforms can encode and
decode byte data into/from Strings using UTF-8 encoding, and yet the Socket
API and other APIs which assume that Strings have raw binary data, continue
to work. It's a mystery to me; but I am concerned primarily with the C++
target where I can understand things a little better.

The only real problem with my approach is with Bytes.fromString(). Since
our internal representation of Strings always uses UTF-8 encoding, it means
that when raw byte arrays are converted to Strings using Bytes.toString(),
the bytes are single-byte encoded if the value is < 0x80 but two-byte
encoded if the value is between 0x80 and 0xFF, inclusive. When converting a
String that is internally represented in this way back into raw binary
bytes, there is no problem; since the String itself only included
characters in the range 0 - 0xFF inclusive, the conversion back to binary
characters can be exact.

But if the String contained Unicode characters > 0xFF, then converting
back to raw binary using Bytes.fromString() is problematic. How do we
represent a unicode character value greater than 0xFF in a raw binary form
that would make sense in Bytes.fromString()? I chose the simplest and most
obvious solution: unicode characters between 0x100 and 0xFFFF inclusive
encode into two binary bytes (MSB first), and unicode characters > 0xFFFF
encode into 4 bytes.

As long as Strings are always treated either as binary data, or as actual
strings, this scheme will work correctly. But it's awfully fragile.

Reply to this email directly or view it on GitHubhttps://github.com//issues/2586#issuecomment-33616910
.

@ghost ghost assigned ncannasse Feb 1, 2014
@ncannasse
Copy link
Member

We will review entirely the String encoding and usage of binary data in 3.4, I'm assigning the issue to the corresponding milestone

@ncannasse ncannasse added this to the 3.4 milestone Feb 23, 2014
@singpolyma
Copy link
Contributor

Bytes is used internally in Http -- couldn't a onDataBytes method be added pretty easily that would expose the internal bytes without calling toString() ? This came up recently since it silently corrupts image downloads and similar in my use case :P

@ProPuke
Copy link
Contributor

ProPuke commented Feb 4, 2015

I propose the addition of an encoding enum (UTF8 or BINARY) parameter for Http. The default value being HttpEncoding.UTF8. The onData callback could be set to Dynamic and return either a String or Bytes depending on the value of this parameter.

This is backward compat, allows the addition of other future encodings, and allows differing platforms to conditionally support encodings. It does leave a slightly unsightly Dynamic, though.

@singpolyma
Copy link
Contributor

It doesn't make a lot of sense for the user to specify the encoding, since HTTP has built-in ways for the server to specify that.

@ProPuke
Copy link
Contributor

ProPuke commented Feb 5, 2015

A lot of the time when requesting content specifying a format to receive in makes a lot of sense. For example, if requesting an html or json url text content is presumed. If requesting a downloadable asset a binary encoding is presumed (you may wish to request a binary encoding even if downloading an ascii file, so that it can be saved locally in a byte-identical form). In these cases explicitly stating what format to receive makes a lot of sense. If this encoding is not available for the data received, a suitable error should occur instead.

An HttpEncoding.AUTO could be available for letting the server dictate the encoding if you wish to dynamically handle all cases.

@ncannasse ncannasse modified the milestones: 3.3, Long term Jun 10, 2015
@ncannasse
Copy link
Member

@Simn could you review the changes needed to have haxe.io.Bytes results for haxe.Http, VS introducing a new haxe.io.Http that works the same as haxe.Http but with binary data ?

@ncannasse ncannasse assigned Simn and unassigned ncannasse Jun 10, 2015
@Simn
Copy link
Member

Simn commented Jun 10, 2015

Are you just talking about the results passed to onData/onError or is this also about other parts which currently use strings (responseData, responseHeaders, headers, params)?

@back2dos
Copy link
Member

The protocol headers are supposedly "human readable", so String should do
the job there, while having a binary body would definitely would be useful.

On Wed, Jun 10, 2015 at 11:42 PM, Simon Krajewski [email protected]
wrote:

Are you just talking about the results passed to onData/onError or is this
also about other parts which currently use strings (responseData,
responseHeaders, headers, params)?


Reply to this email directly or view it on GitHub
#2586 (comment)
.

@hughsando
Copy link
Member

I agree the headers should be strings.
The body might be a good use case for abstract - if could return a
"ResponseData(haxe.io.Bytes)", which when assigned to a string does
utf8->string (throwing an error on invalid utf8 for security reasons), or
to a unicode (utf32?) string converts correctly, or to a Bytes passes
though automatically for maximal compatibility and performance. Such an
object might be useful in other ways too.

On Thu, Jun 11, 2015 at 6:42 AM, Juraj Kirchheim [email protected]
wrote:

The protocol headers are supposedly "human readable", so String should do
the job there, while having a binary body would definitely would be useful.

On Wed, Jun 10, 2015 at 11:42 PM, Simon Krajewski <
[email protected]>
wrote:

Are you just talking about the results passed to onData/onError or is
this
also about other parts which currently use strings (responseData,
responseHeaders, headers, params)?


Reply to this email directly or view it on GitHub
<
https://github.com/HaxeFoundation/haxe/issues/2586#issuecomment-110922990>

.


Reply to this email directly or view it on GitHub
#2586 (comment)
.

@ProPuke
Copy link
Contributor

ProPuke commented Jun 11, 2015

For reference: I pushed an update across to dionjwa/nodejs-std at the same as making my original suggestion, as I need the functionality on a haxe node project.

You can view the pull requests here.

@ncannasse
Copy link
Member

I agree we could keep headers as String, hopefully there will no UTF-8 incompatible header.

@Simn
Copy link
Member

Simn commented Jun 11, 2015

Well then it's just a question of what to do with the interface. We can't just change the type of onData/onError.

@back2dos
Copy link
Member

How about just leaving onError:String->Void and having an additional onBytes:Bytes->Void. Usually when you get an error body, it's human readable or just a white screen of death, in which case bytes vs. string doesn't make much of a difference either.

Other options:

  • Add an onErrorBytes callback if needed (personally I'd not add it unless someone actually asks for it)
  • Make onData and onError and abstract with @:from String->Void and @:from Bytes->Void (but this is probably going to break code that relies on inference for the callback argument type)

@waneck
Copy link
Member

waneck commented Jun 11, 2015

If we keep any part as a string, we should still make sure that an invalid
UTF string there won't crash the Http class in an unexpected way

On Thu, Jun 11, 2015, 07:40 Juraj Kirchheim [email protected]
wrote:

How about just leaving onError:String->Void and having an additional
onBytes:Bytes->Void. Usually when you get an error body, it's human
readable or just a white screen of death, in which case bytes vs. string
doesn't make much of a difference either.

Other options:

  • Add an onErrorBytes callback if needed (personally I'd not add it
    unless someone actually asks for it)
  • Make onData and onBytes and abstract with @:from String->Void and
    @:from Bytes->Void (but this is probably going to break code that relies on
    inference for the callback argument type)


Reply to this email directly or view it on GitHub
#2586 (comment)
.

@Simn
Copy link
Member

Simn commented Jun 21, 2015

@back2dos: Are you saying Http should call both onBytes and onData?

@back2dos
Copy link
Member

Yes, both. You could change them from dynamic function to var and check for null to not have to convert Bytes to String or vice versa on platforms where that's expensive.

@Simn Simn removed their assignment Nov 23, 2015
@Simn
Copy link
Member

Simn commented Nov 23, 2015

I've unassigned myself because I don't feel comfortable working on this. If someone sends a pull request we can review it accordingly.

@cancerberoSgx
Copy link
Contributor

cancerberoSgx commented Jul 19, 2019

What's the status of this issue ? is any way of requesting binary content (for example images) ? The following works on neko but breaks in cpp and java:

  var http = new haxe.Http(url);
  http.onData = function (data:String) {
    var bytes = haxe.io.Bytes.ofString(data);      
    // I tried with Encoding.UTF8 and Encoding.RawNative but not success.
  };
  http.request();

Any workaround to request binary content is most appreciated. Thanks!

@Aurel300
Copy link
Member

A small update to some of the old messages in this thread: String in Haxe is now only for textual, UTF-8 data, so it is not safe to use it to store binary data (which would be invalid UTF-8 in general). The only exception is neko, which is not Unicode-safe.

There has not been a binary-safe refactor for Http as of yet.

@cancerberoSgx
Copy link
Contributor

cancerberoSgx commented Jul 19, 2019

This is how I accomplished to get binary content on sys targets. Unfortunately customRequest method, although it won't give compilation errors, it's not documented. (So a cheap solution could be just exposing that method or an alias and document it). Hope it help others:

var req = new haxe.Http(url);
var responseBytes = new haxe.io.BytesOutput();
req.customRequest(false, responseBytes, null, "GET");
var bytes = responseBytes.getBytes();

This is a fetch() function that will work on all targets, including node.js and the browser:

public static function fetchResource(url:String, cb:(error:Dynamic, data:haxe.io.BytesData) -> Void) {
  #if js
  untyped if ((typeof(window) == 'undefined' || typeof(window.fetch) == 'undefined') && typeof(process) != 'undefined') {
    untyped require('https').get(url, resp -> {
      var data = [];
      resp.on('data', chunk -> {
        data.push(chunk);
      });
      resp.on('end', chunk -> {
        var body = Buffer.concat(data);
        cb(null, body);
        return null;
      });
    }).on("error", (err) -> {
      cb(err, null);
    });
  } else {
    untyped js.Browser.window.fetch(url).then(response -> response.arrayBuffer().then(data -> cb(null, data)));
  }
  #else
  var req = new haxe.Http(url);
  var responseBytes = new haxe.io.BytesOutput();
  req.onError = function(err) {
    cb(err, null);
  };
  req.customRequest(false, responseBytes, null, "GET");
  cb(null, responseBytes.getBytes().getData());
  #end
}

I would be glad to create a cookbook example with these, but I feel it will be rejected since the method is not exposed. What do you think?

@cancerberoSgx
Copy link
Contributor

Also a small feedback, you want to expose a signature based on haxe.io.Input and haxe.io.Output rather than in haxe.io.Bytes since they support more use cases like streams.

@Aurel300
Copy link
Member

4.1 (hopefully) will bring streams and a more Node.JS-like API for HTTP, see HaxeFoundation/haxe-evolution#59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

11 participants