haxe.Http class should not use Strings to represent binary data #2586

bjitivo · 2014-01-29T16:56:51Z

We have modified the C++ String implementation to correctly support UTF-8 strings. Our implementation does not accept improperly encoded UTF-8 strings, and we have found that when using the haxe.Http class, the representation of binary data as Strings in functions like onData and setPostData causes problems.

Using a String as a container for binary data implies that the String internal representation is in bytes; but this is a problematic assumption to make when the String internal representation is UTF-8.

Another issue is the fact that binary data used in onData and setPostData may have embedded 0 values (nulls), and embedded nulls are in general at odds with string representations.

Why don't onData and setPostData use Bytes objects instead of Strings?

bjitivo · 2014-01-29T17:26:22Z

Same problem exists in the Haxe Socket class. This appears to be an endemic problem of the Haxe library design.

It seems that it is impossible to create a String implementation that enforces that when created from binary data, the binary data is properly formed UTF-8, because Strings are used all over the Haxe standard library to represent arbitrary binary data.

May I say again what an awful, horrendous choice it is to use String to represent arbitrary binary data?

ncannasse · 2014-01-29T17:39:25Z

Thanks for reporting, we will investigate it.

For the socket you can use the input/output which allows you to read/write binary data.

The String usage comes from historical reasons, where we didn't have haxe.io.Bytes, as well as the lack of binary data support on some platforms (JS only got typed arrays in Chrome 7 in late 2010 and Firefox 4 in mid 2011)

bjitivo · 2014-01-29T17:42:33Z

I think we can work around this problem. The basic mis-assumption of our implementation is that Bytes.toString() and Bytes.readString() should assume UTF-8 encodings of the strings being read or written. That assumption is incorrect and I think leads to most other Haxe io related APIs not working.

ncannasse · 2014-01-29T18:41:07Z

What actual API change would you suggest?

bjitivo · 2014-01-29T18:54:20Z

I think ideally String would not be used for binary data; but since that is a very far-reaching change.

I have implemented a workaround in our local version of Haxe: I've added readStringUtf8() to haxe.io.Input and writeStringUtf8() to haxe.io.Output. This allows haxe.io.Input.readString() and haxe.io.Output.writeString() to read and write strings as raw byte values, and the Utf8() versions of these functions to read and write strings using Utf8 encoding.

The additions of these APIs then require additional functions in Bytes: toStringUtf8() and ofStringUtf8().

What confuses me is that Bytes already appears to assume UTF-8 encoding of the input or output strings for some languages; for example the cs version of Bytes.toString() has:

return cs.system.text.Encoding.UTF8.GetString(b, 0, length);

I don't really understand how these other target platforms can encode and decode byte data into/from Strings using UTF-8 encoding, and yet the Socket API and other APIs which assume that Strings have raw binary data, continue to work. It's a mystery to me; but I am concerned primarily with the C++ target where I can understand things a little better.

The only real problem with my approach is with Bytes.fromString(). Since our internal representation of Strings always uses UTF-8 encoding, it means that when raw byte arrays are converted to Strings using Bytes.toString(), the bytes are single-byte encoded if the value is < 0x80 but two-byte encoded if the value is between 0x80 and 0xFF, inclusive. When converting a String that is internally represented in this way back into raw binary bytes, there is no problem; since the String itself only included characters in the range 0 - 0xFF inclusive, the conversion back to binary characters can be exact.

But if the String contained Unicode characters > 0xFF, then converting back to raw binary using Bytes.fromString() is problematic. How do we represent a unicode character value greater than 0xFF in a raw binary form that would make sense in Bytes.fromString()? I chose the simplest and most obvious solution: unicode characters between 0x100 and 0xFFFF inclusive encode into two binary bytes (MSB first), and unicode characters > 0xFFFF encode into 4 bytes.

As long as Strings are always treated either as binary data, or as actual strings, this scheme will work correctly. But it's awfully fragile.

hughsando · 2014-01-30T04:02:09Z

Yes, strings are "array of bytes", not "arrays of characters" in haxe.
Characters entered in haxe code in the range 7f-ff are converted to 2
bytes. The rest, you are on your own.
I would like to see haxe string literals as utf8 in code. Let the backend
store it however it likes (I think flash uses utf16/32 ?).
String.toBytes() should convert from internal representation to utf8 (a
no-op if the string is stored as utf8). Things that do not deal with
printing character codes should by bytes. Reading/Writing a string to
stream should imply a utf8 stream and have nothing to do with the internal
representation.
Thus it should not be possible to determine the internal representation
without resorting to the index functions.

It would be great for users if it was not possible to determine the
internal representation externally, which means the index functions should
refer to character indices not bytes. But there are implementation issues
here if the internal representation is utf8.

Hugh

On Thu, Jan 30, 2014 at 2:54 AM, bjitivo [email protected] wrote:

I think ideally String would not be used for binary data; but since that
is a very far-reaching change.

I have implemented a workaround in our local version of Haxe: I've added
readStringUtf8() to haxe.io.Input and writeStringUtf8() to haxe.io.Output.
This allows haxe.io.Input.readString() and haxe.io.Output.writeString() to
read and write strings as raw byte values, and the Utf8() versions of these
functions to read and write strings using Utf8 encoding.

The additions of these APIs then require additional functions in Bytes:
toStringUtf8() and ofStringUtf8().

What confuses me is that Bytes already appears to assume UTF-8 encoding of
the input or output strings for some languages; for example the cs version
of Bytes.toString() has:

return cs.system.text.Encoding.UTF8.GetString(b, 0, length);

I don't really understand how these other target platforms can encode and
decode byte data into/from Strings using UTF-8 encoding, and yet the Socket
API and other APIs which assume that Strings have raw binary data, continue
to work. It's a mystery to me; but I am concerned primarily with the C++
target where I can understand things a little better.

The only real problem with my approach is with Bytes.fromString(). Since
our internal representation of Strings always uses UTF-8 encoding, it means
that when raw byte arrays are converted to Strings using Bytes.toString(),
the bytes are single-byte encoded if the value is < 0x80 but two-byte
encoded if the value is between 0x80 and 0xFF, inclusive. When converting a
String that is internally represented in this way back into raw binary
bytes, there is no problem; since the String itself only included
characters in the range 0 - 0xFF inclusive, the conversion back to binary
characters can be exact.

But if the String contained Unicode characters > 0xFF, then converting
back to raw binary using Bytes.fromString() is problematic. How do we
represent a unicode character value greater than 0xFF in a raw binary form
that would make sense in Bytes.fromString()? I chose the simplest and most
obvious solution: unicode characters between 0x100 and 0xFFFF inclusive
encode into two binary bytes (MSB first), and unicode characters > 0xFFFF
encode into 4 bytes.

As long as Strings are always treated either as binary data, or as actual
strings, this scheme will work correctly. But it's awfully fragile.

Reply to this email directly or view it on GitHubhttps://github.com//issues/2586#issuecomment-33616910
.

ncannasse · 2014-02-23T13:13:16Z

We will review entirely the String encoding and usage of binary data in 3.4, I'm assigning the issue to the corresponding milestone

singpolyma · 2014-03-04T22:47:46Z

Bytes is used internally in Http -- couldn't a onDataBytes method be added pretty easily that would expose the internal bytes without calling toString() ? This came up recently since it silently corrupts image downloads and similar in my use case :P

ProPuke · 2015-02-04T17:32:18Z

I propose the addition of an encoding enum (UTF8 or BINARY) parameter for Http. The default value being HttpEncoding.UTF8. The onData callback could be set to Dynamic and return either a String or Bytes depending on the value of this parameter.

This is backward compat, allows the addition of other future encodings, and allows differing platforms to conditionally support encodings. It does leave a slightly unsightly Dynamic, though.

singpolyma · 2015-02-05T13:08:49Z

It doesn't make a lot of sense for the user to specify the encoding, since HTTP has built-in ways for the server to specify that.

ProPuke · 2015-02-05T14:51:41Z

A lot of the time when requesting content specifying a format to receive in makes a lot of sense. For example, if requesting an html or json url text content is presumed. If requesting a downloadable asset a binary encoding is presumed (you may wish to request a binary encoding even if downloading an ascii file, so that it can be saved locally in a byte-identical form). In these cases explicitly stating what format to receive makes a lot of sense. If this encoding is not available for the data received, a suitable error should occur instead.

An HttpEncoding.AUTO could be available for letting the server dictate the encoding if you wish to dynamically handle all cases.

ncannasse · 2015-06-10T21:18:37Z

@Simn could you review the changes needed to have haxe.io.Bytes results for haxe.Http, VS introducing a new haxe.io.Http that works the same as haxe.Http but with binary data ?

Simn · 2015-06-10T21:42:48Z

Are you just talking about the results passed to onData/onError or is this also about other parts which currently use strings (responseData, responseHeaders, headers, params)?

back2dos · 2015-06-10T22:42:35Z

The protocol headers are supposedly "human readable", so String should do
the job there, while having a binary body would definitely would be useful.

On Wed, Jun 10, 2015 at 11:42 PM, Simon Krajewski [email protected]
wrote:

Are you just talking about the results passed to onData/onError or is this
also about other parts which currently use strings (responseData,
responseHeaders, headers, params)?

—
Reply to this email directly or view it on GitHub
#2586 (comment)
.

hughsando · 2015-06-11T01:33:22Z

I agree the headers should be strings.
The body might be a good use case for abstract - if could return a
"ResponseData(haxe.io.Bytes)", which when assigned to a string does
utf8->string (throwing an error on invalid utf8 for security reasons), or
to a unicode (utf32?) string converts correctly, or to a Bytes passes
though automatically for maximal compatibility and performance. Such an
object might be useful in other ways too.

On Thu, Jun 11, 2015 at 6:42 AM, Juraj Kirchheim [email protected]
wrote:

The protocol headers are supposedly "human readable", so String should do
the job there, while having a binary body would definitely would be useful.

On Wed, Jun 10, 2015 at 11:42 PM, Simon Krajewski <
[email protected]>
wrote:

Are you just talking about the results passed to onData/onError or is
this
also about other parts which currently use strings (responseData,
responseHeaders, headers, params)?

—
Reply to this email directly or view it on GitHub
<
https://github.com/HaxeFoundation/haxe/issues/2586#issuecomment-110922990>

.

—
Reply to this email directly or view it on GitHub
#2586 (comment)
.

ProPuke · 2015-06-11T01:34:40Z

For reference: I pushed an update across to dionjwa/nodejs-std at the same as making my original suggestion, as I need the functionality on a haxe node project.

You can view the pull requests here.

ncannasse · 2015-06-11T07:48:54Z

I agree we could keep headers as String, hopefully there will no UTF-8 incompatible header.

Simn · 2015-06-11T07:55:53Z

Well then it's just a question of what to do with the interface. We can't just change the type of onData/onError.

back2dos · 2015-06-11T10:40:01Z

How about just leaving onError:String->Void and having an additional onBytes:Bytes->Void. Usually when you get an error body, it's human readable or just a white screen of death, in which case bytes vs. string doesn't make much of a difference either.

Other options:

Add an onErrorBytes callback if needed (personally I'd not add it unless someone actually asks for it)
Make onData and onError and abstract with @:from String->Void and @:from Bytes->Void (but this is probably going to break code that relies on inference for the callback argument type)

waneck · 2015-06-11T14:02:02Z

If we keep any part as a string, we should still make sure that an invalid
UTF string there won't crash the Http class in an unexpected way

On Thu, Jun 11, 2015, 07:40 Juraj Kirchheim [email protected]
wrote:

How about just leaving onError:String->Void and having an additional
onBytes:Bytes->Void. Usually when you get an error body, it's human
readable or just a white screen of death, in which case bytes vs. string
doesn't make much of a difference either.

Other options:

Add an onErrorBytes callback if needed (personally I'd not add it
unless someone actually asks for it)

Make onData and onBytes and abstract with @:from String->Void and
@:from Bytes->Void (but this is probably going to break code that relies on
inference for the callback argument type)

—
Reply to this email directly or view it on GitHub
#2586 (comment)
.

Simn · 2015-06-21T09:03:58Z

@back2dos: Are you saying Http should call both onBytes and onData?

back2dos · 2015-06-21T11:21:34Z

Yes, both. You could change them from dynamic function to var and check for null to not have to convert Bytes to String or vice versa on platforms where that's expensive.

Simn · 2015-11-23T16:33:46Z

I've unassigned myself because I don't feel comfortable working on this. If someone sends a pull request we can review it accordingly.

cancerberoSgx · 2019-07-19T08:45:03Z

What's the status of this issue ? is any way of requesting binary content (for example images) ? The following works on neko but breaks in cpp and java:

  var http = new haxe.Http(url);
  http.onData = function (data:String) {
    var bytes = haxe.io.Bytes.ofString(data);      
    // I tried with Encoding.UTF8 and Encoding.RawNative but not success.
  };
  http.request();

Any workaround to request binary content is most appreciated. Thanks!

Aurel300 · 2019-07-19T08:52:57Z

A small update to some of the old messages in this thread: String in Haxe is now only for textual, UTF-8 data, so it is not safe to use it to store binary data (which would be invalid UTF-8 in general). The only exception is neko, which is not Unicode-safe.

There has not been a binary-safe refactor for Http as of yet.

cancerberoSgx · 2019-07-19T15:01:26Z

This is how I accomplished to get binary content on sys targets. Unfortunately customRequest method, although it won't give compilation errors, it's not documented. (So a cheap solution could be just exposing that method or an alias and document it). Hope it help others:

var req = new haxe.Http(url);
var responseBytes = new haxe.io.BytesOutput();
req.customRequest(false, responseBytes, null, "GET");
var bytes = responseBytes.getBytes();

This is a fetch() function that will work on all targets, including node.js and the browser:

public static function fetchResource(url:String, cb:(error:Dynamic, data:haxe.io.BytesData) -> Void) {
  #if js
  untyped if ((typeof(window) == 'undefined' || typeof(window.fetch) == 'undefined') && typeof(process) != 'undefined') {
    untyped require('https').get(url, resp -> {
      var data = [];
      resp.on('data', chunk -> {
        data.push(chunk);
      });
      resp.on('end', chunk -> {
        var body = Buffer.concat(data);
        cb(null, body);
        return null;
      });
    }).on("error", (err) -> {
      cb(err, null);
    });
  } else {
    untyped js.Browser.window.fetch(url).then(response -> response.arrayBuffer().then(data -> cb(null, data)));
  }
  #else
  var req = new haxe.Http(url);
  var responseBytes = new haxe.io.BytesOutput();
  req.onError = function(err) {
    cb(err, null);
  };
  req.customRequest(false, responseBytes, null, "GET");
  cb(null, responseBytes.getBytes().getData());
  #end
}

I would be glad to create a cookbook example with these, but I feel it will be rejected since the method is not exposed. What do you think?

cancerberoSgx · 2019-07-19T15:09:54Z

Also a small feedback, you want to expose a signature based on haxe.io.Input and haxe.io.Output rather than in haxe.io.Bytes since they support more use cases like streams.

Aurel300 · 2019-07-19T15:12:33Z

4.1 (hopefully) will bring streams and a more Node.JS-like API for HTTP, see HaxeFoundation/haxe-evolution#59

ghost assigned ncannasse Feb 1, 2014

ncannasse added this to the 3.4 milestone Feb 23, 2014

andyli mentioned this issue Nov 4, 2014

Null terminated string in Http request #3541

Closed

Simn added the haxe-foundation-partner label Feb 25, 2015

ncannasse modified the milestones: 3.3, Long term Jun 10, 2015

ncannasse assigned Simn and unassigned ncannasse Jun 10, 2015

Simn removed their assignment Nov 23, 2015

Simn modified the milestones: 3.4, 3.3.0-rc1 Apr 2, 2016

Simn modified the milestones: 3.4, 4.0 Jan 9, 2017

Simn modified the milestones: Release 4.0, Design Apr 17, 2018

mundusnine mentioned this issue Feb 27, 2019

Adds http bytes request(tested in hashlink) #7891

Closed

haxiomic mentioned this issue Feb 27, 2019

[js] should HttpBase's setPostData() parameter accept more than just String? #7782

Closed

RealyUniqueName mentioned this issue Jul 19, 2019

Bytes API for haxe.Http #8570

Merged

RealyUniqueName closed this as completed in #8570 Aug 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

haxe.Http class should not use Strings to represent binary data #2586

haxe.Http class should not use Strings to represent binary data #2586

bjitivo commented Jan 29, 2014

bjitivo commented Jan 29, 2014

ncannasse commented Jan 29, 2014

bjitivo commented Jan 29, 2014

ncannasse commented Jan 29, 2014

bjitivo commented Jan 29, 2014

hughsando commented Jan 30, 2014

ncannasse commented Feb 23, 2014

singpolyma commented Mar 4, 2014

ProPuke commented Feb 4, 2015

singpolyma commented Feb 5, 2015

ProPuke commented Feb 5, 2015

ncannasse commented Jun 10, 2015

Simn commented Jun 10, 2015

back2dos commented Jun 10, 2015

hughsando commented Jun 11, 2015

ProPuke commented Jun 11, 2015

ncannasse commented Jun 11, 2015

Simn commented Jun 11, 2015

back2dos commented Jun 11, 2015

waneck commented Jun 11, 2015

Simn commented Jun 21, 2015

back2dos commented Jun 21, 2015

Simn commented Nov 23, 2015

cancerberoSgx commented Jul 19, 2019 •

edited by Gama11

Loading

Aurel300 commented Jul 19, 2019

cancerberoSgx commented Jul 19, 2019 •

edited

Loading

cancerberoSgx commented Jul 19, 2019

Aurel300 commented Jul 19, 2019

haxe.Http class should not use Strings to represent binary data #2586

haxe.Http class should not use Strings to represent binary data #2586

Comments

bjitivo commented Jan 29, 2014

bjitivo commented Jan 29, 2014

ncannasse commented Jan 29, 2014

bjitivo commented Jan 29, 2014

ncannasse commented Jan 29, 2014

bjitivo commented Jan 29, 2014

hughsando commented Jan 30, 2014

ncannasse commented Feb 23, 2014

singpolyma commented Mar 4, 2014

ProPuke commented Feb 4, 2015

singpolyma commented Feb 5, 2015

ProPuke commented Feb 5, 2015

ncannasse commented Jun 10, 2015

Simn commented Jun 10, 2015

back2dos commented Jun 10, 2015

hughsando commented Jun 11, 2015

ProPuke commented Jun 11, 2015

ncannasse commented Jun 11, 2015

Simn commented Jun 11, 2015

back2dos commented Jun 11, 2015

waneck commented Jun 11, 2015

Simn commented Jun 21, 2015

back2dos commented Jun 21, 2015

Simn commented Nov 23, 2015

cancerberoSgx commented Jul 19, 2019 • edited by Gama11 Loading

Aurel300 commented Jul 19, 2019

cancerberoSgx commented Jul 19, 2019 • edited Loading

cancerberoSgx commented Jul 19, 2019

Aurel300 commented Jul 19, 2019

cancerberoSgx commented Jul 19, 2019 •

edited by Gama11

Loading

cancerberoSgx commented Jul 19, 2019 •

edited

Loading