Have DEHEX convert UTF-8 sequences from browsers as Unicode #1986

rebolbot · 2013-03-06T09:06:52Z

Submitted by: rebolek

Unicode characters sent in forms from web browsers are now commonly encoded as UTF-8 byte sequences in the %.. notation. For example the character "ø" is encoded as %C5%99. DEHEX cannot understand this yet and converts it badly. See the example code.

>> dehex "%c5%99"
== "A?"

>> to binary! dehex "%c5%99"
== #{C385C299} ; bad sequence

>> to binary! "ø"
== #{C599} ; good sequence

^{CC - Data [ Version: r3 master Type: Wish Platform: All Category: Native Reproduce: Always Fixed-in:none ]}

rebolbot · 2013-03-06T09:08:37Z

Submitted by: rebolek

I see that curecode has its problems with Unicode too. That "ø" should be r with háèek (caron) - HTML entity ř.

rebolbot · 2013-03-06T09:29:13Z

Submitted by: rebolek

This looks like a fix (done in 10 minutes so it may have problems on its own):

    dehex: funct [
        "Converts URL-style hex encoded (%xx) strings. Process Unicode characters properly."
        value [ any-string! ] "The string to dehex"
    ][
        value: to binary! value
        parse value [
            some [
                to #"%" 
                m: ( change/part m load rejoin [ "#{" to string! copy/part next m 2 "}" ] 3 )
                skip
            ]
            to end
        ]
        to string! value
    ]

rebolbot · 2013-03-06T23:36:04Z

Submitted by: BrianH

Change "Process Unicode characters properly." to something like "Process UTF-8 sequences as Unicode." to be more specific and avoid arguments over what "properly" means without any context. Otherwise, good idea if this is both common behavior and according to the current relevant standards.

I don't think this is a bug as much as a wish, since it is working as designed, and this would be a change to its design (minor detail). I'll rephrase the ticket accordingly, though it's the same request either way.

rebolbot · 2013-03-07T00:02:33Z

Submitted by: rebolek

It was really designed to do this?

>> to binary! dehex "Å�"
== #{C385C299}

That looks like a bug, not a design. If it's design, it's very bad one, because it's not predictable.

rebolbot · 2013-03-07T00:04:31Z

Submitted by: rebolek

Grrr, should be "%c5%99". This is really stupid format...

rebolbot · 2013-03-07T00:04:44Z

Submitted by: BrianH

DEHEX was designed to decode ASCII sequences before Unicode in a browser was even a thing, and before the URL-encoding standards even applied to Unicode characters. If browsers have started encoding Unicode characters that way then that is a relatively new development, and definitely one that we should adapt to.

Don't take offence at the "Wish" designation. In general "Wish" tickets are considered more important than "Bug" tickets. Maybe when we hit beta we will be able to put a higher priority on "Bug" tickets that aren't easy fixes or crashes.

rebolbot · 2013-03-07T00:15:36Z

Submitted by: rgchris

URL-encoded strings refer to bytes, not codepoints�url-encoded strings should be converted to binary before dehexing. The current 'dehex converting hex-encoded pairs to codepoints is broken.

rebolbot · 2013-03-07T00:16:06Z

Submitted by: rebolek

I did some research ( = read this Wikipedia entry http://en.wikipedia.org/wiki/Percent-encoding ) and there I found that "The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986." So it's actually not that new development as it pre-dates R3.
No offence taken with the "Wish" designation, I just like to see this fixed in R3 rather sooner than later, as it's basic requirement for R3 as CGI in non-ASCII-only parts of world.

rebolbot · 2013-03-07T00:17:35Z

Submitted by: rebolek

+1 @rgchris

rebolbot · 2013-03-07T01:42:26Z

Submitted by: BrianH

DEHEX hasn't been redesigned for Unicode yet, so it still has the design from R2. I don't want to argue whether that RFC predates R3, because then we'll just get into how many years it takes these standards to be adopted, and how telling it is that before today I had never heard of a browser implementing that standard (not as telling as you'd think since I haven't done web development in a couple years). When it comes down to it it's a good idea to make this change for the reasons you gave above.

rebolbot · 2013-03-07T02:54:13Z

Submitted by: abolka

"DEHEX was designed [..] before Unicode in a browser was even a thing."

I distinctly remember having to tweak a registry setting in Win95 to tinker with Netscape 3 and Unicode. Wikipedia says, Netscape 3.0 was released in 1996; so maybe, Unicode in a browser existed before REBOL was even a thing :)

More on topic: whether under the flag of Wish or Bug, this change is necessary (*) to support Unicode correctly and consistently. On the whole, current DEHEX in R3 is simply wrong. (If this wrongness can be argued to be "by design", then other changes in R3 simply invalidate the previous design).

(*: Alternative fixes are thinkable, but, I think, unreasonable: error out when DEHEX encounters a value >127. Or change the docstring of DEHEX to remove the "URL-style" reference and simply define DEHEX as an additional Rebol-internal method of escaping the first 256 Unicode codepoints.)

So: +1

rebolbot · 2013-03-08T21:11:50Z

Submitted by: BrianH

So, looking through the code, it appears that DEHEX can operate on 8-bit-element binary series and 16-bit-element strings. I don't know when the 8-bit DEHEX is ever called (perhaps it's a left-over from when binary! was a part of any-string!), but one thing that is clear is that it doesn't have a way to change the series to 16-bit-elements mid-stream if it encounters a codepoint that is greater than 255. That means that we can't support this for DEHEX running in 8-bit mode, assuming it ever does. However, some theoretical DEHEX of a binary would just generate UTF-8 binaries for this input using the old method, so we'll probably be fine for now.

If we are willing to limit this to 16-bit DEHEX this can be done. If it's OK with you all I'll have it generate the invalid-character codepoint when it decodes a codepoint outside of the BMP, at least as a stopgap until we support those in R3 at all. And there is no reason to complain about overlong UTF-8 encodings since we're generating Unicode strings in the first place - the restriction against overlong encodings is for when you're working with encoded UTF data directly instead of decoding it first.

rebolbot · 2013-04-10T01:52:43Z

Submitted by: Ladislav

That #"ř" should be #"^(0159)" - as I see it works well in Curecode, so I wonder what exactly Bolek messed up? (probably his browser setting)

rebolbot · 2014-04-11T16:51:21Z

Submitted by: abolka

In the core-tests suite.

hostilefork · 2018-03-23T12:24:45Z

one thing that is clear is that it doesn't have a way to change the series to 16-bit-elements mid-stream if it encounters a codepoint that is greater than 255.

DEHEX creates a new series, so there wasn't really a need to be concerned about this... it uses the mold buffer.

Resolved in Ren-C, with a method which will be improved when UTF-8 Everywhere gets done. Any % in the input that isn't followed by two uppercase or lowercase hex digits will result in an error.

Also adds a corresponding ENHEX native which produces canonized uppercase hex digits, per the RFC's recommendation.

metaeducation/ren-c@6e7d81c

Oldes · 2019-04-01T18:18:03Z

I think, that first of all is a mistake, that dehex is not working on binary!

>> to-string dehex #{2F666F726D3F763D254335253939}
== "/form?v=ř"

>> to-string dehex to-binary "%c5%99"
== "ř"

So one can use it to properly url-decode utf-8 sequence, like: ``` >> to-string dehex #{2F666F726D3F763D254335253939} == "/form?v=ř" >> to-string dehex to-binary "%c5%99" == "ř" ``` Related issue: metaeducation/rebol-issues#1986

…RL-ENCODED format when needed) related to: metaeducation/rebol-issues#1986

related to: metaeducation/rebol-issues#1986

hostilefork · 2020-07-28T07:46:51Z

DEHEX is fundamentally a string operation. Working on BINARY! with UTF-8 everywhere is non essential, as if the BINARY! represents valid UTF-8 it can be AS TEXT!'d and not use extra memory:

>> as text! #{2F666F726D3F763D254335253939}
== "/form?v=%C5%99"

>> dehex as text! #{2F666F726D3F763D254335253939}
== "/form?v=ř"

But adding that to tests.

rebolbot added Type.wish Test.written labels Jan 12, 2016

This was referenced Jan 12, 2016

How shall LOAD handle "external URL strings" containing non-ascii characters? #2013

Open

Internal representation of URLs #2014

Open

hostilefork mentioned this issue Mar 23, 2018

Fix DEHEX behavior w.r.t. UTF-8, add corresponding ENHEX metaeducation/ren-c#758

Merged

hostilefork added the Status.resolved label Mar 23, 2018

Oldes added a commit to Oldes/Rebol3 that referenced this issue Jun 6, 2019

FEAT: added ENHEX native (string conversion with chars converted to U…

1bc566f

…RL-ENCODED format when needed) related to: metaeducation/rebol-issues#1986

Oldes added a commit to Oldes/Rebol3 that referenced this issue Jun 6, 2019

FIX: invalid ENHEX edge result

76c87a9

related to: metaeducation/rebol-issues#1986

Siskin-Bot mentioned this issue Feb 15, 2020

Have DEHEX convert UTF-8 sequences from browsers as Unicode Oldes/Rebol-issues#1986

Closed

hostilefork closed this as completed Jul 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have DEHEX convert UTF-8 sequences from browsers as Unicode #1986

Have DEHEX convert UTF-8 sequences from browsers as Unicode #1986

rebolbot commented Mar 6, 2013

rebolbot commented Mar 6, 2013

rebolbot commented Mar 6, 2013

rebolbot commented Mar 6, 2013

rebolbot commented Mar 7, 2013

rebolbot commented Mar 7, 2013

rebolbot commented Mar 7, 2013

rebolbot commented Mar 7, 2013

rebolbot commented Mar 7, 2013

rebolbot commented Mar 7, 2013

rebolbot commented Mar 7, 2013

rebolbot commented Mar 7, 2013 •

edited by hostilefork

Loading

rebolbot commented Mar 8, 2013

rebolbot commented Apr 10, 2013

rebolbot commented Apr 11, 2014

hostilefork commented Mar 23, 2018

Oldes commented Apr 1, 2019 •

edited

Loading

hostilefork commented Jul 28, 2020

Have DEHEX convert UTF-8 sequences from browsers as Unicode #1986

Have DEHEX convert UTF-8 sequences from browsers as Unicode #1986

Comments

rebolbot commented Mar 6, 2013

rebolbot commented Mar 6, 2013

rebolbot commented Mar 6, 2013

rebolbot commented Mar 6, 2013

rebolbot commented Mar 7, 2013

rebolbot commented Mar 7, 2013

rebolbot commented Mar 7, 2013

rebolbot commented Mar 7, 2013

rebolbot commented Mar 7, 2013

rebolbot commented Mar 7, 2013

rebolbot commented Mar 7, 2013

rebolbot commented Mar 7, 2013 • edited by hostilefork Loading

rebolbot commented Mar 8, 2013

rebolbot commented Apr 10, 2013

rebolbot commented Apr 11, 2014

hostilefork commented Mar 23, 2018

Oldes commented Apr 1, 2019 • edited Loading

hostilefork commented Jul 28, 2020

rebolbot commented Mar 7, 2013 •

edited by hostilefork

Loading

Oldes commented Apr 1, 2019 •

edited

Loading