to string! can cause information change and information loss #1517

Siskin-Bot · 2020-02-15T17:03:44Z

Submitted by: sqlab

to string! (and to-string) changes #{0D} to #{0A}
and reduces  #{0D0A} to  #{0A}

>> to-string to-binary "^M"
== "^/"
>>  to-binary to-string  #{0D}
== #{0A}

>> to-string to-binary "^M^/"
== "^/"
>>  to-binary to-string  #{0D0A}
== #{0A}

^{Imported from: CureCode [ Version: alpha 97 Type: Bug Platform: All Category: Datatype Reproduce: Always Fixed-in:none ]}
^{Imported from: metaeducation#1517}

Comments:

Rebolbot commented on Mar 10, 2010:

Submitted by: meijeru

This is a feature, as far as I know.

Rebolbot commented on Mar 10, 2010:

Submitted by: sqlab

If I want to change line terminators, I can use enline and deline. There is no string conversion without changing the line terminators

Rebolbot commented on Mar 11, 2010:

Submitted by: BrianH

I'm inclined to say that this is not a bug.

REBOL strings use "^/" as a line terminator internally. When you convert to REBOL strings, you convert to REBOL internal line termination. All of the REBOL functions that deal with strings expect REBOL line termination. Other line termination standards are an external matter, handled by the conversion routines that are used to format the strings in binary: WRITE if you want something platform-specific, TO-BINARY if you don't. And use DELINE and ENLINE if you need to work around this (though see Oldes/Rebol-wishes#42).

If you want binary information conserved then work in binary; don't convert to string. This will save you from the invalid UTF character conversion as well.

Hostilefork mentioned this issue on Mar 26, 2018:
TO-STRING TO-BINARY of a STRING! loses carriage returns

Rebolbot added the Type.bug on Jan 12, 2016

Hostilefork commented on Mar 26, 2018:

This issue was also raised on the Atronix R3 repository. I wrote a blog entry that summarizes my opinions on why we should be thinking about living in a world without CR LF:

http://blog.hostilefork.com/death-to-carriage-return/

Of the bug, I said:

I think the big mistake here is trying to take a real/actual/concrete problem and make it "invisible"...thus losing data without warning.

You can't wish away complexity, but you can ask it to go away. I'd suggest that Rebol favor the universe that Unix/Posix/Linux (then OS/X, and now Windows seem to be going for) with just LF. Look at the move to line-feeds-only as a vote for the future... like using UTF-8 as an exchange medium.

So consider files or binaries with carriage returns in them to be a foreign format. Don't read them or write them without a special codec, the same way you'd need for UCS-2 or anything else.
>> to-string #{4F6E650D0A54776F} 
 ** Error: Deprecated 0x0D "carriage return" byte (try DECODE as 'UTF8)
Then have the decoder have options to preserve CR bytes, discard them, give errors if they are found standalone vs. paired with an LF, in reverse order, etc. All the lovely issues you have from the two-character sequence.

It might seem tempting to just say that if you manage to get a string into the system with CR in it that you should write it out. But I'd say the UTF8 default encoder used and standardized by the system should be picky too. Given how much of Rebol's common assumption (and the assumption we'd like to be able to make systemically) is that newline is all you need, if you didn't filter your newlines out you will be getting a mixture most of the time.

So...

Make a strong decision about the default: LF is favored by everyone these days but Notepad, and it's better to help facilitate living in that world.

Standardize that when Rebol files are exchanged over the network they will not have CRLF in them. Don't load source unless a special command line switch or mode is set...default is OFF. (I feel the same way about tabs.) No matter what tolerance is given by these modes do not let string literals have the "bad" characters in them.

If someone is working in a hybrid environment where their data files do have CR in them, be noisy. Don't read as strings or write back out with CR unless they really know what they are doing and demand it. Make it as easy as feasible to demand and give guidance...but make it clear that the native tongue is no-CR.

Other characters that should be excluded would be things like the BOM (Byte-Order-Mark), which is basically a bug if it appears in UTF-8 data, most of the time.

Hostilefork mentioned this issue on Mar 26, 2018:
ENLINE does not convert line endings to native OS format

Oldes mentioned this issue on Dec 18, 2018:
Rebol removes/converts CR character when doing binary to string conversion

Hostilefork added on Mar 26, 2018

The text was updated successfully, but these errors were encountered:

Oldes · 2020-06-02T10:18:10Z

This was fixed so it is again compatible with Rebol2 and Red:

>> to-string to-binary "^M"
== "^M"

>> to-string to-binary "^/"
== "^/"

>> to-binary to-string  #{0D0A}
== #{0D0A}

Siskin-Bot added Ren.important Status.important Type.note labels Feb 15, 2020

Siskin-Bot mentioned this issue Feb 15, 2020

TO-STRING TO-BINARY of a STRING! loses carriage returns #2298

Closed

Oldes added the Test.written label Jun 2, 2020

Oldes closed this as completed Jun 2, 2020

Oldes added a commit to Oldes/Rebol3 that referenced this issue Jun 2, 2020

TEST: Oldes/Rebol-issues#1517

9cf326b

Siskin-Bot mentioned this issue Jul 5, 2022

ENLINE /with option Oldes/Rebol-wishes#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to string! can cause information change and information loss #1517

to string! can cause information change and information loss #1517

Siskin-Bot commented Feb 15, 2020 •

edited by Oldes

Loading

Oldes commented Jun 2, 2020

to string! can cause information change and information loss #1517

to string! can cause information change and information loss #1517

Comments

Siskin-Bot commented Feb 15, 2020 • edited by Oldes Loading

Comments:

Oldes commented Jun 2, 2020

Siskin-Bot commented Feb 15, 2020 •

edited by Oldes

Loading