Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to string! can cause information change and information loss #1517

Closed
Siskin-Bot opened this issue Feb 15, 2020 · 1 comment
Closed

to string! can cause information change and information loss #1517

Siskin-Bot opened this issue Feb 15, 2020 · 1 comment

Comments

@Siskin-Bot
Copy link
Collaborator

Siskin-Bot commented Feb 15, 2020

Submitted by: sqlab

to string! (and to-string) changes #{0D} to #{0A}
and reduces  #{0D0A} to  #{0A}
>> to-string to-binary "^M"
== "^/"
>>  to-binary to-string  #{0D}
== #{0A}

>> to-string to-binary "^M^/"
== "^/"
>>  to-binary to-string  #{0D0A}
== #{0A}

Imported from: CureCode [ Version: alpha 97 Type: Bug Platform: All Category: Datatype Reproduce: Always Fixed-in:none ]
Imported from: metaeducation#1517

Comments:

Rebolbot commented on Mar 10, 2010:

Submitted by: meijeru

This is a feature, as far as I know.


Rebolbot commented on Mar 10, 2010:

Submitted by: sqlab

If I want to change line terminators, I can use enline and deline. There is no string conversion without changing the line terminators


Rebolbot commented on Mar 11, 2010:

Submitted by: BrianH

I'm inclined to say that this is not a bug.

REBOL strings use "^/" as a line terminator internally. When you convert to REBOL strings, you convert to REBOL internal line termination. All of the REBOL functions that deal with strings expect REBOL line termination. Other line termination standards are an external matter, handled by the conversion routines that are used to format the strings in binary: WRITE if you want something platform-specific, TO-BINARY if you don't. And use DELINE and ENLINE if you need to work around this (though see Oldes/Rebol-wishes#42).

If you want binary information conserved then work in binary; don't convert to string. This will save you from the invalid UTF character conversion as well.


Hostilefork mentioned this issue on Mar 26, 2018:
TO-STRING TO-BINARY of a STRING! loses carriage returns


Rebolbot added the Type.bug on Jan 12, 2016


Hostilefork commented on Mar 26, 2018:

This issue was also raised on the Atronix R3 repository. I wrote a blog entry that summarizes my opinions on why we should be thinking about living in a world without CR LF:

http://blog.hostilefork.com/death-to-carriage-return/

Of the bug, I said:

I think the big mistake here is trying to take a real/actual/concrete problem and make it "invisible"...thus losing data without warning.

You can't wish away complexity, but you can ask it to go away. I'd suggest that Rebol favor the universe that Unix/Posix/Linux (then OS/X, and now Windows seem to be going for) with just LF. Look at the move to line-feeds-only as a vote for the future... like using UTF-8 as an exchange medium.

So consider files or binaries with carriage returns in them to be a foreign format. Don't read them or write them without a special codec, the same way you'd need for UCS-2 or anything else.

>> to-string #{4F6E650D0A54776F} 
 ** Error: Deprecated 0x0D "carriage return" byte (try DECODE as 'UTF8)

Then have the decoder have options to preserve CR bytes, discard them, give errors if they are found standalone vs. paired with an LF, in reverse order, etc. All the lovely issues you have from the two-character sequence.

It might seem tempting to just say that if you manage to get a string into the system with CR in it that you should write it out. But I'd say the UTF8 default encoder used and standardized by the system should be picky too. Given how much of Rebol's common assumption (and the assumption we'd like to be able to make systemically) is that newline is all you need, if you didn't filter your newlines out you will be getting a mixture most of the time.

So...

  • Make a strong decision about the default: LF is favored by everyone these days but Notepad, and it's better to help facilitate living in that world.
  • Standardize that when Rebol files are exchanged over the network they will not have CRLF in them. Don't load source unless a special command line switch or mode is set...default is OFF. (I feel the same way about tabs.) No matter what tolerance is given by these modes do not let string literals have the "bad" characters in them.
  • If someone is working in a hybrid environment where their data files do have CR in them, be noisy. Don't read as strings or write back out with CR unless they really know what they are doing and demand it. Make it as easy as feasible to demand and give guidance...but make it clear that the native tongue is no-CR.

Other characters that should be excluded would be things like the BOM (Byte-Order-Mark), which is basically a bug if it appears in UTF-8 data, most of the time.


Hostilefork mentioned this issue on Mar 26, 2018:
ENLINE does not convert line endings to native OS format


Oldes mentioned this issue on Dec 18, 2018:
Rebol removes/converts CR character when doing binary to string conversion


Hostilefork added on Mar 26, 2018


@Oldes
Copy link
Owner

Oldes commented Jun 2, 2020

This was fixed so it is again compatible with Rebol2 and Red:

>> to-string to-binary "^M"
== "^M"

>> to-string to-binary "^/"
== "^/"

>> to-binary to-string  #{0D0A}
== #{0D0A}

@Oldes Oldes closed this as completed Jun 2, 2020
Oldes added a commit to Oldes/Rebol3 that referenced this issue Jun 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants