Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/csv: stop preserving \r in multiline quoted strings #22746

Closed
rsc opened this issue Nov 15, 2017 · 17 comments
Closed

encoding/csv: stop preserving \r in multiline quoted strings #22746

rsc opened this issue Nov 15, 2017 · 17 comments

Comments

@rsc
Copy link
Contributor

rsc commented Nov 15, 2017

In #21201, @alexhultman filed a fairly blunt bug report that “I want the verbatim data as in the CSV file, not some made up corrupt data.” because the CSV reader changes \r\n to \n inside multiline quoted strings.

@nussjustin said on the issue that this behavior was unexpected, and @pborman seemed to agree that it was wrong, but I cannot figure out why. Later in the issue @pborman says “You have pointed out a situation in which the CSV package is not following the RFC.” but I can't see anything in the RFC that says \r must be preserved in this case.

About carriage returns the RFC says (in total!):

“Each record is located on a separate line, delimited by a line break (CRLF).”

“The last record in the file may or may not have an ending line break.”

“Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.”

“As per section 4.1.1. of RFC 2046 [3], this media type uses CRLF to denote line breaks. However, implementors should be aware that some implementations may use other values.”

The entire ABNF grammar is:

file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D ;as per section 6.1 of RFC 2234 [2]
DQUOTE =  %x22 ;as per section 6.1 of RFC 2234 [2]
LF = %x0A ;as per section 6.1 of RFC 2234 [2]
CRLF = CR LF ;as per section 6.1 of RFC 2234 [2]
TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

Nothing in the RFC text or the grammar says that CRLF inside a quoted string must not be shortened to LF. What the RFC does say is that CRLF is the way line breaks are denoted and that some implementations use other ones. Consistent with that, the Go library rewrites all CRLF to just LF, because that is the standard form inside Go programs.

@alexhultman didn't give a rationale for keeping the \r, and I can't think of one.

Everything else in Go uses \n as the line separator. It makes sense to treat inputs here using \r\n as the line separator as an external encoding to be converted and to apply the \r\n → \n conversion to all lines, including quoted strings. CSV files are a special format text file. If you "convert" a CSV file from Unix line endings to Windows line endings, a convenient property would be that reading the CSV data produces the same result for both versions of the file. Go 1.9 and earlier provided that. We do the same processing for \r inside multiline backquote strings in Go source code, for the same reason.

I believe that if we keep the change in CL 52810, this will surprise users who have CSV data from some generator that uses \r\n line endings and now have unexpected \r characters in their multiline values that they will need to postprocess away.

I think we should revert this CL.

@ghost
Copy link

ghost commented Nov 15, 2017

This is the most stupid thing I've heard in my entire life and it is nothing more than stubborn refusal to accept you made a mistake. This behavior makes the Go CSV parser MODIFY data that are in the middle of a string! The standard probably does not say anything about preservibg Swedish character Ö either so why not shorten it to O following this ridiculous nonsense. For me however I'm now using libcsv which is about 7x more performant and does not corrupt data so I don't care personally anymore. Golang is dead for me.

@pborman
Copy link
Contributor

pborman commented Nov 15, 2017

It appears the text and the BNF differ. The text says:

Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes.  For example:

       "aaa","b CRLF
       bb","ccc" CRLF
       zzz,yyy,xxx

This example seems to imply that a line break is CRLF.

The BNF states:

   escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
   TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E
   COMMA = %x2C
   CR = %x0D ;as per section 6.1 of RFC 2234 [2]
   LF = %x0A ;as per section 6.1 of RFC 2234 [2]
   DQUOTE =  %x22 ;as per section 6.1 of RFC 2234 [2]

Expanded, TEXTDATA includes CR, LF, and all printable characters (0x20-0x73) excluding " unless it is doubled. TEXTDATA may contain a lone CR or a lone LF.

The RFC does not cover two types of data: binary and UNICODE. A strict reading of the RFC limits a string to all printable ASCII characters plus a space, linefeed, and carriage return. The csv package is a superset of the RFC (and I think it needs to be, e.g., it should allow UNICODE).

I do not believe either interpretation is wrong and it ends up being a choice of convention.

I agree with Russ that man programs/operating systems feel quite free to convert CRLF to LF and a lone LF to CRLF. If an operating system or program does this then it is quite possible the text seen by the CSV parser is not exactly the original text.

The main problem as I see it is aside from RFC 4180 there is no real CSV standard. Many common use cases do not follow RFC 4180. The csv package tries hard to both honor 4180 as well as accept many common use cases (is the space after a comma part of the field or part of the separator, can a quote exist in an unquoted field, must the number of fields be the same for every record). I seem to recall that Excel's CSV format was one of the variants beyond RFC 4180.

The existing package does documentation does describe the default behavior.

It is unfortunate there is more than one "standard" end of line sequence for text files (LF, CRLF, and I believe I have seen CR as well). Some programs/operating systems distinguish between text and binary files (FTP does, for example).

In my previous comment I was erring on the side of the BNF. Russ errs on the side of principle of least surprise to most people. I believe both are reasonable positions to take, but perhaps the principle of least surprise in this case, due to the ambiguity of end of line sequences, is a better choice than a pedantic reading of the BNF.

@ghost
Copy link

ghost commented Nov 15, 2017

We handle customer data that may come in CSV format and to corrupt customer data like this should be a straight execution penalty!

@pborman
Copy link
Contributor

pborman commented Nov 15, 2017

Alex, can you point to the CSV format description that your files are adhering to? I am honestly interested since back when I wrote the package I looked for every definition I could find and tried to address as many of them as possible (hence the options).

Some text editors on Windows/MS-DOS will "corrupt customer data" simply by reading in and writing out a text file (they change \n to \r\n). Can you provide more detail on what type of data this is (is it even ASCII/Unicode?) and why equating \r\n with \n corrupts the meaning of the data?

If there is a different package that satisfies your requirements then that is the right one to use. I do not doubt it is more performant. The standard csv package is quite old and predates Go 1.0. I would likely write it differently today.

@ghost
Copy link

ghost commented Nov 15, 2017

I don't care what Microsoft do or don't, they are incompetent people. I would never intentionally corrupt customer data like this and blame some paragraph jerkfest as the reason. This kind of reasonging is the reasoning of someone with pants up their armpits and star wars socks. A besserwisser. You can either do the obvious or you can argue current behavior is better, for me I don't care anymore I will never use such broken low quality garbage ever again. Corrupt customer data! Pfft!!

@dsnet
Copy link
Member

dsnet commented Nov 15, 2017

This is the most stupid thing I've heard in my entire life and it is nothing more than stubborn refusal to accept you made a mistake.

@alexhultman. I understand if you disagree. This issue would benefit more from a civil discussion on what the right behavior is. Please see https://golang.org/conduct

The idea to strip "\r" is not unprecedented. For example, in Javascript template literals, "\r\n" is always converted to "\n".

As for the behavior of csv.Reader. I don't have a strong opinion of what the "right" interpretation is. I recently re-wrote the implementation of Reader to be significantly faster. In terms of performance, it is faster to not fold "\r" into "\n" since you don't have to check for it. The new implementation does not parse CSV files on a rune-by-rune basis.

@bourbonspecial
Copy link

For what it's worth, as a user this seems like a strange change. CSV is usually used as a format for passing around data and it seems weird to start messing around with the data itself. I don't think it matters that "Everything else in Go uses \n as the line separator." This is just data passing through Go programs, not some aspect of Go itself.

@rsc
Copy link
Contributor Author

rsc commented Nov 16, 2017

@bourbonspecial, suppose we read this file with two records:

a,b,c<CR><LF>
a,"b1<CR><LF>
b2",c<CR><LF>

Everyone seems to agree with (at least no one is making an argument against) reporting the third field in both records as "c" not "c\r". Why then should we report the middle field of the second record as "b1\r\nb2" instead of "b1\nb2"? The quotes are there to allow the field to span multiple lines, not to say "all of a sudden s are important right here but nowhere else in the file".

I am genuinely interested to hear of a setting where stripping the on lines 1 and 3 of that file is correct and appropriate and desirable but stripping the on line 2 is incorrect, inappropriate, and undesirable. As I said above, I can't think of any.

There is also much weight to be given to preserving existing behavior. Inevitably every behavioral change, even when everyone agrees the old behavior was a mistake, breaks some users. If we can help them see that the old behavior was wrong, that goes a long way to making clear that we're not just breaking things without considering the consequences. Here I think many users will be broken who actually do not want this new behavior and will not agree that the new behavior is desirable. Hence my interest in hearing about concrete instances when a user would want the new behavior.

But I think I'm beating a dead horse at this point. Without a compelling example of why preserving this one in that file is important, the scales clearly tip in favor of preserving existing behavior and therefore not breaking existing uses. I will send a CL.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/78295 mentions this issue: encoding/csv: restore Go 1.9 quoted \r\n handling in Reader

@rsc
Copy link
Contributor Author

rsc commented Nov 16, 2017

CL sent. I also noticed while working on the CL that the earlier change made Reader incompatible with Writer. If you are writing with a Writer with UseCRLF set to true, it encodes "a\nb" as "a\r\nb", so that UseCRLF means every line ending in the output is CRLF. Reader correctly undid that in Go 1.9 so that data round-tripped. Reverting the earlier change preserves that property.

@bourbonspecial
Copy link

bourbonspecial commented Nov 16, 2017

@rsc suppose you're processing some data that includes the following within a field "hello\r\nfriend", how would I preserve that sequence of characters if I have to read it through and write it out of the CSV package?

In your example, the difference is that lines 1 and 3 \r\n denotes the end of a record whereas in line 2 \r\n is data in an escaped field. The RFC says you're allowed to include a certain set of characters (including \r) in a field so I think stripping characters out of a field is a very surprising behaviour.

@pborman
Copy link
Contributor

pborman commented Nov 16, 2017

@bourbonspecial the issue is that CSV files are text files. Text files are not binary. The two text files "foo\r\nbar" and "foo\nbar" are logically equivalent even though they are different from a binary perspective. Another example, the following two one line CSV files are equivalent:

foo,bar

"foo","bar"

but they physically differ.

I personally do not believe there is a single correct answer to how to deal with carriage returns or newlines embedded in a string. We could have a "preserve carriage returns in fields" option, but that doesn't deal with other text handling code in a system that might canonicalize line endings in text files (FTP ASCII mode transfers are one such system).

We have had one very loud complaint about end-of-line handling by the csv package, but very little information on the specifics of why this was a problem, if the problem was in fact other software that was fragile, or even what kind of data it was. If the data was binary then I would suggest that using CSV is a poor choice and the file format is one that makes a different compromise.

We have a concern that the change of preserving carriage returns will cause existing programs to break, which violates the backwards compatibility guarantee of Go 1.

I accept that I made a mistake when I said “You have pointed out a situation in which the CSV package is not following the RFC.” I now believe the RFC is ambiguous on this point. The prose and the BNF are in conflict. RFC4180 is in the context of MIME, and the file is a series of lines of text. A CSV field can span multiple lines of text. Lines of text do not contain CRLF or bare LF sequences.

I can still make the argument for preserving CR's in fields, but absence any other information than what has been presented, the argument that they are lines of text and preserving the backwards compatibility guarantee tilts the resolution in favor of existing behavior.

@bourbonspecial
Copy link

@pborman fair enough.

My as-a-humble-user point of view is that there's some thought involved in "foo\r\nbar" == "foo\nbar", and if there's an opportunity to distance yourself from worrying about how other people might interpret those characters then it seems safer to not get involved. Totally accept that the risk of breaking existing programs is ever present so it's a trade off.

@rsc
Copy link
Contributor Author

rsc commented Nov 16, 2017

@pborman FWIW I don't even think the BNF conflicts with any of this. The BNF just says that in order to be a valid CSV, any mid-value CR or LF bytes must be quoted. It does not say how they must be interpreted.

@soliloqui
Copy link

Assertions
When one uses escaping one expects the escaped data to survive parsing.
The original bug complains purely about escaped data being changed.

My 2p
If my shell decided to ignore escaping and altered data I had explicitly escaped I would consider it broken irrespective of the default line ending in use on my system.

The RFC detail for implementors about CRLF for line breaks being open to different line breaks is about line breaks delimiting records, it is not about escaped data.

So I would expect:

file = [header BR] record *(BR record) [BR]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D ;as per section 6.1 of RFC 2234 [2]
DQUOTE =  %x22 ;as per section 6.1 of RFC 2234 [2]
LF = %x0A ;as per section 6.1 of RFC 2234 [2]
BR = [CR] LF ;as per section 6.1 of RFC 2234 [2] and as per section 4.1.1. of RFC 2046 [3]
TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

I think a parser so generated would leave an escaped field containing \r\n unmodified.

When dealing with end users the de-facto standard for CSV is not the IETF RFC it is 'whatever Excel does'. I think (hope) most developers who have a need to use CSV wish to do so because their end users are using it in Excel or they are doing some database ETL process (in which case you still want the escaped \r\n preserved).

@noangel
Copy link

noangel commented Jun 5, 2018

I trying to use Go for my csv processing tool.
go version go1.10.2 windows/amd64
My tool must merge some data change from one CSV file to one or more CSV files.
My tool first requirement is: if source file is not changed output CSV must look like it not changed too.
When I using standard module from encoding/csv, it breaks many new line characters everywhere!
So current solution is create own CSV reader and writer which will keep new line characters.
I'm new to Go, but I wonder why standard modules should have such limitations? If it's difficult to decide keep new line inside quotes or not, please give us an option or create separate function, please support both cases.
In real world there are a lot of CSV files in mixed format: someone uses UTF-8, someone prefers Shift-JIS, someone uses Mac OS and saves a file with \n, another guy saves his file from Windows with \r\n. I know it's not easy but universal module which will support all these cases smartly will be super helpful. For this what we have now - it looks like incomplete toy...

@ianlancetaylor
Copy link
Member

@noangel This issue is closed. We don't use issues for discussion. Please see https://golang.org/wiki/Questions . Thanks.

The short answer to your question is that it turns out that there are hundreds of variants of CSV and there are hundreds of different ways that they are used. Supporting them all in the standard library would mean a package that is very difficult to understand and use correctly. So we've decided going forward to restrict the package only to the format described in RFC 4180. We understand that this means that the package can't work for everyone. The code is there and available for modification for your specific use case.

neeral added a commit to neeral/cockroach that referenced this issue Aug 2, 2018
See cockroachdb#25344.

It appears this is being caused by the behaviour of Golang's
encoding/csv library, which folds \r\n into \n when reading. This was
fixed in golang/go#21201 but then reverted golang/go#22746. It appears
based on that second issue that Go is unlikely to change that behavior.

Check in the stdlib `encoding/csv` into `pkg/util` with
golang/go#22746 reverted.

Release note:
`\r\n` characters in CSV files were silently converted into `\n`. This
causes imported data to be different. This is now fixed.
neeral added a commit to neeral/cockroach that referenced this issue Aug 2, 2018
See cockroachdb#25344.

It appears this is being caused by the behaviour of Golang's
encoding/csv library, which folds \r\n into \n when reading. This was
fixed in golang/go#21201 but then reverted golang/go#22746. It appears
based on that second issue that Go is unlikely to change that behavior.

Check in the stdlib `encoding/csv` into `pkg/util` with
golang/go#22746 reverted.

Release note:
`\r\n` characters in CSV files were silently converted into `\n`. This
causes imported data to be different. This is now fixed.
dt pushed a commit to dt/cockroach that referenced this issue Aug 6, 2018
See cockroachdb#25344.

It appears this is being caused by the behaviour of Golang's
encoding/csv library, which folds \r\n into \n when reading. This was
fixed in golang/go#21201 but then reverted golang/go#22746. It appears
based on that second issue that Go is unlikely to change that behavior.

Check in the stdlib `encoding/csv` into `pkg/util` with
golang/go#22746 reverted.

Release note:
`\r\n` characters in CSV files were silently converted into `\n`. This
causes imported data to be different. This is now fixed.
@golang golang locked and limited conversation to collaborators Jun 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants